January 24, 2023 numpy python optimisation
I was profiling some code I had written where
np.clip was being called a couple hundreds of millions of times. The profiler was flagging some weird related functions that seemed unimportant so I delved deeper. Turns out that as of numpy 1.17, the
np.clip has been substantially slower than previously which has been documented here. This caught my curiosity and after some experimentation I found two substantially faster methods which in turn sped up my code by $50\%$!
Below is my log of what I tried, but if you just want results then here is your link.
Note: These tests were undertaken using numpy version
VMIN = -0.5 VMAX = 0.5 X_scalar = 2 * np.random.rand() - 0.5
np.clip(X_scalar, -0.5, 0.5) > 11.3 µs ± 202 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Testing the baseline gave us a fairly decent time of $11.3\mu s$.
np.core.umath.maximum(np.core.umath.minimum(X_scalar, VMAX), VMIN) > 2.25 µs ± 64.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
However, we quickly got huge gains by using max/min functions directly.
np.max([np.min([X_scalar, VMAX]), VMIN]) > 8.74 µs ± 141 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
Using the normal max/min functions proved still faster but not nearly as fast as calling the
umath functions directly.
VMIN if X_scalar < VMIN else VMAX if X_scalar > VMAX else X_scalar > 75.2 ns ± 7.8 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
For scalars, there is not much faster than direct logic in python it seems. Huge gains!
X_array = 2 * (np.random.rand(50000) - VMAX)
Again, testing the baseline:
np.clip(X_array, VMIN, VMAX) > 172 µs ± 646 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
This gave us a very slow time of $172\mu s$ which isn’t good enough. Since using the usual min/max functions proved fruitless in the scalar case, we skip it here.
np.core.umath.maximum(np.core.umath.minimum(X_array, VMAX), VMIN) > 40.9 µs ± 388 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
umath functions were faster.
np.asarray([VMIN if Xi < VMIN else VMAX if Xi > VMAX else Xi for Xi in X_array]) > 10 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And (as expected honestly) the raw python variation was horribly slow.
For scalars, using
VMIN if X < VMIN else VMAX if X > VMAX else X is 150x faster than normal
np.clip. However, using raw python for arrays is (expectedly) horrible. For this case, using
np.core.umath.maximum(np.core.umath.minimum(X, VMAX), VMIN) is roughly 4x faster than
I got most of these ideas from numpy#14281 but wanted to verify them for myself.