### How slow is Python really? (Or how fast is your language?)

I have this code which I have written in Python/NumPy

`from __future__ import division import numpy as np import itertools n = 6 iters = 1000 firstzero = 0 bothzero = 0 """ The next line iterates over arrays of length n+1 which contain only -1s and 1s """ for S in itertools.product([-1, 1], repeat=n+1): """For i from 0 to iters -1 """ for i in xrange(iters): """ Choose a random array of length n. Prob 1/4 of being -1, prob 1/4 of being 1 and prob 1/2 of being 0. """ F = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n) """The next loop just makes sure that F is not all zeros.""" while np.all(F == 0): F = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n) """np.convolve(F, S, 'valid') computes two inner products between F and the two successive windows of S of length n.""" FS = np.convolve(F, S, 'valid') if FS[0] == 0: firstzero += 1 if np.all(FS == 0): bothzero += 1 print("firstzero: %i" % firstzero) print("bothzero: %i" % bothzero)`

It is counting the number of times the convolution of two random arrays, one which is one longer than the other, with a particular probability distribution, has a 0 at the first position or a 0 in both positions.

I had a bet with a friend who says Python is a terrible language to write code in that needs to be fast. It takes 9s on my computer. He says it could be made 100 times faster if written in a "proper language".

The challenge is to see if this code can indeed by made 100 times faster in any language of your choice. I will test your code and the fastest one week from now will win. If anyone gets below 0.09s then they automatically win and I lose.

**Status****Python**. 30 times speed up by Alistair Buxon! Although not the fastest solution it is in fact my favourite.**Octave**. 100 times speed up by @Thethos.**Rust**. 500 times speed up by @dbaupp.**C++**. 570 times speed up by Guy Sirton.**C**. 727 times speed up by @ace.**C++**. Unbelievably fast by @Stefan.

The fastest solutions are now too fast to sensibly time. I have therefore increased n to 10 and set iters = 100000 to compare the best ones. Under this measure the fastest are.

**C**. 7.5s by @ace.**C++**. 1s by @Stefan.

**My Machine**The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.**Follow up posted**As this competition was rather too easy to get a x100 speedup, I have posted a followup for those who want to exercise their speed guru expertise. See How Slow Is Python Really (Part II)?Stefan Correct answer

7 years ago# C++ bit magic

## 0.84ms with simple RNG, 1.67ms with c++11 std::knuth

0.16ms with slight algorithmic modification (see edit below)

The python implementation runs in 7.97 seconds on my rig. So this is 9488 to 4772 times faster depending on what RNG you choose.

`#include <iostream> #include <bitset> #include <random> #include <chrono> #include <stdint.h> #include <cassert> #include <tuple> #if 0 // C++11 random std::random_device rd; std::knuth_b gen(rd()); uint32_t genRandom() { return gen(); } #else // bad, fast, random. uint32_t genRandom() { static uint32_t seed = std::random_device()(); auto oldSeed = seed; seed = seed*1664525UL + 1013904223UL; // numerical recipes, 32 bit return oldSeed; } #endif #ifdef _MSC_VER uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); } #else uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); } #endif std::pair<unsigned, unsigned> convolve() { const uint32_t n = 6; const uint32_t iters = 1000; unsigned firstZero = 0; unsigned bothZero = 0; uint32_t S = (1 << (n+1)); // generate all possible N+1 bit strings // 1 = +1 // 0 = -1 while ( S-- ) { uint32_t s1 = S % ( 1 << n ); uint32_t s2 = (S >> 1) % ( 1 << n ); uint32_t fmask = (1 << n) -1; fmask |= fmask << 16; static_assert( n < 16, "packing of F fails when n > 16."); for( unsigned i = 0; i < iters; i++ ) { // generate random bit mess uint32_t F; do { F = genRandom() & fmask; } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) ); // Assume F is an array with interleaved elements such that F[0] || F[16] is one element // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive // and ~MSB(F) & LSB(F) returns 1 for all elements that are negative // this results in the distribution ( -1, 0, 0, 1 ) // to ease calculations we generate r = LSB(F) and l = MSB(F) uint32_t r = F % ( 1 << n ); // modulo is required because the behaviour of the leftmost bit is implementation defined uint32_t l = ( F >> 16 ) % ( 1 << n ); uint32_t posBits = l & ~r; uint32_t negBits = ~l & r; assert( (posBits & negBits) == 0 ); // calculate which bits in the expression S * F evaluate to +1 unsigned firstPosBits = ((s1 & posBits) | (~s1 & negBits)); // idem for -1 unsigned firstNegBits = ((~s1 & posBits) | (s1 & negBits)); if ( popcnt( firstPosBits ) == popcnt( firstNegBits ) ) { firstZero++; unsigned secondPosBits = ((s2 & posBits) | (~s2 & negBits)); unsigned secondNegBits = ((~s2 & posBits) | (s2 & negBits)); if ( popcnt( secondPosBits ) == popcnt( secondNegBits ) ) { bothZero++; } } } } return std::make_pair(firstZero, bothZero); } int main() { typedef std::chrono::high_resolution_clock clock; int rounds = 1000; std::vector< std::pair<unsigned, unsigned> > out(rounds); // do 100 rounds to get the cpu up to speed.. for( int i = 0; i < 10000; i++ ) { convolve(); } auto start = clock::now(); for( int i = 0; i < rounds; i++ ) { out[i] = convolve(); } auto end = clock::now(); double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0; #if 0 for( auto pair : out ) std::cout << pair.first << ", " << pair.second << std::endl; #endif std::cout << seconds/rounds*1000 << " msec/round" << std::endl; return 0; }`

Compile in 64-bit for extra registers. When using the simple random generator the loops in convolve() run without any memory access, all variables are stored in the registers.

How it works: rather than storing

`S`

and`F`

as in-memory arrays, it is stored as bits in an uint32_t.

For`S`

, the`n`

least significant bits are used where an set bit denotes an +1 and an unset bit denotes an -1.

`F`

requires at least 2 bits to create an distribution of [-1, 0, 0, 1]. This is done by generating random bits and examining the 16 least significant (called`r`

) and 16 most significant bits (called`l`

). If`l & ~r`

we assume that F is +1, if`~l & r`

we assume that`F`

is -1. Otherwise`F`

is 0. This generates the distribution we're looking for.Now we have

`S`

,`posBits`

with an set bit on every location where F == 1 and`negBits`

with an set bit on every location where F == -1.We can prove that

`F * S`

(where * denotes multiplication) evaluates to +1 under the condition`(S & posBits) | (~S & negBits)`

. We can also generate similar logic for all cases where`F * S`

evaluates to -1. And finally, we know that`sum(F * S)`

evaluates to 0 if and only if there is an equal amount of -1's and +1's in the result. This is very easy to calculate by simply comparing the number of +1 bits and -1 bits.This implementation uses 32 bit ints, and the maximum

`n`

accepted is 16. It is possible to scale the implementation to 31 bits by modifying the random generate code, and to 63 bits by using uint64_t instead of uint32_t.## edit

The folowing convolve function:

`std::pair<unsigned, unsigned> convolve() { const uint32_t n = 6; const uint32_t iters = 1000; unsigned firstZero = 0; unsigned bothZero = 0; uint32_t fmask = (1 << n) -1; fmask |= fmask << 16; static_assert( n < 16, "packing of F fails when n > 16."); for( unsigned i = 0; i < iters; i++ ) { // generate random bit mess uint32_t F; do { F = genRandom() & fmask; } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) ); // Assume F is an array with interleaved elements such that F[0] || F[16] is one element // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive // and ~MSB(F) & LSB(F) returns 1 for all elements that are negative // this results in the distribution ( -1, 0, 0, 1 ) // to ease calculations we generate r = LSB(F) and l = MSB(F) uint32_t r = F % ( 1 << n ); // modulo is required because the behaviour of the leftmost bit is implementation defined uint32_t l = ( F >> 16 ) % ( 1 << n ); uint32_t posBits = l & ~r; uint32_t negBits = ~l & r; assert( (posBits & negBits) == 0 ); uint32_t mask = posBits | negBits; uint32_t totalBits = popcnt( mask ); // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0 if ( totalBits & 1 ) continue; uint32_t adjF = posBits & ~negBits; uint32_t desiredBits = totalBits / 2; uint32_t S = (1 << (n+1)); // generate all possible N+1 bit strings // 1 = +1 // 0 = -1 while ( S-- ) { // calculate which bits in the expression S * F evaluate to +1 auto firstBits = (S & mask) ^ adjF; auto secondBits = (S & ( mask << 1 ) ) ^ ( adjF << 1 ); bool a = desiredBits == popcnt( firstBits ); bool b = desiredBits == popcnt( secondBits ); firstZero += a; bothZero += a & b; } } return std::make_pair(firstZero, bothZero); }`

cuts the runtime to 0.160-0.161ms. Manual loop unroll (not pictured above) makes that 0.150. The less trivial n=10, iter=100000 case runs under 250ms. I'm sure i can get it under 50ms by leveraging additional cores but that's too easy.

This is done by making the inner loop branch free and swapping the F and S loop.

If`bothZero`

is not required i can cut down the run time to 0.02ms by sparsely looping over all possible S arrays.Could you provide a gcc friendly version and also what your command line would be please? I am not sure I can test it currently.

I know nothing about this but google tells me that __builtin_popcount might be a replacement for _mm_popcnt_u32() .

Code updated, uses #ifdef switch to select the correct popcnt command. It compiles with `-std=c++0x -mpopcnt -O2` and takes 1.01ms to run in 32 bit mode (i don't have a 64-bit GCC version at hand).

Could you make it print the output? I am not sure if it is actually doing anything currently :)

output: http://pastebin.com/Xg7xVst2 (you'll have to change the `#if 0` on line 123 for it to print the output of convolve())

Thanks. It seems there is no randomness in your code, right? I mean you don't select a random seed. It does appear to be terrifyingly fast. Try increasing the number of iterations to 100,000.

Could you explain your solution by any chance? It really is amazingly fast and I think people would be interested.

@Lembik There is no randomness, only pseudo-randomness in the simple rng case because the seed is fixed. This is not too hard to fix by changing the seed. I've updated the answer with an explanation.

Runs in 1.02 ms on my laptop when compiled with 64-bit gcc.

@Stefan This is impressive work! I would hate to have to be the one to fix any bugs in it though :p bit shifting magic.

You are clearly a wizard. + 1

Hmm, I've got my work cut out for me. Glad that this epic code got me the populist badge.

You may be able to gain a speedup by precomputing `1<

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

7 years ago

Could you provide a gcc friendly version and also what your command line would be please? I am not sure I can test it currently.