Python bitarray - 60x speedup for comparisons
Published on Dec 24, 2020 by Impaktor.
This is intended to show how to use bitarray/bitstrings in python, on a beginner friendly way, to speed up code, by using more efficient data representation
1. Introduction — The problem to solve
Recently I found myself needing to speed up a bottle neck in my code, that was essentially querying a dictionary if it contained a specified key (datetime object), or specifically a function:
def is_workday(date): """True if weekday, false if weekend or holiday/red day. Input is Timestamp or DateTime object""" import holidays # Saturday and Sunday are index 5, 6 return date.weekday() < 5 and date not in holidays.USA(years=date.year)
where I am using the holidays package, which defines a dictionary of holidays for each year and country. I have 57 million dates that need to be checked, and this would take ~30 hours where the above code was the main bottle neck.
So is there a faster way to see if date
is in the dictionary? There are
several (I’ll mention some in the end), but we’ll look at bit representation
here.
2. The solution — bitarray
2.1. What is a bit-array and a bit-mask?
The reader is familiar that computers represents numbers as “ones and zeros”. E.g. the first numbers represented in binary as (here using 3 bits):
Decimal | Binary |
---|---|
0 | 000 |
1 | 001 |
2 | 010 |
3 | 011 |
4 | 100 |
5 | 101 |
6 | 110 |
7 | 111 |
But what if we instead interpret, for instance, “2” to literally mean “string of all zeros, where second-to-last bit is flipped to 1” to represent something other than the number “2”? This is quite common practice among C/C++ programming, but not something I’ve seen done much in the python community, thus this write up.
So instead of asking date in dictionary_of_holidays
, we represent a date
as a bitarray, and then use a bit-mask and bitwise AND operator.
E.g. to check if the second day in a year is on a weekend or not, could do something like the following:
01000000000000... AND 00000110000011... = 00000000000000...
where the top string has a 1
at the second bit (where we now define that
we count positions starting from left), representing the second day, and we
then operate on this bitarray with a bitwise AND together with the
bit-mask, where every (two) 7th position is 1, for Saturday and Sunday
(assuming that year started on a Monday), and the resulting output has a
’1’ if true (i.e. “the second day is on a weekend”), and all zeros if false
(“the second day of the year is on a working day”).
Naturally, we want to also include holidays in the bit-mask string, and encode these in the right relative starting positions, which will be different each year.
2.2. Implementation
Since every fourth year in our Gregorian calendar has an extra day, the number of bits in our bitarrays will be:
N_bits = 366
i.e. for the other three years, the last bit will be unused.
We will use the bitarrays (v. 1.6.1) package, and initialize the bitarray
representation of a date as an N_bits
long string with all zeros, and
then flip the correct bit to a 1
. We define the left most bit to be
position 0, and represent the first day in the year.
There are two different ways to initialize a bit array with all zeros (as I understood it), one that takes 2µs (micro seconds):
a = N_bits * bitarray('0')
another one that is almost 20 times faster, with 115ns.
a = bitarray(N_bits).setall(False)
However, this last method did not work reliably for me, either due to bug, or me doing something silly/being tired, so I opted for the first method. Besides, initializing the bitarray is not the bottle neck, nor the interesting part.
I did not find any support for shifting bits >>
and <<
operators, which
might be the traditional way to shift a bit to the correct position, but I
can access the n
-th bit in the bitarray element using the [n]
accessor,
thus a function that converts a Timestamp or DateTime object to bitarray
representation is:
from bitarray import bitarray def date_to_bit(date): """ Convert DateTime or Timestamp to bitarray representation Parameters ---------- date : datetime Returns ------- bitarray object N_bits bits long (366), with one bit flipped """ a = N_bits * bitarray('0') # flip bit of year (dayofyear [1,356]): # dayofyear = date.dayofyear # only for Timestamp (fast) dayofyear = int(date.strftime('%j')) # for Timestamp & datetime a[dayofyear-1] = True return a
if date
is a Timestamp object we could use date.dayofyear
to find the
position in the array to flip to one, but I use strftime('%j')
instead,
as it works on both DateTime and Timestamp objects. These give January 1st
as day “one”, so need to offset by -1
, since python is 0-indexed (unlike
e.g. Lua or Fortran).
2.3. Constructing the Bitmask
In the holiday package, every date that is a holiday is stored in a dict-like object:
In [1]: import holidays In [2]: holidays.USA(years=2019) Out[2]: {datetime.date(2019, 1, 1): "New Year's Day", datetime.date(2019, 1, 21): 'Martin Luther King Jr. Day', datetime.date(2019, 2, 18): "Washington's Birthday", datetime.date(2019, 5, 27): 'Memorial Day', datetime.date(2019, 7, 4): 'Independence Day', datetime.date(2019, 9, 2): 'Labor Day', datetime.date(2019, 10, 14): 'Columbus Day', datetime.date(2019, 11, 11): 'Veterans Day', datetime.date(2019, 11, 28): 'Thanksgiving', datetime.date(2019, 12, 25): 'Christmas Day'}
It varies from country to country if Sunday and Saturday are included, so we can flip every seven bit, assuming we know the starting position for the first Sunday and/or Saturday for the year we want to encode:
bit_mask = 366 * bitarray('0') # Initialize empty bitstring bit_mask[first_saturday::7] = True # Saturdays bit_mask[first_saturday+1::7] = True # Sundays
Results in
bitarray('011000001100000.....')
We can create many masks, i.e. one for Sundays, one for Saturdays, and one for the other holidays, and combine them using a bitwise OR operation. Below this is demonstrated for holidays found in the dictionary:
import holidays for key, val in holidays.USA(years=year).items(): bit_date = N_bits * bitarray('0') # Initialize empty bitmask of lenght N_bits dayinyear = int(key.strftime('%j')) # Find index of day in a year, in interval [1,366] bit_date[dayinyear - 1] = True # Flip the correct bit to '1' bit_mask = bit_mask | bit_date # OR the mask for that day, to the other
2.4. Benchmark
Although I’m not showing the exact code I used to build my bitmask, the new implementation of the first code, shown in the Introduction, has more lines of code, but it will be worth it:
def is_workday(date): """True if weekday, false if red or weekend Parameters ---------- date : datetime A date time object / or timestamp Returns ------- bool True if it's (approximately) a normal working (bank) day """ bit_date = date_to_bit(date) # convert date to bitarray object global bit_masks # dictionary of bit-masks for every year year = date.year # get year, e.g. "2019" if year not in bit_masks: # if a new year, construct new bitmask: bit_masks[year] = build_bitmask(year) # Bitwise AND on date and mask, is there a '1' in the resulting bitstring? return not bool((bit_date & bit_masks[year]).count())
The bitmask for each year is only constructed once, and can then be stored
in a bit_masks
dict, keyed on year. Then our 57 million dates can just be
represented as 366 long bitarrays.
To benchmark, we select 10k rows from a our pandas dataframe, in iPython:
In [1251]: %time tmp.time.apply(lambda x: is_workday(x)) CPU times: user 3.46 s, sys: 8 µs, total: 3.46 s Wall time: 3.47 s
3.47 seconds for 10k lines, if apply
be the method of our choosing, which
could be debated.
After re-defining our is_workday
to use our new bitarrays implementation,
we get:
In [1253]: %time tmp.time.apply(lambda x: is_workday(x)) CPU times: user 62.4 ms, sys: 0 ns, total: 62.4 ms Wall time: 61.9 ms
61.9 ms, i.e. 56 times faster! As a bonus, we got to learn how to use efficient bitarrays, which can come in handy for one-hot encoding of sets or creating digital DNA for genetic evolution, or some other neat thing.
3. Epilogue
In hindsight, for my problem, I could have used caching instead, as the number of dates to look-up vastly outnumber the number of unique dates to compute. I actually implemented this as well, so the bit array logic above only runs to populate a lookup dictionary.
Also, in this post I’ve simplified the actual the problem I faced, for brevity.