UP | HOME
Impaktor

Impaktor

Inject Emacs intravenously

Python bitarray - 60x speedup for comparisons
Published on Dec 24, 2020 by Impaktor.

This is intended to show how to use bitarray/bitstrings in python, on a beginner friendly way, to speed up code, by using more efficient data representation

Introduction — The problem to solve

Recently I found myself needing to speed up a bottle neck in my code, that was essentially querying a dictionary if it contained a specified key (datetime object), or specifically a function:

def is_workday(date):
    """True if weekday, false if weekend or holiday/red day.
    Input is Timestamp or DateTime object"""
    import holidays

    # Saturday and Sunday are index 5, 6
    return date.weekday() < 5 and date not in holidays.USA(years=date.year)

where I am using the holidays package, which defines a dictionary of holidays for each year and country. I have 57 million dates that need to be checked, and this would take ~30 hours where the above code was the main bottle neck.

So is there a faster way to see if date is in the dictionary? There are several (I’ll mention some in the end), but we’ll look at bit representation here.

The solution — bitarray

What is a bit-array and a bit-mask?

The reader is familiar that computers represents numbers as “ones and zeros”. E.g. the first numbers represented in binary as (here using 3 bits):

Decimal Binary
0 000
1 001
2 010
3 011
4 100
5 101
6 110
7 111

But what if we instead interpret, for instance, “2” to literally mean “string of all zeros, where second-to-last bit is flipped to 1” to represent something other than the number “2”? This is quite common practice among C/C++ programming, but not something I’ve seen done much in the python community, thus this write up.

So instead of asking date in dictionary_of_holidays, we represent a date as a bitarray, and then use a bit-mask and bitwise AND operator.

E.g. to check if the second day in a year is on a weekend or not, could do something like the following:

    01000000000000...
AND 00000110000011...
  = 00000000000000...

where the top string has a 1 at the second bit (where we now define that we count positions starting from left), representing the second day, and we then operate on this bitarray with a bitwise AND together with the bit-mask, where every (two) 7th position is 1, for Saturday and Sunday (assuming that year started on a Monday), and the resulting output has a ’1’ if true (i.e. “the second day is on a weekend”), and all zeros if false (“the second day of the year is on a working day”).

Naturally, we want to also include holidays in the bit-mask string, and encode these in the right relative starting positions, which will be different each year.

Implementation

Since every fourth year in our Gregorian calendar has an extra day, the number of bits in our bitarrays will be:

N_bits = 366

i.e. for the other three years, the last bit will be unused.

We will use the bitarrays (v. 1.6.1) package, and initialize the bitarray representation of a date as an N_bits long string with all zeros, and then flip the correct bit to a 1. We define the left most bit to be position 0, and represent the first day in the year.

There are two different ways to initialize a bit array with all zeros (as I understood it), one that takes 2µs (micro seconds):

a = N_bits * bitarray('0')

another one that is almost 20 times faster, with 115ns.

a = bitarray(N_bits).setall(False)

However, this last method did not work reliably for me, either due to bug, or me doing something silly/being tired, so I opted for the first method. Besides, initializing the bitarray is not the bottle neck, nor the interesting part.

I did not find any support for shifting bits >> and << operators, which might be the traditional way to shift a bit to the correct position, but I can access the n-th bit in the bitarray element using the [n] accessor, thus a function that converts a Timestamp or DateTime object to bitarray representation is:

from bitarray import bitarray

def date_to_bit(date):
    """
    Convert DateTime or Timestamp to bitarray representation

    Parameters
    ----------
    date : datetime

    Returns
    -------
    bitarray object
        N_bits bits long (366), with one bit flipped
    """

    a = N_bits * bitarray('0')

    # flip bit of year (dayofyear [1,356]):
    # dayofyear = date.dayofyear          # only for Timestamp (fast)
    dayofyear = int(date.strftime('%j'))  # for Timestamp & datetime
    a[dayofyear-1] = True
    return a

if date is a Timestamp object we could use date.dayofyear to find the position in the array to flip to one, but I use strftime('%j') instead, as it works on both DateTime and Timestamp objects. These give January 1st as day “one”, so need to offset by -1, since python is 0-indexed (unlike e.g. Lua or Fortran).

Constructing the Bitmask

In the holiday package, every date that is a holiday is stored in a dict-like object:

In [1]: import holidays
In [2]: holidays.USA(years=2019)
Out[2]:
{datetime.date(2019, 1, 1): "New Year's Day",
 datetime.date(2019, 1, 21): 'Martin Luther King Jr. Day',
 datetime.date(2019, 2, 18): "Washington's Birthday",
 datetime.date(2019, 5, 27): 'Memorial Day',
 datetime.date(2019, 7, 4): 'Independence Day',
 datetime.date(2019, 9, 2): 'Labor Day',
 datetime.date(2019, 10, 14): 'Columbus Day',
 datetime.date(2019, 11, 11): 'Veterans Day',
 datetime.date(2019, 11, 28): 'Thanksgiving',
 datetime.date(2019, 12, 25): 'Christmas Day'}

It varies from country to country if Sunday and Saturday are included, so we can flip every seven bit, assuming we know the starting position for the first Sunday and/or Saturday for the year we want to encode:

bit_mask = 366 * bitarray('0')        # Initialize empty bitstring
bit_mask[first_saturday::7] = True    # Saturdays
bit_mask[first_saturday+1::7] = True  # Sundays

Results in

bitarray('011000001100000.....')

We can create many masks, i.e. one for Sundays, one for Saturdays, and one for the other holidays, and combine them using a bitwise OR operation. Below this is demonstrated for holidays found in the dictionary:

import holidays

for key, val in holidays.USA(years=year).items():
    bit_date = N_bits * bitarray('0')    # Initialize empty bitmask of lenght N_bits
    dayinyear = int(key.strftime('%j'))  # Find index of day in a year, in interval [1,366]
    bit_date[dayinyear - 1] = True       # Flip the correct bit to '1'
    bit_mask = bit_mask | bit_date       # OR the mask for that day, to the other

Benchmark

Although I’m not showing the exact code I used to build my bitmask, the new implementation of the first code, shown in the Introduction, has more lines of code, but it will be worth it:

def is_workday(date):
    """True if weekday, false if red or weekend

    Parameters
    ----------
    date : datetime
        A date time object / or timestamp

    Returns
    -------
    bool
        True if it's (approximately) a normal working (bank) day
    """
    bit_date = date_to_bit(date)   # convert date to bitarray object

    global bit_masks               # dictionary of bit-masks for every year
    year = date.year               # get year, e.g. "2019"
    if year not in bit_masks:      # if a new year, construct new bitmask:
        bit_masks[year] = build_bitmask(year)

    # Bitwise AND on date and mask, is there a '1' in the resulting bitstring?
    return not bool((bit_date & bit_masks[year]).count())

The bitmask for each year is only constructed once, and can then be stored in a bit_masks dict, keyed on year. Then our 57 million dates can just be represented as 366 long bitarrays.

To benchmark, we select 10k rows from a our pandas dataframe, in iPython:

In [1251]: %time tmp.time.apply(lambda x: is_workday(x))
CPU times: user 3.46 s, sys: 8 µs, total: 3.46 s
Wall time: 3.47 s

3.47 seconds for 10k lines, if apply be the method of our choosing, which could be debated.

After re-defining our is_workday to use our new bitarrays implementation, we get:

In [1253]: %time tmp.time.apply(lambda x: is_workday(x))
CPU times: user 62.4 ms, sys: 0 ns, total: 62.4 ms
Wall time: 61.9 ms

61.9 ms, i.e. 56 times faster! As a bonus, we got to learn how to use efficient bitarrays, which can come in handy for one-hot encoding of sets or creating digital DNA for genetic evolution, or some other neat thing.

Epilogue

In hindsight, for my problem, I could have used caching instead, as the number of dates to look-up vastly outnumber the number of unique dates to compute. I actually implemented this as well, so the bit array logic above only runs to populate a lookup dictionary.

Also, in this post I’ve simplified the actual the problem I faced, for brevity.