 # Impaktor

Inject Emacs intravenously

# Python bitarray - 60x speedup for comparisons Published on Dec 24, 2020 by Impaktor.

This is intended to show how to use bitarray/bitstrings in python, on a beginner friendly way, to speed up code, by using more efficient data representation

## Introduction — The problem to solve

Recently I found myself needing to speed up a bottle neck in my code, that was essentially querying a dictionary if it contained a specified key (datetime object), or specifically a function:

```def is_workday(date):
"""True if weekday, false if weekend or holiday/red day.
Input is Timestamp or DateTime object"""
import holidays

# Saturday and Sunday are index 5, 6
return date.weekday() < 5 and date not in holidays.USA(years=date.year)
```

where I am using the holidays package, which defines a dictionary of holidays for each year and country. I have 57 million dates that need to be checked, and this would take ~30 hours where the above code was the main bottle neck.

So is there a faster way to see if `date` is in the dictionary? There are several (I’ll mention some in the end), but we’ll look at bit representation here.

## The solution — bitarray

### What is a bit-array and a bit-mask?

The reader is familiar that computers represents numbers as “ones and zeros”. E.g. the first numbers represented in binary as (here using 3 bits):

Decimal Binary
0 000
1 001
2 010
3 011
4 100
5 101
6 110
7 111

But what if we instead interpret, for instance, “2” to literally mean “string of all zeros, where second-to-last bit is flipped to 1” to represent something other than the number “2”? This is quite common practice among C/C++ programming, but not something I’ve seen done much in the python community, thus this write up.

So instead of asking `date in dictionary_of_holidays`, we represent a date as a bitarray, and then use a bit-mask and bitwise AND operator.

E.g. to check if the second day in a year is on a weekend or not, could do something like the following:

```    01000000000000...
AND 00000110000011...
= 00000000000000...
```

where the top string has a `1` at the second bit (where we now define that we count positions starting from left), representing the second day, and we then operate on this bitarray with a bitwise AND together with the bit-mask, where every (two) 7th position is 1, for Saturday and Sunday (assuming that year started on a Monday), and the resulting output has a ’1’ if true (i.e. “the second day is on a weekend”), and all zeros if false (“the second day of the year is on a working day”).

Naturally, we want to also include holidays in the bit-mask string, and encode these in the right relative starting positions, which will be different each year.

### Implementation

Since every fourth year in our Gregorian calendar has an extra day, the number of bits in our bitarrays will be:

```N_bits = 366
```

i.e. for the other three years, the last bit will be unused.

We will use the bitarrays (v. 1.6.1) package, and initialize the bitarray representation of a date as an `N_bits` long string with all zeros, and then flip the correct bit to a `1`. We define the left most bit to be position 0, and represent the first day in the year.

There are two different ways to initialize a bit array with all zeros (as I understood it), one that takes 2µs (micro seconds):

```a = N_bits * bitarray('0')
```

another one that is almost 20 times faster, with 115ns.

```a = bitarray(N_bits).setall(False)
```

However, this last method did not work reliably for me, either due to bug, or me doing something silly/being tired, so I opted for the first method. Besides, initializing the bitarray is not the bottle neck, nor the interesting part.

I did not find any support for shifting bits `>>` and `<<` operators, which might be the traditional way to shift a bit to the correct position, but I can access the `n`-th bit in the bitarray element using the `[n]` accessor, thus a function that converts a Timestamp or DateTime object to bitarray representation is:

```from bitarray import bitarray

def date_to_bit(date):
"""
Convert DateTime or Timestamp to bitarray representation

Parameters
----------
date : datetime

Returns
-------
bitarray object
N_bits bits long (366), with one bit flipped
"""

a = N_bits * bitarray('0')

# flip bit of year (dayofyear [1,356]):
# dayofyear = date.dayofyear          # only for Timestamp (fast)
dayofyear = int(date.strftime('%j'))  # for Timestamp & datetime
a[dayofyear-1] = True
return a
```

if `date` is a Timestamp object we could use `date.dayofyear` to find the position in the array to flip to one, but I use `strftime('%j')` instead, as it works on both DateTime and Timestamp objects. These give January 1st as day “one”, so need to offset by `-1`, since python is 0-indexed (unlike e.g. Lua or Fortran).

In the holiday package, every date that is a holiday is stored in a dict-like object:

```In : import holidays
In : holidays.USA(years=2019)
Out:
{datetime.date(2019, 1, 1): "New Year's Day",
datetime.date(2019, 1, 21): 'Martin Luther King Jr. Day',
datetime.date(2019, 2, 18): "Washington's Birthday",
datetime.date(2019, 5, 27): 'Memorial Day',
datetime.date(2019, 7, 4): 'Independence Day',
datetime.date(2019, 9, 2): 'Labor Day',
datetime.date(2019, 10, 14): 'Columbus Day',
datetime.date(2019, 11, 11): 'Veterans Day',
datetime.date(2019, 11, 28): 'Thanksgiving',
datetime.date(2019, 12, 25): 'Christmas Day'}
```

It varies from country to country if Sunday and Saturday are included, so we can flip every seven bit, assuming we know the starting position for the first Sunday and/or Saturday for the year we want to encode:

```bit_mask = 366 * bitarray('0')        # Initialize empty bitstring
```

Results in

```bitarray('011000001100000.....')
```

We can create many masks, i.e. one for Sundays, one for Saturdays, and one for the other holidays, and combine them using a bitwise OR operation. Below this is demonstrated for holidays found in the dictionary:

```import holidays

for key, val in holidays.USA(years=year).items():
bit_date = N_bits * bitarray('0')    # Initialize empty bitmask of lenght N_bits
dayinyear = int(key.strftime('%j'))  # Find index of day in a year, in interval [1,366]
bit_date[dayinyear - 1] = True       # Flip the correct bit to '1'
```

### Benchmark

Although I’m not showing the exact code I used to build my bitmask, the new implementation of the first code, shown in the Introduction, has more lines of code, but it will be worth it:

```def is_workday(date):
"""True if weekday, false if red or weekend

Parameters
----------
date : datetime
A date time object / or timestamp

Returns
-------
bool
True if it's (approximately) a normal working (bank) day
"""
bit_date = date_to_bit(date)   # convert date to bitarray object

year = date.year               # get year, e.g. "2019"
if year not in bit_masks:      # if a new year, construct new bitmask:

# Bitwise AND on date and mask, is there a '1' in the resulting bitstring?
```

The bitmask for each year is only constructed once, and can then be stored in a `bit_masks` dict, keyed on year. Then our 57 million dates can just be represented as 366 long bitarrays.

To benchmark, we select 10k rows from a our pandas dataframe, in iPython:

```In : %time tmp.time.apply(lambda x: is_workday(x))
CPU times: user 3.46 s, sys: 8 µs, total: 3.46 s
Wall time: 3.47 s
```

3.47 seconds for 10k lines, if `apply` be the method of our choosing, which could be debated.

After re-defining our `is_workday` to use our new bitarrays implementation, we get:

```In : %time tmp.time.apply(lambda x: is_workday(x))
CPU times: user 62.4 ms, sys: 0 ns, total: 62.4 ms
Wall time: 61.9 ms
```

61.9 ms, i.e. 56 times faster! As a bonus, we got to learn how to use efficient bitarrays, which can come in handy for one-hot encoding of sets or creating digital DNA for genetic evolution, or some other neat thing.

## Epilogue

In hindsight, for my problem, I could have used caching instead, as the number of dates to look-up vastly outnumber the number of unique dates to compute. I actually implemented this as well, so the bit array logic above only runs to populate a lookup dictionary.

Also, in this post I’ve simplified the actual the problem I faced, for brevity.