Inject Emacs intravenously

On BERT for non-English, finding data, and the fine art of web-scraping
Published on Oct 07, 2020 by Impaktor.

One person’s data is another person’s noise.

K. C. Cole

Table of Contents


Documentation of my exploration for learning BERT (Bidirectional Encoder Representations from Transformers), and finding multi-language sources for training BERT language model, and fine tuning. Specifically, to train a transformer model for Norwegian language (bokmål), preferably BERT, but methods and data sources are applicable to many other non-standard languages.

Data sources for non-English sentiment analysis

Collection of data resources for training sentiment analysis, or simply just raw text for training language model, like BERT.

Multilanguage source (incl. Norwegian)

General advice to find raw text corpus in your language, look up:

  • NLP groups in universities of your country. What have they done?
  • NLP projects in the national/royal library. Have they compiled any corpus?
  • NLP projects in prominent book/news publishing houses.
  • Transcripts of speeches and/or proceedings from parliament/bureaucracy of your country or public administration documents

Specific multilanguage sources:

  • Leipzig Corpora Newspaper texts or texts randomly collected from the web, foreign sentences removed
  • Open subtitles, in total 62 languages, 22G tokens: here, with parser for downloading and extracting tokens into one corups:
  • TenTen Corpus Family Corpus for 40 langauges, targets +10 billion words per language, text from web, 2.47 billion words, in bokmål. Source is from a “linguistically valuable corpus” curated from web content.
  • TalkOfEurope Project from 2016, that covers all plenary debates held in the European Parliament (EP) between 1999-07 – 2014-01. Access to data here (2.4GB compressed).
  • Tatoeba collection of sentences and translations. Their default sentence.csv dataset consists of 8M sentences (500MB), in various languages. Used by Facebook research in their LASER paper, their processed data in github here.
  • Common Crawl web crawl data, from +40 languages supported, released monthly. ~ 250 TB of data stored on AWS in WARC format, from billions of web pages. Note: might need substantial cleaning for pre-training BERT. See examples to get started. (Disclaimer: I have not worked with this data set myself).
  • OSCAR Open Super-Large Crawl Data, for many languages, see table.

    Table 1: Example of some of the data from OSCAR that is of interest to me
    Language Words original Size original Words deduplicated Size deduplicated
    Norwegian 1,344,326,388 8.0G 804,894,377 4.7G
    Norwegian Nynorsk 14,764,980 85M 9,435,139 54M
    Swedish 7,155,994,312 44G 4,106,120,608 25G
    Finnish 3,196,666,419 27G 1,597,855,468 13G
    Danish 2,637,463,889 16G 1,620,091,317 9.5G
    Esperanto 48,486,161 299M 37,324,446 228M
    English 418,187,793,408 2.3T 215,841,256,971 1.2T
    French 46,896,036,417 282G 23,206,776,649 138G
  • Project Gutenberg: based on 3000 books in English, but worth investigating if there’s something similar for other languages as well?
  • Wikipedia-dumps, plus wiki-extractor.py. Note see warning about using wikipedia data further down.
  • WikiText the source which multilanguage BERT is trained on. 100M tokens, 110 times larger than Penn Treebank (PTB) (Or is this English only?)
  • Homemade BookCorpus Are there non-English books in this data set of 200k books in raw txt?


  • conceptnet-numberbatch The best pre-computed word embeddings you can use, word embeddings in 78 different langauges.

Norwegian specific sources

  • NoReC: The Norwegian Review Corpus 15M tokens, 35.2k full text reviews scored 1-6, from newspapers 2003–2017, created by the SANT project. The data is stored in CoNNL-U format. This can be read from python using following options:
    • pyconll — Fetches review body but not head(?), where the review score is.
    • conllu — Smaller than pyconll. Problem converting to raw text?
    • norec — They provide their own data reading library, I found this to be the simplest to get both text and review score out
  • talk-of-norway Norwegian Parliament Corpus 1998–2016, with both metadata, and raw text from speeches.
  • noTenTen: Corpus of the Norwegian Web, 2.47 billion words (bokmål), 44 million (nynorsk).
  • bokelskere.no, has book review dataset (local goodreads.com equivalent), about 180k reviews, however, only 20% have set a score.
  • www.legelisten.no site for doctor, psychiatrist, and dentist reviews. I’ve scraped this manually (47000 reviews, from 4400 unique doctors, as of 2020-09) data not provided, method described further down.
  • Språkbanken from Nasjonalbiblioteket, Looks to have many resources, need to lookinto it.
  • Aftenposten archive with more than 150 year worth of articles. Requires account on aftenposten, and perhaps/likely not in “raw” format, perhaps one can buy a data-dump from them?
  • Norwegian NLP Resources github readme with collection of NLP resources for Norwegian NLP. Great resource.
  • Wikipedia.no Norwegian wikipedia, (maybe use wikiextractor).
  • Other resources, e.g. Research groups at University in Oslo (UiO):

Warning about training on Wikipedia

A word of caution when training on wikipedia text, illustrated by example below:


Figure 1: Relative number of wikipedia articles by language

The two biggest non-english languages on wikipedia is Cebuano (language from Philippines) & Swedish. How come? Because the contributor Sverker Johansson (Swedish with wife from Philippines) has written a bot for automatic article generation in these languages, by constructing sentences augmented with data from data bases, e.g.

The average giraffe is [height] meters tall and weighs [mass] kg.

i.e. data not good for training NLP.

Web scraping

To train a sentiment classifier we need text (typically a review) + score / emoticon (thumbs-up/down). Some suggestions on reviews that can be found online in any language:

  • restaurants, hotels (yelp?), rental agencies
  • film, books, store products
  • apps, software,
  • videos, music
  • social media
  • services: healthcare, construction, education
  • user reviews on auction sites / market places

Google maps

Google maps collects a lot of reviews for all kinds of establishments, like restaurants, hotels, parks, landmarks, continents, and weirder things.

There seems to be many options for web scraping, but so far I have not found any way of doing it for free. The available options:

Google Maps Places API
Google provides their own API that is supposedly very straightforward to use, and well documented, it’s a pay-as-you-go pricing model.“
Google Maps Scraper

github repo that extracts reviews from a file urls.txt containing coordinates. I.e. the problem is now to compile the csv-file with locations. Useage:

python scraper.py --N 50 -i url_input_file.txt

generates a csv file containing last 50 reviews of places present in urls.txt

Google Maps Crawler
From company botsol offer free trial before you buy.
Third party scraping “made easy” (no coding needed). Offers 14 days free trial. Tutorial here for scraping data in google maps, and template here.
module can scrape map at different zoom levels: country, state, city. But must run on their platform? Can also scrape instagram, facebook, etc. There’s a free tier “developer” package.
Google Map Extractor
From Ahmadsoftware. Made to find business leads, extracts rating (but not review text?), only windows?

Google apps reviews

To get software reviews on apps, there is a readily available python tool for scraping reviews from Google play appstore.

pip install google_play_scraper

There’s a chapter in the free online book Get SH*T Done with PyTorch here, code is also mirrored on colab.

Manual webscraping

There are a number of ways to do web scraping using python.


Easiest, for me, and most bare bone is using BeautifulSoup. See documentation, and/or tutorials, I’ve found useful e.g.:


For fancy webpages, that rely on javascript shenanigans, such that you have to interact with it to get the data, e.g. scroll or click to activate elements, selenium is the tool to use, as it’s basically a headless browser, with which you can simulate this. There are interesting tutorial:


Scrapy has all the bells and whistles. One defines various classes that then are loaded by spacy. I spent some time reading the documentation, but there were too high threshold to get over just to start using it, compared to what I had in mind. Probably worth the time investment if one embarks on a larger web scraping endeavour.

The Final Cleansing

Often web scraped reviews/pages can be “polluted” by foreign language, i.e. Norwegian reviews typically contain also Danish, Swedish and English (Or worse: the language has two different ways of writing: bokmål and nynorks).

For example, the bokelskere-set has the following distribution:

Table 2: Distribution of languages in norwegian data set for book reviews
Language Records Relative
Norwegian - bokmål 184651 0.853
Norwegian - nynorsk 21004 0.097
English 7174 0.033
Danish 2459 0.011
Swedish 1229 0.006

This can be filtered out using e.g. polyglot. For Norwegian, and evaluate the below function once, to download the needed resources:

from polyglot.downloader import downloader
from polyglot.text import Text
from polyglot.detect import Detector

def download_polyglot():
    "Helper function with examples for downloading the resources we need"


    # show supported modules for language

    # check which languages are supported in sentiment analysis

    # show downloaded packages from polyglot

    # download:

    # Download Norwegian, bokmaal & nynorsk; & English
    downloader.download("sentiment2.no", quiet=False)
    downloader.download("sentiment2.nn", quiet=False)
    downloader.download("sentiment2.en", quiet=False)

    for model in ['sentiment2', 'embeddings2', 'ner2']:
        for lang in ['no', 'nn', 'en']:
            downloader.download(f"{model}.{lang}", quiet=False)

Then we can apply the following filter on a pandas data frame:

def filter_norwegian(df):
    """Return masks for norwegian: bokmaal, and nynorsk

    df: PandasSeries
        Column of review text

    no: PandasSeries
        Masking column for bokmaal, Boolean
    nn: PandasSeries
        Masking column for nynorsk, Boolean

    # Filter out non-norwegian
    # df['lang'] = df.text.apply(lambda x: Detector(x).language.code)

    l = []
    for t in df.values:
            l.append(Detector(t, quiet=True).language.code)
        except Exception as e:

    lang = pd.DataFrame(l)

    nn = lang.apply(lambda x: x == 'nn')  # nynorsk
    no = lang.apply(lambda x: x == 'no')  # bokmaal
    return no, nn



Summary of summary:

  • BPE counts the frequency of each word in the training corpus. It then begins from the list of all characters, and will learn merge rules to form a new token from two symbols in the vocabulary until it has learned a vocabulary of the desired size (this is a hyperparameter to pick).
  • Byte-level BPE by operating on byte level, no <UNK> tokens are needed, and more compact/efficient than BPE.
  • WordPiece the difference to BPE is that it doesn’t choose the pair that is the most frequent but the one that will maximize the likelihood on the corpus once merged. It’s subtly different from what BPE does in the sense that it evaluates what it “loses” by merging two symbols and makes sure it’s worth it.
  • Unigram Instead of merging to form words, it starts with words and splits. not used directly, only a part of SentencePiece
  • SentencePiece deal with not all languages separate words with space. Used in ALBERT and XLNet.

Transformer libraries

Learning BERT (PyTorch only)

BERT originates from Google, and the first models are available in TensorFlow, e.g. “Fine-tuning a BERT model” (colab), but due to my own preference for PyTorch over TensorFlow (TF), I’ve mainly been playing with the huggingface library which is the main implementation for transformer models, in python.

For theoretical/visual tutorial, Jay Alammar’s blog posts are the “must reads”, e.g.

Recently, there was a paper published on arXive (via Hacker news discussion):

0. Minimal BERT example

The huggingface tutorial starts simple with the smallest possible code sample for BERT for sentiment analysis by using pre-trained English model, without fine-tuning:

from transformers import pipeline

nlp = pipeline("sentiment-analysis")

for s in ["I despise you", "I've taken a liking to you"]:
    result = nlp(s)[0]
    print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

It’s worth also looking into the following HuggingFace resources:

1. Chris McCormick’s tutorial

The best hands-on tutorial using huggingface is Chris McCormick’s tutorial, also available as video. It’s very similar in structure, and excellent to do after, PyTorch From First Principles.

It follows the PyTorch way, by implementing optimizer, scheduler, data loader, i.e. quite verbose but clear, compared to the more recent new way of relying on torch.Train(). He makes use of BertTokenizer and BertForSequenceClassification.

The dataset is The Corpus of Linguistic Acceptability (CoLA). It’s a set of sentences labeled as grammatically correct or incorrect, i.e. binary single sentence classification.

Also worth looking into his tutorial on BERT word embeddings, compare to “normal” embedding, using just a single auto-encoder layer, but in BERT we have 12+1 (“+1” due to input embedding layer), so how should a word be represented? Different layers encode different kind of information.

There is a library, bert-as-service, that already extracts word embedding from BERT in “smart way”:

…uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

Spoiler: average embedding for each word piece (sub word token) that makes up the word, typically in the second to last layer. Also, note: words are not uniquely represented, i.e. “river bank” and “money bank”.

2. Getting SH*T done with PyTorch

The next logical BERT-tutorial in the order is from the free online book Getting things done with PyTorch, specifically the chapter “Sentiment Analysis with BERT and Transformers by Hugging Face using PyTorch and Python”.

This is interesting, since he doesn’t use the BertForSequenceClassification model (a normal Bert + linear layer on top), like McCormick & co, but rather he uses the bare bones language model found in BertModel and then wraps it manually in a PyTorch model class, where he extracts the parameters from the last pooling-layer of BertModel, and connect that to a fully connected layer on top.

Code sample:

from torch import nn
from transformers import BertModel

class SentimentClassifier(nn.Module):
    """Note that we are returning the raw output of the last layer since
    that is required for the cross-entropy loss function in PyTorch to

    Will work like any other PyTorch model

    def __init__(self, n_classes):
        self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
        # dropout for regularization:
        self.drop = nn.Dropout(p=0.3)
        # fully connected for output
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
        output = self.drop(pooled_output)
        return self.out(output)

# Can now use it like any other PyTorch model
model = SentimentClassifier(len(class_names))
model = model.to(device)

3. Other/Misc BERT tutorials

3.1 Testing multiple models, and 16 bit floats

A Hands-On Guide To Text Classification With Transformer Models (XLNet, BERT, XLM, RoBERTa)

  • Fine tune BERT testing multiple models, BertForSequenceClassification, XLMForSequenceClassification, RobertaForSequenceClassification
  • Code is messy, and in addition depends on external files, from previous tutorials.
  • Data set it Yelp review, (& src on colab)
  • Interesting is he uses apex import amp, from Nvidia’s mixed precision library, to shift to fp16 for reduced memory footprint and speed up fine-tuning runs

3.2 Intent classification tutorial & quick view of BERT’s layers

BERT for dummies — Step by Step Tutorial (colab)

Doesn’t do sentiment classification, but rather intent classification (but still using the BertForSequenceClassification model). Use case is to find the intent of a search query. Data set is ATIS, (Airline Travel Information System), which consists of 26 highly imbalanced classes, that are augmented using SMOTE algorithm, (but fails). The tutorial depends on previous article fro reading in the ATIS data.

They use an LSTM for null-hypothesis testing, to benchmark the BERT model against. (Same author also did a tutoral using the same data set, called Natural Language Understanding with Sequence to Sequence Models)

Explores the components of the BertForSequenceClassification model:

We can see the 0BertEmbedding layer at the beginning, followed by a Transformer architecture for each encoder layer: BertAttention, BertIntermediate, BertOutput. At the end, we have the Classifier layer.

For some reason this tutorial also demonstrates using the pad_sequences from keras.preprocessing.sequence to pad the sequences, rather then using the tokenizer’s padding option.

3.3 Using Simpletransformer package

Simple Transformers — Introducing The Easiest Way To Use BERT, RoBERTa, XLNet, and XLM

Author of the python package simpletransformers demonstrates how to use it, on the Yelp review data set for sentiment analysis.

It seems to hide too much under the hood, for my taste, but might be interesting for people interested in minimal code / maximum code wrapping.

3.4 Using TorchText with tokenizer

BERT Text Classification Using Pytorch (Main code colab & processing colab)

This tutorial has pretty clean/straight forward code. It does text classification on the kaggle REAL and FAKE News dataset (i.e. will use BinaryCrossEntropy loss function). What might be of interest here is it uses TorchText to create Text Field (news article) and the Label Field (target), and how to let it know to use the BERT tokenizer, rather than build its own vocabulary:

from torchtext.data import Field, TabularDataset, BucketIterator, Iterator
# In order to use BERT tokenizer with TorchText, we have to set use_vocab=False and tokenize=tokenizer.encode
text_field = Field(use_vocab=False, tokenize=tokenizer.encode, lower=False, include_lengths=False,
                   batch_first=True, fix_length=MAX_SEQ_LEN, pad_token=PAD_INDEX, unk_token=UNK_INDEX)
label_field = ...
fields = [('label', label_field), ('title', text_field), ('text', text_field), ('titletext', text_field)]
train, valid, test = TabularDataset.splits(path=source_folder, train='train.csv', validation='valid.csv',
                                           test='test.csv', format='CSV', fields=fields, skip_header=True)

Also does a minimal wrapping of BERT in it’s own class, and also implements saving/loading of checkpoints:

class BERT(nn.Module):

    def __init__(self):
        super(BERT, self).__init__()

        options_name = "bert-base-uncased"
        self.encoder = BertForSequenceClassification.from_pretrained(options_name)

    def forward(self, text, label):
        loss, text_fea = self.encoder(text, labels=label)[:2]
        return loss, text_fea

3.5 “Lost in translation - Found by transformer”

Lost in Translation. Found by Transformer. BERT Explained.

This is simply a re-hashing of Jay Alammar’s famous blog post “The Illustrated Transformer”.

4. Using new Trainer() method

New PyTorch method that abstracts away a lot of the boilerplate code for training loop. All functionality is still available through arguments.

See section Multilanguage BERT for tutorial that uses it.

5. Use Auto class for easier experimentation

Using the Auto tool of PyTorch it automatically selects the correct class for the model and tokenizer we want to use, by simply specifying the model_name.

from transformers import AutoTokenizer, AutoModel

On using the Auto class, for easier model experimentation Automatic Model Loading using Hugging Face

Also covered by in the quicktour.

Non-English BERT

In the following, we explore the options available for using BERT on non-English languages

Multi-language BERT (mBERT)

In regard to multilanguage BERT’s performance on small languages, BotXO, said the following in an interview:

The multilingual model performs poorly for languages such as the Nordic languages like Danish or Norwegian because of underrepresentation in the training data. For example, only 1% of the total amount of data constitutes the Danish text. To illustrate, the BERT model has a vocabulary of 120,000 words, which leaves room for about 1,200 Danish words. Here comes BotXO’s model, which has a vocabulary of 32,000 Danish words.

  • Google Research has a multilanguage BERT is trained on 104 languages.
  • Hugging face has transformer models BERT, XLM, and XLM RoBERTa for multilanguage use, with the following checkpoints:
    • bert-base-multilingual-uncased 102 languages
    • bert-base-multilingual-cased 104 languages
    • xlm-roberta-base 100 languages, outperforms mBERT
    • xlm-roberta-large 100 languages, outperforms mBERT

(I do not yet have results when running these models)

LASER — Language-Agnostic SEntence Representations (facebook)

From facebook (blog, github) built on PyTorch, to do zero-shot transfer of NLP models from one language, such as English, to scores of others.

LASER sets a new state of the art on zero-shot cross-lingual natural language inference accuracy for 13 of the 14 languages in the XNLI corpus.

  • Trained on 93 languages, all using the same BiLSTM encoder
  • (…but tokenizer is language specific)
  • Is trained on sentence-paris from different languages, to embed all languages jointly in single shared space.


On of the tasks of interest might be Application to cross-lingual document classification where they train a classifier on English, then apply to several other languages. However, most other applications for the model seems to be measuring distance in embedded space between same sentence in different languages, or similar.

LaBSE — Language-Agnostic BERT Sentence Embedding (google)

From Google (arXiv, blog), in TF2 available on tfhub. The blog post stats that accuracy decreases slowly with more languages added.

Idea is to leverage small corpus languages by multi language embedding:

… cross-lingual sentence embeddings for 109 languages. The model is trained on 17 billion monolingual sentences and 6 billion bilingual sentence pairs …resulting in a model that is effective even on low-resource languages for which there is no data available during training … interestingly, LaBSE even performs relatively well for many of the 30+ Tatoeba languages for which it has no training data (see below)

It can do various translations / sorting of translation matches.

(I do not have results on running this on any tasks)

TODO BotXO models (Nordic BERT models)

BotXO.ai is a danish company (that develops chat bots?), that have trained their own Nordic BERT models (interview) using google TPUs, and common crawl data:

Next, to load these models, I’ve explored using the native TensorFlow-way (equals pain), and then PyTorch. The pre-trained model checkpoints contain the following files:


Loading models using Tensorflow (and giving up)

To load a model, follow general instructions google-research/bert (e.g. on GLUE-test data downloaded with script). I start with downloading their models, to make sure they work, but I ran into the following issues:

  • Requires TF between v1.15 and v1.11 –> must downgrade TF
  • pip for that TF version requires python <= v3.6 –> must downgrade python. I.e. I use python -m venv as described in previous post, with requirements:

  • Once this has been fixed, it cannot run on GPU, since the libcudnn library is incompatible with the pip-installed TF version. One must also install CUDA 10.1 or 10.0, and matching version of cuDNN. (Available in Arch linux User Repository).
  • TF is very particular (read: retarded) with file paths. It can not find model file in “path”, even when the “path” is returned correctly in the error message. To fix:
    • When/if using relative path, one must explicitly use “here”, i.e. prefix folder with ./
    • Don’t use ~/ instead write out the full path: /home/me/

Also, I read through the documentation of Bert-chainer library, but I don’t currently understand what the exact benefit is (older attempt to do the same thing as Keras?), but they state:

  • “This implementation can load any pre-trained TensorFlow checkpoint for BERT”
  • “You can convert any TensorFlow checkpoint for BERT (in particular the pre-trained models released by Google) in a Chainer save file by using the converttfcheckpointtochainer.py script.”

Either way, at this point I gave up and shifted to just converting the model to PyTorch.

Loading model using PyTorch

Trying to load these models using native TF, is just causing an endless chain of headache. Instead, we can convert the model to PyTorch, and we don’t need to downgrade anything, it just works.

The conversion script is part of transformers-cli since 2.3.0. (needs both TensorFlow and PyTorch). Example applied to google’s models:

export BERT_BASE_DIR=/home/me/uncased_L-12_H-768_A-12

transformers-cli convert --model_type bert \
  --tf_checkpoint $BERT_BASE_DIR/bert_model.ckpt \
  --config $BERT_BASE_DIR/bert_config.json \
  --pytorch_dump_output $BERT_BASE_DIR/pytorch_model.bin
  • Note-1, for tf_checkpoint we must not only give folder path to TF checkpoint, but also stem/base name of the checkpoint (bert_model.ckpt). Resulting model is less then half the size of the TF model.
  • Note-2, might/must rename the bert_config.json to just config.json, for transformers loading to find the model

Test that it worked, we can use frompretrained to load the local model

import torch
from transformers import BertModel, BertTokenizer

PATH = "./models/uncased_L-12_H-768_A-12/"  # google-bert model
# PATH = "./models/norwegian_bert_uncased/"   # BotXO model

tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)
model = BertModel.from_pretrained(PATH)

# TEST tokenizer:
sample_txt = "How are you doing to day, sir?"
tokens = tokenizer.tokenize(sample_txt)

Complication: Tokenizer in Norwegian only yields [UNK] tokens (see issue #9). Comparing vocab.txt with the converted google bert-uncased model, we see difference in how segments starting a word are denoted:

  • Engilish: No space token, and suffix tokens prefixed with ##
  • Norwegian: All tokens prefixed with ## and space token _, thus suffix tokens lack space token as first symbol.
Table 3: Vocabularies look different in how they encode space vs. following a word. (Note, there’s no “row-wise” relation between columns)
English Norwegian
[unused0] [UNK]
[unused98] [CLS]
[SEP] ##er
[MASK] ##en
[unused99] ##s
[unused100] ##et
[unused101] ##for
[unused993] ##ing
the ##det
of ##te
and ##av
in ##
to ##de
was ##som
he ##at
is ##med
as ##opp
for ##ker
on ##ang
with ##du
that ##men
##s ##ett
she ##re
you ##sel
##ing ##man

Above problem can be fixed by:

# Load vocab version with "##_"-prefix removed:
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True,

I’ve also tried doing the swap in place, the vocab gets updated, but still encodes all tokens to [UNK]

def fix_norwegian_vocab(tokenizer):
    "Start of word tokens prefixed with '##_' -> remove it!"
    import re
    import collections

    # The special '_'-like char is unicode: 0x2581
    # use M-x describe-char in emacs
    underscore = "\u2581"

    d = {re.sub(r'##' + underscore, "", key): item
         for key, item in tokenizer.vocab.items()}
    d = collections.OrderedDict(d)

    tokenizer.vocab = d
    return tokenizer

Swedish royal library BERT

TODO Pre-training BERT (training from scratch)

There are many pre-trained BERT models available, both from huggingface but also from the community; there’s an untested electra-base (Google’s ELECTRA TF model src here), which could be used for text classification, among other things.

Pre-training is done once per language/model and takes a couple of days on a cluster of TPUs in the cloud. Fine-tuning is then done in under 1 hour on a single cloud TPU (64GB RAM), (see fine-tuning on TPU), or a few hours on GPU, according to google.

To consider regarding performance:

Longer sequences are disproportionately expensive because attention is quadratic to the sequence length. In other words, a batch of 64 sequences of length 512 is much more expensive than a batch of 256 sequences of length 128. The fully-connected/convolutional cost is the same, but the attention cost is far greater for the 512-length sequences. Therefore, one good recipe is to pre-train for, say, 90,000 steps with a sequence length of 128 and then for 10,000 additional steps with a sequence length of 512. The very long sequences are mostly needed to learn positional embeddings, which can be learned fairly quickly. Note that this does require generating the data twice with different values of max_seq_length.

Note, Google AI recently posted on how to improve the quadratic scaling of transformers for long range attention: Rethinking Attention with Performers, by introducing the Performer, an attention mechanism that scales linearly.

Data size?

Question is how much data is needed to train a transformer model? GPT-2 (82M parameters) is trained on wikiText-103 (100M tokens, 200k vocab).

There are examples of fine-tuning on 900k tokens & 10k vocabulary, or 100M tokens and 200k vocabulary. Generally, typical values for VOC_SIZE are somewhere in between 32000 and 128000, and 100M lines is sufficient for reasonable sized BERT-base (according to blog).

SpaCy writes in a blog post, (about interfacing with huggingface), that “Strubell (2019) calculates that pretraining the BERT base model produces carbon emissions roughly equal to a transatlantic flight.”


Tensorflow-models, can be converted to PyTorch, if that is preferable. Using TF, to pre-training from scratch, google writes:

our recommended recipe is to pre-train a BERT-Base on a single preemptible Cloud TPU v2, which takes about 2 weeks at a cost of about $500 USD (based on the pricing in October 2018). You will have to scale down the batch size when only training on a single Cloud TPU, compared to what was used in the paper. It is recommended to use the largest batch size that fits into TPU memory.

Also of interest:

  • TPU-pricing
  • Training on TPUv2 took ~54h for this tutorial, but google will terminate running colab notebooks every ~8h, so need pre-paid instance.


  • Input data file needs to be as shard, since it stores all at once in memory
  • The input is a plain text file, with one sentence per line. Documents are delimited by empty lines. The output is a set of tf.train.Examples serialized into 0TFRecord= file format.

Resources and tutorials to keep in mind:

TODO PyTorch

If we want to train a new model from scratch, e.g. a new language, there’s a huggingface tutorial for Esperanto that is useful. The model size is comparable to DistilBERT (84M parameters = 6 layers x 768 hidden, 12 attention heads) trained on 3GB dataset (oscar + Leipzig). The tutorial is mirrored in a colab version (git), as that makes use of the new PyTorch Trainer() method, rather than as a running the training as scrip from the command line with long string of arguments.

In spite of the model being relatively small, on my Nvidia GPU (GeForce GTX 1050 Ti Mobile) it would take 40h to train a single Epcoh of 1M Esperanto sequences, in batches of 8 (reduce from default 64, to fit on 4GB GPU RAM).

The following from the tutorial caused me some head scratching:

“We choose to train a byte-level Byte-pair encoding tokenizer (the same as GPT-2), with the same special tokens as RoBERTa. Let’s arbitrarily pick its size to be 52,000. We recommend training a byte-level BPE (rather than let’s say, a WordPiece tokenizer like BERT) because it will start building its vocabulary from an alphabet of single bytes, so all words will be decomposable into tokens (no more <unk> tokens!).”

Stuff I’m currently reading:

Beyond BERT

Overview of other transformer / post-BERT models.

Huggingface transformer models

There is more then just BERT to the huggingface library of pretrained transformer-models. Here’s a short summary of (the summary of) the models I found interesting in my quest for text classification.

Note that the only difference between autoregressive models (e.g. GPT) and autoencoding models (e.g. BERT) is in the way the model is pretrained.

  • BERT
    • Models: bert-base-uncased
  • ALBERT (A lite BERT …)
    • Smaller embedding layer, than hidden layer (E << H)
    • Group layers that share weights (save RAM)
    • Models: albert-base-v1
  • RoBERTa > BERT - A Robustly Optimized BERT Pretraining Approach
    • Better optimized hyperparameters compared to BERT
    • Same implementation as BertModel with a tiny embeddings tweak as well as a setup for RoBERTa pretrained models
    • RoBERTa shares architecture with BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pre-training scheme.
    • RoBERTa doesn’t have token_type_ids,
    • Models: distilroberta-base
  • DistilBERT Destilled BERT. A faster lighter cheaper version of BERT.
    • Same implementation, derived from a pre-trained BERT
    • Significantly fewer parameters.
    • Models: distilbert-base-uncased, distilbert-base-multilingual-cased
  • mBERT Multilingual BERT, see above
  • XLM Cross-lingual Language Model Pretraining (arXiv)
    • On top of positional embeddings, the model has language embeddings
    • Three different checkpoints (for three different types of training)
      1. Causal language modeling (CLM)
      2. Masked language modeling (MLM) (like RoBERTa)
      3. Combined MLM & Translation language modeling (TLM)
    • Models: xlm-mlm-100-1280 trained on 100 langauges
    • Uses RoBERTa tricks on the XLM approach, sans TLM
    • Doesn’t use the language embeddings, so it’s capable of detecting the input language by itself.

      Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks“… This implementation is the same as RoBERTa

    • Models: xlm-roberta-base
  • XLNet Generalized Autoregressive Pretraining for Language Understanding
    • Huggingface documentation
    • Has no sequence length limit
    • Models: XLNetForSequenceClassification, XLNetForMultipleChoice, XLNetForTokenClassification

State of the art

There are many models that go beyond BERT, and out perform it on bench marks. Some of interest might be:

  • XLNet (Google), Generalized Autoregressive Pretraining for Language Understanding
    • uses “transformer-XL” (better at handling long complicated sentences)
    • does the word-masking in training differently to BERT (shuffled)
    • Autoregressive (like GPT-1/2), i.e. good at generating new text, rather than autoencoder, which are good at reconstructing learned text, as BERT
    • Is implemented in Huggingface, doc
  • ERNIE (Baidu), Enhanced Representation through Knowledge Integration, record breaking general language model, by also including knowledge about the world.
  • ERNIE 2.0 “Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks
  • Language-Agnostic BERT Sentence Embedding (google) as covered previously above .