Training a German LLM from Scratch 🦜

Code: Here
This article is not finished and will be updated.

The research group I work with has access to a small GPU cluster, which occasionally sits idle. To avoid wasting valuable compute resources (IDLE GPUs essentially burn money through opportunity costs), I decided to train a German GPT-2-style model from scratch, using only German text.

Existing German models available on Hugging Face have 137M parameters and a context length of 1024 tokens1, which is quite limited compared to recently released models, such as those in the LLAMA family.

To make the model at least somewhat competitive with current alternatives, I aimed to support context lengths of at least double that. I also wanted the model to have more parameters, which generally enhances model quality. Therefore, I set out to train a GPT-2-style model with 358M parameters and a context window of 2048 tokens. While still modest compared to state-of-the-art models, it’s an improvement. The resulting model is available on at kkirchheim/german-gpt2-medium.

Dataset

A large dataset is required before training a model. Since this LLM is German-only, it was crucial to ensure that the collected texts were in German.

While we could have scraped the internet ourselves to gather enough data, this would be a lengthy process, requiring a custom crawler seeded with relevant pages and a substantial runtime.

Thankfully, others have already done this work: Common Crawl provides a massive text dataset from internet scrapes spanning the past decade. A derivative project, the German Colossal, Cleaned Common Crawl corpus (GC4), contains the German subset of the entire Common Crawl. This dataset also includes quality information about the texts. We selected only the highest-quality texts, such as those from newspapers, government sites, Wikipedia, and similar sources. This means that we do not have to download the entire internet and filter for German content manually.

To start, we downloaded all the tar archives listed on the website - around 180GB of compressed text. After extraction, we were left with 300GB of uncompressed, high-quality German text. Since the data was scraped from 2015 to 2020, this will be the knowledge cutoff for our LLM. For context, existing German-only models were trained on just 90GB of text.2

While this dataset is publicly available, which is nice for reproducibility, the fact that it is a collection of scraped data also means that we do not have the licenses. For research purposes, it is allowed to train models on such content.3

Training

Training an LLM involves two main steps: first, creating a tokenizer to map character sequences to tokens that the LLM can process (and vice versa). Second, training the LLM to predict a probability distribution over the next tokens, given preceding tokens in the text.

Tokenizer

Training a tokenizer with Hugging Face is quite straightforward4, and I gave it a try. However, in the end, I opted to reuse the tokenizer used by existing models.

Optimization

After selecting a tokenizer, we can use the Hugging Face API to train the model. The API makes this extremely convenient. I won’t provide code here, as it essentially just configures the training parameters for the Hugging Face Trainer, which handles most of the work behind the scenes.

We start by tokenizing the entire dataset and caching the results on disk. Then, we begin training the model. Given the corpus size, I only trained for a single epoch.

The loss curve over the training period is shown below. Aside from some initial spikes, it follows the expected pattern: a sharp loss drop at first, followed by a gradual decrease as training progresses.

Loss over Training. The gaps in the data indicate crashes of the training script.

Loss over Training. The gaps in the data indicate crashes of the training script.

Gradient Norm Spikes

During training, we can observe an interesting phenomenon: when we look at the norm of the gradient of the loss $\lVert \nabla_{\theta} \mathcal{L}(x, y) \rVert$ w.r.t. the models weights $\theta$, we see (plot below) that

  1. they start at around $1$ and then quickly decrease. However, we observe some spikes, particularly in early epochs. These spikes also correlate with some drastic jumps in the model’s loss (see image above).
  2. we can see that the gradient norm increases towards the end of the epoch.

This magnitude tells us something about how large the updates are that we apply to the model’s weights. It makes intuitive sense to me that we start out with quite large updates at the beginning of the training, and then gradually move towards the minimum of the loss in smaller steps as it becomes more difficult to improve the loss, so the gradient is not as steep.

However, to be honest, I do not know why we observe these jumps and the gradual increase towards the end of the epoch. If you have any suggestions, feel free to contact me.

Spikes in the norm of the gradient

Spikes in the norm of the gradient

Evaluation

Now that the model is trained, how can we evaluate it?

Qualitative

One of the first things we can do to assess how good (or bad) the model is, is to simply have a look at some example generations. For example, we make the model complete the following text:

from transformers import pipeline

pipe = pipeline("text-generation", model="kkirchheim/german-gpt2-medium", device="cuda")

text = pipe("Der Sinn des Lebens ist", 
            max_length=256,  
            no_repeat_ngram_size=3,  
            top_k=50, 
            top_p=0.95,
            do_sample=True
 )[0]["generated_text"]
print(text)

The result looks like this:

Der Sinn des Lebens ist der Weg in die Freiheit, die wir für uns und unsere Kinder anstreben.

Das Wichtigste dabei ist es, die richtige Entscheidung für eine Lebens- und Sozialform zu treffen.

Denn nur wenn Kinder in einer Familie aufwachsen und Familie ihr Leben selbst bestimmen können, werden sie sich auch in Zukunft in ihrer Persönlichkeit verwirklichen. Wie wichtig es ist, in einer gesunden und lebenswerten Umwelt zu leben, zeigt sich am großen Anteil von älteren Menschen. Der demografische Wandel ist in vielen Bereichen bereits spürbar und wird viele Menschen immer stärker belasten.

Viele Familien in unserer Region leben seit Generationen im Eigentum. Sie sind in einem Generationenverbund mit ihren Kindern mit ihren eigenen Bedürfnissen und Ideen an den Ort ihrer Wohnumgebung gebunden. Die Generation der Jüngeren lebt zu einem Großteil allein in einer kleinen Wohnsiedlung ohne eigenen Garten, im Altersheim oder als alleinstehende Rentnerin oder Rentner.

Die Lebensbereiche Wohnen, Familie und Gesellschaft rücken in dieser Situation in den Fokus der Gesellschaft und erfordern die Entwicklung von neuen gesellschaftlichen, sozialen und ökonomischen Lebensmodellen.

Mit unserer Gesellschaft und unseren Kindern ist es oft nicht mehr so einfach wie früher, in dieser Lebensphase, sich von einer festen Bindung in die neuen Lebensphase zu lösen. Neue soziale Systeme müssen deshalb ganz neu entwickelt werden, um

While this reads strange, at times, it does resemble valid German text.

Language Modeling

For English models, there is a plethora of benchmarks that evaluate all kinds of properties of the model, such as its reasoning abilities, its knowledge in certain fields, or its truthfulness. However, for German text, our choices are quite limited. However, what you can always do is to compare the losses of different models on the same corpus. This will give you an idea of how well the models can predict the next token. Instead of comparing the loss, people often compare the per-token-perplexity, which is a measure of how perplexed the model is by a given text. Perplexity over a sequence of tokens $w$ with length $N$ is computed as: $$ PPL(w) = \exp \left( -\frac{1}{N} \sum_{i=1}^{N} \log(p_{\theta}(w_i \mid w_1, …, w_{i-1})) \right) $$ so, in essence, it is the exponentiated loss.5 In practice, perplexity is often only approximated, as computing it exactly requires $N$ forward passes, which can take a very long time for larger corpora.

There are several implementations of the perplexity metric available online, and interestingly, many of them give slightly different results. So, I went with the implementation of higgingface evaluate, which I only modified slightly, because it would throw an error for some of the models.

For the evaluation, we took the first 10k articles of german Wikipedia.

from datasets import load_dataset

dataset = load_dataset("wikipedia", "20220301.de", split="train")
text = [sample['text'] for n, sample in enumerate(dataset) if n < 10000]

You can find the resulting perplexity values below:

Perplexity of different models on some test data

Perplexity of different models on some test data

As you can see, the LLama model outperforms ours, which is unsurprising, given that it has over $20 \times$ the number of parameters. Our model, on the other hand, outperforms the smaller German models (also, unsurprisingly, as it is larger and was trained on much more data). It should be noted that per-token perplexity can be difficult to compare between models with different tokenizers, so I am not entirely sure how to interpret the performance difference to LLama3. However, the German models all use the same tokenizer.

To be honest, I am not sure why the stefan-it model performs so poorly. According to the model card, it is basically a variant of the dbmdz model trained on much more data, so you would expect it to perform better.

Inference Speed

Quantization

Model quantization can be used to reduce the VRAM required to run inference on a model. While one could assume that quantizazion also makes inference faster (I will admit that I did), this does not seem to be the case.

Below, you can see a histogram depicting the distribution of time required to sample 1024 tokens from our model on an A100.

Time required to generate 1024 tokens with different levels of quantizations on an A100

Time required to generate 1024 tokens with different levels of quantizations on an A100

Lessons Learned

Throughout collecting data, implementing the training script and finally evaluating the model, there were several lessons which I learned.

Crashes Happen You might have noticed gaps in the previous plots. One key lesson I learned is that training can unexpectedly be interrupted, even when there’s no apparent reason. For instance, if the disk becomes full and the Hugging Face Trainer tries to save a new model checkpoint, it crashes. Without prior checkpointing, this can mean a lot of wasted compute.

Batch-Size Matters Initially, I started training the model with a moderate batch-size, however, it turns out that this leads to a loss plateau early on. In my search for solutions to this problem, I had a look the the hyperparameters in Kaparthys Nano GPT and noticed that this implementation uses much larger batch sizes.


  1. To my knowledge, the largest and best purely German models are dbmdz/german-gpt2 and stefan-it/german-gpt2-larger. The latter is trained on the same corpus, but only on 90GB of the CommonCrawl. ↩︎

  2. According to the information provided on huggingface. ↩︎

  3. Concerning the EU AI act, which will be enacted soon, this is still legal for research purposes in Europe. I assume that the EU AI Act is the reason that some recently released LLAMA models are not available in the EU: Meta does not want to get sued. ↩︎

  4. A tutorial is provided here ↩︎

  5. There is an excellent post on perplexity available on the Gradient. There is also a paper describing alternative evaluation strategies↩︎


Last Updated: 14 Nov. 2024
Categories: Deep Learning
Tags: Deep Learning · Generative Models · LLM