Mining tagesschau.de

Code: Here

I like to read tagesschau.de, so I wrote a script to scrape it in regular intervals.

My original goal was to determine which articles stay on the front page the longest, which ones allow commenting (a feature that seems to have been disabled almost entirely since March 2020), and if articles are modified after the initial release (without mentioning this), because I sometimes feel that headlines change.

Dataset Creation

Tagesschau.de provides a JSON API, so the “scraping” is relatively straight forward and can be done with just a few lines of code.

now = datetime.now()
date_time = now.strftime("%Y-%m-%d_%H_%M_%S")

url = f"https://www.tagesschau.de/api2/"
r = requests.get(url)
path = join(root, f"{date_time}.json")

if r.status_code == 200:
    data = r.content

    with open(path, "w") as f:
        f.write(data.decode())

I automatically ran this script once per hour for more then two years, which gave me $\approx$ 15,000 unique news articles.

Statistics

Now that we have a dataset, we can do some exploratory data analysis. For example, we can investigate when articles are published. Let’s plot the number of articles per weekday:

Apparently, more articles are published on Wednesday and Friday, while, during the weekend, the least articles are published. This sounds reasonable: less people work on the weekend, so there are fewer articles. But what is the reason for the spike on Fridays? Since the articles contain the exact publication date, we can plot the distribution of articles for each day, over each hour. The plot looks like this:

Here, we notice something interesting: Quite a lot of articles are published on Friday around 17:00 and 20:00. My hypothesis is that these are articles that the editorial staff pushed out so that people have stuff to read during the weekend.

Lets have a look at the length of the articles:

If the hypothesis is true, we could expect that the articles published on a Friday evening are longer then average. The histogram of the number of articles, plotted against the hour and the number of words in the articles looks like this:

Length of articles released on Fridays, over time.

Length of articles released on Fridays, over time.

This seems to support the hypothesis: Friday evening after the tagesschau has aired, an unusual amount of lengthy articles is published. This does not seem too far fetched.

Masked Language Modeling

Masked language modeling can be seen as a special kind of classification task. Given the previous and the next word, what is the probability of the masked word.

For example

The [mask] jumps over the lazy dog.  

we are trying to find the most probably word for [mask].

Or, in mathematical terms $$ p(\mathcal{D} \vert \theta) = \prod_{x \in \mathcal{D}} p(x_i | x_{j \neq i}, \theta) $$ where $\mathcal{D}$ is a dataset with a set of documents and $\theta$ are the parameters of our model.

Below is a picture.

Language modeling by recovering masked inputs.

Language modeling by recovering masked inputs.

Clustering

Below, you can find a clustering of the articles based on their content. Articles are vectorized by a german version of BERT, the visualization uses PCA and T-SNE. The color represents the category articles were assigned to. Using the categories as a sanity check, the clustering seem to work reasonably well. In fact, we can even find some articles that apparently have been categorized wrong.

Article Clustering based on BERT

Article Clustering based on BERT

Generative Language Modeling

We can use this dataset to create a fake news generator.

GPT-2

GPT General architecture

GPT General architecture

GPTagesschau

I fine-tuned a german GPT-2-based language model on the dataset to generate news articles in the style of tagesschau.de. The model is not that good yet, which is probably due to the facts that

  • german pre-trained language models are not as good as their english counterparts and
  • the dataset is to small (15k unique articles at the time of writing).

Still, the generated (fake) news articles are somewhat coherent, even it they tend to contradict themselves. The model is also able to generate titles and headlines.

In the following, I give an, in my opinion, rather funny example. Note that I did not select this article for its realism, but because I though the apparent mixture of two topics makes it an interesting read.

Inhaftierter Wikileaks-Gründer: Erste Anklage gegen Assange?

Die US-Justiz hat in London den Prozess gegen Julian Assange eingeleitet. Ihm wird vorgeworfen, im Irak Anhänger einer Terrormiliz bekämpft zu haben. Assanges Unterstützer sollen im Krieg gegen den Irak an Waffen und Ausrüstung gekommen sein.

Ein britisches Gericht hat den Rechtsstreit um die Auslieferung des Enthüllers der Enthüllungsplattform Wikileaks, Julian Assange, in Großbritannien begonnen. Für den Gründer der Demokratie-Bewegung Assange bestehe die Chance, die Untersuchungshaft in London absitzen zu können, teilte die Londoner Generalstaatsanwältin Letitia James mit. “Assange kann hoffen, eine Chance zu haben, in einem fairen Prozess von allen Seiten an das Richtige erinnert zu werden.” Assange drohen bei einer Verurteilung bis zu 175 Jahre Haft. Die Staatsanwaltschaft wirft ihm vor, Anhänger der Terrormiliz “Islamischer Staat” (IS) und des IS-Regimes mit Waffen und Ausrüstung versorgt zu haben.

Assange: Ein Anhänger von Al-Kaida und Al-Nur?

Seine Verteidiger hingegen sagten, das Gericht sei der Meinung, dass Assange zu den Aktivitäten von Al-Kaida oder der Terrororganisation Islamischer Staat (IS) gehöre. Es gebe keine Beweise dafür, dass er IS-Mitglieder angeworben habe. Dem Gericht zufolge wird Assange vorgeworfen, er habe falsche Angaben gemacht, um Kämpfer der IS-Miliz und IS-Anführer zu unterstützen. Der Anwalt von Assange, Michel Barnier, nannte die Anklageerhebung einen "Meilenstein" für ihn. "Der Rechtsstaat gibt Julian Assange das Recht, sich frei zu bewegen", sagte Barnier im Sender Euronews.

Assange drohen bis zu 175 Jahre Haft

Ein Prozess gegen Assange wäre der erste, in dem ein Gericht ein Urteil fällte. Der Gründer der ältesten und wichtigste Nachrichtenplattform der Welt sitzt in der ecuadorianischen Metropole Quito in Haft, seit er 2007 festgenommen und im September vergangenen Jahres in die USA gebracht worden war. Es wäre die erste Anklage gegen Assange, die ein Gericht in Großbritannien erhebt. Der 37-Jährige ist der größte investigative Journalist, der je inhaftiert wurde.

Last Updated: 26 Nov. 2022
Categories: Data Mining
Tags: Tagesschau · Generative Models · Data Mining