The Norwegian Newspaper Corpus (NNC) is a large monitor corpus representing contemporary Norwegian language in both its written varieties, Bokmål and Nynorsk. The corpus is compiled through daily harvesting and processing of published texts from the web edition of Norwegian newspapers.

As of April 2011, the NNC is a little short of 1 billion words, which makes it the largest searchable corpus of Norwegian.

For search and access to the corpus, follow the link "Søk i korpuset".

The text collection for the Norwegian Newspaper Corpus began in 1998. Before the compilation of the NNC, no large corpus of Norwegian was available. The growth of this dynamic corpus is on average approximately 230,000 words per day. The corpus currently consists of the full web version of 24 Norwegian newspapers. The system can be visualised as in the figure and involves the steps listed below.

Prinsippskisse engelsk

1.      harvesting: a web-crawler programme (w3mir or wget) downloads the full internet versions of Norwegian newspapers

2.      boilerplate and duplicate removal: a set of specifically designed programs automatically selects the core text, including the body text, headline, lead paragraph and picture caption, but discarding advertisements, navigation menus, etc.

3.      language classification: the texts are classified as either Bokmål or Nynorsk, while English and other foreign texts are discarded

4.      text annotation: metadata concerning date, author and source are extracted from the source texts, and the texts are machine classified according to topic and morphosyntactically tagged by the Oslo-Bergen tagger

5.      user interface: the annotated texts are made available for search via Corpus Workbench and Corpuscle, a new in-house search system

6.      neology extraction: the inventory of word forms of newly harvested texts is compared with an accumulated list of word forms, and a list of forms not previously recorded is extracted and added to the accumulated word list

7.      neology classification: new word forms are classified according to orthographic criteria, and anglicism candidates are identified

8.      frequency profiling and lexical database entry: statistical filters are used to identify neologisms that are most relevant for lexicography and registered in the Norwegian Word Bank and subsequently used by the Oslo-Bergen tagger

9.      extraction of multiword expressions: sequences of words with a strong tendency to co-occur are extracted from the corpus and added to the lexical database


April 2014