How to use the corpus files¶
The corpus files of Poio are mostly collected and extracted from the Wikipedias in the different languages. For this, we use the official Wikipedia dumps and convert them to ISO 24612 (GrAF-XML). Peter Bouda wrote a blog post about the whole process:
This data is then fed into the pressagio library to build a language model for the text prediction. A follow-up blog post about this workflows is available here:
To parse the content of the GrAF-XML files you can use the Python library graf-python. First, you have to download one the corpus files. See api-corpusfiles on how to use the official API of Poio to get a list of all corpus files for a given language.
For example, the Wikipedia corpus for Bavarian is available for download at:
Next, extract the example files. Here is an example code snippet how the individual documents can be read into a Python list:
import graf gp = graf.GraphParser() g = gp.parse("barwiki-20131229.txt") documents = list() for n in g.nodes: if n.id.startswith("doc..") and len(n.links) > 0 and len(n.links) > 0: doc = txt[n.links.start:n.links.end] documents.append(doc)
This will store all the documents as strings in a list documents.