— title: “WikiLite: Loading Wikipedia into SQLite” date: 2020-11-19T12:00:00Z draft: true slug: wikilite-loading-wikipedia-into-sqlite

tags:

One of the oft-overlooked but chief dangers of modern life is time travel. Yes, the idea that an unsuspecting soul can inadvertently be transported back in time and that their only hope for return might be some esoteric piece of knowledge such as “when will the clock tower be struck by lightning”? Though I haven’t looked up the exact per-capita statistics of this happening, my gut tells me that it is likely high enough that we should have a plan in place to mitigate it.

While the most straight-forward approach might be to spend every night and weekend cramming all known pieces of information into my cranium, a potentially less brute-force approach would be to keep my own Grey’s-Sports-Almanac-esque cheat-sheet with me at all times. That seems completely safe. Given that the latest World Book Encyclopedia is a not-so-portable 14,000 pages of print, we might be better served by using something a bit more digital, such as an offline copy of Wikipedia. If one were available, that is.

As fortune would have it, the Wikimedia Foundation makes compressed XML exports of all articles available near weekly. Weighing in at 6.1 million articles and accessible offline from any reluctant time-traveler’s laptop, this is likely to be the closest thing that we might find to an ideal solution.

wget https://dumps.wikimedia.org/enwiki/20201101/enwiki-20201101-pages-articles-multistream.xml.bz2

Examining the Format

After having downloaded the bzip’d XML export, we’ll need to start by getting an understanding of the data format. Since we can assume it to be a relatively simple dataset, albeit quite large, we can just take a sampling of the document to get an idea of the structure. Let’s start by viewing the first eight kilobytes to see if we piece together an understanding.

bzcat enwiki-20201101-pages-articles-multistream.bz2 | head -c 8192
<mediawiki xmlns=“http://www.mediawiki.org/xml/export-0.10/“ xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance” xsi:schemaLocation=“http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd version=“0.10” xml:lang=“en”>
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>enwiki</dbname>
    <base>https://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.36.0-wmf.14</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key=“-2” case=“first-letter”>Media</namespace>
      <namespace key=“-1” case=“first-letter”>Special</namespace>
      <namespace key=“0” case=“first-letter” />
      ...
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title=“Computer accessibility />
    <revision>
      <id>854851586</id>
      <parentid>834079434</parentid>
      <timestamp>2018-08-14T06:47:24Z</timestamp>
      <contributor>
        <username>Godsy</username>
        <id>23257138</id>
      </contributor>
      <comment>remove from category for seeking instructions on rcats</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes=“94” xml:space=“preserve”>#REDIRECT [[Computer accessibility]]

{{R from move}}
{{R from CamelCase}}
{{R unprintworthy}}</text>
      <sha1>42l0cvblwtb4nnupxm6wo000d27t6kf</sha1>
    </revision>
  </page>

This appears to be straightforward enough: the top-level mediawiki element contains a siteinfo that we can ignore and then contains one page element for each page on the site. Each page element has a title, namespace (ns), id, text, and potentially a redirect. If there is as redirect, the text can be ignored since it only provides the markup for the redirect itself.

Downloading the ZIP

import bz2 file = bz2.open(filename)

Parsing the XML

Parsing this XML file shouldn’t be too much of a problem

import xml.etree.ElementTree
tree = ElementTree.parse(enwiki-20201101-pages-articles-multistream.xml)
pages = tree.getroot().findall(./page)

Now let's just take a look at our input file now to make sure that everything looks fair…

File explorer shows that the XML is, well, huge.

Yikes. Maybe we should have checked that a moment ago. Eager loading, while easiest doesn't appear even remotely practical when dealing with a 78Gb file. So, for this, it looks like we'll need to replace our implementation with SAX.

class LazyObjectHandler(xml.sax.ContentHandler):
    def __init__(self, on_page):
        self.on_page = on_page
        self.breadcrumb = [{}]
        self.skipped_first_node = False
        self.content = None

    def startElement(self, name, attrs):
        if not self.skipped_first_node:
            self.skipped_first_node = True
            return

        tag = {"name": name, "attrs": attrs, "content": ""}

        self.breadcrumb[-1][name] = tag
        self.breadcrumb.append(tag)
        self.content = []

    def endElement(self, name):
        self.breadcrumb[-1]["content"] = "".join(self.content)

        self.breadcrumb.pop()
        
        if len(self.breadcrumb) == 1:
            self.on_page(self.breadcrumb[-1][name], name)
            self.breadcrumb = [{}]

    def characters(self, content):
        if content == None or len(content) == 0 or self.content is None:
            return
            
            self.content.append(content)

with open('enwiki-20201101-pages-articles-multistream.xml', 'r') as file:
    parser = xml.sax.make_parser()
    parser.setContentHandler(LazyObjectHandler(on_page_builder(self._on_page)))
    parser.parse(file)

Easy-peasy. Problem solved. Or is it? While this does ensure we have the data on hand, it doesn't do much for us in the way of providing low latency access. So, it should be fine if we’re trying to find out when the clock tower is set to get struck by lighting days in advance, but less ideal if we need to .

(env) jrogers@GLaDOS:/mnt/c/Users/jrogers/Source/wiki/src$ python app.py
Page Count: 633, Pages per second: Global (315) / Momentary (315)
Page Count: 20704895, Pages per second: Global (2640) / Momentary (3569)
Creating page_title index
Calculating Redirect Targets
Extracting redirects
Creating Indices
Committing