Wiki-Loader: Loading Wikipedia into Elasticsearch

Posted on Sep 23, 2021

Over the last two posts I explored the process of extracting a very large dataset (an XML export of Wikipedia) into various types of databases (an RDBMS1 and a Graph database) that I can run locally as a precaution against being marooned in the past should some portal, machine, or ancient artifact that send me careening backwards in history. While these two databases have satisfied my need to quickly look up information and find relational paths between said pieces of information, they still leave me needing to know exactly what I'm looking for. That isn't really realistic though; sometimes I don't have all the information I need for a query. Sometimes all I have is an ancient text obviously written in my own handwriting that bears the smudged words “Beware the Spanish inq̶͈̬̺͌͒͝ừ̵̩̣̩ï̶͕͖sì̵̮̏t̸̠͐̉ḯ̶͍o̷̙͠n̶̫̾”. And sometimes I'd really, really like to know what that smudged bit said. 2 Hypothetically. Look, I don't want to talk about it.

In the abstract, sometimes we don't know exactly what we're looking for but we still need to find it given only incomplete information: sometimes we need a search engine. While I could try to implement a search engine on top of the previous databases, I wouldn't expect it to be a very clean fit. Yes, there's tooling to allow full-text searching in SQLite (and PostgreSQL, and Oracle…), but personally I find all of these tools to feel bolted-on, and to have varying degrees of capability that all feel somewhat lacking. To me, this looks like an ideal use-case for Elasticsearch. Elasticsearch, a NoSQL-style document store based off Lucene, bills itself as “a distributed, RESTful search and analytics engine” and provides for very powerful natural language searching.

A Quick Tour of Elasticsearch

Unlike some of the past posts, getting to the code will take a bit longer this time: In order to get natural language search and analysis of text documents working via code, an understanding of how and why Elasticsearch works is necessary. Without this understanding, any code written would devolve into guesswork, copy-pasta, and superstition. 3 The main concepts that need to be well understood before any coding here are inverted indices, n-grams, edge n-grams, word stemming, and stop words. No doubt, there are quite a few concepts, but we can move through them quickly.

Inverted Indices

At the heart of Elasticsearch, (and Lucene which it uses internally), is the concept of the inverted index. Whereas a forward index provides a mapping from “document id” to “text of the document”, an inverted index provides a mapping from “tokens in the dataset” to “the ids of the documents that they appear in.” By default, each token is just a word from the document but, based on configuration, it could also be a handful of characters or even multiple words. We'll come back to this later.

To help make things more concrete, given the following forward index, ElasticSearch would produce an inverted index matching the table below it.

Forward Index

Document IdDocument Body
1Believe
2Daydream Believer
3What a Fool Believes
4What is Love

Inverted Index

TokenDocument Ids
Believe1, 2, 3
Daydream2
Fool3
Love4
What3, 4

Given this inverted index and a query for “What Love”, Elasticsearch could quickly determine that document 4 matched both search terms and that document 3 was a partial match, causing it to return the ordered result set 4, 3.

N-Grams

While the inverted index manages to explain how we can perform searches so quickly, it doesn't help explain how Elasticsearch, or search engines in general, are able to rank documents more highly when search terms are clustered together in the document. For instance, given the following forward index and the query “Example Hello World”, from what we know about inverted indices now, we should (correctly) expect that both of these documents will rank identically, since that both contain “Hello” and “World”, and neither contain “Example”.

Forward Index

Document IdDocument Body
110 PRINT “Hello World”
2Hello from Disney World

To improve our document ranking, we need to take order and proximity into account as well. This is where n-grams come into play: n-grams act like a sliding window of n tokens (once again, often but not always words) bundled together. This causes the inverted index to contain phrases as well as single tokens. By including n-gram matches in document ranking, we can more highly rank documents where more of the search terms appear together and in the desired order.

Given the example above, the document 2-grams would look something like this:

Inverted Index

n-gramDocument Ids
Print1
Print Hello1
Hello1, 2
Hello World1
World1, 2
Hello Disney2
Disney2
Disney World2

By leveraging the above 2-grams during document ranking, document 1 would now be given a higher rank than document 2 since the 2-gram “Hello World” from the search terms matches one of the 2-gram that were indexed from document 1. Using a 2-gram, 3-gram, or 4-gram in this way increases the accuracy of document ranks, but not without cost: the larger the n, the longer it takes to index and and the greater the disk storage requirements for said index. As a trade off between accuracy and efficiency, Elasticsearch defaults to 3-grams when n-grams are to be used.

Edge N-Grams

The usefulness of n-grams isn't just related to phrase matching though: since tokens aren't restricted to just being words, we can also produce n-grams where the tokens are individual letters. While this makes the index quite a bit larger, it enables a modern convenience: “search as you type” / auto-complete. The effect here is that even typing a partial word match would be enough to perform a search.

However, since an n-gram acts as a pure sliding window, at word boundaries we would end up indexing the jibberish of two partial words. Rather than a pure n-gram, the variant “edge n-gram” better fits this use-case. Rather than be a fixed number of n characters, edge n-grams anchor at the start of the current word, allowing a more aware use of natural language searching.

Using Elasticsearch to generate the edge n-grams for a document containing the text “Hello World” would produce the following table.

edge n-gramDocument Ids
H1
He1
Hel1
Hell1
Hello1
W1
Wo1
Wor1
Worl1
World1

Word Stemming

One of the next tricks that Elasticsearch (and usable full-text searching in general) relies on is word stemming. The basic premise of word stemming is that, rather than indexing a word exactly as it appears in a document, the system should index the root form of the word. This allows it to handle more variability / fuzziness in search terms, such as differing verb tense.

Looking back at the indices from the Inverted Index section shows that “Believe”, “Believer”, and “Believes” were all indexed under the root “Believe”. Continuing to reference that table shows that, given a search for “Believing Daydreams”, the query would be stemmed back to “Believe Daydream” which would match document 2. From the query “Believing Daydreams” it could correctly match and rank “Daydream Believer”.

Since word stemming rules are very much based on the rules of the language, individual support has to be provided for each language being indexed. These rules can come either in the form of algorithmic stemmers (which naively perform replacement, though typically with sufficient accuracy) or in the form of stemming dictionaries (which are highly accurate but less performant). 4

Stop Words

What if, still using the indices from the first section, I had searched for “A believer of their daydreams”? In that particular case, I would have gotten the same results as before. Before executing a query, Elasticsearch first strips it of any stop words, or rather common connective words that are likely to occur in most documents without providing any search value themselves. So, after filtering the stop words out of “A believer of their daydreams”, we'd be back to the “believe daydreams”, which (after stemming) would once again match document 2 from the table above.

Getting Elasticsearch Up-and-Running

With the crash course out of the way, it's time to get an instance of Elasticsearch going. Copying the following into docker-compose.yml and running docker-compose up is that that should be required, with the database completing startup over the next few seconds. (To note, this docker-compose is a modification of one from from the Elastic website. I started from their multi-node cluster template, changed it to be a single node, raised the memory allocation, and set a local folder as the data volume, rather than a named volume, all for my personal convenience.)

version: '2.2'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.14.1
    environment:
      - discovery.type=single-node
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms2048m -Xmx2048m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - ./volumes/data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200

Once the console stops showing output, the database should be ready to go. We can verify by making a request to it by running curl http://localhost:9200. If Elasticsearch is running, this should return an object similar to the following:

{
  "name" : "fb3097e04520",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "dmcgK-XAR02GN5kOOH9enQ",
  "version" : {
    "number" : "7.14.1",
    "build_flavor" : "default",
    "build_type" : "docker",
    "build_hash" : "66b55ebfa59c92c15db3f69a335d500018b3331e",
    "build_date" : "2021-08-26T09:01:05.390870785Z",
    "build_snapshot" : false,
    "lucene_version" : "8.9.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

Worth emphasis is that we used curl to access this status page. Rather than a custom binary protocol, Elasticsearch uses REST APIs hosted over HTTP (or HTTPS) that both take and return JSON objects. This makes communication and testing much easier overall, though platform specific bindings are still available for all major platforms should you prefer them.

Connecting to Elasticsearch

Similar to the last post, we're going to start by implementing the skeleton of a PageWriter from our framework.

from wikipedia import ContentPage, RedirectPage

class ElasticsearchPageWriter:
    def __init__(self):
        pass

    def __enter__(self):
        return self

    def __exit__(self, exception_type, exception_value, exception_traceback):
        pass

    def on_page(self, page):
        pass

As written, our new PageWriter doesn't do anything, but it does fit into the framework and (if we instantiate it in app.py) does get called for every page that is in the XML export. While a start, if we'd like it to do something more, we'll need a connection to Elasticsearch. The process of connecting to Elasticsearch is rather simple though: after running pip install elasticsearch and importing Elasticsearch, we can form a new connection just by instantiating a new Elasticsearch object using the host and port for the server (e.g. Elasticsearch(["localhost:9200"]).)

After updating the `ElasticsearchPageWriter to both create and dispose of connections as required, we end up with the following:

from elasticsearch import Elasticsearch

class ElasticsearchPageWriter:
    def __init__(self, uri):
        self.uri = uri
        self.connection = None

    def __enter__(self):
        self.connection = Elasticsearch([self.uri])
        return self

    def __exit__(self, exception_type, exception_value, exception_traceback):
        self.connection.close()

    def on_page(self, page):
        pass

Establishing a Schema

Given that we have a connection now, what's the plan? We could just start lobbing data into Elasticsearch with reckless abandon, ignoring any burden on our part to define a schema, but we wouldn't be doing ourselves any favors. Yes, NoSQL applications tend to be forgiving if we chose to forgo that step, but there isn't a real benefit to doing so in this instance: We already know what the shape of the data is since we control it, and we gain more flexibility and functionality by informing the database what expectations it can have about each field.

You know more about your data than Elasticsearch can guess, so while dynamic mapping can be useful to get started, at some point you will want to specify your own explicit mappings. - Elastic Documentation

So, to best save us from ourselves (and from the Spanish inq̶͈̬̺͌͒͝ừ̵̩̣̩ï̶͕͖sì̵̮̏t̸̠͐̉ḯ̶͍o̷̙͠n̶̫̾), we should consider the data that we have available and our needs regarding that data. So, what do we have? Referring back to Page and its subclasses, every page has a namespace, a title, an id, and a last modification date (lastedit). After that, the fields depend on the page type: ContentPage has the actual body of the page in content, and RedirectPage has a property target that contains the namespace and title of the target page.

If we were using an RDBMS, we'd probably store all of the common fields from Page in a page table, and then have separate tables for each of the subclasses that had their specific fields. (If we look back at the SQLite implementation, that's exactly what I did.) While that level of normalization is fine, if not appropriate for an RDBMS, for a document store we can get the most milage by going the other direction and listing all of the properties that we might have and then passing in documents that have some subset of those fields.

Now that we've taken a quick inventory of the possible fields in our documents, the next thing for us to do would be to figure out what type they should be stored as. Coming from an RDBMS5, we might expect this to match things like NVARCHAR, INTEGER, BLOB, or DATETIME, but the field types supported by Elasticsearch are far more granular; since Elastic's goal here is to work as a search engine, the more we can tell it about what the data represents or how we want to use it, the more efficiently it can structure indices and the more granular search options it can provide.

So, what field types does Elastic support? Well, there are actually quite a few. Out of all of them that are supported, I want to focus on four (that, you've already guessed, we'll be needing.):

Field TypeDescription
dateA vanilla Date/Time
keywordA text field where, rather than splitting the field into individual tokens and then indexing them, the entire field value is treated as a single token. This is useful for the case where field acts as an identifier or some other atomic object, such as an email address. This field type is appropriate for identifiers, but not for any form of full-text searching.
textA text field in which the body will be broken down into individual tokens which are then added to an inverted index to allow for full-text searching.
search_as_you_typeA “search as you type” field is indexed in such a way that it can efficiently… search as you type (aptly named). Marking a field with this type makes it act similar to a text field with the addition of extra subfields that get added to it: _2gram, _3gram, and _index_prefix for the field. As the names indicate, the first two subfields are n-grams (2 and 3 respectively) in order to allow higher ranking of documents that include phrase matches. The last subfield, _index_prefix is a edge n-gram, which allows matching against partial words, as covered earlier. Together, this field and its three synthetic subfields provide all the information needed at query-time to provide a satisfying “search as you type” experience at the cost of up-front indexing speed and disk space.

With our fields and the available field types listed, it's time to match them together.

Field NameField TypePrimary Use-Case
namespacekeywordFiltering search results to only show results from a specified namespace
titlesearch_as_you_typeAuto-completing page names while the user (my future self) is still typing
lasteditdateFuture: I'd like to explore adding the time since the last edit as part of the ranking weights
contenttextRetrieving page contents when matches are found and allowing for full document search
targetkeywordRedirecting to the specified page when a user selects a search result, if that selected page is a redirect

Now that we've darkened many a pixel discussing our schema, we can actually put it in place. Using the python below, we are able to create a new index (sort of like a database) for our documents, and we are able to specify the field mappings for that index.

def _create_index(self):
    self.connection.indices.create(
        index="wikipedia",
        body={
            "mappings": {
                "properties": {
                    "namespace": {"type": "keyword"},
                    "title": {"type": "search_as_you_type"},
                    "lastedit": {"type": "date"},
                    "content": {"type": "text"},
                    "target": {"type": "keyword"}
                }
            }
        }
    )

Calling this method after creating the connection in __enter__ will ensure that when we index documents, they are processed the way we expect them to be.

Indexing Data

With the schema in place, the actual process of indexing is the only thing remaining. Even then, the bulk of the effort is just to map our input Page object into an appropriate shape to be indexed.

@staticmethod
def _render_page(page):
    page_dto = {
        # Page title and URI leaf used by Wikipedia
        "title": page.title,

        # The default namespace should be rendered as "Page"
        "namespace": page.namespace or "Page",

        # The DateTime has to be passed as an epoch-time
        "lastedit": page.last_edit.timestamp()
    }

    if type(page) is ContentPage:
        page_dto["content"] = page.content
    elif type(page) is RedirectPage:
        # We really only need the target title, since it indicates the whole page path.
        page_dto["target"] = page.target.title
    else:
        raise "An unknown page type was encountered"

    return page_dto

Using the above method, we can map any Page object we have into a document that can be indexed by Elasticsearch. All that is left at this point is to use the mapping and index documents any time on_page is called.

def on_page(self, page):
    self.connection.index(index="wikipedia", body=_render_page(page))

Indexing Data In Bulk

So, the above will work, but it will take a while. Honestly, we can do quite a bit better with just another small change: we can significantly improve the indexing performance by passing items in bulk, rather than one-at-a-time. on_page gets called every time a page is processed, but we can still use bulk indexing by queuing several Page objects into an array and then periodically flushing them when we reach a threshold. For this, we'll need to import the bulk helper.

from elasticsearch.helpers import bulk

Next, we need to add an array of buffered_pages to ElasticsearchPageWriter.

def __init__(self, uri):
    self.uri = uri
    self.connection = None
    self.buffered_pages = []

When a new page comes in, we can then proceed to buffer the page, being mindful to flush the buffer if the buffer has filled.

# Chosen because this is the default batch size used by the Python library for Elasticsearch.
_max_buffer_size = 500

def on_page(self, page):
    self.buffered_pages.append(page)

    if len(self.buffered_pages) == _max_buffer_size:
        self._flush_buffer()

We also need to flush the buffer in __exit__ so that we don't fail to index the last batch of Pages since the number of pages in the export is not likely to be an exact multiple of 500.

def __exit__(self, exception_type, exception_value, exception_traceback):
    self._flush_buffer()
    self.connection.close()

All that is left now is to implement _flush_buffer. Here, _flush_buffer maps all of the buffered Page objects into objects that can be passed back to Elasticsearch. It then clears the buffer, feeds the documents to Elasticsearch in bulk, and does basic error checking.

def _flush_buffer(self):
    if len(self.buffered_pages) == 0:
        return

    actions = [ElasticsearchPageWriter._render_page(x) for x in self.buffered_pages]
    self.buffered_pages = []
    success_count, failure_count = bulk(client=self.connection,
                                        index="wikipedia",
                                        actions=actions,
                                        stats_only=True,
                                        max_retries=5,
                                        request_timeout=30)

    if failure_count > 0:
        raise Exception("At least one record was not accepted.")

The only things to note about this implementation are that:

  • Normally the tuple from bulk returns a count of successes and an array of failing documents. I don't really care what the failing documents are, only if there are any, so I used stats_only, changing the second value in the tuple to a count of failures.
  • Elasticsearch's indexing timeout is 30 seconds, but the library defaults to 10. This mismatch between them was causing some client side errors when I passed large batches over the wire. I set request_timeout to the same value used by the server to mitigate this.
  • By default, failures are not retried so I bumped the max_retries up to 5 so that it would be a little more lenient in the case of failures, such as network blips.

Querying Elasticsearch

A Simple Query

Given a few hours of runtime, the entirety of Wikipedia is now at my fingertips, ready to query. This is a perfect chance to figure out what the (probably hypothetical) ominous warning against the “Spanish inq̶͈̬̺͌͒͝ừ̵̩̣̩ï̶͕͖sì̵̮̏t̸̠͐̉ḯ̶͍o̷̙͠n̶̫̾” is referring to. To query Elasticsearch, we'll need use some more curl, this one directed at the wikipedia index's _search endpoint. To search, we need only give it a query parameter q which has the text of our query. For readability, I also added pretty.

curl http://localhost:9200/wikipedia/_search?pretty&q=Spanish%20i
{
  "took" : 26,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 15.575411,
    "hits" : [
      {
        "_index" : "wikipedia",
        "_type" : "_doc",
        "_id" : "uJ_3AXwBzVxGdcBb6yjI",
        "_score" : 15.575411,
        "_source" : {
          "title" : "Spanish in America",
          "namespace" : "Page",
          "lastedit" : 1.591077186E9,
          "target" : "American Spanish"
        }
      },
      {
        "_index" : "wikipedia",
        "_type" : "_doc",
        "_id" : "CcIuAnwBzVxGdcBbS45i",
        "_score" : 15.575411,
        "_source" : {
          "title" : "Spanish in Catalonia",
          "namespace" : "Page",
          "lastedit" : 1.227722238E9,
          "target" : "Languages of Catalonia"
        }
      },
      {
        "_index" : "wikipedia",
        "_type" : "_doc",
        "_id" : "s242A3wBzVxGdcBbLaLq",
        "_score" : 10.967632,
        "_source" : {
          "title" : "Wikipedia:WikiProject Women in Red/Missing articles by occupation/Politicians - Spain",
          "namespace" : "Wikipedia",

In 26 milliseconds (took) Elasticsearch identified over 10k matches (hits.total) for our query, but they aren't exactly what we wanted.

  • While, yes, I was searching for “Spanish in”, it obviously isn't performing partial word matching on the last term like I would want it to, so I'm only getting results with the exact word “in”.
  • The entire document body is being included in each result. This means that result sets are going to be huge. In this case, to the tune of about 600kb. I really only want a title, so most of that space is wasted.
  • That last entry isn't from the Page namespace, so I'd like to exclude it.

A Complex Query

To get the results we want, we'll have to take a little more control of the query… okay, a lot more control. For this, we'll need to use the JSON-based query API. We'll also have to swap from a GET to a POST because we'll have to submit a rather hefty JSON as the query. This in turn this implies that we also need to add a Content-Type header so that Elastic knows that we handed it JSON. The below block has the same functionality as our initial query, albeit in a much longer form.

curl http://localhost:9200/wikipedia/_search?pretty \
    -XPOST \
    -H 'Content-Type: application/json' \
    -d '
{
  "query": {
    "query_string": {
      "query": "Spanish in"
    }
  }
}'

Rewriting the query as a POST has gotten us to where we fix the rest of our issues, but it hasn't fixed any of them itself. Let's start by trying to fix the query so that it handles search-as-you-type. This means we'll need to replace the query_string query with a multi_match query. Whereas query_string generally checks to see if any field matches the query, multi_match lets you specify a list of fields which are searched, after which the score for each field searched is combined to determine a score for the document (though how it combines them can be controlled by the options we pass.) While it might seem that, for auto complete, we only want to query a single field, the goal of using the multi_match was so that we can search not only title, but also its _2gram and _3gram subfields we created earlier. This will allow it to more highly rank titles that include words in the desired order.

{
  "query": {
    "multi_match": {
      "fields": ["title", "title._2gram", "title._3gram"],
      "query": "Spanish In"
    }
  }
}'

Next up, let's add support for auto-completion. When we set the title field to search_as_you_type, that didn't instantly make all queries run against the field perform partial words matches, but rather it only ensured that the indices needed to support partial word matches would be generated when documents were indexed. In order to support partial words, we want to change the match type to bool_prefix. This changes the search logic so that the last search term is tested against the edge n-gram field _index_prefix and all other search terms are tested as expected. With that change in place, we now support partial word matches.

{
  "query": {
    "multi_match": {
      "fields": ["title", "title._2gram", "title._3gram"],
      "query": "Spanish In",
      "type": "bool_prefix"
    }
  }
}'

Just as an aside, by default Elasticsearch considers a document to match the search if one of the search terms appears in it. For this particular case, I'd prefer the logic to require that all search terms appear, so I'll add an operator.

{
  "query": {
    "multi_match": {
      "fields": ["title", "title._2gram", "title._3gram"],
      "query": "Spanish In",
      "type": "bool_prefix",
      "operator": "and"
    }
  }
}'

Next up, let's update the search results to only show items from the Page namespace. At this point we now effectively have two search criteria we want to apply at once: the original search and the namespace restriction. In order to do this, we'll have to change our query type to bool and nest our current query inside it. The bool query type lets you specify an array of queries that either must match (must), or must not match (must_not), as well as queries that are optional but only are used to assist ranking (should) and queries that must match but are not used to affect rankings (filter).

We'll put the current query inside a must array since we want to use the rankings from it and we'll add a new query to filter to restrict the items appearing in the result set for scoring.

{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "fields": ["title", "title._2gram", "title._3gram"],
            "query": "Spanish In",
            "type": "bool_prefix",
            "operator": "and"
          }
        }
      ],
      "filter": [
        { "term": { "namespace": "Page" } }
      ]
    }
  }
}

Lastly, we'll limit the response fields to fields that are useful to us, specifically title and target.

{
  "_source": ["title","target"],
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "fields": ["title", "title._2gram", "title._3gram"],
            "query": "Spanish In",
            "type": "bool_prefix",
            "operator": "and"
          }
        }
      ],
      "filter": [
        { "term": { "namespace": "Page" } }
      ]
    }
  }
}

Pressing Questions Answered

So, the time is here to find out what terrible fate I might hypothetically be trying to avoid. What is the “Spanish inq̶͈̬̺͌͒͝ừ̵̩̣̩ï̶͕͖sì̵̮̏t̸̠͐̉ḯ̶͍o̷̙͠n̶̫̾”?!

curl http://localhost:9200/wikipedia/_search?pretty \
    -XPOST \
    -H 'Content-Type: application/json' \
    -d '
{
  "_source": ["title","target"],
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "fields": ["title", "title._2gram", "title._3gram"],
            "query": "Spanish In",
            "type": "bool_prefix",
            "operator": "and"
          }
        }
      ],
      "filter": [
        { "term": { "namespace": "Page" } }
      ]
    }
  }
}'
{
  "took" : 20,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 554,
      "relation" : "eq"
    },
    "max_score" : 11.027894,
    "hits" : [
      {
        "_index" : "wikipedia",
        "_type" : "_doc",
        "_id" : "vF5tAXwBzVxGdcBbk9n8",
        "_score" : 11.027894,
        "_source" : {
          "title" : "Spanish influenza",
          "target" : "Spanish flu"
        }
      },
      {
        "_index" : "wikipedia",
        "_type" : "_doc",
        "_id" : "qF5qAXwBzVxGdcBbZ02Q",
        "_score" : 11.027894,
        "_source" : {
          "title" : "Spanish inquisition",
          "target" : "Spanish Inquisition"
        }
      },
      ...

After only 20 milliseconds, the analysis completed. So, what was the smudged “Spanish inq̶͈̬̺͌͒͝ừ̵̩̣̩ï̶͕͖sì̵̮̏t̸̠͐̉ḯ̶͍o̷̙͠n̶̫̾ “? Given the search results, it seems that the most probable results are that it is either Spanish influenza or the Spanish inquisition. The _score on each of those documents is identical, indicating to us that both were equally good matches, however my intuition tells me that it is more likely to be the Spanish flu that was being referenced.

I mean, while it could technically be Spanish Inquisition, I wouldn't expect it. 6

The full code for this post can be found here

Footnotes


  1. SQLite tends to be a “love it or hate it” system. Personally, I like it for many use-cases while also recognizing that its reduced feature set means that it isn't an appropriate or even viable choice for many workloads. That said, I don't see much in the way of a good-faith argument that the lack of features seen in other databases precludes this from being an RDBMS, especially given that the same logic used to load data into it would have equally applied should another RDBMS have been chosen. ↩︎

  2. While I would like to blame the potential smudge on the sheer amount of time that would have gone by, if I'm honest, it would just be because I'm a lefty. I could start writing with a fountain pen now and you'd have a smudged text in only 5 minutes. ↩︎

  3. As computing philosopher Stevie Wonder advised: “When you believe in things that you don't understand, then you suffer. Superstition aint the way.” Wise words, professor Wonder, wise words. I sorely wish that every copy-paste from StackOverflow would include this as trailing text. ↩︎

  4. While not quite related, synonym replacement is another optimization used in natural language searching that works very similar to word stemming: find a (conceptual) root for each word and index based on it instead of the word as stored. ↩︎

  5. I've mentioned RDBMS several times through this, not because I'm interested in throwing shade at them, but because most people are familiar with them, so using that mental model as a starting point and noting when we deviate from it can help us more correctly understand the document database we are using. ↩︎

  6. Though technically possible, the fantastically more probable case is that the message is Spanish Influenza. I doubt that anyone (likely no one) would expect it to be Spanish Inquisition. ↩︎