[Elasticsearch] Building a simple spell corrector with elasticsearch

Note : Implementation of the same using Java is available here on github. Check it out .

Let’s try to build a simple spell corrector using elasticsearch.

It is a very common behaviour for users to make typos, while they are searching in the web applications. If your web application implements search, then it  must detect the typos during search and it should try to correct or suggest the correct words. So, how can you achieve this?  Elasticsearch’s term_suggester to the rescue.

Let’s see how term_suggester solves our problem. Elasticsearch’s term suggester uses the edit distance algorithm to detect the closest correct word and suggest those closest words as a replacement for the wrongly-spelled word. So, how does it know which word is the  correct word for suggesting? It actually depends on the data you have indexed into elasticsearch. So, if it finds any closest word in your data then it tries to suggest as an alternative to the misspelled word.

NOTE: If there is no data in your index,  elasticsearch itself cannot suggest you any words. It tries to predict words based on the data-set present in your index.

Now, let’s try to implement this.

First, let’s insert the setting and mapping for indexing data into Elasticsearch.

PUT test
{
 "settings": {
 "number_of_shards": 1
 },
 "mappings": {
 "data": {
 "properties": {
 "my_field": {
 "type": "text"
    }
   }
   }
  }
}

We created an index test with a mapping called data. And we defined a field with the name my_field  which can store text.

Let’s insert some data into our index. I made a quick search in google for commonly misspelled words , I took few of them and indexed the data into our index.

PUT test/data/1 { "my_field":"disappoint" }

PUT test/data/2
{
 "my_field":"ecstasy"
}

PUT test/data/3
{
 "my_field":"embarass"
}

Now we have inserted the data. Let’s try to search using wrong spellings.

GET test/_search { "query": { "match" : { "my_field" : "dissappoint" } } }

We searched for “dissapoint” i.e wrongly spelled word  and we get no results. When you do not get any result for your search then you can always assume that there might be spelling mistake from the user and you can use term_suggester to suggest new words for your users.

Now lets’ see how to use term_suggester.

POST test/_search { "suggest" : { "mytermsuggester" : { "text" : "dissappoint", "term" : { "field" : "my_field" } } } }

The above snippet is used to suggest terms that are closely related.  Here I asked the term_suggester to suggest new words that are closely related to  misspelled word “dissappoint” , I called my suggester  mytermsuggester(you can name it anything) and i am telling suggester to suggest from the field my_field. Now, lets see the result of the above query.

{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "mytermsuggester": [ { "text": "dissappoint", "offset": 0, "length": 11, "options": [ { "text": "disappoint", "score": 0.9, "freq": 1 } ] } ] } }

Aha! we got the word disappoint as a suggestion and it is present in our index. Now, you can suggest your user with this new word or you can correct the word yourself and show the results for corrected word. The result contains score and freq along with the word. Score is calculated based on number of occurrences of that word in the index  and also how important is that word to your index. Freq is the number of times the word occurs in your index.

There are lots of other options available to use with the term_suggest query and you can refer to elasticsearch documentation here for the same. But some of the important ones are

  • min_doc_freq – Minimum no of times the word should occur in your documents to be suggested. For suppose the value is 5 then the word has to occur in  5 different documents.
  • max_term_freq – The minimum number of times the word should occur in your index irrespective of documents. i.e if 5 times the word is present in one document also it is suggested.
  • sort – sort the suggested words based on score or freq. If the value is score then sort bases on scores, if value is freq then it sorts the words based on frequency.
Advertisements

2 thoughts on “[Elasticsearch] Building a simple spell corrector with elasticsearch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s