[Elasticsearch] Building a simple spell corrector with elasticsearch

Note : Implementation of the same using Java is available here. Check it out .

Let’s try to build a simple spell corrector using elasticsearch.

It is a very common behavior for users to make typos while they are searching inside the applications. If your application implements search, then it  must detect the typos during search and should try to correct or suggest the correct words. So, how can we do this? You definitely do not need any machine learning algorithm to do this. You can make use of Elasticsearch’s term_suggester.

Let’s see what term_suggester does for us. Elasticsearch’s term suggester uses the edit distance algorithm to detect the closest correct word and suggest those closest words as a replacement for the wrongly-spelled word. So, how does it know which word is correct? It actually depends on the data you have indexed into elasticsearch. So, if it finds any nearest word in your data then it tries to suggest as an alternative to the misspelled word.

NOTE: If there is no data in your index,  elasticsearch itself cannot suggest you any words. It tries to predict words based on the data-set present in your index.

Now, let’s try to implement this.

First, let’s insert the setting and mapping for indexing data into Elasticsearch.

PUT test
{
 "settings": {
 "number_of_shards": 1
 },
 "mappings": {
 "data": {
 "properties": {
 "my_field": {
 "type": "text"
    }
   }
   }
  }
}

We created an index test with a mapping called data. And we defined a field with the name my_field  which can contain text.

Let’s insert some data into our index. I made a quick search in google for commonly misspelled words and it suggested me a list. I took few of them and indexed the data into our index.

PUT test/data/1 { "my_field":"disappoint" }

PUT test/data/2
{
 "my_field":"ecstasy"
}

PUT test/data/3
{
 "my_field":"embarass"
}

Now we have inserted the data. Let’s try to search using wrong spellings.

GET test/_search { "query": { "match" : { "my_field" : "dissappoint" } } }

We searched for “dissapoint” i.e wrongly spelled word  and we get no results. When you do not get any result for your search then you can always assume that there might be spelling mistake from the user and you can use term_suggester to suggest new words for your users.

Now lets’ see how to use term_suggester.

POST test/_search { "suggest" : { "mytermsuggester" : { "text" : "dissappoint", "term" : { "field" : "my_field" } } } }

The above snippet is used to suggest terms that are closely related.  Here I asked the term_suggester to suggest new words that are closely related to  misspelled word “dissappoint” , I called my suggester  mytermsuggester(you can name it anything) and i am telling suggester to suggest from the field my_field. Now, lets see the result of the above query.

{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "mytermsuggester": [ { "text": "dissappoint", "offset": 0, "length": 11, "options": [ { "text": "disappoint", "score": 0.9, "freq": 1 } ] } ] } }

Aha! we got the word disappoint as a suggestion and it is present in our index. Now, you can suggest your user with this new word or you can correct the word yourself and show the results for corrected word. The result contains score and freq along with the word. Score is calculated based on number of occurrences of that word in the index  and also how important is that word to your index. Freq is the number of times the word occurs in your index.

There are lots of other options available to use with the term_suggest query and you can refer to elasticsearch documentation here for the same. But some of the important ones are

  • min_doc_freq – Minimum no of times the word should occur in your documents to be suggested. For suppose the value is 5 then the word has to occur in  5 different documents.
  • max_term_freq – The minimum number of times the word should occur in your index irrespective of documents. i.e if 5 times the word is present in one document also it is suggested.
  • sort – sort the suggested words based on score or freq. If the value is score then sort bases on scores, if value is freq then it sorts the words based on frequency.
Advertisements

2 thoughts on “[Elasticsearch] Building a simple spell corrector with elasticsearch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s