MongoDB mapreduce Introduction

What is Map-reduce ?

Map-reduce is a programming model that helps to do operations on big data in parallel to achieve faster results. To understand map reduce go through this article which has a nice explanation for beginners.

MongoDB supports map-reduce to operate on huge data sets to get the desired results in much faster way.  So, map reduce has two main functions which is a map function which groups all the data based on the key value (go through the article mentioned above to understand what is key.) and a reduce function which performs operation on the mapped data. So, the data is independently mapped and reduced in different shards and then combined together again by map function and finally reduced to give a single result. Map-reduce function is performed on the data independently and in parallel.  So, you should be very careful with your reduce function so that it can perform operation independently.

Lets look at an example and solve the problem using map-reduce. For simplicity lets take the data mentioned in the article mentioned above.

Here is the problem statement. There is  list of cities with temperature, the goal is to find the maximum temperature for each city. This can be easily done using mongodb aggregation framework. But lets solve this problem using map-reduce now and look at the advantages later.

Lets insert some data.

db.cities.insert({city: 'Toronto', temperature: 20}) 
db.cities.insert({city: 'Whitby', temperature: 25}) 
db.cities.insert({city: 'New York', temperature: 22}) 
db.cities.insert({city: 'Rome', temperature: 32})
db.cities.insert({city: 'Toronto', temperature: 4}) 
db.cities.insert({city: 'Rome', temperature: 33})
db.cities.insert({city: 'New York', temperature: 18})
db.cities.insert({city: 'New York', temperature: 14})


Now that the data is inserted we can perform map reduce on that. The map reduce query looks like this.

 db.collectionName.mapReduce(mappingFunction, reduceFunction, 

As informed before we need two function i.e mapper function and reduce function. Mongodb can interpret java-script and you must write the functions in JavaScript.

Now Lets look at the mapping function.

Mapper Function

     emit(, this.temperature) //Emits the city and temperature

The above function runs for each and every document in the collection where you run the map reduce, in our case it is cities collection. For every document it emits city as key and temparature as values. The mapping function basically groups all the data using key and produces just one key and array of values(temperature in our example).  In our case its grouped as follows

     "New York"  => [22, 18,14] 

      "Toronto"  => [20, 4]

      "Rome"  => [32, 33]

Now that the data is grouped its time for reduce function to operate on the values. Here is the reduce function.

Reduce Function: 

 function(key, values) {
   return Math.max.apply(Math, values);

So the reduce function takes 2 parameters the key and the values that are produced by mapping function , performs an operation and returns a single value.

In our case the reduce function has to find the maximum temperature for each city. So the reduce function runs on the grouped data.

So lets see how it works for our example.

From the mapper function 'New York' and [22,18,14] is passed to reduce function. Reduce function returns the maximum value in the array. That is 22. 

Similarly, for "Toronto"  => [20, 4] , the maximum value is 20 and that is returned from the  reduce function.

P.S : The above explanation makes you understand how the mapper and reduce function works. But, internally the mapper function and reduce functions are called repeatedly and not just once for every key and values. That is emit function may just emit just 2 values for New York in the beginning. .i.e [ 18, 14 ] and then reduce function reduces and gives the maximum value 18 which is again called by mapper function and when it encounters another document with the same key  i.e  22 , it groups 18 and 22 together [18,22]. Again passed to reduce function and you get 22 as the result. which is the final value. So, by breaking the operations still you achieved the same result and with better performance. The data can be split and operated on independently in many threads or in many machines and achieve much faster results.

So, now that you have mapper function and reduce function lets run the map reduce command and get the results.  Mongodb map-reduce command will output the result to a new collection rather than printing it to the console. So, you need to specify the output collection for it to give the results. In my example I am outputting it to a collection called maxTemp.

So our final query looks like this.

 db.test.mapReduce( function() { emit(, this.temperature)}, 
    function(key, values) {return Math.max.apply(Math, values)}, 

Now running this command should  have give the result and will be stored in maxTemp collection.

Lets take the look at the result now.

> db.maxTemp.find()
Result : 
{ "_id" : "New York", "value" : 22 }
{ "_id" : "Rome", "value" : 33 }
{ "_id" : "Toronto", "value" : 20 }
{ "_id" : "Whitby", "value" : 25 }

Finally we have the maximum temperature calculated for every city .

When to use map-reduce ?

Map reduce should be used when your aggregation query is slow and taking longer time to execute because of huge amount of data in the DB. Map-reduce can run parallel and can perform operations at much higher rate.

If the data is less its better to stick to aggregate queries as map reduce takes longer exexcution times compared to aggregate queries when data set is low. And the effort required is more compared to aggregate queries.

Your map-reduce function should be written in such a way that it can run parallel with i.e map and reduce and still give the correct result.

You can check mongodb docs for more options and use it in your map reduce query. Here is the link for the same.

[Elasticsearch] Building a simple spell corrector with elasticsearch

Lets try to build a simple spell corrector using elasticsearch.

It is a very common behavior for users to make typos while they are searching in the applications. If your application implements search, then it  must detect the typos during search and should try to correct or suggest the correct words. So, how do we do this ? You definitely do not need any machine learning algorithm to do this . You can make use of Elasticsearch’s term_suggester.

Let’s see what term_suggester does for us. Elasticsearch’s term suggester uses the edit distance algorithm to detect the closest correct word and suggest that as a replacement for the wrongly-spelled word. So, how does it know which word is correct ? It actually depends on the data you have indexed into elasticsearch. So, if it finds any nearest word in your data then it tries to suggest as a alternative to misspelled word.

NOTE: If there is no data in your index,  elasticsearch itself cannot suggest you any words. It tries to predict words based on the data-set present in your index.

Now, lets try to implement this.

First, lets insert the setting and mapping for indexing data into Elasticsearch.

PUT test
 "settings": {
 "number_of_shards": 1
 "mappings": {
 "data": {
 "properties": {
 "my_field": {
 "type": "text"

We created an index test with a mapping called data. And we defined a field with the name my_field  which can contain text.

Lets insert some data into our index. I made a quick search in google for commonly misspelled words and it suggested me a list. I took few of them and indexed the data into our index.

PUT test/data/1 { "my_field":"disappoint" }

PUT test/data/2

PUT test/data/3

Now we have inserted the data. Lets try to search with wrong spellings.

GET test/_search { "query": { "match" : { "my_field" : "<strong>dissappoint</strong>" } } }

We searched for “dissapoint” i.e wrongly spelled word  and we get no results. When you do not get any result for your search then you can always assume that there might be spelling mistake from the user and you can use term_suggester to suggest new words for your users.

Now lets see how to use term_suggester.

POST test/_search { "suggest" : { "mytermsuggester" : { "text" : "dissappoint", "term" : { "field" : "my_field" } } } }

The above snippet is used to suggest terms that are closely related.  Here I asked the term_suggester to suggest new words that are closely related to  misspelled word “dissappoint” , I called my suggester  mytermsuggester(you can name it anything) and i am telling suggester to suggest from the field my_field. Now, lets see the result of the above query.

{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "failed": 0 }, "hits": { "total": 0, "max_score": 0, "hits": [] }, "suggest": { "mytermsuggester": [ { "text": "dissappoint", "offset": 0, "length": 11, "options": [ { "text": "disappoint", "score": 0.9, "freq": 1 } ] } ] } }

Aha! we got the word disappoint as a suggestion and it is present in our index. Now, you can suggest your user with this new word or you can correct the word yourself and show the results for corrected word. The result contains score and freq along with the word. Score is calculated based on number of occurrences of that word in the index  and also how important is that word to your index. Freq is the number of times the word occurs in your index.

There are lot of other options available to use with the term_suggest query and you can refer to elasticsearch documentation here for the same. But some of the important one’s are

  • min_doc_freq – Mimimum no of times the word should occur in your documents to be suggested. For suppose the value is 5 then the word has to occur in  5 different documents.
  • max_term_freq – mimimum number of times the word should occur in your index irrespective of documents. i.e if 5 times the word is present in one document also it is suggested.
  • sort – sort the suggested words based on score or freq. If the value is score then sort bases on scores , if value is freq then it sorts the words based on frequency.

Elasticsearch [How to search for a word without spaces, if word itself is combination of two different words]

I was using  Elasticsearch for search and there was this specific problem I encountered in one of the applications I was working on. So, the problem was this there are numerous words in English that are two different words and can appear as single word in some context.

For Eg: New York can appear as newyork or new york. Suppose you have dataset which contains newyork (without space) and when you search for new york(with space) , you will end up not getting any results for the search you made.

In order to solve this problem we can make use of elasticsearch tokenizers and filters.

So lets solve this problem. Elasticsearch settings consists of 3 main components i.e analyzers, filters , tokenizers and other index related settings.

Lets create a setting that is required for our search. We are now creating a custom analyzer that can be mapped to our field while creating mappings for our index.

PUT /test
   "settings": {
   "analysis": {
   "analyzer": {
   "bigram_combiner": {
   "tokenizer": "standard",
   "filter": [
 "filter": {
     "custom_shingle": {
     "type": "shingle",
     "min_shingle_size": 2,
     "max_shingle_size": 3,
     "output_unigrams": true
     "my_char_filter": {
     "type": "pattern_replace",
     "pattern": " ",
     "replacement": ""

So, the trick here is with the custom_shingle and my_char_filter. 

Shingle filter gives the combination of words and my_char_filter will remove the space between the shingles and hence gives back a single word. Lets analyze what our custom analyzer bigram_combiner does.

POST test/_analyze
 "analyzer": "bigram_combiner",
 "text": "new york"


"tokens": [
"token": "new",
"start_offset": 0,
"end_offset": 3,
"type": "&amp;lt;ALPHANUM&amp;gt;",
"position": 0
"token": "newyork",
"start_offset": 0,
"end_offset": 8,
"type": "shingle",
"position": 0,
"positionLength": 2
"token": "york",
"start_offset": 4,
"end_offset": 8,
"type": "&amp;lt;ALPHANUM&amp;gt;",
"position": 1

So, if we use bigram_combiner  as analyzer for our field we will accomplish what we need. It breaks down the words into 3 combinations. So, new york is now analyzed as three words : new, york and newyork

Now searching the field for new york or newyork yields you back the result you wanted.

Lets see how to use this analyzer in our index. Now you have inserted the above settings lets insert the mappings for the same. We will create a mapping for the index with the name cities and a field called city, which uses the analyzer bigram_combiner.

PUT test/_mapping/cities
 "properties" : {
  "city" : {
  "type" : "text",

Now lets insert the data:

PUT test/cities/1
"city":"new york"

The index is ready and data is inserted its time to search now and see the magic.

GET test/_search
"query": {
"match" : {
"city" : "newyork"

Here is the output from the above query

"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
"hits": {
"total": 1,
"max_score": 0.9181343,
"hits": [
"_index": "test",
"_type": "cities",
"_id": "1",
"_score": 0.9181343,
"_source": {
"city": "new york"

Finally, we achieved what we wanted. We searched for newyork without spaces and yet we got back the result which contained new york as two separate words.

We achieved this just using combination of filters and analyzers. Elasticsearch filters and tokenizers gives you more power to search differently if used in combinations. There are lots of other filters available and you can refer to it in the elasticsearch documentation.

Now be careful when you are using this settings for your data as this produces combination of all the words if there are more than two words which will consume more memory for your index. Hope this is helpful.