Building Corpus for AutoSuggest (Part 1)

OLX Autosuggest

This is the first post in a series describing the work that we did at OLX for improving the user search experience by enhancing the autosuggest. We’ll be going through various challenges we faced and the techniques that we used to tackle each problem. So, let’s start!


Autosuggest is an important tool to improve the modern user search experience. It encapsulates various techniques to predict what the user is going to type or what the user is searching for, on the platform.

It is more around understanding the user’s intent (making predictions) rather than just giving suggestions.


It saves a lot of user’s time as sometimes, it is cumbersome to write the complete search query, specially on mobile phones where the screen size is too small and typing is error-prone.

According to Google, on an average it reduces the typing by about 25 percent and saves around 200 years of typing time per day. Yes per day!

Well, the above statistic pretty much sums it up why we need autosuggest and why it is widely used across platforms.


Before we do a deep dive into the corpus building part, let’s first take some time to define what use cases are we going to solve here.

We are going to define the steps used for building the corpus for Most Popular Suggester (MPS) which is the most common type of suggester where we use the past user searched queries on the platform.

Following are some of the major problems with the user searched queries:

  1. Queries with special characters and/or multiple spaces between words (“cars %in gurgaon”).
  2. Keyword ordered duplicates (“fortuner gurgaon” and “gurgaon fortuner”).
  3. Missing space duplicates (“iphone11” and “iphone 11”).
  4. Incorrect spellings (“iphine” and “iphone”).
  5. Queries with profane words.

We’ll be going through possible solutions for aforementioned problems by something we call as a Preprocessor.


Preprocessor Flow Diagram

The input to the preprocessor is Query Logs (List of user searched query terms) and output is Preprocessed Query Logs (User searched query term and its frequency).

The Preprocessor primarily consists of 5 components:

1. Parser

The main responsibility of parser is to convert the queries into a valid search query by removing the unwanted characters (like special characters or unsupported characters) and by converting multiple spaces into one.

The algorithm is pretty straightforward, just replace the special characters with space, break the query by space and then, merge the words back by adding a single space to form the query.

For example,

Input — “cars % in __gurgaon” 
Output — “cars in gurgaon”

2. De-duplicator

De-duplication is the process of merging similar queries into one. This component mainly does the following three kinds of de-duplication:

a) Keyword De-duplication

Keyword De-duplication

As we can see from the above list of queries that there are a couple of things that we can identify the results:

  • Two queries are keyword ordered duplicates if they contain the same set of keywords.
  • We don’t need to consider the stop words as they add no special meaning to the search query.
  • We pick the highest frequency query as the best one.
  • We merge all the frequencies of the input queries and assign it to the highest frequency query.

Since, the no. of queries in the query logs would be way too many. So, we need a fast way to check if two queries contain the same set of keywords or not.

Let’s create a hash of each query in the following way:

  • Break the query into keywords by space.
  • Remove duplicate keywords and stop words.
  • Sort the keywords and merge them back by a separator.

Now, we can see that all the above queries will have the same hash as “fortuner:gurgaon” where, “:” is the separator used and then, compare the hashes to de-duplicate these queries.

b) Missing Space De-duplication

Missing Space De-duplication

After looking at the above queries, we can see that users forget spaces sometimes in between keywords but the queries are all the same. So, how do we merge these queries or how do we identify them in the first place?

Well, we can tweak the solution for the above problem and follow somewhat similar steps. We can just remove the sorting step and adding the “:” separator step while merging from the previous algorithm.

All the input queries will have the same hash as “iphone11pro”.

c) Synonymous De-duplication

"home theater" -> "home theatre"
"cruze" -> "cruise"

This kind of de-duplication is a bit different from the previous two. There are some queries where it becomes hard for spell checker to correct the queries where the user’s intent is same but the spelling is off by too many characters or the meaning of two queries is same but we have products or search results for only one of them.

We need to fix these queries because of the following reasons:

  • Don’t have results for that query or keyword in the query. So, we shouldn’t be suggesting it.
  • Don’t need to show misspelled queries to the users.

It is usually safe to correct the queries where spelling is off by at most one character using the Spell Checker component (Yet to be discussed) but when the spelling is off by more than one character or the spelling of the word is totally different (Synonyms) but meaning is same, then we need to use this.

There are two ways we can solve this problem:

  • Provide a list of incorrect spellings along with its correct spelling for correction.
  • Use Phonetics for de-duplication.

We can see that a lot of incorrect and correct keywords are similar sounding ones or have common phoneme as a base. We can use phonetic libraries to find which ones are formed from the same phoneme.

3. Profanity Filter

We need to filter out the queries which are profane and have bad language in them.

One basic approach is to use a list of profane words and remove the query if a profane word is a part of that query (Exact matching). It is a pretty fast and efficient way of filtering out profane queries but it can skip cases when the profane word is misspelled.

Another approach is to use a Machine Learning model library which takes in a list of profane words and does score wise matching and discards the query if it is profane query (Similarity matching). It is a pretty good way but it takes a lot of time to process the queries as each profane query identification takes comparatively more time than the previous approach (profanity_filter library in Python).

4. Keyword Dictionary Builder

Keyword Dictionary Builder

The main purpose of this component is to build a keyword frequency dictionary. Just sum up the frequencies of each distinct keyword across search queries and skip all the stop words (Since, they don’t have any special meaning).

This dictionary will be used in the next component.

5. Spell Checker

tracter -> tractor
fortunar -> fortuner
sweft -> swift

This component is used to correct the misspelled queries with at max 1 edit distance. Why 1 edit distance and not 2 or 3? Because as we increase the edit distance the keywords whose meanings are totally different start to get incorrectly corrected by the spell checker.

So, 1 edit distance seems to be the right bet as changing at most 1 character hardly changes the meaning of the word itself, in most of the cases.

Now, what we need is an algorithm which is fast enough to correct a keyword on the basis of the keyword frequency dictionary we built earlier. You must be thinking why do we need frequency in the dictionary, right? Well, because it defines the priority of a keyword as in how likely a misspelled keyword within 1 edit distance is to be corrected to this keyword.

For example:

iphone, 1000
iphine, 100
iphene, 50

So, if a query comes with iphone, iphine or iphene, it will get corrected to iphone as it is the most valuable one (with the highest frequency in our query logs).

We used SymSpell Checker which is one of the fastest spell check libraries available in almost all the common programming languages and it works pretty well.

You can load the keyword frequency dictionary that we built earlier to SymSpell Checker and whenever, you will ask a query to get corrected it will give you a list of all possible suggestions within the configured edit distance. From the list of suggestions, you can simply pick the best one.

However, we came across one problem with picking the best one. It can sometimes change the original correct keyword to another correct keyword.

For example:

tvs -> tv

In the above example, our spell checker was correcting TV (television) to TVS (a motor company) which is incorrect as both TV and TVS are correct.

We analysed the data a bit and found out the appropriate threshold of high frequency keyword above which all the keywords are valid words. Then, what it means is if a keyword is above the high frequency threshold then, it will not be corrected else it will be.

This trick pretty much solved our problem!

You can contribute to the source code here (


We discussed various techniques around preprocessing (Spell Check, De-duplication, Profanity Filter, etc) the past user searched queries and how they can be transformed into a corpus that can be used for Most Popular Suggester.

Next Part:

In the next part, we will be discussing the approach or algorithms that we used to build autosuggest using the corpus that we created in this part.

Building Corpus for AutoSuggest (Part 1) was originally published in OLX Group Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: OLX

Leave a Reply

Your email address will not be published.