in Product Management

Understanding the basics of text correction in the context of Ecommerce search

Search is a key feature of any Ecommerce marketplace as it drives organic orders based on buyer intent. Quite often, buyers often mistype their word which return inaccurate/wrong products. Automatic text correction of search queries can help to remedy this situation by showing buyers the results of the intended keyword instead of the mistyped keyword

As a new Search Product Manager at Shopee, I thought it would be good if I could gain better insight into the fundamentals of automatic text correction.

Basics of Text Correction

Peter Norvig has an excellent article highlighting the basics of text correction. In a nutshell, a basic text correction has to have at least 3 components:

Candidate Model

The candidate model generates the list of potential correct words for a given search keyword. Potential candidates are generally generated by doing all possible permutations of adding, substituting, transposing, or deleting a character from the original search keyword, within a given Damerau–Levenshtein edit distance. The

What is the Damerau–Levenshtein edit distance? This is basically the measure of how many operations (adding, substituting, transposing or deleting a character) is needed to convert one sequence to another. As part of his research, Damerau discovered that 80% of human misspellings has an edit distance of 1 (only 1 character change needed).

At this stage, while a lot of candidates will be generated, not all of them are meaningful. The candidates have to be checked against a corpus of known valid words.

Language Model

The language model is basically a corpus of known valid words language and the frequency/probability of such words appearing. i.e. the words “the” appears in 7% of English sentences. Typically, we can get a rough sample model from text mining a large amount of literature and then coming up with this model.

Selection Criteria

The selection criteria is what we use to to decide which of the potential candidates we have found is the “best” word to use as the corrected version. A possible selection criteria would be to compute a score based on the edit distance (the lower the better) and how frequently the word appears in our language model (the higher the better) and to pick the candidate word with the highest score.

Symmetric Delete Spelling Correction

While Peter Norvig’s version of text correction is short and simple, its performance is well known to be quite slow. While reviewing algorithm literature, I came across the Symmetric Delete Spelling Correction algorithm, which claims to be 1000x faster.

I implemented this algorithm in Python and compared the accuracy/speed of it against Peter Norvig’s code.

In [1]:
from pyxdameraulevenshtein import damerau_levenshtein_distance
# The frequent dictionary was obtained from https://github.com/wolfgarbe/symspell and is the intersection 
# of Google Books Ngram Data and Spell Checker Oriented Word Lists
# Download it directly here https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt
fname = "frequency_dictionary_en_82_765.txt"
dictionary = {}

def get_deletes_list(w, delete_distance):
    """to return all possible combinations of the string with x number of deletes"""
    master_deletes_list = []
    queue = [w]
    
    for x in range(delete_distance):
        temporary_queue = []
        
        for word in queue:
            if len(word)>delete_distance:
                for c in range(len(word)):
                    word_minus_c = word[:c] + word[c+1:]
                    if word_minus_c not in master_deletes_list:
                        master_deletes_list.append(word_minus_c)
                    if word_minus_c not in temporary_queue:
                        temporary_queue.append(word_minus_c)
        queue = temporary_queue
        
    return master_deletes_list

    
def create_dictionary_entry(w,frequency):
    """
    the dictionary will be in the format of {"key":[[autocorrected word 1, autocorrected word 2, etc],frequency}
    """
    #Add the word
    if w not in dictionary:
        dictionary[w] = [[w],frequency]   
    else:
        dictionary[w][0].append(w)
        dictionary[w] = [dictionary[w][0],frequency]  
    deletes_list = get_deletes_list(w,2)

    for word in deletes_list:
        if word not in dictionary:
            dictionary[word] = [[w],0]
        else:
            dictionary[word][0].append(w)

def create_dictionary(fname):
    total_frequency =0
    with open(fname) as file:
        for line in file:
            create_dictionary_entry(line.split()[0],line.split()[1])
    for keyword in dictionary:
        total_frequency += float(dictionary[keyword][1])
    for keyword in dictionary:
        dictionary[keyword].append(float(dictionary[keyword][1])/float(total_frequency))


def get_suggestions(w):
    search_deletes_list = get_deletes_list(w,2)
    search_deletes_list.append(w)
    candidate_corrections = []
    
    #Does not correct words which are existing in the dictionary and that has a high frequency based on the word corpus
    if w in dictionary and int(dictionary[w][1])>10000:
        return w
    else:
        for query in search_deletes_list:
            if query in dictionary:
                for suggested_word in dictionary[query][0]:
                        edit_distance = float(damerau_levenshtein_distance(w, suggested_word))
                        frequency = float(dictionary[suggested_word][1])
                        probability = float(dictionary[suggested_word][2])
                        score = frequency*(0.003**(edit_distance))
                        
                        candidate_corrections.append((suggested_word,frequency,edit_distance,score))

        candidate_corrections = sorted(candidate_corrections, key=lambda x:int(x[3]), reverse=True)
        return candidate_corrections

    
def get_correction(w):
    
    try:
        return get_suggestions(w)[0][0]
    except:
        return "no result"

    
create_dictionary(fname)
    

As you can see, the text correction script works reasonably well.

In [2]:
print(get_correction("pphoone"))
print(get_correction("dresss"))
print(get_correction("alptop"))
print(get_correction("beaautifol"))
phone
dress
laptop
beautiful

However, its main improvement over Peter Norvig’s algorithm is its speed in doing text corrections. Below I have combined the same test set of queries used by Peter Norvig to test the algorithm and its speed improvement is nearly 90x-100x faster. Symmetric Delete Spelling Correction allows me process around 1000 to 2000 words per second as compared to Peter Norvig’s method of only 20-25 word per second.

In [3]:
def spelltest(tests):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.clock()
    good, unknown = 0, 0
    n = len(tests)
    wrong_words_list =[]
    for right, wrong in tests:
        w = get_correction(wrong)
        good += (w == right)
        if w!=right:
            wrong_words_list.append(wrong)
        
    dt = time.clock() - start
    print('{:.0%} of {} correct  at {:.0f} words per second '
          .format(good / n, n, n / dt))
#     return wrong_words_list
    
def Testset(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

spelltest(Testset(open('norvig-spell-testset1.txt'))) # Development set
spelltest(Testset(open('norvig-spell-testset2.txt'))) # Final test set
74% of 270 correct  at 1075 words per second 
74% of 400 correct  at 1745 words per second 

Below is Norvig’s Code

In [4]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))



def unit_tests():
    assert correction('speling') == 'spelling'              # insert
    assert correction('korrectud') == 'corrected'           # replace 2
    assert correction('bycycle') == 'bicycle'               # replace
    assert correction('inconvient') == 'inconvenient'       # insert 2
    assert correction('arrainged') == 'arranged'            # delete
    assert correction('peotry') =='poetry'                  # transpose
    assert correction('peotryy') =='poetry'                 # transpose + delete
    assert correction('word') == 'word'                     # known
    assert correction('quintessential') == 'quintessential' # unknown
    assert words('This is a TEST.') == ['this', 'is', 'a', 'test']
    assert Counter(words('This is a test. 123; A TEST this is.')) == (
           Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))
    assert len(WORDS) == 32192
    assert sum(WORDS.values()) == 1115504
    assert WORDS.most_common(10) == [
     ('the', 79808),
     ('of', 40024),
     ('and', 38311),
     ('to', 28765),
     ('in', 22020),
     ('a', 21124),
     ('that', 12512),
     ('he', 12401),
     ('was', 11410),
     ('it', 10681)]
    assert WORDS['the'] == 79808
    assert P('quintessential') == 0
    assert 0.07 < P('the') < 0.08
    return 'unit_tests pass'

def spelltest_norvig(tests, verbose=False):
    "Run correction(wrong) on all (right, wrong) pairs; report results."
    import time
    start = time.clock()
    good, unknown = 0, 0
    n = len(tests)
    for right, wrong in tests:
        w = correction(wrong)
        good += (w == right)
        if w != right:
            unknown += (right not in WORDS)
            if verbose:
                print('correction({}) => {} ({}); expected {} ({})'
                      .format(wrong, w, WORDS[w], right, WORDS[right]))
    dt = time.clock() - start
    print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '
          .format(good / n, n, unknown / n, n / dt))
    
def Testset_norvig(lines):
    "Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
    return [(right, wrong)
            for (right, wrongs) in (line.split(':') for line in lines)
            for wrong in wrongs.split()]

spelltest_norvig(Testset_norvig(open('norvig-spell-testset1.txt'))) # Development set
spelltest_norvig(Testset_norvig(open('norvig-spell-testset2.txt'))) # Final test set
75% of 270 correct (6% unknown) at 28 words per second 
68% of 400 correct (11% unknown) at 24 words per second 

 

Most of the speed gains came because some of the processing was shifted to the pre-query stage and also because the Symmetric Delete Spelling Correction algorithm, as its name suggests, only makes use of deletes. This makes it much less operation intensive.

 

 Improvements and Ecommerce adaptation

While the text correction algorithm created earlier works quite well for general English text correction, it will perform poorly in an Ecommerce setting. This is because its dictionary corpus is not optimised for product names and only contains general English words. In Ecommerce search, the dictionary should be augmented with either product titles or common search queries.

Furthermore, for non-alphabetic languages such as Mandarin or Thai, you will need to design separate text correction algorithms for it.