Search is a key feature of any Ecommerce marketplace as it drives organic orders based on buyer intent. Quite often, buyers often mistype their word which return inaccurate/wrong products. Automatic text correction of search queries can help to remedy this situation by showing buyers the results of the intended keyword instead of the mistyped keyword
As a new Search Product Manager at Shopee, I thought it would be good if I could gain better insight into the fundamentals of automatic text correction.
Basics of Text Correction
Peter Norvig has an excellent article highlighting the basics of text correction. In a nutshell, a basic text correction has to have at least 3 components:
Candidate Model
The candidate model generates the list of potential correct words for a given search keyword. Potential candidates are generally generated by doing all possible permutations of adding, substituting, transposing, or deleting a character from the original search keyword, within a given Damerau–Levenshtein edit distance. The
What is the Damerau–Levenshtein edit distance? This is basically the measure of how many operations (adding, substituting, transposing or deleting a character) is needed to convert one sequence to another. As part of his research, Damerau discovered that 80% of human misspellings has an edit distance of 1 (only 1 character change needed).
At this stage, while a lot of candidates will be generated, not all of them are meaningful. The candidates have to be checked against a corpus of known valid words.
Language Model
The language model is basically a corpus of known valid words language and the frequency/probability of such words appearing. i.e. the words “the” appears in 7% of English sentences. Typically, we can get a rough sample model from text mining a large amount of literature and then coming up with this model.
Selection Criteria
The selection criteria is what we use to to decide which of the potential candidates we have found is the “best” word to use as the corrected version. A possible selection criteria would be to compute a score based on the edit distance (the lower the better) and how frequently the word appears in our language model (the higher the better) and to pick the candidate word with the highest score.
Symmetric Delete Spelling Correction
While Peter Norvig’s version of text correction is short and simple, its performance is well known to be quite slow. While reviewing algorithm literature, I came across the Symmetric Delete Spelling Correction algorithm, which claims to be 1000x faster.
I implemented this algorithm in Python and compared the accuracy/speed of it against Peter Norvig’s code.
from pyxdameraulevenshtein import damerau_levenshtein_distance
# The frequent dictionary was obtained from https://github.com/wolfgarbe/symspell and is the intersection
# of Google Books Ngram Data and Spell Checker Oriented Word Lists
# Download it directly here https://github.com/wolfgarbe/SymSpell/blob/master/SymSpell/frequency_dictionary_en_82_765.txt
fname = "frequency_dictionary_en_82_765.txt"
dictionary = {}
def get_deletes_list(w, delete_distance):
"""to return all possible combinations of the string with x number of deletes"""
master_deletes_list = []
queue = [w]
for x in range(delete_distance):
temporary_queue = []
for word in queue:
if len(word)>delete_distance:
for c in range(len(word)):
word_minus_c = word[:c] + word[c+1:]
if word_minus_c not in master_deletes_list:
master_deletes_list.append(word_minus_c)
if word_minus_c not in temporary_queue:
temporary_queue.append(word_minus_c)
queue = temporary_queue
return master_deletes_list
def create_dictionary_entry(w,frequency):
"""
the dictionary will be in the format of {"key":[[autocorrected word 1, autocorrected word 2, etc],frequency}
"""
#Add the word
if w not in dictionary:
dictionary[w] = [[w],frequency]
else:
dictionary[w][0].append(w)
dictionary[w] = [dictionary[w][0],frequency]
deletes_list = get_deletes_list(w,2)
for word in deletes_list:
if word not in dictionary:
dictionary[word] = [[w],0]
else:
dictionary[word][0].append(w)
def create_dictionary(fname):
total_frequency =0
with open(fname) as file:
for line in file:
create_dictionary_entry(line.split()[0],line.split()[1])
for keyword in dictionary:
total_frequency += float(dictionary[keyword][1])
for keyword in dictionary:
dictionary[keyword].append(float(dictionary[keyword][1])/float(total_frequency))
def get_suggestions(w):
search_deletes_list = get_deletes_list(w,2)
search_deletes_list.append(w)
candidate_corrections = []
#Does not correct words which are existing in the dictionary and that has a high frequency based on the word corpus
if w in dictionary and int(dictionary[w][1])>10000:
return w
else:
for query in search_deletes_list:
if query in dictionary:
for suggested_word in dictionary[query][0]:
edit_distance = float(damerau_levenshtein_distance(w, suggested_word))
frequency = float(dictionary[suggested_word][1])
probability = float(dictionary[suggested_word][2])
score = frequency*(0.003**(edit_distance))
candidate_corrections.append((suggested_word,frequency,edit_distance,score))
candidate_corrections = sorted(candidate_corrections, key=lambda x:int(x[3]), reverse=True)
return candidate_corrections
def get_correction(w):
try:
return get_suggestions(w)[0][0]
except:
return "no result"
create_dictionary(fname)
As you can see, the text correction script works reasonably well.
print(get_correction("pphoone"))
print(get_correction("dresss"))
print(get_correction("alptop"))
print(get_correction("beaautifol"))
However, its main improvement over Peter Norvig’s algorithm is its speed in doing text corrections. Below I have combined the same test set of queries used by Peter Norvig to test the algorithm and its speed improvement is nearly 90x-100x faster. Symmetric Delete Spelling Correction allows me process around 1000 to 2000 words per second as compared to Peter Norvig’s method of only 20-25 word per second.
def spelltest(tests):
"Run correction(wrong) on all (right, wrong) pairs; report results."
import time
start = time.clock()
good, unknown = 0, 0
n = len(tests)
wrong_words_list =[]
for right, wrong in tests:
w = get_correction(wrong)
good += (w == right)
if w!=right:
wrong_words_list.append(wrong)
dt = time.clock() - start
print('{:.0%} of {} correct at {:.0f} words per second '
.format(good / n, n, n / dt))
# return wrong_words_list
def Testset(lines):
"Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
return [(right, wrong)
for (right, wrongs) in (line.split(':') for line in lines)
for wrong in wrongs.split()]
spelltest(Testset(open('norvig-spell-testset1.txt'))) # Development set
spelltest(Testset(open('norvig-spell-testset2.txt'))) # Final test set
Below is Norvig’s Code¶
import re
from collections import Counter
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))
def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N
def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)
def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
def unit_tests():
assert correction('speling') == 'spelling' # insert
assert correction('korrectud') == 'corrected' # replace 2
assert correction('bycycle') == 'bicycle' # replace
assert correction('inconvient') == 'inconvenient' # insert 2
assert correction('arrainged') == 'arranged' # delete
assert correction('peotry') =='poetry' # transpose
assert correction('peotryy') =='poetry' # transpose + delete
assert correction('word') == 'word' # known
assert correction('quintessential') == 'quintessential' # unknown
assert words('This is a TEST.') == ['this', 'is', 'a', 'test']
assert Counter(words('This is a test. 123; A TEST this is.')) == (
Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))
assert len(WORDS) == 32192
assert sum(WORDS.values()) == 1115504
assert WORDS.most_common(10) == [
('the', 79808),
('of', 40024),
('and', 38311),
('to', 28765),
('in', 22020),
('a', 21124),
('that', 12512),
('he', 12401),
('was', 11410),
('it', 10681)]
assert WORDS['the'] == 79808
assert P('quintessential') == 0
assert 0.07 < P('the') < 0.08
return 'unit_tests pass'
def spelltest_norvig(tests, verbose=False):
"Run correction(wrong) on all (right, wrong) pairs; report results."
import time
start = time.clock()
good, unknown = 0, 0
n = len(tests)
for right, wrong in tests:
w = correction(wrong)
good += (w == right)
if w != right:
unknown += (right not in WORDS)
if verbose:
print('correction({}) => {} ({}); expected {} ({})'
.format(wrong, w, WORDS[w], right, WORDS[right]))
dt = time.clock() - start
print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '
.format(good / n, n, unknown / n, n / dt))
def Testset_norvig(lines):
"Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
return [(right, wrong)
for (right, wrongs) in (line.split(':') for line in lines)
for wrong in wrongs.split()]
spelltest_norvig(Testset_norvig(open('norvig-spell-testset1.txt'))) # Development set
spelltest_norvig(Testset_norvig(open('norvig-spell-testset2.txt'))) # Final test set
Most of the speed gains came because some of the processing was shifted to the pre-query stage and also because the Symmetric Delete Spelling Correction algorithm, as its name suggests, only makes use of deletes. This makes it much less operation intensive.
Improvements and Ecommerce adaptation
While the text correction algorithm created earlier works quite well for general English text correction, it will perform poorly in an Ecommerce setting. This is because its dictionary corpus is not optimised for product names and only contains general English words. In Ecommerce search, the dictionary should be augmented with either product titles or common search queries.
Furthermore, for non-alphabetic languages such as Mandarin or Thai, you will need to design separate text correction algorithms for it.
Great post, some comments though:
1. You are comparing the two algorithms using different dictionaries. Thus the speed/precisions differences are not only influenced by the algorithms and their implementation, but also by dictionary quality and dictionary size. Dictionaries are interchangeable between algorithms.
2. You did the comparison for maximum edit distance=2. The performance difference between the algorithms increases dramatically for higher edit distances.
3. You implemented the first version of the Symmetric Delete Spelling Correction algorithm. Version 6.1 is much faster, has shorter precalculation time and lower memory consumption.
See https://towardsdatascience.com/symspell-vs-bk-tree-100x-faster-fuzzy-string-search-spell-checking-c4f10d80a078 for a benchmark between Norvig’s algorithm, BK-Tree and the current version of SymSpell for different edit distances and dictionary sizes.