March 2012 Archives

Can we reverse engineer Google’s word correction algorithm given a corpus of misspelled words paired with their corrections?

Since I have a single word domain name mischievous, which is one of the 100 most misspelled English words, this allows me to analyze some interesting data from Google’s webmaster tools. I pulled out all the misspellings and impressions within a Levenshtein Distance. There is a nice academic paper that discusses Learning a Spelling Error Model from Search Query Logs that I plan to use to explore some of this data in the future.

A chart and regression of the misspelling data on a log-log chart shows that impressions of misspellings of the word mischievous vs the rank that they appear in all keywords that lead to this blog follows Zipf’s_law. I refitted words with under 10 impressions based on their rank data (ranks >= 83) as webmaster tools only gives a sample value when the impressions are greater than 10.

mischievous-fvs-r.png

Raw Data

You can use this table to gauge your spelling (I should add the cumulative distribution so you should see what percentile a misspelling places you )

rank query replace levenshtein similarity
1 mischievous 27000.00 0 1.00
2 mischevious 4500.00 2 0.41
3 mischivious 700.00 2 0.50
6 michevious 500.00 3 0.21
7 mischevous 500.00 1 0.64
13 mischiveous 170.00 2 0.50
18 mischieveous 150.00 1 0.67
19 mischivous 150.00 1 0.64
20 michievous 110.00 1 0.64
21 mischeivious 90.00 3 0.39
23 mischeivous 90.00 2 0.50
24 michevous 70.00 2 0.38
25 mischievious 70.00 1 0.67
26 mischeveous 70.00 2 0.41
29 mischeavious 60.00 3 0.39
30 mischiefous 60.00 1 0.60
31 michivious 60.00 3 0.28
32 mischeavous 50.00 2 0.50
33 mishevious 35.00 3 0.28
35 miscevious 35.00 3 0.35
47 mishievous 16.00 1 0.64
48 michievious 16.00 2 0.41
53 misgevious 12.00 4 0.28
54 micheivious 12.00 4 0.20
55 mischvious 12.00 3 0.44
56 mischiveious 12.00 2 0.47
58 mischevios 12.00 3 0.28
83 mischevius 11.15 2 0.35
101 miscevous 8.30 2 0.57
113 micheavous 7.01 3 0.28
133 mischeives 5.48 4 0.28
140 mischeviuos 5.08 3 0.26
153 mischiefious 4.44 2 0.56
176 mischeous 3.60 2 0.47
196 mechivious 3.06 4 0.21
218 miscievious 2.61 2 0.41
223 mechevious 2.52 4 0.15
241 mischieved 2.24 3 0.53
262 myschevious 1.98 3 0.20
263 misjevious 1.96 4 0.28
273 mischeviouse 1.86 3 0.32
277 machivious 1.82 4 0.21
279 mischeiveous 1.80 3 0.39
282 mischives 1.77 3 0.38
321 mischievous? 1.45 1 1.00
324 miscchievous 1.43 1 0.79
333 mischeifous 1.38 3 0.41
334 mistchivious 1.37 3 0.32
351 miscievous 1.27 1 0.64
357 mischieveious 1.24 2 0.63
363 mishcevious 1.21 3 0.26
371 mischievous  1.17 2 1.00
378 mischievous. 1.14 1 1.00
408 micheveous 1.01 3 0.21
422 mischevoius 0.96 2 0.41
430 mistivious 0.94 4 0.28
438 mischievo 0.91 2 0.69
444 misgivious 0.89 4 0.28
483 michivous 0.79 2 0.38
510 mischievous, 0.72 1 1.00
525 mystivious 0.69 5 0.15
528 myschivious 0.69 3 0.26
543 mis chievous 0.66 1 0.67
603 meschivious 0.56 3 0.26
606 mischievoud 0.56 1 0.71
626 mischeviois 0.53 3 0.26
629 micheavious 0.53 4 0.20
635 mishievious 0.52 2 0.41
661 miscivous 0.49 2 0.47
671 meschevious 0.48 3 0.20
676 miss chivous 0.47 3 0.39
734 mischieves 0.42 2 0.53

Digging through my email I realized that I had quite abit of email related to sales of Facebook Class B shares on the secondary market. I dug around the emails that I saved from November 2011 and have eight data points for Facebook sales to date. Here are the results in a handy chart with the obligatory R² for a simple linear regression. I’ll also note that these “shares” are not actually purchases of shares but purchases of an investment vehicle designed to hold shares of Facebook via an indirect interest.

facebook_sales_data.png

Facebook Share Sales Data

Detailed Facebook share transaction dates proces and volume that I have collected:

DatePriceVolume
November 16, 2011$30.0075,000
December 9, 2011$33.00100,000
December 21, 2011$32.00150,000
January 20, 2012$34.0070,000
February 8, 2012$44.00150,000
February 14, 2012$42.00200,000
February 22, 2012$42.00125,000
February 29, 2012$40.00125,000

About this Archive

This page is an archive of entries from March 2012 listed from newest to oldest.

January 2012 is the previous archive.

April 2012 is the next archive.

Find recent content on the main index or look in the archives to find all content.