GCLI should allow fuzzy matching of commands

RESOLVED FIXED in Future

Status

P2
normal
RESOLVED FIXED
7 years ago
7 months ago

People

(Reporter: jwalker, Unassigned)

Tracking

unspecified
Future

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 1 obsolete attachment)

GCLI has a spell module, but it's disabled for being slow.
Example:

>> hlp -> help


See also bug 720689
See also http://stackoverflow.com/questions/5859561/getting-the-closest-string-match#answer-5859823
Also GCLI has a 'Speller' module that was used, but is now disabled because it was too slow.

(Triage)
Whiteboard: [good first bug][mentor=jwalker]
Created attachment 657691 [details] [diff] [review]
faster spell-checking patch

It turns out that the Speller module uses Peter Norvig's spellchecking algorithm where the vocabulary size is expected to be huge (~100K words). It computes all plausible edits and tries to match them with a big dictionary. This is not practical for GCLI, since the dictionary is very small. Instead, the Speller module should compute the Levenshtein distance for every possible command.

I implemented a working proof-of-concept that passes existing unit tests (I tested manually since I don't know how to run the tests). I'm not only using Levenshtein distance but Damerau-Levenshtein distance which includes swaps (eg. distance("help", "hlep") is not two but one).

I overwrote spell.js but this may be a bad idea since the patch is hard to read. I'd be happy to change anything in this patch. This is my first patch for a Mozilla product: if I did anything wrong I'll be happy to fix it.
Attachment #657691 - Flags: review?
Quentin, this patch looks great.

The fuzzy matching algorithms that I've seen in the wild look something like this: <https://raw.github.com/espadrine/mozcmd/master/find.mozcmd>
This algorithm returns a list of all accepted matches, and it is a subset of all commands. On the other hand, using Levenshtein yields all commands, each with a score (that may be very bad for words that have nothing in common).

This begets the question, what is the corresponding user experience? If it provides a list of acceptable matches, a list that gets more precise as the user types, then using Levenshtein is awkward. If it auto-corrects when we type space, then we really only need the best match, and Levenshtein is the tool for the job.

On a totally different note, I'd be very interested to know a performance analysis between the old approach (spellchecking) and the new one (yours).

Also, as a nit: the < operator doesn't order lexicographically, but along Unicode code points. (For instance, Abel < ant < Bob, but "Abel" < "Bob" < "ant".) 

Finally, a very rare edge-case occurs when you're computing with "__proto__" as a candidate. Unlikely to happen, but we never know. Can you use a map instead of an object for |distance|?
<https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/Map>
Thanks for looking at this Thaddee.

(In reply to Thaddee Tyl [:espadrine] from comment #3)
> Quentin, this patch looks great.
> 
> The fuzzy matching algorithms that I've seen in the wild look something like
> this: <https://raw.github.com/espadrine/mozcmd/master/find.mozcmd>
> This algorithm returns a list of all accepted matches, and it is a subset of
> all commands. On the other hand, using Levenshtein yields all commands, each
> with a score (that may be very bad for words that have nothing in common).
> 
> This begets the question, what is the corresponding user experience? If it
> provides a list of acceptable matches, a list that gets more precise as the
> user types, then using Levenshtein is awkward. If it auto-corrects when we
> type space, then we really only need the best match, and Levenshtein is the
> tool for the job.

Right now, it's providing a single match, but there is nothing to prevent us updating the UI to show more than one.

> On a totally different note, I'd be very interested to know a performance
> analysis between the old approach (spellchecking) and the new one (yours).
> 
> Also, as a nit: the < operator doesn't order lexicographically, but along
> Unicode code points. (For instance, Abel < ant < Bob, but "Abel" < "Bob" <
> "ant".) 

Does that cause a bug, or is it just a case of localeCompare being better in general?

> Finally, a very rare edge-case occurs when you're computing with "__proto__"
> as a candidate. Unlikely to happen, but we never know. Can you use a map
> instead of an object for |distance|?
> <https://developer.mozilla.org/en-US/docs/JavaScript/Reference/
> Global_Objects/Map>

Agreed, except that GCLI works in web browsers where Map isn't available yet.

I'd like to accept this patch because it works (quite nicely) now, but I'm not against us looking at presenting the user with multiple matches in a followup.
Created attachment 658520 [details] [diff] [review]
v2

Several of the tests that checked for predictions were broken buy the original patch, this fixes them.
Also hidden commands were considered for completion. They are now correctly hidden.
Finally the spell module was tidied up to remove the need for a constructor, to fix some indents, etc.
Attachment #657691 - Attachment is obsolete: true
Attachment #657691 - Flags: review?
(In reply to Joe Walker [:jwalker] from comment #4)
> > Also, as a nit: the < operator doesn't order lexicographically, but along
> > Unicode code points. (For instance, Abel < ant < Bob, but "Abel" < "Bob" <
> > "ant".) 
> 
> Does that cause a bug, or is it just a case of localeCompare being better in
> general?

It's not a bug, especially in this case where it's only used for strings with the same score. That's just me being pedant about alphabetical order ☺
`localeCompare` still gives us ASCIIbetical order either, though.

(More at http://en.wikipedia.org/wiki/ASCIIbetical_order#Order)

> I'd like to accept this patch because it works (quite nicely) now, but I'm
> not against us looking at presenting the user with multiple matches in a
> followup.

That sounds great.
Created attachment 658614 [details]
Profile for the new Damerau-Levenshtein algorithm.

(In reply to Thaddee Tyl [:espadrine] from comment #3)
> Quentin, this patch looks great.
> 
> This begets the question, what is the corresponding user experience? If it
> provides a list of acceptable matches, a list that gets more precise as the
> user types, then using Levenshtein is awkward. If it auto-corrects when we
> type space, then we really only need the best match, and Levenshtein is the
> tool for the job.

I assumed that Levenshtein was only to be used for spellchecking, eg. when nothing else worked. It gives a nice edit distance and the way it works is well understood and is said to match the kind of typos humans do.

If you want to show multiple commands at the same time, the current algorithm can stay the same: when matching, show possible matches, otherwise, show the possible matches. Lowering the maximum edit distance would avoid showing some of the weirdest matches. However, multiple matches are not going to be needed until GCLI has way more commands. But it can make sense to optimize for a lot of commands right now.

> On a totally different note, I'd be very interested to know a performance
> analysis between the old approach (spellchecking) and the new one (yours).

I used Firebug (the only tool I know for this), and "benchmarked" by typing a few wrong words and asking for autocomplete. I tested the old version (norvig-spellcheck.html), and Joe Walker's version of the patch (levenshtein.html).

You can look at the fully detailed profile logs if you like, but what's important is that the old version spent more than 97% of CPU time in spell.js and the new version about 7.80% (3.57% in the actual edit distance code). It's no longer the main bottleneck.

> Also, as a nit: the < operator doesn't order lexicographically, but along
> Unicode code points. (For instance, Abel < ant < Bob, but "Abel" < "Bob" <
> "ant".) 

Maybe the patch could say "in the ASCIIbetical order"? What's important here is to have an uniform behavior.
Created attachment 658616 [details]
Profile for the old version
Attachment #658616 - Attachment mime type: text/plain → text/html
Attachment #658614 - Attachment mime type: text/plain → text/html
https://tbpl.mozilla.org/?tree=Fx-Team&rev=aaa9837d68b3
Whiteboard: [good first bug][mentor=jwalker] → [fixed-in-fx-team]
https://hg.mozilla.org/mozilla-central/rev/dea1d1224107
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Whiteboard: [fixed-in-fx-team]

Updated

7 months ago
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.