Open Bug 675685 Opened 13 years ago Updated 2 years ago

TextUpdater::DoUpdate computes Levenshtein Edit Distance inefficiently

Categories

(Core :: Disability Access APIs, defect)

defect

Tracking

()

People

(Reporter: jruderman, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [bk1])

Attachments

(1 file, 4 obsolete files)

TextUpdater::DoUpdate seems to be using a Levenshtein distance algorithm that is O(m*n) in both space and time.

According to http://en.wikipedia.org/wiki/Levenshtein_distance there are more efficient general algorithms, and also specialized algorithms in cases where "4 or more" is an acceptable answer.

I don't know how often this code runs and what the typical input sizes are.
Whiteboard: [bk1]
Assignee: nobody → askalski
I have following ideas right now:
1) I can optimise memory usage to O(min(m,n))
2) to my imagination "4 or more" can be generalized into "N or more" for any N (using diagonal approach as in wikipedia). If that's a sufficient answer I can code it in O(min(m,n) * (1+N)) (it's a shot, I am guessing I can right now).

Is there a good boundary for N known?
Should I write any of these or look further?
(In reply to Andrzej Skalski from comment #1)
> I have following ideas right now:
> 1) I can optimise memory usage to O(min(m,n))

that's interesting, but I think speed is more important than memory size here assuming we don't create such a big array it won't fit.  That's because either it will be small and short lived and so the exact size doesn't really matter, or it'll be big in which jemalloc should just mmap us a region  which we'll let it unmap fiarly quickly and since its mmaped fragmentation etc shouldn't be an issue.

> 2) to my imagination "4 or more" can be generalized into "N or more" for any
> N (using diagonal approach as in wikipedia). If that's a sufficient answer I
> can code it in O(min(m,n) * (1+N)) (it's a shot, I am guessing I can right
> now).

is that a space or time bound?  I haven't looked at wikipeida in a while for this and am pretty tired.

> Is there a good boundary for N known?

not really, theoretically it could be arbitrarily large I think, so perhaps we should keep the max length constant like we do now but make it bigger.

> Should I write any of these or look further?

I'm not really sure off hand, but my suggestion would be to optimize for speed over size.
> is that a space or time bound?

I meant computation time. The memory size should remain linear as well (as I imagine it right now).

I think I propose 2 solutions, one that mimic the present function saving memory, and other that takes additional argument (limit for N) that can be used for a speedup.

I can also write a third function, that returns an approximate result, but in return it will be very fast.

I am generally very flexible with algorithms, if you see some paper on revolutionary solution I am not aware of just let me know :)
(In reply to Andrzej Skalski from comment #3)
> > is that a space or time bound?
> 
> I meant computation time. The memory size should remain linear as well (as I
> imagine it right now).
> 
> I think I propose 2 solutions, one that mimic the present function saving
> memory, and other that takes additional argument (limit for N) that can be
> used for a speedup.

I don't think we can have a limit for N if N is a distance

> I can also write a third function, that returns an approximate result, but
> in return it will be very fast.

what does approximate result mean exactly?
> I don't think we can have a limit for N if N is a distance

Then I guess all we're left with is classic approach with some minor optimizations (constant time factor). I am still waiting for e-mail answer from someone who PHDs in such stuff.

> what does approximate result mean exactly?

I was thinking about http://people.csail.mit.edu/konak/papers/approximating_edit_distance.html, but now that I browse it, I don't think it's practically implementable. I would take weeks to do it right, and still I don't know what would be numerical errors.

I can google further or take #1 option.
I contacted to doctors specialized in text algorithms, one of the authors of linked MIT paper among them, and they both told me that approximation algorithms are purely theoretical. Also, there is a O(m*n / log(n)) exact algorithm, but I was told it's non-practical as well.

What can be done is:
- Reduce memory usage to linear O(min(m,n))
- Reduce computational time to O(min(n,m) * d) where d is distance
- Change entire computation to O(n + d^2), which is very fast for similar strings (LED < sqrt(n) ).
- Finish a computation earlier if we use LED for decision making only (for example if(LED > max_D) we can compute LED only to known limit)

Some more theoretical stuff:
- I heard about algorithm that works in O((n+m)*(min(n,m) - d)) which would be fast for very different strings (waiting for a document to check if that's not a typo)
- I proposed merging the two above to get one working for both cases, but I guess that's something another PhD can be made of. I am waiting right now to get Professors's feedback whether that's doable.

My suggestion:
Do first two reductions, and the fourth one if possible.
(In reply to Andrzej Skalski from comment #6)
> I contacted to doctors specialized in text algorithms, one of the authors of
> linked MIT paper among them, and they both told me that approximation
> algorithms are purely theoretical. Also, there is a O(m*n / log(n)) exact
> algorithm, but I was told it's non-practical as well.

ok

> What can be done is:
> - Reduce memory usage to linear O(min(m,n))
> - Reduce computational time to O(min(n,m) * d) where d is distance

that'd be great

> - Change entire computation to O(n + d^2), which is very fast for similar
> strings (LED < sqrt(n) ).

its probably very often true that the edit distance is small,  one big case of this is typeing or deleting characters.

> - Finish a computation earlier if we use LED for decision making only (for
> example if(LED > max_D) we can compute LED only to known limit)

Well, whatever algorithm we use I think we'll need to keep the bailout we currently have for big  computations to just fire delete and insert events for the whole strings.  So this might be useful.

> My suggestion:
> Do first two reductions, and the fourth one if possible.

that sounds fine.  optimizing this is  good, but probably not the most important thing to work on optimizing right now.
(In reply to Trevor Saunders (:tbsaunde) from comment #7)

> > - Change entire computation to O(n + d^2), which is very fast for similar
> > strings (LED < sqrt(n) ).
> 
> its probably very often true that the edit distance is small,  one big case
> of this is typeing or deleting characters.

it can be big for text replacement. we don't run Levenshtein algorithm for those and as you said it makes sense to keep it since 99% percents I think it's running for equal strings, single insertion or deletion.

> > My suggestion:
> > Do first two reductions, and the fourth one if possible.
> 
> that sounds fine.  optimizing this is  good, but probably not the most
> important thing to work on optimizing right now.

agree
> > its probably very often true that the edit distance is small,  one big case
> > of this is typeing or deleting characters.
> 
> it can be big for text replacement. we don't run Levenshtein algorithm for
> those and as you said it makes sense to keep it since 99% percents I think
> it's running for equal strings, single insertion or deletion.
> 

Well, at http://en.wikipedia.org/wiki/Levenshtein_distance#Upper_and_lower_bounds it's written, that:

|m - n| <= d <= max(m,n), so the only situation in which this algorithm is actually slower is if you replace a short string with a very long one, totally different, and d = len(new_text). In most situation, the d square actually fits within O(m*n) rectangle. It's pretty much like with qsort - the worst case scenario is n^2, but do we really care?
I say "implement both, check which one performs better".

> > > My suggestion:
> > > Do first two reductions, and the fourth one if possible.
> > 
> > that sounds fine.  optimizing this is  good, but probably not the most
> > important thing to work on optimizing right now.
> 
> agree
It's just a prove of concept. To reach it's full performance, "slide" function needs to be replaced with TRIE tree. File contains implementation of naive version as well, for testing purposes.

Look for sources in the header to understand code, comments would be useless at this level of abstraction.
I have proof-of-concept, the memory consumption can be cut to linear right now (from vector lhd I actually need only previous row which never gets wider than O(min(m,n)) ). I still need to write a suffix tree to get full performance.

And here is my question:
1) is there a special character I could use as "outside the alphabet" symbol? It's typical concept in text algorithms, but I am not sure what are capabilities of nsAString?
2) is there any size limit on nsAString? Right now I rely on "infinity" value, and would like to keep it this way.
> And here is my question:
> 1) is there a special character I could use as "outside the alphabet"
> symbol? It's typical concept in text algorithms, but I am not sure what are
> capabilities of nsAString?

I'm not aware of one, but  they're usually just buffers of utf16 code points.  You're probably safe horking one of the ascii control code things.

> 2) is there any size limit on nsAString? Right now I rely on "infinity"
> value, and would like to keep it this way.

well, its stores the length internally as a 32 bit int so certainly no longer than that, and I'm a little scared what happens if we have a string longer than 2^31 or so
update:

I wrote the algorithm from:
http://cs.haifa.ac.il/LANDAU/gadi/LMS.ps , pages (1-8)
and now it works in time O(md) (as in unix diff), this is implementation in proof-of-concept

the way to upgrade it to O(n + d^d) is to replace slide() method in code (which is naive Longest Common Extension problem solution) with something that runs in a constant time. I did experiments with original Ukkonen algorithm, as proposed in paper, and after several hours of fighting, I googled this:

http://www.mat.unb.br/~ayala/lvsaANDrmq.pdf

This reduces LCE problem to Range Minimum Query, which is further solved in O(1) time with Cartesian Trees. After hours of experiments, I abandon suffix tree approach, and redesign the code to work with this idea.
Ok, I gave up with suffix trees and suffix arrays for this week, and I would like to port a current approach ( O(d*min(m,n)) pesimistic O(n + d^2) expected time) as a patch. I optimized memory usage to O(max(m,n)) with only 2 allocations. I think the best way is to add a new class that calculates this LED (for code clarity and further extensions). Where should I put this class?
(In reply to Andrzej Skalski from comment #14)
> Ok, I gave up with suffix trees and suffix arrays for this week, and I would
> like to port a current approach ( O(d*min(m,n)) pesimistic O(n + d^2)
> expected time) as a patch. I optimized memory usage to O(max(m,n)) with only
> 2 allocations. I think the best way is to add a new class that calculates
> this LED (for code clarity and further extensions). Where should I put this
> class?

under TextUpdater where the diff is calculated now. Do you like to keep new and old logic the same time?
Attachment #597878 - Attachment is obsolete: true
Attachment #600043 - Attachment is obsolete: true
Attachment #601346 - Attachment is patch: true
Attachment #601346 - Attachment mime type: text/x-c++src → text/plain
While trying to extract events from the calculated matrix, I found out that I will most probably need all immediate informations as calculated in a first attachment (though probably better stored). The example that shows precisely why is:

string 1: dupa
string 2: bisdupxa or bisduxpa

no matter what string second is, the diagonal_substitution and horizontal_translation looks the same, and the commands to turn string1 into string2 will differ. The reason is complex - basically a long diagonal slide might not necessarily indicate a string alignment, but may be a result of an alignment on some neighboring diagonal.

Unfortunately storing all "lhd" values require O(d^2) space. Since we don't know d while calculating, we can either do max(m,n)^2 worst-case in single allocation, or use std::vector -like approach to do save up memory by switching to dynamic allocation.

I pause my work on LED right now hoping I will get some better ideas.

The bug assignee didn't login in Bugzilla in the last 7 months.
:Jamie, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: andrzej.j.skalski → nobody
Flags: needinfo?(jteh)
Severity: normal → S4
Flags: needinfo?(jteh)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: