Open Bug 675685 Opened 13 years ago Updated 2 years ago

TextUpdater::DoUpdate computes Levenshtein Edit Distance inefficiently

Tracking

()

Status:

NEW

People

(Reporter: jruderman, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [bk1])

Attachments

(1 file, 4 obsolete files)

proof-of-concept of proposed algorithm, non-optimised, just corectness 12 years ago Andrzej Skalski 4.76 KB, text/x-c++src		Details
same proof-of-concept, but with memory optimisation. still correct :) 12 years ago Andrzej Skalski 6.42 KB, text/x-c++src		Details
corrected proof-of-concept (avoid unnecessary computation) 12 years ago Andrzej Skalski 6.58 KB, patch		Details \| Diff \| Splinter Review
fixed uninitialized memory bug (condition h >= resdiag) and some additions 12 years ago Andrzej Skalski 8.67 KB, text/x-c++src		Details
same algorithm, now I successfully extract all data needed to get events generated 12 years ago Andrzej Skalski 11.93 KB, text/x-c++src		Details

Jesse Ruderman

Reporter

Description

•

13 years ago

TextUpdater::DoUpdate seems to be using a Levenshtein distance algorithm that is O(m*n) in both space and time.

According to http://en.wikipedia.org/wiki/Levenshtein_distance there are more efficient general algorithms, and also specialized algorithms in cases where "4 or more" is an acceptable answer.

I don't know how often this code runs and what the typical input sizes are.

alexander :surkov (:asurkov)

Updated

•

13 years ago

Blocks: a11yperf

Hubert Figuiere [:hub]

Updated

•

13 years ago

Whiteboard: [bk1]

David Bolter [:davidb] (NeedInfo me for attention)

Updated

•

12 years ago

Assignee: nobody → askalski

Andrzej Skalski

Comment 1

•

12 years ago

I have following ideas right now:
1) I can optimise memory usage to O(min(m,n))
2) to my imagination "4 or more" can be generalized into "N or more" for any N (using diagonal approach as in wikipedia). If that's a sufficient answer I can code it in O(min(m,n) * (1+N)) (it's a shot, I am guessing I can right now).

Is there a good boundary for N known?
Should I write any of these or look further?

Trevor Saunders (:tbsaunde)

Comment 2

•

12 years ago

(In reply to Andrzej Skalski from comment #1)
> I have following ideas right now:
> 1) I can optimise memory usage to O(min(m,n))

that's interesting, but I think speed is more important than memory size here assuming we don't create such a big array it won't fit.  That's because either it will be small and short lived and so the exact size doesn't really matter, or it'll be big in which jemalloc should just mmap us a region  which we'll let it unmap fiarly quickly and since its mmaped fragmentation etc shouldn't be an issue.

> 2) to my imagination "4 or more" can be generalized into "N or more" for any
> N (using diagonal approach as in wikipedia). If that's a sufficient answer I
> can code it in O(min(m,n) * (1+N)) (it's a shot, I am guessing I can right
> now).

is that a space or time bound?  I haven't looked at wikipeida in a while for this and am pretty tired.

> Is there a good boundary for N known?

not really, theoretically it could be arbitrarily large I think, so perhaps we should keep the max length constant like we do now but make it bigger.

> Should I write any of these or look further?

I'm not really sure off hand, but my suggestion would be to optimize for speed over size.

Andrzej Skalski

Comment 3

•

12 years ago

> is that a space or time bound?

I meant computation time. The memory size should remain linear as well (as I imagine it right now).

I think I propose 2 solutions, one that mimic the present function saving memory, and other that takes additional argument (limit for N) that can be used for a speedup.

I can also write a third function, that returns an approximate result, but in return it will be very fast.

I am generally very flexible with algorithms, if you see some paper on revolutionary solution I am not aware of just let me know :)

alexander :surkov (:asurkov)

Comment 4

•

12 years ago

(In reply to Andrzej Skalski from comment #3)
> > is that a space or time bound?
> 
> I meant computation time. The memory size should remain linear as well (as I
> imagine it right now).
> 
> I think I propose 2 solutions, one that mimic the present function saving
> memory, and other that takes additional argument (limit for N) that can be
> used for a speedup.

I don't think we can have a limit for N if N is a distance

> I can also write a third function, that returns an approximate result, but
> in return it will be very fast.

what does approximate result mean exactly?

Andrzej Skalski

Comment 5

•

12 years ago

> I don't think we can have a limit for N if N is a distance

Then I guess all we're left with is classic approach with some minor optimizations (constant time factor). I am still waiting for e-mail answer from someone who PHDs in such stuff.

> what does approximate result mean exactly?

I was thinking about http://people.csail.mit.edu/konak/papers/approximating_edit_distance.html, but now that I browse it, I don't think it's practically implementable. I would take weeks to do it right, and still I don't know what would be numerical errors.

I can google further or take #1 option.

Andrzej Skalski

Comment 6

•

12 years ago

I contacted to doctors specialized in text algorithms, one of the authors of linked MIT paper among them, and they both told me that approximation algorithms are purely theoretical. Also, there is a O(m*n / log(n)) exact algorithm, but I was told it's non-practical as well.

What can be done is:
- Reduce memory usage to linear O(min(m,n))
- Reduce computational time to O(min(n,m) * d) where d is distance
- Change entire computation to O(n + d^2), which is very fast for similar strings (LED < sqrt(n) ).
- Finish a computation earlier if we use LED for decision making only (for example if(LED > max_D) we can compute LED only to known limit)

Some more theoretical stuff:
- I heard about algorithm that works in O((n+m)*(min(n,m) - d)) which would be fast for very different strings (waiting for a document to check if that's not a typo)
- I proposed merging the two above to get one working for both cases, but I guess that's something another PhD can be made of. I am waiting right now to get Professors's feedback whether that's doable.

My suggestion:
Do first two reductions, and the fourth one if possible.

Trevor Saunders (:tbsaunde)

Comment 7

•

12 years ago

(In reply to Andrzej Skalski from comment #6)
> I contacted to doctors specialized in text algorithms, one of the authors of
> linked MIT paper among them, and they both told me that approximation
> algorithms are purely theoretical. Also, there is a O(m*n / log(n)) exact
> algorithm, but I was told it's non-practical as well.

ok

> What can be done is:
> - Reduce memory usage to linear O(min(m,n))
> - Reduce computational time to O(min(n,m) * d) where d is distance

that'd be great

> - Change entire computation to O(n + d^2), which is very fast for similar
> strings (LED < sqrt(n) ).

its probably very often true that the edit distance is small,  one big case of this is typeing or deleting characters.

> - Finish a computation earlier if we use LED for decision making only (for
> example if(LED > max_D) we can compute LED only to known limit)

Well, whatever algorithm we use I think we'll need to keep the bailout we currently have for big  computations to just fire delete and insert events for the whole strings.  So this might be useful.

> My suggestion:
> Do first two reductions, and the fourth one if possible.

that sounds fine.  optimizing this is  good, but probably not the most important thing to work on optimizing right now.

alexander :surkov (:asurkov)

Comment 8

•

12 years ago

(In reply to Trevor Saunders (:tbsaunde) from comment #7)

> > - Change entire computation to O(n + d^2), which is very fast for similar
> > strings (LED < sqrt(n) ).
> 
> its probably very often true that the edit distance is small,  one big case
> of this is typeing or deleting characters.

it can be big for text replacement. we don't run Levenshtein algorithm for those and as you said it makes sense to keep it since 99% percents I think it's running for equal strings, single insertion or deletion.

> > My suggestion:
> > Do first two reductions, and the fourth one if possible.
> 
> that sounds fine.  optimizing this is  good, but probably not the most
> important thing to work on optimizing right now.

agree

Andrzej Skalski

Comment 9

•

12 years ago

> > its probably very often true that the edit distance is small,  one big case
> > of this is typeing or deleting characters.
> 
> it can be big for text replacement. we don't run Levenshtein algorithm for
> those and as you said it makes sense to keep it since 99% percents I think
> it's running for equal strings, single insertion or deletion.
> 

Well, at http://en.wikipedia.org/wiki/Levenshtein_distance#Upper_and_lower_bounds it's written, that:

|m - n| <= d <= max(m,n), so the only situation in which this algorithm is actually slower is if you replace a short string with a very long one, totally different, and d = len(new_text). In most situation, the d square actually fits within O(m*n) rectangle. It's pretty much like with qsort - the worst case scenario is n^2, but do we really care?
I say "implement both, check which one performs better".

> > > My suggestion:
> > > Do first two reductions, and the fourth one if possible.
> > 
> > that sounds fine.  optimizing this is  good, but probably not the most
> > important thing to work on optimizing right now.
> 
> agree

Andrzej Skalski

Comment 10

•

12 years ago

Attached file proof-of-concept of proposed algorithm, non-optimised, just corectness (obsolete) — Details

It's just a prove of concept. To reach it's full performance, "slide" function needs to be replaced with TRIE tree. File contains implementation of naive version as well, for testing purposes.

Look for sources in the header to understand code, comments would be useless at this level of abstraction.

Andrzej Skalski

Comment 11

•

12 years ago

I have proof-of-concept, the memory consumption can be cut to linear right now (from vector lhd I actually need only previous row which never gets wider than O(min(m,n)) ). I still need to write a suffix tree to get full performance.

And here is my question:
1) is there a special character I could use as "outside the alphabet" symbol? It's typical concept in text algorithms, but I am not sure what are capabilities of nsAString?
2) is there any size limit on nsAString? Right now I rely on "infinity" value, and would like to keep it this way.

Trevor Saunders (:tbsaunde)

Comment 12

•

12 years ago

> And here is my question:
> 1) is there a special character I could use as "outside the alphabet"
> symbol? It's typical concept in text algorithms, but I am not sure what are
> capabilities of nsAString?

I'm not aware of one, but  they're usually just buffers of utf16 code points.  You're probably safe horking one of the ascii control code things.

> 2) is there any size limit on nsAString? Right now I rely on "infinity"
> value, and would like to keep it this way.

well, its stores the length internally as a 32 bit int so certainly no longer than that, and I'm a little scared what happens if we have a string longer than 2^31 or so

Andrzej Skalski

Comment 13

•

12 years ago

update:

I wrote the algorithm from:
http://cs.haifa.ac.il/LANDAU/gadi/LMS.ps , pages (1-8)
and now it works in time O(md) (as in unix diff), this is implementation in proof-of-concept

the way to upgrade it to O(n + d^d) is to replace slide() method in code (which is naive Longest Common Extension problem solution) with something that runs in a constant time. I did experiments with original Ukkonen algorithm, as proposed in paper, and after several hours of fighting, I googled this:

http://www.mat.unb.br/~ayala/lvsaANDrmq.pdf

This reduces LCE problem to Range Minimum Query, which is further solved in O(1) time with Cartesian Trees. After hours of experiments, I abandon suffix tree approach, and redesign the code to work with this idea.

Andrzej Skalski

Comment 14

•

12 years ago

Ok, I gave up with suffix trees and suffix arrays for this week, and I would like to port a current approach ( O(d*min(m,n)) pesimistic O(n + d^2) expected time) as a patch. I optimized memory usage to O(max(m,n)) with only 2 allocations. I think the best way is to add a new class that calculates this LED (for code clarity and further extensions). Where should I put this class?

Andrzej Skalski

Comment 15

•

12 years ago

Attached file same proof-of-concept, but with memory optimisation. still correct :) (obsolete) — Details

alexander :surkov (:asurkov)

Comment 16

•

12 years ago

(In reply to Andrzej Skalski from comment #14)
> Ok, I gave up with suffix trees and suffix arrays for this week, and I would
> like to port a current approach ( O(d*min(m,n)) pesimistic O(n + d^2)
> expected time) as a patch. I optimized memory usage to O(max(m,n)) with only
> 2 allocations. I think the best way is to add a new class that calculates
> this LED (for code clarity and further extensions). Where should I put this
> class?

under TextUpdater where the diff is calculated now. Do you like to keep new and old logic the same time?

Andrzej Skalski

Comment 17

•

12 years ago

Attached patch corrected proof-of-concept (avoid unnecessary computation) (obsolete) — Details — Splinter Review

Attachment #597878 - Attachment is obsolete: true

Attachment #600043 - Attachment is obsolete: true

David Bolter [:davidb] (NeedInfo me for attention)

Updated

•

12 years ago

Attachment #601346 - Attachment is patch: true

Attachment #601346 - Attachment mime type: text/x-c++src → text/plain

Andrzej Skalski

Comment 18

•

12 years ago

Attached file fixed uninitialized memory bug (condition h >= resdiag) and some additions (obsolete) — Details

Andrzej Skalski

Comment 19

•

12 years ago

Attached file same algorithm, now I successfully extract all data needed to get events generated — Details

Attachment #601346 - Attachment is obsolete: true

Attachment #602953 - Attachment is obsolete: true

Andrzej Skalski

Comment 20

•

12 years ago

While trying to extract events from the calculated matrix, I found out that I will most probably need all immediate informations as calculated in a first attachment (though probably better stored). The example that shows precisely why is:

string 1: dupa
string 2: bisdupxa or bisduxpa

no matter what string second is, the diagonal_substitution and horizontal_translation looks the same, and the commands to turn string1 into string2 will differ. The reason is complex - basically a long diagonal slide might not necessarily indicate a string alignment, but may be a result of an alignment on some neighboring diagonal.

Unfortunately storing all "lhd" values require O(d^2) space. Since we don't know d while calculating, we can either do max(m,n)^2 worst-case in single allocation, or use std::vector -like approach to do save up memory by switching to dynamic allocation.

I pause my work on LED right now hoping I will get some better ideas.

BugBot [:suhaib / :marco/ :calixte]

Comment 21

•

2 years ago

The bug assignee didn't login in Bugzilla in the last 7 months.
:Jamie, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: andrzej.j.skalski → nobody

Flags: needinfo?(jteh)

James Teh [:Jamie]

Updated

•

2 years ago

Severity: normal → S4

Flags: needinfo?(jteh)

You need to log in before you can comment on or make changes to this bug.