Closed Bug 498938 Opened 16 years ago Closed 16 years ago

Add Levenshtein Edit Distance function to Sqlite so we can use it in queries.

Categories

(Core :: SQLite and Embedded Database Bindings, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.9.2a1

People

(Reporter: cbartley, Assigned: cbartley)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 8 obsolete files)

Add Levenshtein Edit Distance function and register it with Sqlite. 16 years ago Curtis Bartley [:cbartley] 8.51 KB, patch	sdwilsh : review-	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk II 16 years ago Curtis Bartley [:cbartley] 11.36 KB, patch	sdwilsh : review-	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk III 16 years ago Curtis Bartley [:cbartley] 12.62 KB, patch		Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk IV 16 years ago Curtis Bartley [:cbartley] 13.70 KB, patch	sdwilsh : review-	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk V 16 years ago Curtis Bartley [:cbartley] 14.10 KB, patch	sdwilsh : review+	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk VI 16 years ago Curtis Bartley [:cbartley] 14.73 KB, patch	sdwilsh : review+	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk VII 16 years ago Curtis Bartley [:cbartley] 15.47 KB, patch	sdwilsh : review+	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk VIII 16 years ago Curtis Bartley [:cbartley] 16.23 KB, patch	sdwilsh : review+	Details \| Diff \| Splinter Review
Add Levenshtein Edit Distance function and register it with Sqlite. Mk VIII -- regen 16 years ago Curtis Bartley [:cbartley] 16.23 KB, patch	vlad : superreview+	Details \| Diff \| Splinter Review

Curtis Bartley [:cbartley]

Assignee

Description

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. (obsolete) — Details — Splinter Review

One of the planned subfeatures for bug 482874 is alternative URL suggestion where alternative URLs are draw from the places database and ranked using edit distance. The ideal way to do this is to be able to use the edit distance function directly in the Sqlite queries. Since edit distance is such a generally useful function, I think it makes sense to just add it globally to our Sqlite implementation so it will be widely accessible.

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Attachment #383745 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Updated

•

16 years ago

Assignee: nobody → cbartley

Shawn Wilsher :sdwilsh

Updated

•

16 years ago

Attachment #383745 - Flags: review?(sdwilsh) → review-

Shawn Wilsher :sdwilsh

Comment 1

•

16 years ago

Comment on attachment 383745 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. This needs some tests. See the work in bug 499990 where adw is adding collation support (and tests to go with it). http://reviews.visophyte.org/r/383745/diff/#index_header on file: storage/src/mozStorageSQLFunctions.h line 92 > * An implementation of the Leventshtein Edit Distance algorithm for use in typo: Levenshtein (and drop Edit please) on file: storage/src/mozStorageSQLFunctions.h line 102 > NS_HIDDEN_(void) levenshteinDistanceFunction(sqlite3_context *p, p is not aCtx ;) on file: storage/src/mozStorageSQLFunctions.cpp line 154 > /** > * Compute the Levenshtein Edit Distance for the strings in "s" and "t". > */ Proper javadoc style comments please on file: storage/src/mozStorageSQLFunctions.cpp line 157 > PRUint32 LevenshteinDistance(const nsAString &s, const nsAString &t) { Two things: 1) This doesn't follow the style guideline for function implementations (http://mxr.mozilla.org/mozilla-central/source/storage/style.txt). 2) You should probably just return int here since SQLite will expect an int. (You probably just want to use int throughout this). on file: storage/src/mozStorageSQLFunctions.cpp line 159 > const PRUint32 sn = s.Length(); > const PRUint32 tn = t.Length(); I'd prefer variable names with at least "len" in them, if not "length" on file: storage/src/mozStorageSQLFunctions.cpp line 187 > // Declare the raw pointers that will actually be used to access the memory. > // Raw pointers may be a little bit faster, and we can avoid relying on more > // obscure parts of nsAutoArrayPtr behavior when we swap the rows below. > PRUint32 *prevRow = row1; > PRUint32 *currRow = row2; What obscure parts are you referring to here. More details would be nice (although I understand the bit about swaping and the pointers). on file: storage/src/mozStorageSQLFunctions.cpp line 194 > for (PRUint32 si = 0; si <= sn; si++) { Can we just use "i" here as the variable. "si" seems to imply it's something other than what it actually is. on file: storage/src/mozStorageSQLFunctions.cpp line 226 > PRUint32 min = aPrime; > if (bPrime < min) min = bPrime; > if (cPrime < min) min = cPrime; > > currRow[si] = min; How about currRow[si] = NS_MIN( aPrime, NS_MIN(bPrime, cPrime)); on file: storage/src/mozStorageSQLFunctions.cpp line 356 > nsDependentString A(static_cast<const PRUnichar *>(::sqlite3_value_text16(aArgv[0]))); > nsDependentString B(static_cast<const PRUnichar *>(::sqlite3_value_text16(aArgv[1]))); You probably want to use nsAutoString here since it will use a stack buffer if the string fits (likely it will) so less heap allocations.

Curtis Bartley [:cbartley]

Assignee

Comment 2

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk II (obsolete) — Details — Splinter Review

(In reply to comment #1) > (From update of attachment 383745 [details] [diff] [review]) > This needs some tests. See the work in bug 499990 where adw is adding > collation support (and tests to go with it). Done. > http://reviews.visophyte.org/r/383745/diff/#index_header > on file: storage/src/mozStorageSQLFunctions.h line 92 > > * An implementation of the Leventshtein Edit Distance algorithm for use > > typo: Levenshtein (and drop Edit please) Fixed the typo. However, I want to defend the "Edit". I'm surmising that more people will recognize what an "edit distance" algorithm is than will know what the "Levenshtein Distance" algorithm is. > on file: storage/src/mozStorageSQLFunctions.h line 102 > > NS_HIDDEN_(void) levenshteinDistanceFunction(sqlite3_context *p, > > p is not aCtx ;) Fixed. > on file: storage/src/mozStorageSQLFunctions.cpp line 154 > > /** > > * Compute the Levenshtein Edit Distance for the strings in "s" and "t". > > */ > > Proper javadoc style comments please Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 157 > > PRUint32 LevenshteinDistance(const nsAString &s, const nsAString &t) { > > Two things: > 1) This doesn't follow the style guideline for function implementations > (http://mxr.mozilla.org/mozilla-central/source/storage/style.txt). Largely fixed. But do you really want "nsAString &aS, nsAString &aT"? Or I could do "nsAString &aString, nsAString &aAnotherString", but I don't think either one will make the code more readable. > 2) You should probably just return int here since SQLite will expect an int. > (You probably just want to use int throughout this). Fixed. > on file: storage/src/mozStorageSQLFunctions.cpp line 159 > > const PRUint32 sn = s.Length(); > > const PRUint32 tn = t.Length(); > > I'd prefer variable names with at least "len" in them, if not "length" Now "sLen" and "tLen". > on file: storage/src/mozStorageSQLFunctions.cpp line 187 > > // Declare the raw pointers that will actually be used to access the > > // Raw pointers may be a little bit faster, and we can avoid relying > > // obscure parts of nsAutoArrayPtr behavior when we swap the rows > > PRUint32 *prevRow = row1; > > PRUint32 *currRow = row2; > > What obscure parts are you referring to here. More details would be nice > (although I understand the bit about swaping and the pointers). nsAutoArrayPtr defines an assignment operator that *transfers* ownership rather than copying it. So if "a" and "b" are nsAutoArrayPtr's, then "a = b" will store the value of "b" in "a", and then it will set "b" to NULL! Now because it does this, the swapping code works fine, even though there is no refcounting taken place as you might expect since that's how nsCOMPtr and friends work. Maybe this is fine, but the whole thing where the assignment operator NULLs out its rhs gives me a queasy feeling, and I'd like to avoid depending on it if I could. I should note that this behavior is consistent with the STL's auto_ptr, but I still think it's wrong! I should also note that using nsAutoArrayPtr for "prevRow" and "currRow" doesn't appear to be measurably slower on OSX/gcc. I'm not sure what that does to my argument... > on file: storage/src/mozStorageSQLFunctions.cpp line 194 > > for (PRUint32 si = 0; si <= sn; si++) { > > Can we just use "i" here as the variable. "si" seems to imply it's something > other than what it actually is. Is this that confusing? I'm using "si" both here and down below for iterating across the matrix columns, which correspond to letters in the string "s". And I'm using "si" below because it's inside another loop (which uses "ti" to iterate the rows, which correspond to the letters in "t"). Also, I've updated the comments at the top to better indicate how the (virtual) matrix dimensions relate to "s" and "t", if that helps any. > on file: storage/src/mozStorageSQLFunctions.cpp line 226 > > PRUint32 min = aPrime; > > if (bPrime < min) min = bPrime; > > if (cPrime < min) min = cPrime; > > > > currRow[si] = min; > > How about > currRow[si] = NS_MIN( aPrime, NS_MIN(bPrime, cPrime)); Done. I wish we had a three argument NS_MIN, though. > on file: storage/src/mozStorageSQLFunctions.cpp line 356 > > nsDependentString A(static_cast<const PRUnichar *>(::sqlite3_value_text16 > > nsDependentString B(static_cast<const PRUnichar *>(::sqlite3_value_text16 > > You probably want to use nsAutoString here since it will use a stack buffer if > the string fits (likely it will) so less heap allocations. nsDependentString does not copy the string data at all, so it should be faster than nsAutoString. It would be nice if we had an nsAutoIntArray, though. Allocating the integer arrays appears to account for about half my execution time.

Attachment #383745 - Attachment is obsolete: true

Attachment #386131 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 3

•

16 years ago

(In reply to comment #2) > Largely fixed. But do you really want "nsAString &aS, nsAString &aT"? > Or I could do "nsAString &aString, nsAString &aAnotherString", but I don't > think either one will make the code more readable. What about aStringA and aStringB? It makes it clearer when you are using the parameters to the function this way. Why is it "s" and "t" now? > > on file: storage/src/mozStorageSQLFunctions.cpp line 194 > > > for (PRUint32 si = 0; si <= sn; si++) { > > > > Can we just use "i" here as the variable. "si" seems to imply it's something > > other than what it actually is. > > Is this that confusing? I'm using "si" both here and down below for iterating > across the matrix columns, which correspond to letters in the string "s". > And I'm using "si" below because it's inside another loop (which uses "ti" to > iterate the rows, which correspond to the letters in "t"). So, this comment should only apply to the first loop that initializes the first row. The other loops made sense. > Done. I wish we had a three argument NS_MIN, though. Me too :( I bet we could make that work actually... > nsDependentString does not copy the string data at all, so it should be faster > than nsAutoString. That's a good point, as long as you don't need your string to be null terminated. > It would be nice if we had an nsAutoIntArray, though. Allocating the integer > arrays appears to account for about half my execution time. You could try nsAutoTArray<int, nsAutoString::kDefaultStroageSize> so we match what we consider to be the "normal size" of words in strings (and so we'll update if that ever changes)

Curtis Bartley [:cbartley]

Assignee

Comment 4

•

16 years ago

(In reply to comment #3) > (In reply to comment #2) > > Largely fixed. But do you really want "nsAString &aS, nsAString &aT"? > > Or I could do "nsAString &aString, nsAString &aAnotherString", but I don't > > think either one will make the code more readable. > What about aStringA and aStringB? It makes it clearer when you are using the > parameters to the function this way. Why is it "s" and "t" now? "s" and "t" are like "x" and "y", only for strings. Using "s" when you only have a single string argument is pretty common. Using "t" for a second argument is less common but not unheard of. I'm the king of reallyLongAndMeaningful identifiers, but sometimes they hurt more than they help. When things get too mathematical, especially when there's a matrix involved, it may make sense to use really short mathematics-style identifiers. I'm contending this is one of those cases. Another way to look at it is that it's easier to distinguish "s" from "t" when reading the code than it is to distinguish "aStringA" from "aStringB". > > > on file: storage/src/mozStorageSQLFunctions.cpp line 194 > > > > for (PRUint32 si = 0; si <= sn; si++) { > > > > > > Can we just use "i" here as the variable. "si" seems to imply it's something > > > other than what it actually is. > > > > Is this that confusing? I'm using "si" both here and down below for iterating > > across the matrix columns, which correspond to letters in the string "s". > > And I'm using "si" below because it's inside another loop (which uses "ti" to > > iterate the rows, which correspond to the letters in "t"). > So, this comment should only apply to the first loop that initializes the first > row. The other loops made sense. I can live with "i" here. It seemed reasonable to ask why the choice of "si" made sense in one place but not the other, though. (Plus I was confused about which for loop you were talking about, and didn't catch my mistake until the very last moment :-) > > Done. I wish we had a three argument NS_MIN, though. > Me too :( I bet we could make that work actually... > > > nsDependentString does not copy the string data at all, so it should be faster > > than nsAutoString. > That's a good point, as long as you don't need your string to be null > terminated. > > > It would be nice if we had an nsAutoIntArray, though. Allocating the integer > > arrays appears to account for about half my execution time. > You could try nsAutoTArray<int, nsAutoString::kDefaultStroageSize> so we match > what we consider to be the "normal size" of words in strings (and so we'll > update if that ever changes) I think nsAutoTArray is the wrong tool for this job. Among other things, I'm not sure that I can swap an nsAutoTArray the way I need to. Note that I could also do something like: nsAutoArrayPtr<int> row1; int autoBuffer1[MAX_AUTO_SIZE]; int *prevRow = autoBuffer1; if (sLen + 1 > MAX_AUTO_SIZE) { row1 = new int[sLen + 1]; // only use the auto array ptr in this case prevRow = row1; } This is kind of ugly, and I'm reluctant to do this kind of optimization until I have a compelling reason. I think nsAutoTArray might solve this problem but will be ugly in other ways.

Shawn Wilsher :sdwilsh

Comment 5

•

16 years ago

Comment on attachment 386131 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. Mk II Feels like we almost have this. http://reviews.visophyte.org/r/386131/diff/#index_header on file: storage/src/mozStorageSQLFunctions.cpp line 163 > int LevenshteinDistance(const nsAString &s, nit: return type on new line on file: storage/src/mozStorageSQLFunctions.cpp line 166 > const int sLen = s.Length(); > const int tLen = t.Length(); These should actually be PRUint32 (that's what Length returns). on file: storage/src/mozStorageSQLFunctions.cpp line 169 > if (sLen == 0) return tLen; > if (tLen == 0) return sLen; nit: returns on new line on file: storage/src/mozStorageSQLFunctions.cpp line 189 > // Allocate memory for two rows. Use auto pointers to manage the memory > // so we don't have to explicitly free them. > nsAutoArrayPtr<int> row1(new int[sLen + 1]); > nsAutoArrayPtr<int> row2(new int[sLen + 1]); > > // Declare the raw pointers that will actually be used to access the memory. > // Raw pointers may be a little bit faster, and we can avoid relying on more > // obscure parts of nsAutoArrayPtr behavior when we swap the rows below. > int *prevRow = row1; > int *currRow = row2; So, this should be doable with the following: nsAutoTArray<int, nsAutoString::kDefaultStroageSize> row1(sLen + 1), row2(sLen + 1); int *prevRow = row1.Elements(); int *currRow = row2.Elements(); (I am not 100% sure on the syntax for the constant) on file: storage/src/mozStorageSQLFunctions.cpp line 201 > for (int si = 0; si <= sLen; si++) { > prevRow[si] = si; > } nit: no braces for a one line for loop and just use "i" here. sLen should be a PRUint32, so i should be as well to not introduce compiler warnings. on file: storage/src/mozStorageSQLFunctions.cpp line 205 > // Compute the empty cells in the "matrix" row-by-row, starting with > // the second row. > for (int ti = 1; ti <= tLen; ti++) { How about this compromise on the strings names: we use a longer variable name for the arguments, but before we enter the loop we do this: const PRUnichar *s = longerNameS.get(); const PRUnichar *t = longerNameT.get(); Then you just use normal array syntax to get the characters in the array. You don't need the class abstraction at this point anyway (and results in less method calls). on file: storage/src/mozStorageSQLFunctions.cpp line 207 > for (int ti = 1; ti <= tLen; ti++) { ti needs to be a PRUint32 on file: storage/src/mozStorageSQLFunctions.cpp line 213 > PRUnichar tch = t.CharAt(ti - 1); nit: const PRUnichar tChar = t[ti - 1]; on file: storage/src/mozStorageSQLFunctions.cpp line 217 > for (int si = 1; si <= sLen; si++) { si needs to be a PRUint32 on file: storage/src/mozStorageSQLFunctions.cpp line 221 > PRUnichar sch = s.CharAt(si - 1); nit: const PRUnichar sChar = s[si - 1]; on file: storage/src/mozStorageSQLFunctions.cpp line 228 > > int aPrime = prevRow[si - 1] + cost; > int bPrime = prevRow[si] + 1; > int cPrime = currRow[si - 1] + 1; > nit: lose the whitespace around the temps please - the comment applies to all four lines there. on file: storage/src/mozStorageSQLFunctions.cpp line 247 > int result = prevRow[sLen]; > > return result; don't need the temporary here, so just return prevRow[sLen] on file: storage/src/mozStorageSQLFunctions.cpp line 267 > {"lower", 1, SQLITE_UTF16, 0, caseFunction}, > {"lower", 1, SQLITE_UTF8, 0, caseFunction}, > {"upper", 1, SQLITE_UTF16, (void*)1, caseFunction}, > {"upper", 1, SQLITE_UTF8, (void*)1, caseFunction}, > > {"like", 2, SQLITE_UTF16, 0, likeFunction}, > {"like", 2, SQLITE_UTF8, 0, likeFunction}, > {"like", 3, SQLITE_UTF16, 0, likeFunction}, > {"like", 3, SQLITE_UTF8, 0, likeFunction}, > > {"levenshteinDistance", 2, SQLITE_UTF16, 0, levenshteinDistanceFunction}, > {"levenshteinDistance", 2, SQLITE_UTF8, 0, levenshteinDistanceFunction}, since you already have to reformat these, can we do it like this for all of them: {"lower", 1, SQLITE_UTF16, 0, caseFunction}, etc. on file: storage/src/mozStorageSQLFunctions.cpp line 359 > nsDependentString A(static_cast<const PRUnichar *>(::sqlite3_value_text16(aArgv[0]))); > nsDependentString B(static_cast<const PRUnichar *>(::sqlite3_value_text16(aArgv[1]))); You should also use sqlite3_value_bytes[16] to get the length so we don't have to computer the length here ourselves. We should also check that the length of either string does not exceed INT_MAX. This will only actually matter on 32-bit platforms, but we should call sqlite3_result_error_toobig if either string is too big (and probably wrap the checks in an NS_UNLIKELY) on file: storage/test/unit/test_levenshtein.js line 17 > * Mozilla Corporation. nit: no indent on file: storage/test/unit/test_levenshtein.js line 42 > var stmt = createStatement("SELECT levenshteinDistance(?1, ?2)"); > stmt.bindStringParameter(0, s); > stmt.bindStringParameter(1, t); var stmt = createStatement("SELECT levenshteinDistance(:s, :t) AS result"); stmt.params.s = s; stmt.params.t = t; on file: storage/test/unit/test_levenshtein.js line 45 > do_check_true(stmt.executeStep()); > do_check_eq(expectedDistance, stmt.getInt32(0)); > stmt.reset(); > stmt.finalize(); should wrap execution in try block, and have stmt.finalize() in a finally clause (no catch needed) on file: storage/test/unit/test_levenshtein.js line 46 > do_check_eq(expectedDistance, stmt.getInt32(0)); stmt.row.result instead of getInt32

Attachment #386131 - Flags: review?(sdwilsh) → review-

Curtis Bartley [:cbartley]

Assignee

Comment 6

•

16 years ago

(In reply to comment #5) > (From update of attachment 386131 [details] [diff] [review]) > > So, this should be doable with the following: > > nsAutoTArray<int, nsAutoString::kDefaultStroageSize> row1(sLen + 1), row2(sLen > + 1); > > int *prevRow = row1.Elements(); > int *currRow = row2.Elements(); > (I am not 100% sure on the syntax for the constant) I did some poor man's profiling on the various alternatives here: sLen, tLen == 2, 15,000,000 reps -------------------------------- static buffer ~3s new[] ~10s nsAutoTArray, capacity=1 ~13s nsAutoTArray, capacity=64 ~9s sLen, tLen == 10, 2,000,000 reps (fewer reps to get us back to the 10s range) ----------------------------------------------------------------------------- static buffer ~9s new[] ~9s nsAutoTArray, capacity=1 ~9s nsAutoTArray, capacity=64 ~9s Disclaimer: The times are all by manual count, not by a stopwatch. There's some slop as a consequence, but I don't think it effects the analysis. There are appreciable performance differences for very small strings (a few chars), but even small strings (10 chars), the cost of the algorithm itself swamps the cost of the memory allocations. We can thank the O(n*m) algorithm for this, I presume. My conclusion is that we're better off sticking with the simplest solution (just use new[] for allocations) unless we think we're going to be looking at a lot of extremely short strings.

Shawn Wilsher :sdwilsh

Comment 7

•

16 years ago

Given those numbers, I'd prefer to use nsAutoTArray just to prevent more heap allocations when possible.

Curtis Bartley [:cbartley]

Assignee

Comment 8

•

16 years ago

(In reply to comment #5) > (From update of attachment 386131 [details] [diff] [review]) I'll have a new patch up shortly. In the meantime, I was going to ask you specifically about this part: > on file: storage/src/mozStorageSQLFunctions.cpp line 267 > > {"lower", 1, SQLITE_UTF16, 0, caseFunction}, > > {"lower", 1, SQLITE_UTF8, 0, caseFunction}, > > {"upper", 1, SQLITE_UTF16, (void*)1, caseFunction}, > > {"upper", 1, SQLITE_UTF8, (void*)1, caseFunction}, > > > > {"like", 2, SQLITE_UTF16, 0, likeFunction}, > > {"like", 2, SQLITE_UTF8, 0, likeFunction}, > > {"like", 3, SQLITE_UTF16, 0, likeFunction}, > > {"like", 3, SQLITE_UTF8, 0, likeFunction}, > > > > {"levenshteinDistance", 2, SQLITE_UTF16, 0, levenshteinDistanceFunction}, > > {"levenshteinDistance", 2, SQLITE_UTF8, 0, levenshteinDistanceFunction}, > > since you already have to reformat these, can we do it like this for all of > them: > {"lower", > 1, > SQLITE_UTF16, > 0, > caseFunction}, > etc. So my questions is, umm, why? The table format is much easier to read, and I'd argue it's easier to modify as well.

Shawn Wilsher :sdwilsh

Comment 9

•

16 years ago

Because we'll mess up blame each time we add a function that changes things around. Also, it fits better with how we do XPCOM component registration in most places in the code base. It's mostly for the blame (which I've found to be useful in the past figuring out when/how something was broke/missed).

Curtis Bartley [:cbartley]

Assignee

Comment 10

•

16 years ago

(In reply to comment #9) > Because we'll mess up blame each time we add a function that changes things > around. Also, it fits better with how we do XPCOM component registration in > most places in the code base. It's mostly for the blame (which I've found to > be useful in the past figuring out when/how something was broke/missed). OK. I can't help but feel that the cure might be worse than the disease here, though.

Curtis Bartley [:cbartley]

Assignee

Comment 11

•

16 years ago

(In reply to comment #7) > Given those numbers, I'd prefer to use nsAutoTArray just to prevent more heap > allocations when possible. I'm dubious that there's any advantage to that -- they're not costing us anything in performance, and it's unlikely that it's causing fragmentation since we just allocate it and then immediately free it. If you still feel strongly about it, let me suggest we right a custom auto-array class that just does what we need, which is frankly not much. TAutoArray puts a lot of complexity in the code path that we just don't need; that's probably why it's slower than a static buffer for really small strings. We could do something like: template <class T, PRUint32 N> class mozAutoArray { public: mozAutoArray(PRUint32 size) : mBuffer(size <= N ? mAutoBuffer : new T[size]) {} ~mozAutoArray() { if (mBuffer != mAutoBuffer) { delete[] mBuffer; } } T *get() { return mBuffer; } private: T *mBuffer; T mAutoBuffer[N]; }; which is what I'm using right at the moment. (This is just preliminary code, of course; at the very least it could use some comments.)

Curtis Bartley [:cbartley]

Assignee

Comment 12

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk III (obsolete) — Details — Splinter Review

I've addressed all of the comments except "use nsAutoTArray" (right now the code is using the mozAutoArray class I described earlier). There is also additional error handling in levenshteinDistance() and levenshteinDistanceFunction(). The latter has changed substantially so you may want to pay extra attention to it.

Attachment #386131 - Attachment is obsolete: true

Attachment #386587 - Flags: review?(sdwilsh)

Curtis Bartley [:cbartley]

Assignee

Comment 13

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk IV (obsolete) — Details — Splinter Review

Nearly identical to the previous patch. I cleaned up the mozAutoArray class template and added comments. You may want to pay particular attention to it. It's dirt simple, but it's new, and it does manage memory. Also, I hate to ask, but should we be handling SQL NULLs?

Attachment #386587 - Attachment is obsolete: true

Attachment #387778 - Flags: review?(sdwilsh)

Attachment #386587 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 14

•

16 years ago

What do you mean by "SQL NULLs" exactly?

Curtis Bartley [:cbartley]

Assignee

Comment 15

•

16 years ago

As I understand it, the way an ordinary SQL function would handle NULLs is that if *any* of its arguments are NULL, then it returns NULL itself. I suppose COALESCE is a notable exception.

Shawn Wilsher :sdwilsh

Comment 16

•

16 years ago

In this case, we should probably treat NULL as the empty string.

Curtis Bartley [:cbartley]

Assignee

Comment 17

•

16 years ago

Here are some examples of how common Sqlite functions handled NULLs: SELECT 1 + NULL ==> NULL SELECT "A" || NULL ==> NULL SELECT ABS(NULL) ==> NULL SELECT LENGTH(NULL) ==> NULL SELECT NULL LIKE "AB%" ==> NULL SELECT MAX(1, NULL) ==> NULL SELECT QUOTE(NULL) ==> NULL It's true that this behavior of NULLs causes lots of headaches in SQL, but I feel like us being inconsistent isn't really going to make the situation any better.

Shawn Wilsher :sdwilsh

Comment 18

•

16 years ago

Comment on attachment 387778 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. Mk IV This looks a lot better. Few comments though: http://reviews.visophyte.org/r/387778/ on file: storage/src/mozStorageSQLFunctions.cpp line 152 > /** > * This class manages a dynamic array. It can represent an array of any > * reasonable size, but if the array is "N" elements or smaller, it will be > * stored using fixed space inside the auto array itself. If the auto array > * is a local variable, this internal storage will be allocated cheaply on the > * stack, similar to nsAutoString. If a larger size is requested, the memory > * will be dynamically allocated from the heap. Since the destructor will > * free any heap-allocated memory, client code doesn't need to care where the > * memory came from. > */ > template <class T, PRUint32 N> class mozAutoArray Can you pull this out into it's own file? Something like SimpleAutoArray.h? You should also add some automated native code tests that use it in the test dir. on file: storage/src/mozStorageSQLFunctions.cpp line 162 > template <class T, PRUint32 N> class mozAutoArray s/PRUint32/size_t/ on file: storage/src/mozStorageSQLFunctions.cpp line 168 > mozAutoArray(PRUint32 size) : mBuffer(size <= N ? mAutoBuffer : new T[size]) size_t here too. Also, please use the constructor format in storage - example: http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageBindingParamsArray.cpp#49 on file: storage/src/mozStorageSQLFunctions.cpp line 175 > if (mBuffer != mAutoBuffer) { > delete[] mBuffer; > } nit: no braces around one-line if. on file: storage/src/mozStorageSQLFunctions.cpp line 195 > * @param s > * a string > * @param t > * another string > * @returns the Levenshtein Distance between the arguments This is no longer accurate. on file: storage/src/mozStorageSQLFunctions.cpp line 201 > nsresult > LevenshteinDistance(const nsAString &aStringS, > const nsAString &aStringT, > int *_result This should probably return and int and we can just use SQLite result codes. It'll work better for your error handling too. on file: storage/src/mozStorageSQLFunctions.cpp line 458 > // If the strings are (way) too big, return an error immediately. > if (NS_UNLIKELY(aLen > INT_MAX) || NS_UNLIKELY(bLen > INT_MAX)) { > ::sqlite3_result_error_toobig(aCtx); > return; > } I know I asked you to check this, but I realize that this isn't possible. An int isn't going to be larger than INT_MAX on file: storage/src/mozStorageSQLFunctions.cpp line 469 > if (NS_SUCCEEDED(rv)) { > ::sqlite3_result_int(aCtx, distance); > } else if (rv == NS_ERROR_OUT_OF_MEMORY) { > ::sqlite3_result_error_nomem(aCtx); > } else { > ::sqlite3_result_error(aCtx, "User function returned error code", -1); > } nit: if () { ... } else { ... }

Attachment #387778 - Flags: review?(sdwilsh) → review-

Curtis Bartley [:cbartley]

Assignee

Comment 19

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk V (obsolete) — Details — Splinter Review

> This looks a lot better. Few comments though: > > http://reviews.visophyte.org/r/387778/ > on file: storage/src/mozStorageSQLFunctions.cpp line 152 > > /** > > * This class manages a dynamic array. It can represent an array of any > > * reasonable size, but if the array is "N" elements or smaller, it will be > > * stored using fixed space inside the auto array itself. If the auto array > > * is a local variable, this internal storage will be allocated cheaply on the > > * stack, similar to nsAutoString. If a larger size is requested, the memory > > * will be dynamically allocated from the heap. Since the destructor will > > * free any heap-allocated memory, client code doesn't need to care where the > > * memory came from. > > */ > > template <class T, PRUint32 N> class mozAutoArray > > Can you pull this out into it's own file? Something like SimpleAutoArray.h? > You should also add some automated native code tests that use it in the test > dir. This seems like a lot of effort for what was supposed to be a simple little helper class. > You should also add some automated native code tests... There's only like 4 lines of real code in this whole class. I believe that this class is an excellent candidate for "test by inspection". I mean I could write a test for it, but all I can test for is that it actually returns memory and doesn't crash. We can tell that when test the main levenshteinDistance function. Note that I did add another case to test_levenshtein.js where I pass really long strings to make sure we're exercising the heap allocation code path. (I should have thought of this myself, but it was actually a suggestion from Dolske.) > on file: storage/src/mozStorageSQLFunctions.cpp line 162 > > template <class T, PRUint32 N> class mozAutoArray > > s/PRUint32/size_t/ Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 168 > > mozAutoArray(PRUint32 size) : mBuffer(size <= N ? mAutoBuffer : new T[size]) > > size_t here too. Done. > Also, please use the constructor format in storage - example: > http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageBindingParamsArray.cpp#49 Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 175 > > if (mBuffer != mAutoBuffer) { > > delete[] mBuffer; > > } > > nit: no braces around one-line if. Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 195 > > * @param s > > * a string > > * @param t > > * another string > > * @returns the Levenshtein Distance between the arguments > > This is no longer accurate. Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 201 > > nsresult > > LevenshteinDistance(const nsAString &aStringS, > > const nsAString &aStringT, > > int *_result > > This should probably return and int and we can just use SQLite result codes. > It'll work better for your error handling too. Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 458 > > // If the strings are (way) too big, return an error immediately. > > if (NS_UNLIKELY(aLen > INT_MAX) || NS_UNLIKELY(bLen > INT_MAX)) { > > ::sqlite3_result_error_toobig(aCtx); > > return; > > } > > I know I asked you to check this, but I realize that this isn't possible. An > int isn't going to be larger than INT_MAX Gone. As an interesting exercise, how long would it take to compute an O(n * m) algorithm on a gigahertz class CPU, when n and m are both in the billions? And why do I hear this question in Carl Sagan's voice? > on file: storage/src/mozStorageSQLFunctions.cpp line 469 > > if (NS_SUCCEEDED(rv)) { > > ::sqlite3_result_int(aCtx, distance); > > } else if (rv == NS_ERROR_OUT_OF_MEMORY) { > > ::sqlite3_result_error_nomem(aCtx); > > } else { > > ::sqlite3_result_error(aCtx, "User function returned error code", -1); > > } > > nit: > if () { > ... > } > else { > ... > } Done. I made a couple of other changes as well. I changed LevenshteinDistance to levenshteinDistance (small-l) -- a change I think you requested before and which I *thought* I'd made, but apparently not. I also changed the levenshteinDistanceFunction to return a SQL NULL if either of the arguments are SQL NULLs as I proposed above.

Attachment #387778 - Attachment is obsolete: true

Attachment #388030 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 20

•

16 years ago

Comment on attachment 388030 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. Mk V http://reviews.visophyte.org/r/388030/ on file: storage/src/mozStorageSQLFunctions.cpp line 162 > template <class T, size_t N> class mozAutoArray nit: lose the moz prefix on file: storage/src/mozStorageSQLFunctions.cpp line 167 > // Construct a new auto array large enough for "size" elements. this comment isn't really needed on file: storage/src/mozStorageSQLFunctions.cpp line 173 > // Destruct the auto array and free heap allocated memory, if any. nor is this comment on file: storage/src/mozStorageSQLFunctions.cpp line 180 > // Return the pointer to the allocated array. WARNING: If the array > // allocation failed, get() will return NULL! > T *get() > { > return mBuffer; > } This comment should really be in java-doc style. But I'm wondering why you didn't use a type operator here instead of an explicit get method. on file: storage/src/mozStorageSQLFunctions.cpp line 206 > int *_result > ) nit: paren on same line as _result please on file: storage/src/mozStorageSQLFunctions.cpp line 459 > const PRUnichar *a = static_cast<const PRUnichar *>(::sqlite3_value_text16(aArgv[0])); > int aLen = sqlite3_value_bytes16(aArgv[0]) / sizeof(*a); From the SQLite doc page: Please pay particular attention to the fact that the pointer returned from sqlite3_value_blob(), sqlite3_value_text(), or sqlite3_value_text16() can be invalidated by a subsequent call to sqlite3_value_bytes(), sqlite3_value_bytes16(), sqlite3_value_text(), or sqlite3_value_text16(). on file: storage/test/unit/test_levenshtein.js line 48 > } finally { nit: finally starts on new line after closing brace r=sdwilsh with comments addressed

Attachment #388030 - Flags: review?(sdwilsh) → review+

Curtis Bartley [:cbartley]

Assignee

Comment 21

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk VI (obsolete) — Details — Splinter Review

(In reply to comment #20) > (From update of attachment 388030 [details] [diff] [review]) > http://reviews.visophyte.org/r/388030/ > on file: storage/src/mozStorageSQLFunctions.cpp line 162 > > template <class T, size_t N> class mozAutoArray > > nit: lose the moz prefix Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 167 > > // Construct a new auto array large enough for "size" elements. > > this comment isn't really needed Gone. > on file: storage/src/mozStorageSQLFunctions.cpp line 173 > > // Destruct the auto array and free heap allocated memory, if any. > > nor is this comment Gone. > on file: storage/src/mozStorageSQLFunctions.cpp line 180 > > // Return the pointer to the allocated array. WARNING: If the array > > // allocation failed, get() will return NULL! > > T *get() > > { > > return mBuffer; > > } > > This comment should really be in java-doc style. But I'm wondering why you > didn't use a type operator here instead of an explicit get method. Now in JavaDoc style. I'm not using a type operator because that just seems too deep in the C++ bag of tricks for what I'm trying to do here. I know people have found type operators to be problematical in the past, and I'd just as soon not have to think about the possible issues here. Because of the very limited way I'm using the class, I don't think it hurts the understandability of the code any, and in fact the extra explicitness might be a small win for understandability. > on file: storage/src/mozStorageSQLFunctions.cpp line 206 > > int *_result > > ) > > nit: paren on same line as _result please Done. > on file: storage/src/mozStorageSQLFunctions.cpp line 459 > > const PRUnichar *a = static_cast<const PRUnichar *>(::sqlite3_value_text1 > > int aLen = sqlite3_value_bytes16(aArgv[0]) / sizeof(*a); > > From the SQLite doc page: > Please pay particular attention to the fact that the pointer returned from > sqlite3_value_blob(), sqlite3_value_text(), or sqlite3_value_text16() can be > invalidated by a subsequent call to sqlite3_value_bytes(), > sqlite3_value_bytes16(), sqlite3_value_text(), or sqlite3_value_text16(). Reordered the calls and just used "sizeof(PRUnichar)" in the length computation. > on file: storage/test/unit/test_levenshtein.js line 48 > > } finally { > > nit: finally starts on new line after closing brace Done. > r=sdwilsh with comments addressed Also added "sqlite3_result_error_nomem" to db/sqlite3/src/sqlite.def.

Attachment #388030 - Attachment is obsolete: true

Attachment #388806 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 22

•

16 years ago

Comment on attachment 388806 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. Mk VI nit: s/@returns/@return/g >+ /** >+ * Return the pointer to the allocated array. WARNING: If the array >+ * allocation failed, get() will return NULL! >+ * @returns the pointer to the allocated array >+ */ nit: newline before @return nit: put the warning in an @note please. r=sdwilsh

Attachment #388806 - Flags: review?(sdwilsh) → review+

Curtis Bartley [:cbartley]

Assignee

Comment 23

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk VII (obsolete) — Details — Splinter Review

(In reply to comment #22) > (From update of attachment 388806 [details] [diff] [review]) > nit: s/@returns/@return/g Fixed, include one on a function I didn't write :-). > > >+ /** > >+ * Return the pointer to the allocated array. WARNING: If the array > >+ * allocation failed, get() will return NULL! > >+ * @returns the pointer to the allocated array > >+ */ > nit: newline before @return > nit: put the warning in an @note please. Done and Done. > r=sdwilsh P.S. I'd like to suggest that requiring full-on JavaDoc on small utility classes and functions -- and private to a single file -- is not really necessary. I find the JavaDoc style comments more laborious to write and harder to read inline than just regular comments. And for small, private stuff I don't ever find myself reading the JavaDoc generated HTML, anyway.

Attachment #388806 - Attachment is obsolete: true

Attachment #388815 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 24

•

16 years ago

Comment on attachment 388815 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. Mk VII r=sdwilsh It'd also be useful to run your tests on a UTF-16 database. See the work in bug 499990 on how to do this.

Attachment #388815 - Flags: review?(sdwilsh) → review+

Curtis Bartley [:cbartley]

Assignee

Comment 25

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk VIII (obsolete) — Details — Splinter Review

The tests are now run twice, once against the default test database and again against a UTF-16 in-memory database (built using code borrowed from adw's patch for bug 499990.)

Attachment #388815 - Attachment is obsolete: true

Attachment #388960 - Flags: review?(sdwilsh)

Shawn Wilsher :sdwilsh

Comment 26

•

16 years ago

Comment on attachment 388960 [details] [diff] [review] Add Levenshtein Edit Distance function and register it with Sqlite. Mk VIII r=sdwilsh I don't think your patch applies cleanly to trunk anymore. This should get sr (probably from vlad) since it adds a new API (via a SQL function).

Attachment #388960 - Flags: review?(sdwilsh) → review+

Curtis Bartley [:cbartley]

Assignee

Comment 27

•

16 years ago

Attached patch Add Levenshtein Edit Distance function and register it with Sqlite. Mk VIII -- regen — Details — Splinter Review

There were no source changes between this patch and the previous mk VIII patch. The previous patch was no long applying cleanly against tip, so I regened it. This patch needs sr since it's an API addition.

Attachment #388960 - Attachment is obsolete: true

Attachment #389216 - Flags: superreview?

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Attachment #389216 - Attachment is patch: true

Attachment #389216 - Attachment mime type: application/octet-stream → text/plain

Attachment #389216 - Flags: superreview? → superreview?(vladimir)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

16 years ago

Attachment #389216 - Flags: superreview?(vladimir) → superreview+

Curtis Bartley [:cbartley]

Assignee

Comment 28

•

16 years ago

This patch applied cleanly in a try server build as of about 2pm today. I manually tested the OS X build as well.

Keywords: checkin-needed

Johnathan Nightingale [:johnath]

Comment 29

•

16 years ago

http://hg.mozilla.org/mozilla-central/rev/cdf35906c03b

Status: NEW → RESOLVED

Closed: 16 years ago

Keywords: checkin-needed

Resolution: --- → FIXED

Shawn Wilsher :sdwilsh

Updated

•

16 years ago

Flags: in-testsuite+

Target Milestone: --- → mozilla1.9.2a1

Benjamin Smedberg

Comment 30

•

16 years ago

FTR, this patch caused the static analysis boxes to go red. In these cases it would be best to backout and contact me so we can resolve the analysis error... in this case it didn't recognize the allocation pattern at http://hg.mozilla.org/mozilla-central/rev/cdf35906c03b#l2.32 I've fixed the analysis with http://hg.mozilla.org/mozilla-central/rev/5fed87ccbba5

Updated

•

1 year ago

Product: Toolkit → Core

You need to log in before you can comment on or make changes to this bug.