Closed Bug 570975 Opened 14 years ago Closed 14 years ago

Don't convert UTF-8 strings to UTF-16 in MatchAutoCompleteFunction

Categories

(Toolkit :: Places, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: ehsan.akhgari, Assigned: justin.lebar+bug)

References

(Blocks 2 open bugs)

Details

(Keywords: perf, regression)

Attachments

(3 files, 7 obsolete files)

Attached file Profile
I've seen this problem on a few occasions, and now I have a useful profile over it.  With very large strings inside the location bar, Firefox starts to use 100% of the CPU, and becomes completely unresponsive, to the point that you need to kill it.

We spend 80% of our time in mozilla::places::MatchAutoCompleteFunction::OnfunctionCall.  from that, 55% of the time is spent in mozilla::storage::ArgValueArray::GetString, and 20% is spent in nsAString_internal::Assign.  The GetString call breaks down to 35% in sqlite3VdbeChangeEncoding.

I'm attaching the Shark profile.  What I did to reproduce it was open a very large URL, and click in the location bar.  I *tried* to close the tab as well, but I'm not sure if Firefox got stuck before the close tab command could be processed or not.
So, the problem here is that we do a whole bunch of conversion between UTF-8 and UTF-16.  I'm throwing together a patch which should fix this, but something in the back of my mind tells me there was a reason we needed to use UTF-16...
We can make location bar searches use less CPU and go faster here for Firefox 4 (hell, we could backport this if the fix is as simple as I think it is).  I think this should block as a result of this feature being used a bunch.
Assignee: nobody → sdwilsh
Status: NEW → ASSIGNED
blocking2.0: --- → ?
I've had *many* performance problems with the location bar.  Are you also interested in other bug reports coming from me in that regard?  I'm pretty sure that they're all places, but I don't have profiles right not to back up my assertion.
(In reply to comment #3)
> I've had *many* performance problems with the location bar.  Are you also
> interested in other bug reports coming from me in that regard?  I'm pretty sure
> that they're all places, but I don't have profiles right not to back up my
> assertion.
Yes, especially if you can attach profiles.  I'm not seeing this stuff, but I do occasionally here people complaining about it.
Attached patch v1.0 (obsolete) — Splinter Review
Use only UTF-8 instead of doing a bunch of conversions.  We should have someone who is better familiar with our string apis to take a look at this too.  Finding someone now...
Attachment #450197 - Flags: review?(mak77)
Comment on attachment 450197 [details] [diff] [review]
v1.0

>diff --git a/toolkit/components/places/src/SQLFunctions.cpp b/toolkit/components/places/src/SQLFunctions.cpp

>+    if (IsUTF8(unescapedSpec)) {
>+      _fixedSpec.Assign(unescapedSpec);
>+    }
>+    else {
>+      _fixedSpec.Assign(aURISpec);
>+    }
> 
>-    if (StringBeginsWith(_fixedSpec, NS_LITERAL_STRING("http://")))
>+    if (StringBeginsWith(_fixedSpec, NS_LITERAL_CSTRING("http://")))
>       _fixedSpec.Cut(0, 7);
>-    else if (StringBeginsWith(_fixedSpec, NS_LITERAL_STRING("https://")))
>+    else if (StringBeginsWith(_fixedSpec, NS_LITERAL_CSTRING("https://")))
>       _fixedSpec.Cut(0, 8);
>-    else if (StringBeginsWith(_fixedSpec, NS_LITERAL_STRING("ftp://")))
>+    else if (StringBeginsWith(_fixedSpec, NS_LITERAL_CSTRING("ftp://")))
>       _fixedSpec.Cut(0, 6);

inconsistent braces, I know you said me it is fixed locally, just a reminder.


>diff --git a/toolkit/components/places/src/SQLFunctions.h b/toolkit/components/places/src/SQLFunctions.h

>  * The Initial Developer of the Original Code is
>- * Mozilla Corporation.
>+ * Mozilla Foundation.

nit: "the Mozilla Foundation" iirc

looks good, and the perf gain is nice.
Attachment #450197 - Flags: review?(mak77) → review+
vlad checked this out on irc, so I'll land this shortly.
Unfortunately, this patch isn't going to work so well:
NeilAway> sdwilsh: ok, so the difference between Strings and CStrings is that case insensitive comparison on CStrings is ascii-only

This would be a regression...
Depends on: 571098
Comment on attachment 450197 [details] [diff] [review]
v1.0

This patch is no good for non-ASCII...
Attachment #450197 - Attachment is obsolete: true
Attachment #450197 - Flags: review-
This bug can't be fixed until some resolution of bug 571098.
Assignee: sdwilsh → nobody
Status: ASSIGNED → NEW
Depends on: 145975
No longer depends on: 571098
(In reply to comment #10)
> This bug can't be fixed until some resolution of bug 571098.

which is a dupe of bug 145975.
i'm not clear on how widespread of a problem this is. we don't know yet if this is a regression or if it's been here since 3.0.

we also have no clear owner or path to fixing this (bug 145975 is ancient and unfixed).

i'm not going to mark this blocking at this time.

if we can get details from support that this occurs often enough, or 100% reproducible STR that'll tell us how often it might be hit, then i'll re-evaluate blocking at that time.
blocking2.0: ? → -
If this bug is indeed the same as bug 526727, I had 100% reproducible STR in bug 526727, and I even found the regression range. It is new since 3.5 at least. Asking for blocking again to re-evaluate based on this.
blocking2.0: - → ?
Keywords: regression
Pasting data urls into the address bar is not a common action so I wouldn't block based on that.
blocking2.0: ? → -
Summary: Firefox chews 100% of CPU with large strings inside the location bar → Don't convert UTF-8 strings to UTF-16 in MatchAutoCompleteFunction
Attached patch Patch v2 (obsolete) — Splinter Review
This patch is based on Patch v3.3 (attachment 468545 [details] [diff] [review]) in bug 145975.

I'll look at the perf difference in a moment.
Well, the perf difference is...big.  I guess we could time how long these queries take to run.  But we're cutting out a whole hot function, ConvertUTF8toUTF16.  Also, UTF8CharEnumerator no longer shows up in profiles either.  Those together are about equal to the overhead of pthreads lock / unlock I see.
Blocks: 590376
Attached patch Patch v3 (obsolete) — Splinter Review
Built on top of bug 145975, patch v4.

The UTF-8 findOnBoundary is about 1.8x faster for ASCII strings than converting to UTF-16 and comparing.  Ignoring the UTF-8 -> UTF-16 conversion overhead, the new findOnBoundary function is about as fast as the old one for ASCII strings.

Unfortunately, for unicode strings, the new code is slower by a factor of 0.6, including the UTF-8 -> UTF-16 overhead of the old code.

Since URLs are ASCII and both URLs and page titles go through MatchAutoCompleteFunctions::findOnBoundary, I'd expect this to roughly be a wash for users with lots of non-ASCII pages in their history, and a clear win for users with mostly ASCII text in their histories.

We can probably do better, though.  Bug 590376 may be a start.
Attachment #468597 - Attachment is obsolete: true
(In reply to comment #18)
> Since URLs are ASCII and both URLs and page titles go through
> MatchAutoCompleteFunctions::findOnBoundary, I'd expect this to roughly be a
> wash for users with lots of non-ASCII pages in their history, and a clear win
> for users with mostly ASCII text in their histories.
URLs, page/bookmark titles, tags, and the search string to be more exact
(In reply to comment #18)
> Since URLs are ASCII and both URLs and page titles go through
> MatchAutoCompleteFunctions::findOnBoundary, I'd expect this to roughly be a
> wash for users with lots of non-ASCII pages in their history, and a clear win
> for users with mostly ASCII text in their histories.
Also, URLs are not required to be ASCII.
Turns out the UTF8 case-insensitive comparison code was missing a crucial fastpath.  I've added it back in, and now I'm seeing a speedup of 2.6x on ASCII and 2.1x on Unicode strings.
Blocks: 593893
Attached patch Patch v4 (obsolete) — Splinter Review
New patch, based on patch v5 in Bug 145975.  Benchmark results in a moment.
Attachment #471355 - Attachment is obsolete: true
I'm seeing a 1.5x - 2x speedup in findOnBoundary and findAnywhere, depending on the workload.  Also, just -O3 gives about a 1.5x speedup across the board (including for the current UTF-16 routines).

Full results are in bug 145975 comment 77.
Attachment #472876 - Flags: review?(mak77)
Attachment #472876 - Flags: feedback?(sdwilsh)
(In reply to comment #23)
> I'm seeing a 1.5x - 2x speedup in findOnBoundary and findAnywhere, depending on
> the workload.  Also, just -O3 gives about a 1.5x speedup across the board
> (including for the current UTF-16 routines).
Is that -O3 for just this file?
(In reply to comment #24)
> Is that -O3 for just this file?

It's -O3 everywhere.  I suspect that we'd see a similar speedup using -O3 just on that file and in nsUnicharUtils.cpp, but it's somewhat tricky to specify -O3 on nsUnicharUtils.cpp alone, since it gets built twice, both with and without PIC, and once is in toolkit/library, by way of some dependency...  Anyway, it could be done if wanted to, but I didn't try after I saw that it wasn't easy.

I should say that my 1.5x - 2x speedup is in *addition* to the 1.5x speedup you get from -O3.  I'm comparing UTF-16 convert and compare with -O3 to UTF-8 compare with -O3.
Could you maybe provide an interdiff between my changes and yours?  It's hard to tell what's different to be honest.
I unbitrotted patch v1, but interdiff still can't handle it.  Sorry.  I tried Brendan's hack, `diff patch1 patch2 | grep '[^<>] [-+]'` but the output is basically garbage, at least to my eyes.

fixupURISpec and OnFunctionCall are unchanged.  The header file has just lost a few functions, but is otherwise unchanged.  The rest is substantially different.
(In reply to comment #27)
> I unbitrotted patch v1, but interdiff still can't handle it.  Sorry.  I tried
> Brendan's hack, `diff patch1 patch2 | grep '[^<>] [-+]'` but the output is
> basically garbage, at least to my eyes.
> 
> fixupURISpec and OnFunctionCall are unchanged.  The header file has just lost a
> few functions, but is otherwise unchanged.  The rest is substantially
> different.
Alright, I can do the review of this, but probably next week.
(In reply to comment #28)
> Alright, I can do the review of this, but probably next week.

feel free to steal it from me, since I'm starting thinking I won't be able to go at it till wednesday.
Attached patch Patch v4.1 (obsolete) — Splinter Review
Minor unbitrotting.
Attachment #472876 - Attachment is obsolete: true
Attachment #474273 - Flags: review?(sdwilsh)
Attachment #472876 - Flags: review?(mak77)
Attachment #472876 - Flags: feedback?(sdwilsh)
I wonder if this 1.5x - 2x location bar perf win is something we should block on.  If the current trajectory is to be believed, if we don't mark it as a blocker, it (and its dependency, bug 145975) probably won't get reviewed.
blocking2.0: - → ?
(In reply to comment #31)
> I wonder if this 1.5x - 2x location bar perf win is something we should block
> on.  If the current trajectory is to be believed, if we don't mark it as a
> blocker, it (and its dependency, bug 145975) probably won't get reviewed.
Yeah, I can't see myself or mak really having cycles for this given the open blocker count right now...
So, if the risk involved is low and it's just a matter of review cycles, I actually think we *should* block on this.

Perception of performance is a major goal for Firefox 4, and a 2x more responsive awesomebar would go a long way to that end.
(In reply to comment #33)
> Perception of performance is a major goal for Firefox 4, and a 2x more
> responsive awesomebar would go a long way to that end.
Note that that is 2x for ASCII text, less for non-ASCII.  Let's not let our English speaking bias over-inflate the win here, since I know we have a lot of users who don't speak or type English (although many URls are certainly predominantly ASCII).
(In reply to comment #34)
> Note that that is 2x for ASCII text, less for non-ASCII.  Let's not let our
> English speaking bias over-inflate the win here, since I know we have a lot of
> users who don't speak or type English (although many URls are certainly
> predominantly ASCII).

See bug 145975 comment 77: It's 2x for ASCII and 1.6-1.9x faster for the Unicode strings I was testing.

Since FindOnBoundary doesn't really understand non-Latin text, I'd expect bug 593893 to have a larger impact on Unicode text than on Latin text.  (I might also expect it to have a larger impact on snappiness than this bug, to be honest.)
Up to 2x awesomebar perf improvement, fixing a perf regression, patches in hand, just waiting on review -> blocking+. We can re-evaluate if the review result increases the known risk of the change.

Marco/Shawn: is there something about this patch that makes it a difficult or time-consuming review?
blocking2.0: ? → betaN+
FWIW, bug 145975 is probably the more-difficult one to review.
(In reply to comment #36)
> Marco/Shawn: is there something about this patch that makes it a difficult or
> time-consuming review?
I haven't spent much time looking at non-blocking stuff because I'm buried in blockers right now.  Will skim it tomorrow and let you know.
(In reply to comment #37)
> FWIW, bug 145975 is probably the more-difficult one to review.
Also, needs that bug, which was marked blocking-
(In reply to comment #39)
> Also, needs that bug, which was marked blocking-

Not to get off topic, but was bug 145975 ever marked blocking-?  It's currently blocking?, but maybe it was blocking- at some point in the past.
(In reply to comment #40)
> Not to get off topic, but was bug 145975 ever marked blocking-?  It's currently
> blocking?, but maybe it was blocking- at some point in the past.
bsmedberg marked it blocking- on 2010-08-23 10:06:25 PDT
(In reply to comment #38)
> I haven't spent much time looking at non-blocking stuff because I'm buried in
> blockers right now.  Will skim it tomorrow and let you know.
Hey look!  I lied.  To make up for it, I'm doing the full review now.
Comment on attachment 474273 [details] [diff] [review]
Patch v4.1

I'll note that your re-ordering of some methods in these files has made the diff unnecessarily complex.  In the next revision, could you please undo that?  It's making it more difficult to review as a result.

>+++ b/toolkit/components/places/src/SQLFunctions.cpp
>-  /* static */
>-  void
>+   /* static */
>+   void
please don't change the indentation here.  It should line up with MatchAutoCompleteFunction::fixupURISpec

>   MatchAutoCompleteFunction::fixupURISpec(const nsCString &aURISpec,
>-                                          nsString &_fixedSpec)
>+                                          nsCString &_fixedSpec)
>   {
>     nsCString unescapedSpec;
>     (void)NS_UnescapeURL(aURISpec, esc_SkipControl | esc_AlwaysCopy,
>                          unescapedSpec);
> 
>-    // If this unescaped string is valid UTF-8, we'll convert it.  Otherwise,
>+    // If this unescaped string is valid UTF-8, we'll use it.  Otherwise,
>     // we will simply convert our original string.
You'll want to change the use of convert in the second line too

>+  MatchAutoCompleteFunction::findBeginning(const nsDependentCSubstring &aToken,
>+                                           const nsACString &aSourceString)
>   {
>+    // We can't use StringBeginsWith here, unfortunately.  Although it will
>+    // happily take a case-insensitive UTF8 comparator, it eventually calls
>+    // nsACString::Equals, which checks that the two strings contain the same
>+    // number of bytes before calling the comparator.  This is clearly not what
>+    // we want.
>+
>+    if (aToken.IsEmpty())
>+      return true;
nit: brace this please

>+
>+    const_char_iterator tokenStart(aToken.BeginReading()),
>+                        tokenEnd(aToken.EndReading()),
>+                        sourceStart(aSourceString.BeginReading()),
>+                        sourceEnd(aSourceString.EndReading());
>+
>+    // We didn't check that the source was non-empty, so check upfront in the loop.
I think it'd be more clear to do the check explicitly.

>+    PRBool err;
>+    while (sourceStart < sourceEnd &&
>+           CaseInsensitiveUTF8CharsEqual(sourceStart, tokenStart,
>+                                         sourceEnd, tokenEnd,
>+                                         &sourceStart, &tokenStart, &err)) {
sad to see that this uses PRBool - can we still change it to bool?

>+      // We found the token!
>+      if (tokenStart >= tokenEnd)
>+        return true;
nit: brace please

>+  // findAnywhere and findOnBoundary do almost the same thing, so it's natural
>+  // to implement them in terms of a single function.  They're both
>+  // performance-critical functions, however, and checking aOnlyOnBoundary
>+  // makes them a bit slower.  Our solution is to define findInString as
>+  // NS_ALWAYS_INLINE and rely on the compiler to optimize out the
>+  // aOnlyOnBoundary check.
>+  static NS_ALWAYS_INLINE bool
>+  findInString(const nsDependentCSubstring &aToken,
>+               const nsACString &aSourceString,
>+               bool aOnlyOnBoundary)
So, this function is static to the file, yeah?  Places code would prefer you to use an anonymous namespace, and place it at the top in a heading for anonymous helpers like here:
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/src/History.cpp#157

>   {
>+    // It's not particularly meaningful to call this function with an empty
>+    // token, but CaseInsensitiveUTF8CharsEqual assumes that there's at least
>+    // one byte left in the both strings, so we should check.
I hope it asserts this invariant.

>+    if (aToken.IsEmpty())
>+      return true;
nit: brace please

>+    nsACString::const_char_iterator tokenStart(aToken.BeginReading()),
>+                                    tokenEnd(aToken.EndReading()),
>+                                    sourceStart(aSourceString.BeginReading()),
>+                                    sourceEnd(aSourceString.EndReading());
Of course, if you make it a static method on the class, you can use the class typedef so you don't have super long lines

> 
>     // The start of aSourceString is considered a word boundary, so start there.
>     do {
>+      // We are on a word boundary (if aOnlyOnBoundary is true).  See if aToken
>+      // matches sourceStart.
one of these comments is redundant, I think

>+      // Check whether the first character in the token matches the character
>+      // at sourceStart.  At the same time, get a pointer to the next character
>+      // in both the token and the source.
>+      nsACString::const_char_iterator sourceNext, tokenCur;
>+      PRBool err;
>+      if (CaseInsensitiveUTF8CharsEqual(sourceStart, tokenStart,
>+                                        sourceEnd, tokenEnd,
>+                                        &sourceNext, &tokenCur, &err)) {
> 
>+        // Don't need to check err here -- if CaseInsensitiveUTF8CharCompare
>+        // returns err, it'll also return false, and we'll catch err outside
>+        // the if.
I think that in both cases we should just call err dummy, or maybe CaseInsensitiveUTF8CharsEqual should make passing that in optional (I would prefer the latter).  Seems like we don't usually need to know error information, so no need in making consumers have to pass something in for it.

>+      // If something went wrong above, get out of here!
>+      if (NS_UNLIKELY(err))
>+        return false;
Although, we seem to care here about it.  Can we use "error" as the variable name instead please?

>+      if (aOnlyOnBoundary &&
>+          (('a' <= *sourceStart && *sourceStart <= 'z') ||
>+           ('A' <= *sourceStart && *sourceStart <= 'Z'))) {
We added the isWordBoundary method for code readability.  In theory, it'd be inlined anyway.  Can we keep it please?

>+        // Advance to the next word boundary.  We consider anything that's not
>+        // [a-z] to be a word boundary -- this lets us match CamelCase words.
>+        //
>+        // Since we'll halt as soon as we see a non-ASCII letter we can do a
>+        // simple byte-by-byte comparison and avoid the overhead of a
>+        // UTF8CharEnumerator.
>+
>+        sourceStart++;
>+        while (sourceStart < sourceEnd && 'a' <= *sourceStart && *sourceStart <= 'z')
>+          sourceStart++;
ditto with nextWordBoundary

>+  bool
>+  MatchAutoCompleteFunction::findAnywhere(const nsDependentCSubstring &aToken,
>+                                          const nsACString &aSourceString)
>+    // We can't use FindInReadable here; it works only for ASCII.
>+
>+    return findInString(aToken, aSourceString, false);
I'd actually prefer to pass in an enum to findInString to dictate behavior as it's more readable at the call site.

This is mostly r- because I want to see it again.  This looks OK though, but one more iteration please.
Attachment #474273 - Flags: review?(sdwilsh) → review-
(In reply to comment #43)

Thanks for the review!  I'll put up a new patch soon -- for now, a few comments:

> >+    const_char_iterator tokenStart(aToken.BeginReading()),
> >+                        tokenEnd(aToken.EndReading()),
> >+                        sourceStart(aSourceString.BeginReading()),
> >+                        sourceEnd(aSourceString.EndReading());
> >+
> >+    // We didn't check that the source was non-empty, so check upfront in the loop.
> I think it'd be more clear to do the check explicitly.

Here's the full code:

> +    while (sourceStart < sourceEnd &&
> +           CaseInsensitiveUTF8CharsEqual(...)) {
> +
> +      if (tokenStart >= tokenEnd)
> +        return true;
> +    }
> +
> +    return false;

Do you mean changing it to

  if (sourceStart < sourceEnd)
    return false;

  while (CaseInsensitiveUTF8CharsEqual(...) {
    if (tokenStart >= tokenEnd)
      return true;

    if (sourceStart < sourceEnd)
      return false;
  }

  return false;

?  This seems worse to me.

> >+    PRBool err;
> >+    while (sourceStart < sourceEnd &&
> >+           CaseInsensitiveUTF8CharsEqual(sourceStart, tokenStart,
> >+                                         sourceEnd, tokenEnd,
> >+                                         &sourceStart, &tokenStart, &err)) {
> sad to see that this uses PRBool - can we still change it to bool?

I'm not really familiar with our rules for PRBool vs. bool, but I believe that
CaseInsensitiveUTF8CharsEqual lives somewhere where PRBool is required, or is
exported in such a way as to require PRBool.  I could check if you can direct
me to the rules.

> >   {
> >+    // It's not particularly meaningful to call this function with an empty
> >+    // token, but CaseInsensitiveUTF8CharsEqual assumes that there's at least
> >+    // one byte left in the both strings, so we should check.
> I hope it asserts this invariant.

It does, although only in debug mode.

> >+    nsACString::const_char_iterator tokenStart(aToken.BeginReading()),
> >+                                    tokenEnd(aToken.EndReading()),
> >+                                    sourceStart(aSourceString.BeginReading()),
> >+                                    sourceEnd(aSourceString.EndReading());
> Of course, if you make it a static method on the class, you can use the class
> typedef so you don't have super long lines

I'll see if I can NS_ALWAYS_INLINE a class-static method without putting the implementation in the header file.  If I can't, I'll put it in an anonymous namespace.

> >+      // Check whether the first character in the token matches the character
> >+      // at sourceStart.  At the same time, get a pointer to the next character
> >+      // in both the token and the source.
> >+      nsACString::const_char_iterator sourceNext, tokenCur;
> >+      PRBool err;
> >+      if (CaseInsensitiveUTF8CharsEqual(sourceStart, tokenStart,
> >+                                        sourceEnd, tokenEnd,
> >+                                        &sourceNext, &tokenCur, &err)) {
> > 
> >+        // Don't need to check err here -- if CaseInsensitiveUTF8CharCompare
> >+        // returns err, it'll also return false, and we'll catch err outside
> >+        // the if.
> I think that in both cases we should just call err dummy, or maybe
> CaseInsensitiveUTF8CharsEqual should make passing that in optional (I would
> prefer the latter).  Seems like we don't usually need to know error
> information, so no need in making consumers have to pass something in for it.

I'd prefer to keep the error parameter as required so that ignoring it is an
explicit decision.

> >+      if (aOnlyOnBoundary &&
> >+          (('a' <= *sourceStart && *sourceStart <= 'z') ||
> >+           ('A' <= *sourceStart && *sourceStart <= 'Z'))) {
> We added the isWordBoundary method for code readability.  In theory, it'd be
> inlined anyway.  Can we keep it please?
>
> [...]
>
> ditto with nextWordBoundary

I took these out because I was initially pretty confused by them.

  if (!isWordBoundary(ToLowerCase(*sourceStart++)))
    sourceStart = nextWordBoundary(sourceStart, sourceEnd);

What's a word boundary?  Is it whitespace, or whitespace and punctuation?
Numbers too?  Why do we have to convert *sourceStart to lower-case?  What does
case have to do with word boundaries, anyway?  And if it's *next* word
boundary, why do we have to *sourceStart++?

At least, that's what went through my head when I read the original code.  I
figured that it was pretty reasonable to factor it out when it was called from
two places, but now that it appears only once, it's clearer to me to write it
explicitly.

That said, it's not a big deal either way.
(In reply to comment #43)
> I'll note that your re-ordering of some methods in these files has made the
> diff unnecessarily complex.  In the next revision, could you please undo that? 
> It's making it more difficult to review as a result.

I agree that this is a nasty diff, although I feel like it's more important that the code itself follows a logical order than that this patch is easy to read.  I'll see if patience diff helps, and if not, I'll see if I can fix the order to make a nicer diff.
(In reply to comment #44)
> Do you mean changing it to
No, I mean before the loop, do:
if (aSourceString.Length() == 0) {
  return false;
}
because that is what the comment led me to believe that you are checking.  However, now that I think about it, we should never ever have a source string having zero length (and should assert that).  I think maybe your comment is not properly explaining what you are actually checking (it is to some degree looking at the code you posted, but I didn't realize you cared about the second case based on the comment)?

> I'm not really familiar with our rules for PRBool vs. bool, but I believe that
> CaseInsensitiveUTF8CharsEqual lives somewhere where PRBool is required, or is
> exported in such a way as to require PRBool.  I could check if you can direct
> me to the rules.
There's nothing set in stone.   It's usually module owners discretion, and we can use it anywhere that we do not cross XPConnect boundaries.

> I took these out because I was initially pretty confused by them.
> 
>   if (!isWordBoundary(ToLowerCase(*sourceStart++)))
>     sourceStart = nextWordBoundary(sourceStart, sourceEnd);
> 
> What's a word boundary?  Is it whitespace, or whitespace and punctuation?
> Numbers too?  Why do we have to convert *sourceStart to lower-case?  What does
> case have to do with word boundaries, anyway?  And if it's *next* word
> boundary, why do we have to *sourceStart++?
Right, but all these questions is exactly why we factored it out into its own method in the first place.  The consumer only cares about it being on a word boundary, whatever we define that to be.  We let isWordBoundary dictate that for us, and it makes the consumer more readable.  Right now we have to read a bunch of code in the would-be-consumer that is basically "determine if this is a word boundary".
These search functions are complicated, so we tried to make them as simple as possible to read and understand at a high level, and if you wanted to drill down, see the helper methods.  I can find the review comment where I was asked factor this stuff out in the first place if you'd like since I think it had a better argument for it (I had the same opinion as you at one point).

> At least, that's what went through my head when I read the original code.  I
> figured that it was pretty reasonable to factor it out when it was called from
> two places, but now that it appears only once, it's clearer to me to write it
> explicitly.
Although, it's also true that we wanted it factored out because it was in two places, that wasn't the only reason.
(In reply to comment #46)
> No, I mean before the loop, do:
> if (aSourceString.Length() == 0) {
>   return false;
> }
> because that is what the comment led me to believe that you are checking. 
> However, now that I think about it, we should never ever have a source string
> having zero length (and should assert that).

Okay.

> I think maybe your comment is not
> properly explaining what you are actually checking (it is to some degree
> looking at the code you posted, but I didn't realize you cared about the second
> case based on the comment)?

Ah, I see.  Yes, that comment is misleading.  Maybe we should just take it out.
 
> > I'm not really familiar with our rules for PRBool vs. bool, but I believe that
> > CaseInsensitiveUTF8CharsEqual lives somewhere where PRBool is required, or is
> > exported in such a way as to require PRBool.  I could check if you can direct
> > me to the rules.
> There's nothing set in stone.   It's usually module owners discretion, and we
> can use it anywhere that we do not cross XPConnect boundaries.
> 
> > I took these out because I was initially pretty confused by them.
> > 
> >   if (!isWordBoundary(ToLowerCase(*sourceStart++)))
> >     sourceStart = nextWordBoundary(sourceStart, sourceEnd);
> 
> Although, it's also true that we wanted it factored out because it was in two
> places, that wasn't the only reason.

Sure, I'm happy to put it back in.  Something like

  sourceStart = nextWordBoundary(sourceStart, sourceNext, sourceEnd);

would be clearer to me, since it encapsulates the fact that we always move forward at least one character, and we end up at a word boundary.
(In reply to comment #47)
>   sourceStart = nextWordBoundary(sourceStart, sourceNext, sourceEnd);
sourceNext would be an [out] param?  I might bicker about the signature a bit once I see it implemented, but I'm fine with this.
Attached patch Patch v5 (obsolete) — Splinter Review
Attachment #474273 - Attachment is obsolete: true
Attachment #484986 - Flags: review?(sdwilsh)
This is a much nicer diff of SQLFunctions.cpp.
Hm...actually, that patience diff didn't seem to help as much as I'd hoped.
Clever diffing algorithms are of no use!  I broke down and just re-ordered the functions, like I should have from the beginning.  :)
Attachment #484986 - Attachment is obsolete: true
Attachment #484988 - Attachment is obsolete: true
Attachment #484992 - Flags: review?(sdwilsh)
Attachment #484986 - Flags: review?(sdwilsh)
Note: I'm probably fine with a part 2 that moves the functions which will then be trivial to review.
Comment on attachment 484992 [details] [diff] [review]
Patch v5.1 (nicer diff than v5)

>+++ b/toolkit/components/places/src/SQLFunctions.cpp

>+////////////////////////////////////////////////////////////////////////////////
>+//// Anonymous Helpers
>+namespace {
nit: newline between these please

>+  enum FindInStringBehavior {
>+    eFindOnBoundary,
>+    eFindAnywhere
>+  };
Move this above the FindInString method please

>+  static NS_ALWAYS_INLINE const_char_iterator
>+  nextWordBoundary(const_char_iterator start,
>+                   const_char_iterator next,
>+                   const_char_iterator end) {
Places style dictates formatting like this:
/**
 * blah blah blah
 *
 * @param aStart
 *        blah blah blah
 * ...
 * @return blah blah blah
 */
static
NS_ALWAYS_INLINE const_char_iterator
nextWordBoundary(const_char_iterator aStart,
                 const_char_iterator aNext,
                 const_char_iterator aEnd)
{

Additionally, we are not mutating these, so I'd like to make them all const_char_iterator const (so const char* const).

>+    // Skip to the next word boundary if we're at an ASCII letter (i.e.
>+    // [a-zA-Z]).  Otherwise, advance forward one character using the next
>+    // pointer we're passed.
>+
lose the newline since it makes the comment seem detached from the code below it.  This applies in other places too.

>+  // findAnywhere and findOnBoundary do almost the same thing, so it's natural
>+  // to implement them in terms of a single function.  They're both
>+  // performance-critical functions, however, and checking aOnlyOnBoundary
>+  // makes them a bit slower.  Our solution is to define findInString as
>+  // NS_ALWAYS_INLINE and rely on the compiler to optimize out the
>+  // aOnlyOnBoundary check.
>+  static NS_ALWAYS_INLINE bool
>+  findInString(const nsDependentCSubstring &aToken,
>+               const nsACString &aSourceString,
>+               FindInStringBehavior aBehavior)
>+  {
ditto re: formatting

>+    // It's not particularly meaningful to call this function with an empty
>+    // token, but CaseInsensitiveUTF8CharsEqual assumes that there's at least
>+    // one byte left in the both strings, so we should check.
>+    if (aToken.IsEmpty())
>+      return true;
The token can never ever be empty, but the search string can be (I lied in comment 46.  I was thinking about the token, not the source string).  Just assert it is non-empty.

>+    // We cannot match anything if there is nothing to search.
>+    if (aSourceString.IsEmpty())
>+      return false;
nit: brace this please

>+  MatchAutoCompleteFunction::findBeginning(const nsDependentCSubstring &aToken,
>+                                           const nsACString &aSourceString)
>   {
>+    NS_PRECONDITION(!aSourceString.IsEmpty(),
>+                    "Source string should be non-empty.");
As stated above, it's the token we can assert about, not the source string.
nit: move this to the first line of the method please.

r=sdwilsh with these changes
Attachment #484992 - Flags: review?(sdwilsh) → review+
Whiteboard: [needs new patch]
There are testcases which push empty source strings through these functions.

See http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1287698308.1287699852.1354.gz

May I change the assertion back to an if, or do you want me to try and fix the testcases?
Like I said in comment 55 - I was mistaken in comment 46.  The source string can be null, but the token should never be.  Assert about the token, not the source.
Whiteboard: [needs new patch]
Oh, I see -- I pushed the wrong version to try and ended up being very confused.
Assignee: nobody → justin.lebar+bug
http://hg.mozilla.org/mozilla-central/rev/e7bf870a8bb4
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment on attachment 484992 [details] [diff] [review]
Patch v5.1 (nicer diff than v5)

>+    // We can't use StringBeginsWith here, unfortunately.  Although it will
>+    // happily take a case-insensitive UTF8 comparator, it eventually calls
>+    // nsACString::Equals, which checks that the two strings contain the same
>+    // number of bytes before calling the comparator.  This is clearly not what
>+    // we want.
Because changing the case of a character could change the number of bytes?
Yes.  The comment should be more precise!
(In reply to comment #61)
> Yes.  The comment should be more precise!
rs=me to fix that
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: