Closed Bug 151871 Opened 22 years ago Closed 22 years ago

performance problems in quoteUrls

Categories

(Bugzilla :: Bugzilla-General, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
Bugzilla 2.18

People

(Reporter: myk, Assigned: bbaetz)

References

Details

(Keywords: perf)

Attachments

(1 file, 6 obsolete files)

Bug 145055 on b.m.o has a lot of URLs in one of its comments.  As a result, it
doesn't load, since the HTTP transaction times out before Bugzilla finishes
generating and outputting the content.
Blocks: bz-perf
I don't think this is a regression.  The only thing that happened to quoteUrls
recently doesn't look like it would have caused this problem.
For a Fun Evening (TM) paste checksetup.pl into a comment. quote_urls is 90-95%
of the time taken, because of all the @ symbols.

I thought I'd filed a bug on that, but obviously I didn't. I tried looking
through the regexps, but they're, well, 'complicated'.
Blocks: 150783
Why don't we cache the quoted forms of the comments as an extra field in the
comments table? It seems stupid to requote every comment on every bug every time
someone wants to see it. If we ever change QuoteUrls(), we can just add a "nuke
the cache" to checksetup.pl.

Gerv
That would be take too much db space, really. Not to mention that GetBugLink is
dynamic (for the status/summary of the linked bug)

I think we just need to redo some of the regexps, though, rather than rewrite
the schema.
So, test results with checksetup.pl pasted into a comment (I had to temporarily
remove arround the 65535 char limit we now have):

The problem is the first regexp. Look closely at:

    while ($text =~ s%((mailto:)?([\w\.\-\+\=]+\@[\w\-]+(?:\.[\w\-]+)+)\b|
                    (\b((?:$protocol):[^ \t\n<>"]+[\w/])))%"##$count##"%exo) {

Think about what this the first part of this re does to any word, say 'abcdef'.
It doesn't match mailto:, but thats optional, so perl searches the word for an
@, starting at 'a'. It doesn't find one, so it backtracks, and tries again from
'b'. Then from 'c', then..... This is not good, and a real quick fix is to move
a \b check to the front of the regexp. Thats the 4 char patch. At least I think
that that is what is happening.

Theres more we can do, though:

Timings are with time REQUEST_METHOD='GET' QUERY_STRING='id=19' perl -wT
show_bug.cgi > /dev/null

Times vary about 0.15 seconds each way, and I viewed the page in the browser
first so that db disk access wasn't an issue. All tests include the bits from
before.

- unpatched - 9.4 seconds
- 4 character patch to move \b - 4.8 seconds
- Another few characters to remove unelss capturing fu - 4.55 seconds [WANT TO
CONSIDER THIS ONE FOR 2.16]
- Splitting the single regexp into two - 2.5 seconds
- Removing the /e modifier - 2.5 seconds (This one won't matter much unless we
have lots of matches, though)
- Combining regexps so that we can use /g, rather than playing games with # (and
also elimimating $&, which didn't seem to have much of an effect) - 1.6 seconds

Note that it takes about 1.1 seconds before the Content-Type header shows up, so
this is a reduction in time of just under 94%

The last one includes a few other minor cleanups, too (eg /i for matching on
protocols, not including the ) for |(attachment 3 [details] [diff] [review])| and so on)

I'll attach the patch I want for 2.16, plus the trunk one. The trunk one wants
careful inspection.

We can take the splitting one for 2.16 too, if we want - I'll attach that one too.
Assignee: justdave → bbaetz
Keywords: perf
Target Milestone: --- → Bugzilla 2.16
Attached patch minimal change, v1 (obsolete) — Splinter Review
Attached patch medium change, v1 (obsolete) — Splinter Review
Either this one or the minimal one should happen for 2.16
Attached patch invasive change, v1 (obsolete) — Splinter Review
This should go onto the trunk, once its been tested.
Times for bug 145055:

- unpatched - 100 seconds
- minimal change - 34 seconds
- medium change - 5.5 seconds
- invasive change - 1.8 seconds

The big diff from minimal->medium is probably because this is a large file which
also has lots of matches
Severity: normal → major
Status: NEW → ASSIGNED
Keywords: patch, review
Comment on attachment 88090 [details] [diff] [review]
minimal change, v1

I think we should aim at the medium solution for the branch. At least I feel
that it will give us the best speedup/risk ratio. 

Unfortunately my brains are not sufficiently awake to have a look at those
regexp chunks... This minimal approach was enough ;-) r=jouni for the minimal
model in case we decide to check it in. The regex looks sane to me and also
seems to work.
Attachment #88090 - Flags: review+
bbaetz- Have you tried saving output to files and comparing to see if there are
any differences?
I haven't, no.

I could set up a script to do this on the landfill db, I guess, but the problem
is that the results aren't going to be the same.

Currently 'asdasdasdamailto:bbaetz@student.usyd.edu.au' will link the mailto
thing, but my code changes it. Thats because that is the cause of the perf
problem, so it has to change. Instead, it will linkify just the actual address
itsself, because : is counted as \b.

Also, the current code will, for "(see attachment 1 [details] [diff] [review])", linkify from the a in
attachment up to and including the ). The last patch of mine avoid linkifying
the ) because the 'created attachment' regexp is separate from the others.

The last patch linkifies HTTP://lwn.net; the old code didn't.

I'll see if I can test, and then manually check any changes, though.
Attached file test script (obsolete) —
OK. So i wrote a test script, and it failed.

The problem is something like: http://foo/bar?id=foo@bar.com

That contains both a valid url and a valid email address. So we have to do
substitution for uris, to avoid double matching.

The alternate is to have the start not match =, /, or #. But uris can contain
other chars, too (&amp; and ( being some of the more interesting ones), and so
blocking that blocks valid bug references, too (|http://foo/bug#3| vs
|(bug#2)|)

Anyway, the test script passes my db + landfill. You'll need Text::Diff to run
it.

There are, however, differences, the most motable one being the 'updated
attachment' thing I described in a precious comment.

myk, can you run this on a version of bmo's db, and mail me the results? After
that, I'll convert these into patches again.
Attachment #88090 - Attachment is obsolete: true
Attachment #88092 - Attachment is obsolete: true
Attachment #88093 - Attachment is obsolete: true
Oh, and for an improvment on the small and med patches (as well as the
original), replace:

    for (my $i=0 ; $i<$count ; $i++) {
        $text =~ s/##$i##/$things[$i]/e;
    }

with:

   $text =~ s/##(\d+)##/$things[$1]/eg;

which avoids us traversing the list once for each match, ie a quadratic algorithm.

That change, made on a tree with the medium patch in place, cuts the time for
the bug 145055 testcase down from 5.5 seconds to 2.9 seconds. Thats still above
the 1.8 I get for the invasive patch, but its much better than the original.
Running the test script with the patch and excluding secure bugs.  Will report
results directly to bbaetz.
Attached patch test script, take 2 (obsolete) — Splinter Review
OK, heres take 2. I fixed some stuff:
 - mailto:foo@bar was being turned into mailto:mailto:foo@bar
 - "bug# 123" wasn't being linkified, because of the space after the #.

I've also readded the bug that caused closing brackets after attachment ids to
be included in the link, in a bit to remove false positives from the bmo test
results. I'll remove that before making the patches.

Other changes I noticed, which I am deliberatly leaving in place:

- trailing/leading -, +, = are no longer included in mailto links. This is
almost always not what is wanted. I don't even know if thats valid in an email
address, and a grep through the results indicated that that was never what it
was being used for.

- value_quote (which the old stuff uses) replaces newlines with &#013;, because
thats neededd for hidden form elements (see bug 4928). This means that we have
to replace that with \n again, for display. bugzilla does this before
substituting the replacements, so the extra entity is in the source if we
matched over multiple lines. Since thats the only difference between
value_quote and html_quote, I just used html_quote instead, in the first place.
This difference is only visible in the source code, at least for mozilla, which
expands the entity on the page.

myk, can I get you to run this one on bmo's data, please?
Attachment #88248 - Attachment is obsolete: true
Blocks: 154278
Attached patch test script, take 3 (obsolete) — Splinter Review
OK. Changes - I was misreading the diff, and so my previous diff made us match
on trailing -, instead of removing.
Other fixes:
  - match on attachment\s(id=123)
  - an "attachment 1 [details] [diff] [review]" in a bug title was matched - swapped the order arround to
fix that.
  - only search for groupset=0

One big problem with this test script is that the old code has a bug. The
anchor matching code looks to see if it should prepend mailtoi by searching on
/^$protocol:/. However, $protocol does not have brackets arround it. This means
that anyone whose emailaddress contains http, mail, or news didn't have mailto:
preprended. I've fixed this, but mpt@mailandnews.com appears a great nuber of
times here. Whle tracking that doen, I did clean up that code a bit, using
qr//.

Other things I'm not going to fix:
  - line ending format. This doen't matter for html, and I don't see the point
in using value_quote, then resubstituting immediately. This appears to only
affect moved bug mail by the looks of it, since, process_bug does line ending
substitution, which makes is doubly wasteful to convert stuff.

This time I got to the 40% mark before giving up.

Anyway, lets give this one more round. I no longer think we should take this
for 2.16, so my test script only tests the final change now. (This should also
mean that the test script runs faster)

myk: Can you run this once more, then I'll generate a patch?
Review comments are welcome, too.
Attachment #89047 - Attachment is obsolete: true
I don't think this should make 2.16 - its a long standing bug, and testing
multiple fixes is just a pain. It can go into 2.16.1 if we ever release that
Target Milestone: Bugzilla 2.16 → Bugzilla 2.18
OK, summary. There are 829 diffs in the final log, consisting of approximately:

 - 332 &#013; -> \n translations
 - 59 places where the preceeding -/=/. in an email was linkified
 - About 240 cases of email addresses missing the mailto: when linkified
(including 108 of mpt's email)
 - About 140 cases of capital urls now being included.
 - 80 cases of the linkification for "Created an attachment" when not in the
first line being altered to not include the 'Created an ' bit
 - 5 cases of line endings when importing a bug from bugsplat.
 - A couple of cases where the duplicate bug-linking has changed (it now has to
be on the beginning/end of a line, to check that it was bugzilla-generated)

Thats a total of 856, out of the 829 present. These were done with grep -c, so
they're not accurate, esp as Text::Diff doesn't seem to leave a newline between
the - and + line of a diff if the diffed line doesn't end in a newline (ie the
last one in the comment, or teh &#013; thing, in our case). This meant that my
divide by 2 count was off.

Also, some diffs were affected by more than one problem, so were double counted.

I'll go create a patch, now. I'm not really fussy about having 'created at
attachment' matching on a line by its own which isn't the start, and same with
the duplicate stuff.

However, because of the algorithm used, the duplicate stuff will need to be at
least on a line of its own, even if its not at the end of the comment (because
if a bugtitle contains the magic text, we'll linkify inside that, too - see
comments in the test script) If I don't match on the last line, though, we will
mis-linkify comments which have a title of:

"
*** This bug has been marked a duplicate of 123 ***
"

_including the newlines_. Newlines in a summary shouldn't be valid, but I know
we do have such bugs, so...

I'm not sure if that one is such an issue, though - it only happens about 10
times, mostly with imported bugs, in which case the linkification wouldn't be
correct anyway. Ditto for the 'created an attachment' bit, although there are
cases where people copy and paste the comment to comment on it. For that case,
though, we still do have a link to the attachment, it just doesn't include the
'created an ' part.

Not to mention that restricting these probably speeds things up a bit, too.
Attached patch patch, v3Splinter Review
OK, heres the patch, based on the script. Changes from the test version:

- a few comment changes
- reinstated the fix for matching closing ) in attachment comments I described
earlier
- don't match on |foobarattachment (id=1)| There weren't any cases of that in
the log myk sent me, but its still a correctness fix, since the old code
doesn't match on that either.
- do match on (attachment (id=2)) - ditto

I've also included a fix to CGI.pl to not use $&, which according to the perlre
docs will slow down matches if its in there once. Didn't notice anyt
difference, but it won't hurt, and since this patch removes all the other
uses....

This is ready for review.
Attachment #89191 - Attachment is obsolete: true
Comment on attachment 89683 [details] [diff] [review]
patch, v3

>Index: globals.pl

>+    $text =~ s~\b(${protocol_re}:  # The protocol:
>+                  [^\s<>\"]+       # Any non-whitespace
>+                  [\w\/])          # so that we end in \w or /

Can't URLs end in non-word characters like periods (.)?


>+              ~($tmp = html_quote($1)) &&
>+               ($things[$count++] = "<a href=\"$tmp\">$tmp</a>") &&

Not sure  if it matters, but wouldn't it save memory to store
only $tmp in @things and wrap it in an anchor when you put the
things back in at the end?


>+    # Duplicate markers
>+    $text =~ s~(?<=^\*\*\*\ This\ bug\ has\ been\ marked\ as\ a\ duplicate\ of\ )
>+               (\d+)
>+               (?=\ \*\*\*\Z)
>+              ~GetBugLink_big($1, $1)

(bbaetz says on IRC to pretend this says GetBugLink)


>+    # Now remove the encoding hacks
>+    $text =~ s/\0\0(\d+)\0\0/$things[$1]/eg;
>+    $text =~ s/$chr1\0/\0/g;

I suppose there are obscure situations where this meta-escaping
would fail, but I don't think we need to worry about them.


> sub GetBugLink {
>-    my ($bug_num, $link_text) = (@_);
>+    my ($bug_num, $link_text, $comment_num) = (@_);

My perennial nit: (@_) is redundant.


>Index: CGI.pl

>-            $0 =~ m:[^/\\]*$:;
>-            $nexturl = $&;
>+            $0 =~ m:([^/\\]*)$:;
>+            $nexturl = $1;

What's the reason for this change?  It doesn't appear to have
anything to do with the quote URLs changes, nor does it appear
to change anything (both versions look like they work exactly
the same).
Attachment #89683 - Flags: review+
> Can't URLs end in non-word characters like periods (.)?

They can, yes. But:

a) the old code didn't
b) its rare;
c) more importantly, consider http://foo.com/bar.html. We don't want the . to be
linkified.

> Not sure  if it matters, but wouldn't it save memory to store
> only $tmp in @things and wrap it in an anchor when you put the
> things back in at the end?

That would only work if this is the only thing stored in @things. Which it is,
but when we linkify http://bugzilla.mozilla.org/show_bug.cgi?id=1, it won't be.
Besides, you'd nee a truly massive number of links to take up a noticable
ammount of memory.

> I suppose there are obscure situations where this meta-escaping
> would fail, but I don't think we need to worry about them.

There shouldn't be any - the old code did a similar thing, and the value in
things has been value_quoted, so it can't contain \0 at all.

[CGI.pl change]

> What's the reason for this change?  It doesn't appear to have
> anything to do with the quote URLs changes, nor does it appear
> to change anything (both versions look like they work exactly
> the same).

as soon as any part of a program uses a regexp $-var, perl has to start
capturing it for all cases, since it can't know if its used or not. This measn
that this use means that every regexp now captures the entire matching string.
perldoc perlrc warns that $& is expensive, and so its probably a good idea to
remove it from this one place, and thus everywhere (since I've removed the
remainder of the uses in quote_urls). I didn't notice a difference, but it can't
hurt...
Comment on attachment 89683 [details] [diff] [review]
patch, v3

Ok, that's good enough for me.	This is still r=myk; not sure what the
convention is yet for attachment statuses in 2.17, so I'll check
"second-review".
Attachment #89683 - Flags: review+
Checked in.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
QA Contact: matty_is_a_geek → default-qa
You need to log in before you can comment on or make changes to this bug.