Closed
Bug 151871
Opened 22 years ago
Closed 22 years ago
performance problems in quoteUrls
Categories
(Bugzilla :: Bugzilla-General, defect)
Bugzilla
Bugzilla-General
Tracking
()
RESOLVED
FIXED
Bugzilla 2.18
People
(Reporter: myk, Assigned: bbaetz)
References
Details
(Keywords: perf)
Attachments
(1 file, 6 obsolete files)
8.99 KB,
patch
|
myk
:
review+
myk
:
review+
|
Details | Diff | Splinter Review |
Bug 145055 on b.m.o has a lot of URLs in one of its comments. As a result, it doesn't load, since the HTTP transaction times out before Bugzilla finishes generating and outputting the content.
Reporter | ||
Comment 1•22 years ago
|
||
I don't think this is a regression. The only thing that happened to quoteUrls recently doesn't look like it would have caused this problem.
Reporter | ||
Comment 2•22 years ago
|
||
Here's a hack to make the bug visible: http://bugzilla.mozilla.org/show_bug.cgi?id=145055&format=no-urls
Assignee | ||
Comment 3•22 years ago
|
||
For a Fun Evening (TM) paste checksetup.pl into a comment. quote_urls is 90-95% of the time taken, because of all the @ symbols. I thought I'd filed a bug on that, but obviously I didn't. I tried looking through the regexps, but they're, well, 'complicated'.
Comment 4•22 years ago
|
||
Why don't we cache the quoted forms of the comments as an extra field in the comments table? It seems stupid to requote every comment on every bug every time someone wants to see it. If we ever change QuoteUrls(), we can just add a "nuke the cache" to checksetup.pl. Gerv
Assignee | ||
Comment 5•22 years ago
|
||
That would be take too much db space, really. Not to mention that GetBugLink is dynamic (for the status/summary of the linked bug) I think we just need to redo some of the regexps, though, rather than rewrite the schema.
Assignee | ||
Comment 6•22 years ago
|
||
So, test results with checksetup.pl pasted into a comment (I had to temporarily remove arround the 65535 char limit we now have): The problem is the first regexp. Look closely at: while ($text =~ s%((mailto:)?([\w\.\-\+\=]+\@[\w\-]+(?:\.[\w\-]+)+)\b| (\b((?:$protocol):[^ \t\n<>"]+[\w/])))%"##$count##"%exo) { Think about what this the first part of this re does to any word, say 'abcdef'. It doesn't match mailto:, but thats optional, so perl searches the word for an @, starting at 'a'. It doesn't find one, so it backtracks, and tries again from 'b'. Then from 'c', then..... This is not good, and a real quick fix is to move a \b check to the front of the regexp. Thats the 4 char patch. At least I think that that is what is happening. Theres more we can do, though: Timings are with time REQUEST_METHOD='GET' QUERY_STRING='id=19' perl -wT show_bug.cgi > /dev/null Times vary about 0.15 seconds each way, and I viewed the page in the browser first so that db disk access wasn't an issue. All tests include the bits from before. - unpatched - 9.4 seconds - 4 character patch to move \b - 4.8 seconds - Another few characters to remove unelss capturing fu - 4.55 seconds [WANT TO CONSIDER THIS ONE FOR 2.16] - Splitting the single regexp into two - 2.5 seconds - Removing the /e modifier - 2.5 seconds (This one won't matter much unless we have lots of matches, though) - Combining regexps so that we can use /g, rather than playing games with # (and also elimimating $&, which didn't seem to have much of an effect) - 1.6 seconds Note that it takes about 1.1 seconds before the Content-Type header shows up, so this is a reduction in time of just under 94% The last one includes a few other minor cleanups, too (eg /i for matching on protocols, not including the ) for |(attachment 3 [details] [diff] [review])| and so on) I'll attach the patch I want for 2.16, plus the trunk one. The trunk one wants careful inspection. We can take the splitting one for 2.16 too, if we want - I'll attach that one too.
Assignee | ||
Comment 7•22 years ago
|
||
Assignee | ||
Comment 8•22 years ago
|
||
Either this one or the minimal one should happen for 2.16
Assignee | ||
Comment 9•22 years ago
|
||
This should go onto the trunk, once its been tested.
Assignee | ||
Comment 10•22 years ago
|
||
Times for bug 145055: - unpatched - 100 seconds - minimal change - 34 seconds - medium change - 5.5 seconds - invasive change - 1.8 seconds The big diff from minimal->medium is probably because this is a large file which also has lots of matches
Comment 11•22 years ago
|
||
Comment on attachment 88090 [details] [diff] [review] minimal change, v1 I think we should aim at the medium solution for the branch. At least I feel that it will give us the best speedup/risk ratio. Unfortunately my brains are not sufficiently awake to have a look at those regexp chunks... This minimal approach was enough ;-) r=jouni for the minimal model in case we decide to check it in. The regex looks sane to me and also seems to work.
Attachment #88090 -
Flags: review+
Reporter | ||
Comment 12•22 years ago
|
||
bbaetz- Have you tried saving output to files and comparing to see if there are any differences?
Assignee | ||
Comment 13•22 years ago
|
||
I haven't, no. I could set up a script to do this on the landfill db, I guess, but the problem is that the results aren't going to be the same. Currently 'asdasdasdamailto:bbaetz@student.usyd.edu.au' will link the mailto thing, but my code changes it. Thats because that is the cause of the perf problem, so it has to change. Instead, it will linkify just the actual address itsself, because : is counted as \b. Also, the current code will, for "(see attachment 1 [details] [diff] [review])", linkify from the a in attachment up to and including the ). The last patch of mine avoid linkifying the ) because the 'created attachment' regexp is separate from the others. The last patch linkifies HTTP://lwn.net; the old code didn't. I'll see if I can test, and then manually check any changes, though.
Assignee | ||
Comment 14•22 years ago
|
||
OK. So i wrote a test script, and it failed. The problem is something like: http://foo/bar?id=foo@bar.com That contains both a valid url and a valid email address. So we have to do substitution for uris, to avoid double matching. The alternate is to have the start not match =, /, or #. But uris can contain other chars, too (& and ( being some of the more interesting ones), and so blocking that blocks valid bug references, too (|http://foo/bug#3| vs |(bug#2)|) Anyway, the test script passes my db + landfill. You'll need Text::Diff to run it. There are, however, differences, the most motable one being the 'updated attachment' thing I described in a precious comment. myk, can you run this on a version of bmo's db, and mail me the results? After that, I'll convert these into patches again.
Attachment #88090 -
Attachment is obsolete: true
Attachment #88092 -
Attachment is obsolete: true
Attachment #88093 -
Attachment is obsolete: true
Assignee | ||
Comment 15•22 years ago
|
||
Oh, and for an improvment on the small and med patches (as well as the original), replace: for (my $i=0 ; $i<$count ; $i++) { $text =~ s/##$i##/$things[$i]/e; } with: $text =~ s/##(\d+)##/$things[$1]/eg; which avoids us traversing the list once for each match, ie a quadratic algorithm. That change, made on a tree with the medium patch in place, cuts the time for the bug 145055 testcase down from 5.5 seconds to 2.9 seconds. Thats still above the 1.8 I get for the invasive patch, but its much better than the original.
Reporter | ||
Comment 16•22 years ago
|
||
Running the test script with the patch and excluding secure bugs. Will report results directly to bbaetz.
Assignee | ||
Comment 17•22 years ago
|
||
OK, heres take 2. I fixed some stuff: - mailto:foo@bar was being turned into mailto:mailto:foo@bar - "bug# 123" wasn't being linkified, because of the space after the #. I've also readded the bug that caused closing brackets after attachment ids to be included in the link, in a bit to remove false positives from the bmo test results. I'll remove that before making the patches. Other changes I noticed, which I am deliberatly leaving in place: - trailing/leading -, +, = are no longer included in mailto links. This is almost always not what is wanted. I don't even know if thats valid in an email address, and a grep through the results indicated that that was never what it was being used for. - value_quote (which the old stuff uses) replaces newlines with 
, because thats neededd for hidden form elements (see bug 4928). This means that we have to replace that with \n again, for display. bugzilla does this before substituting the replacements, so the extra entity is in the source if we matched over multiple lines. Since thats the only difference between value_quote and html_quote, I just used html_quote instead, in the first place. This difference is only visible in the source code, at least for mozilla, which expands the entity on the page. myk, can I get you to run this one on bmo's data, please?
Attachment #88248 -
Attachment is obsolete: true
Assignee | ||
Comment 18•22 years ago
|
||
OK. Changes - I was misreading the diff, and so my previous diff made us match on trailing -, instead of removing. Other fixes: - match on attachment\s(id=123) - an "attachment 1 [details] [diff] [review]" in a bug title was matched - swapped the order arround to fix that. - only search for groupset=0 One big problem with this test script is that the old code has a bug. The anchor matching code looks to see if it should prepend mailtoi by searching on /^$protocol:/. However, $protocol does not have brackets arround it. This means that anyone whose emailaddress contains http, mail, or news didn't have mailto: preprended. I've fixed this, but mpt@mailandnews.com appears a great nuber of times here. Whle tracking that doen, I did clean up that code a bit, using qr//. Other things I'm not going to fix: - line ending format. This doen't matter for html, and I don't see the point in using value_quote, then resubstituting immediately. This appears to only affect moved bug mail by the looks of it, since, process_bug does line ending substitution, which makes is doubly wasteful to convert stuff. This time I got to the 40% mark before giving up. Anyway, lets give this one more round. I no longer think we should take this for 2.16, so my test script only tests the final change now. (This should also mean that the test script runs faster) myk: Can you run this once more, then I'll generate a patch? Review comments are welcome, too.
Attachment #89047 -
Attachment is obsolete: true
Assignee | ||
Comment 19•22 years ago
|
||
I don't think this should make 2.16 - its a long standing bug, and testing multiple fixes is just a pain. It can go into 2.16.1 if we ever release that
Target Milestone: Bugzilla 2.16 → Bugzilla 2.18
Assignee | ||
Comment 20•22 years ago
|
||
OK, summary. There are 829 diffs in the final log, consisting of approximately: - 332 
 -> \n translations - 59 places where the preceeding -/=/. in an email was linkified - About 240 cases of email addresses missing the mailto: when linkified (including 108 of mpt's email) - About 140 cases of capital urls now being included. - 80 cases of the linkification for "Created an attachment" when not in the first line being altered to not include the 'Created an ' bit - 5 cases of line endings when importing a bug from bugsplat. - A couple of cases where the duplicate bug-linking has changed (it now has to be on the beginning/end of a line, to check that it was bugzilla-generated) Thats a total of 856, out of the 829 present. These were done with grep -c, so they're not accurate, esp as Text::Diff doesn't seem to leave a newline between the - and + line of a diff if the diffed line doesn't end in a newline (ie the last one in the comment, or teh 
 thing, in our case). This meant that my divide by 2 count was off. Also, some diffs were affected by more than one problem, so were double counted. I'll go create a patch, now. I'm not really fussy about having 'created at attachment' matching on a line by its own which isn't the start, and same with the duplicate stuff. However, because of the algorithm used, the duplicate stuff will need to be at least on a line of its own, even if its not at the end of the comment (because if a bugtitle contains the magic text, we'll linkify inside that, too - see comments in the test script) If I don't match on the last line, though, we will mis-linkify comments which have a title of: " *** This bug has been marked a duplicate of 123 *** " _including the newlines_. Newlines in a summary shouldn't be valid, but I know we do have such bugs, so... I'm not sure if that one is such an issue, though - it only happens about 10 times, mostly with imported bugs, in which case the linkification wouldn't be correct anyway. Ditto for the 'created an attachment' bit, although there are cases where people copy and paste the comment to comment on it. For that case, though, we still do have a link to the attachment, it just doesn't include the 'created an ' part. Not to mention that restricting these probably speeds things up a bit, too.
Assignee | ||
Comment 21•22 years ago
|
||
OK, heres the patch, based on the script. Changes from the test version: - a few comment changes - reinstated the fix for matching closing ) in attachment comments I described earlier - don't match on |foobarattachment (id=1)| There weren't any cases of that in the log myk sent me, but its still a correctness fix, since the old code doesn't match on that either. - do match on (attachment (id=2)) - ditto I've also included a fix to CGI.pl to not use $&, which according to the perlre docs will slow down matches if its in there once. Didn't notice anyt difference, but it won't hurt, and since this patch removes all the other uses.... This is ready for review.
Attachment #89191 -
Attachment is obsolete: true
Reporter | ||
Comment 22•22 years ago
|
||
Comment on attachment 89683 [details] [diff] [review] patch, v3 >Index: globals.pl >+ $text =~ s~\b(${protocol_re}: # The protocol: >+ [^\s<>\"]+ # Any non-whitespace >+ [\w\/]) # so that we end in \w or / Can't URLs end in non-word characters like periods (.)? >+ ~($tmp = html_quote($1)) && >+ ($things[$count++] = "<a href=\"$tmp\">$tmp</a>") && Not sure if it matters, but wouldn't it save memory to store only $tmp in @things and wrap it in an anchor when you put the things back in at the end? >+ # Duplicate markers >+ $text =~ s~(?<=^\*\*\*\ This\ bug\ has\ been\ marked\ as\ a\ duplicate\ of\ ) >+ (\d+) >+ (?=\ \*\*\*\Z) >+ ~GetBugLink_big($1, $1) (bbaetz says on IRC to pretend this says GetBugLink) >+ # Now remove the encoding hacks >+ $text =~ s/\0\0(\d+)\0\0/$things[$1]/eg; >+ $text =~ s/$chr1\0/\0/g; I suppose there are obscure situations where this meta-escaping would fail, but I don't think we need to worry about them. > sub GetBugLink { >- my ($bug_num, $link_text) = (@_); >+ my ($bug_num, $link_text, $comment_num) = (@_); My perennial nit: (@_) is redundant. >Index: CGI.pl >- $0 =~ m:[^/\\]*$:; >- $nexturl = $&; >+ $0 =~ m:([^/\\]*)$:; >+ $nexturl = $1; What's the reason for this change? It doesn't appear to have anything to do with the quote URLs changes, nor does it appear to change anything (both versions look like they work exactly the same).
Attachment #89683 -
Flags: review+
Assignee | ||
Comment 23•22 years ago
|
||
> Can't URLs end in non-word characters like periods (.)? They can, yes. But: a) the old code didn't b) its rare; c) more importantly, consider http://foo.com/bar.html. We don't want the . to be linkified. > Not sure if it matters, but wouldn't it save memory to store > only $tmp in @things and wrap it in an anchor when you put the > things back in at the end? That would only work if this is the only thing stored in @things. Which it is, but when we linkify http://bugzilla.mozilla.org/show_bug.cgi?id=1, it won't be. Besides, you'd nee a truly massive number of links to take up a noticable ammount of memory. > I suppose there are obscure situations where this meta-escaping > would fail, but I don't think we need to worry about them. There shouldn't be any - the old code did a similar thing, and the value in things has been value_quoted, so it can't contain \0 at all. [CGI.pl change] > What's the reason for this change? It doesn't appear to have > anything to do with the quote URLs changes, nor does it appear > to change anything (both versions look like they work exactly > the same). as soon as any part of a program uses a regexp $-var, perl has to start capturing it for all cases, since it can't know if its used or not. This measn that this use means that every regexp now captures the entire matching string. perldoc perlrc warns that $& is expensive, and so its probably a good idea to remove it from this one place, and thus everywhere (since I've removed the remainder of the uses in quote_urls). I didn't notice a difference, but it can't hurt...
Reporter | ||
Comment 24•22 years ago
|
||
Comment on attachment 89683 [details] [diff] [review] patch, v3 Ok, that's good enough for me. This is still r=myk; not sure what the convention is yet for attachment statuses in 2.17, so I'll check "second-review".
Attachment #89683 -
Flags: review+
Assignee | ||
Comment 25•22 years ago
|
||
Checked in.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
QA Contact: matty_is_a_geek → default-qa
You need to log in
before you can comment on or make changes to this bug.
Description
•