Closed Bug 558482 Opened 14 years ago Closed 14 years ago

Suggest known leak bugs when the log summary contains a leak report

Categories

(Tree Management Graveyard :: TBPL, enhancement)

x86
macOS
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alqahira, Assigned: ehsan.akhgari)

References

Details

Attachments

(1 file, 6 obsolete files)

I just starred http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1270851612.1270853781.17925.gz, which is a leak on branch that was fixed >3 months ago on m-c.  tbpl didn't offer any suggestions.  

In Ehsan's blog post, he mentioned that the suggestion code looks for "recent fixes" to cover things that need to be open, or branches, but that didn't help here, because >3m is not "recent" for m-c.  But with the vagaries of branch approvals, especially exacerbated here by the 2 firedrills since 3.6 came out, branch landings can be stuck waiting for a long time, it seems.

Is there some way to detect that we're on a branch tbpl and increase the time coverage for "recently fixed" to cover a longer period on branches?
I didn't anything about only checking recent fixes, in fact the word recent doesn't occur in my blog post at all!

About this log, tbpl doesn't support suggesting anything for leaks right now, and that is why you don't get any suggestions for this.

Markus, do you think it's worth adding a hack for when automationutils.processLeakLog occurs in a log, and show a list of all known orange leak bugs as suggestions?
Blocks: 557445
(In reply to comment #1)
> I didn't anything about only checking recent fixes, in fact the word recent
> doesn't occur in my blog post at all!

Wow, my mind is playing tricks on me; sorry.  Thankfully it's Friday!
(In reply to comment #1)
> Markus, do you think it's worth adding a hack for when
> automationutils.processLeakLog occurs in a log, and show a list of all known
> orange leak bugs as suggestions?

Sure, why not.
Assignee: nobody → ehsan
Status: NEW → ASSIGNED
Summary: Consider increasing the "fixed bug" date range for branches → Suggest known leak bugs when the log summary contains a leak report
Attached patch Patch (v1) (obsolete) — Splinter Review
Here is the patch to make this happen, but I'm not very sure how useful it is.  Currently, there are 87 bugs returned by the query (http://bit.ly/9jAtDa), which means that loading the summary almost always times out.

We need a better way to track leaks on Tinderboxen, I think.

FWIW, the current query is:

Return all bugs having sw:orange or blocking mlkTests, which have "leak" in their summary and don't have "tracking" in their summary.
Attachment #438298 - Flags: review?(mstange)
Perhaps a much better approach would be to try to automatically determine a number of frequently occurring leak bugs, and suggest whether they're applicable or not.

I'm going to take a first stab at this by trying to automatically recognize bug 538462.
Attached patch Patch (v2) (obsolete) — Splinter Review
This patch adds an automatic leak analysis script for bug 538462 to tbpl.  This way, people can make sure that they're starring the correct orange by just one click.

This same framework can be extended for other leaks as well, and I think would be a big step in making it easier to star leak bugs.
Attachment #438298 - Attachment is obsolete: true
Attachment #438401 - Flags: review?(mstange)
Attachment #438401 - Flags: feedback?(philringnalda)
Attachment #438298 - Flags: review?(mstange)
Oh, again, to test this, you can go to my tbpl instance, which is basically in "dry run mode" (doesn't star or comment).

Here is a sample of how the output looks like:

http://ehsanakhgari.org/tinderboxpushlog/leak-analysis/538462/?id=1271019494.1271021346.23205.gz&tree=Firefox
Comment on attachment 438401 [details] [diff] [review]
Patch (v2)

Mostly matches my understanding of 538462, except for a couple of things:

* assuming that the first "INFO | automation.py | Application ran for" signals the end of mochitest-chrome is fragile, assuming as it does that nobody will ever put another suite first in the same run as mochitest-chrome

* assuming $found = true unless you have $windows and $suspicousWindows and one of the $windows is not in $suspicousWindows leads to false positives like http://ehsanakhgari.org/tinderboxpushlog/leak-analysis/538462/?id=1270851612.1270853781.17925.gz&tree=Firefox3.6 where you say that the log Smokey filed this bug about is a bug 538462, because $windows is empty.

----

Sadly, I can't come up with much more to suggest about: day to day, I star Mac mochitest-other leaks, which turn out to be bug 538462 (unless I screw up and convince myself they're something else), and Windows crashtest leaks that we're currently just throwing in bug 509071 without analysis beyond "crashtest+leak=509071". Bug 505650 seems to have petered out, even on 1.9.2; I'm not sure whether people would be starring bug 529664, which Smokey filed this about, more often than once a month if they had suggestions, but hand-coding a way to suggest a leak that's just waiting for approval for the clear and certain fix seems excessive.
Attachment #438401 - Flags: feedback?(philringnalda) → feedback+
(In reply to comment #8)
> (From update of attachment 438401 [details] [diff] [review])
> Mostly matches my understanding of 538462, except for a couple of things:
> 
> * assuming that the first "INFO | automation.py | Application ran for" signals
> the end of mochitest-chrome is fragile, assuming as it does that nobody will
> ever put another suite first in the same run as mochitest-chrome

You're right.  Fixed that.

> * assuming $found = true unless you have $windows and $suspicousWindows and one
> of the $windows is not in $suspicousWindows leads to false positives like
> http://ehsanakhgari.org/tinderboxpushlog/leak-analysis/538462/?id=1270851612.1270853781.17925.gz&tree=Firefox3.6
> where you say that the log Smokey filed this bug about is a bug 538462, because
> $windows is empty.

I changed the logic to assume $found = !empty($windows), which I think would cover this case.  Thanks for catching that!

> Sadly, I can't come up with much more to suggest about: day to day, I star Mac
> mochitest-other leaks, which turn out to be bug 538462 (unless I screw up and
> convince myself they're something else), and Windows crashtest leaks that we're
> currently just throwing in bug 509071 without analysis beyond
> "crashtest+leak=509071".

I'll be happy to write more leak-analysis scripts, if we have any way to determine whether a leak orange falls into a category or not.  But I'm kind of wary of adding "crashtest+leak=nnnnnn" kind of logic as automatic suggestions.

> Bug 505650 seems to have petered out, even on 1.9.2;
> I'm not sure whether people would be starring bug 529664, which Smokey filed
> this about, more often than once a month if they had suggestions, but
> hand-coding a way to suggest a leak that's just waiting for approval for the
> clear and certain fix seems excessive.

I'm still willing to write the analysis script even for such bugs (although I know it might soon be of little value.)

Ideally, people would file bugs to get new analysis scripts, and I would go ahead and write those for them.  :-)
Attached patch Patch (v2.1) (obsolete) — Splinter Review
New patch with philor's comments addressed.
Attachment #438401 - Attachment is obsolete: true
Attachment #438409 - Flags: review?(mstange)
Attachment #438401 - Flags: review?(mstange)
Attached patch Patch (v2.2) (obsolete) — Splinter Review
I updated the patch to cover a false negative case that philor found out about.  The new patch uses the addresses of the windows in addition to their serial number, and also changes the finding algorithm to look for all $suspiciousWindows entries in $windows, and report success if they are all inside that array.
Attachment #438409 - Attachment is obsolete: true
Attachment #438515 - Flags: review?(mstange)
Attachment #438409 - Flags: review?(mstange)
Attachment #438515 - Flags: feedback?(dbaron)
Comment on attachment 438515 [details] [diff] [review]
Patch (v2.2)

Asking dbaron's feedback to confirm that this is in fact the correct algorithm.
Comment on attachment 438515 [details] [diff] [review]
Patch (v2.2)

>diff -r 131ade52f2d4 leak-analysis/538462/index.php
>--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
>+++ b/leak-analysis/538462/index.php	Mon Apr 12 13:35:53 2010 -0500

>+    $line = fgets($fp, 1024);

I think you need to add the same sleep stuff that summaries/get.php does here in order to avoid wasting CPU cycles while the summary loads. However, I'm not sure that it works like I think it does... I'll do some research.

>+  $found = 0;
>+  foreach ($suspiciousWindows as $window) {
>+    if (in_array($window, $windows)) {
>+      ++$found;
>+    }
>+  }
>+  $found = ($found == count($suspiciousWindows));

Can you name rename $found to $numSuspiciousWindowsInLog and do the check directly in the if condition further down?

>diff -r 131ade52f2d4 summaries/get.php
>--- a/summaries/get.php	Sat Apr 10 14:59:43 2010 -0500
>+++ b/summaries/get.php	Mon Apr 12 13:35:53 2010 -0500

> function processLine(&$lines, $line) {
>+  $found = false;

$foundBugSuggestion

>+    $bugs_json = file_get_contents("https://api-dev.bugzilla.mozilla.org/latest/bug?id=538462");
>+    if ($bugs_json !== false) {
>+      $bugs = parseJSON($bugs_json);
>+      if (isset($bugs->bugs)) {

What does this do? I don't understand.

I also don't really understand your use of $leaksCache (it's only saying "have seen automationutils.processLeakLog line", right?), but I assume that's the start of your general framework.

As for the algorithm, I'll trust philor and dbaron on that.
Attached patch Patch (v2.3) (obsolete) — Splinter Review
(In reply to comment #13)
> (From update of attachment 438515 [details] [diff] [review])
> >diff -r 131ade52f2d4 leak-analysis/538462/index.php
> >--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
> >+++ b/leak-analysis/538462/index.php	Mon Apr 12 13:35:53 2010 -0500
> 
> >+    $line = fgets($fp, 1024);
> 
> I think you need to add the same sleep stuff that summaries/get.php does here
> in order to avoid wasting CPU cycles while the summary loads. However, I'm not
> sure that it works like I think it does... I'll do some research.

Ah, that's what the sleep in summaries/get.php was for?  I guess a much better way to do this would be to just let the stream be blocking, so that fgets would block if it doesn't have anything to fetch.  What is the benefit of a nonblocking stream here?

> >+  $found = 0;
> >+  foreach ($suspiciousWindows as $window) {
> >+    if (in_array($window, $windows)) {
> >+      ++$found;
> >+    }
> >+  }
> >+  $found = ($found == count($suspiciousWindows));
> 
> Can you name rename $found to $numSuspiciousWindowsInLog and do the check
> directly in the if condition further down?

Sure, done.

> >diff -r 131ade52f2d4 summaries/get.php
> >--- a/summaries/get.php	Sat Apr 10 14:59:43 2010 -0500
> >+++ b/summaries/get.php	Mon Apr 12 13:35:53 2010 -0500
> 
> > function processLine(&$lines, $line) {
> >+  $found = false;
> 
> $foundBugSuggestion

Done.

> >+    $bugs_json = file_get_contents("https://api-dev.bugzilla.mozilla.org/latest/bug?id=538462");
> >+    if ($bugs_json !== false) {
> >+      $bugs = parseJSON($bugs_json);
> >+      if (isset($bugs->bugs)) {
> 
> What does this do? I don't understand.

It just retrieves the summary of the bug from bugzilla, so that we wouldn't have to hardcode it here.

> I also don't really understand your use of $leaksCache (it's only saying "have
> seen automationutils.processLeakLog line", right?), but I assume that's the
> start of your general framework.

It's just there to avoid pulling down data from network for subsequent lines containing automationutils.processLeakLog.  As to why it's an array, and not a simple boolean variable, I tried to make the code easy to modify in case for some reason we would need to process other types of log lines in the future.  The concept is similar to the $bugsCache array, except that for $bugsCache, the keys are file names, and for $leaksCache, the keys are parts of the line which we use as a trigger for the leak detection code (such as "automationutils.processLeakLog".)
Attachment #438515 - Attachment is obsolete: true
Attachment #438576 - Flags: review?(mstange)
Attachment #438576 - Flags: feedback?(dbaron)
Attachment #438515 - Flags: review?(mstange)
Attachment #438515 - Flags: feedback?(dbaron)
Attachment #438576 - Flags: review?(mstange) → review+
Comment on attachment 438576 [details] [diff] [review]
Patch (v2.3)

(In reply to comment #14)
> Ah, that's what the sleep in summaries/get.php was for?  I guess a much better
> way to do this would be to just let the stream be blocking, so that fgets would
> block if it doesn't have anything to fetch.

Then we wouldn't be able to check if the cache file started existing during that time. But I agree that this optimization might not be worth the hassle.

> > >+    $bugs_json = file_get_contents("https://api-dev.bugzilla.mozilla.org/latest/bug?id=538462");
> > >+    if ($bugs_json !== false) {
> > >+      $bugs = parseJSON($bugs_json);
> > >+      if (isset($bugs->bugs)) {
> > 
> > What does this do? I don't understand.
> 
> It just retrieves the summary of the bug from bugzilla, so that we wouldn't
> have to hardcode it here.

Ah, makes sense.
The current version might need some perf work - the last couple of logs I've tried against ehsan's install have died with "Maximum execution time of 30 seconds exceeded" errors.
Attached patch Patch (v2.4) (obsolete) — Splinter Review
This patch adds a set_time_limit call to give the script 60 seconds to execute, to prevent some of the timeout messages that philor has been seeing.
Attachment #438576 - Attachment is obsolete: true
Attachment #439148 - Flags: feedback?(dbaron)
Attachment #438576 - Flags: feedback?(dbaron)
http://ehsanakhgari.org/tinderboxpushlog/leak-analysis/538462/?id=1271391497.1271393361.21520.gz&tree=Firefox

Fatal error: Maximum execution time of 60 seconds exceeded in /home2/ea/public_html/tinderboxpushlog/leak-analysis/538462/index.php on line 33

Three times in a row. Sorry.
I don't know how to make the situation better except by increasing the allowed time value.  Increased to 120 seconds for now.
Attachment #439148 - Flags: feedback?(dbaron) → feedback-
Comment on attachment 439148 [details] [diff] [review]
Patch (v2.4)

It seems like this is relatively fragile in ways that could lead to false positives in addition to false negatives, and could be improved in a relatively straightforward way.

In particular, rather than |lookForSuspiciousWindows|, you could walk through the log from beginning to end, while storing:
 * the most recent TEST result line
 * the number of DOMWINDOWs opened since that TEST Result line
 * all open DOMWINDOWs, with their address *and the TEST result line immediately preceding when they were opened and count since that line)

Then, when you reach the end, you'd have a list of test-result-line+count pairs for each leaked DOMWINDOW; this would yield a somewhat less fragile identification, one not subject to false positives, and one that's also useful for learning useful information about any new leak bugs that pop up.
Attached patch Patch (v3)Splinter Review
I implemented dbaron's suggestion in comment 20.  I like this patch a lot better, because it can be easily extended to cover DOMWINDOW leaks in other tests, and other types of leaks as well, provided that we have the relevant log messages in the output.

For an example of the output for a log with the leak, see: <http://ehsanakhgari.org/tinderboxpushlog/leak-analysis/?id=1271019494.1271021346.23205.gz&tree=Firefox>.
For an example of the output for a log with no leaks, see: <http://ehsanakhgari.org/tinderboxpushlog/leak-analysis/?id=1274127012.1274128852.7258.gz&tree=Firefox>.
Attachment #439148 - Attachment is obsolete: true
Attachment #445916 - Flags: review?(mstange)
ping?
Comment on attachment 445916 [details] [diff] [review]
Patch (v3)

Looks good, sorry for the lag!
Attachment #445916 - Flags: review?(mstange) → review+
Blocks: 558719
Thanks!

http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/rev/116c759c5403

It would be great if you could post a note here when this is deployed, so that I can post something on dev.tree.management.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Deployed. Thanks!
One comment about the code here:  it uses showlog.cgi, but doesn't really need to, since it's always getting the full log and doesn't need the summary.  It would probably be faster with some changes starting with:

-  $page = "/showlog.cgi?log=" . $tree . "/" . $id; // . 1233853948.1233859186.27458.gz";
-  $page .= "&fulltext=1";
+  $page = "/" . $tree . "/" . $id; // . 1233853948.1233859186.27458.gz";

which would download the raw log rather than the HTML-ized log.
(In reply to comment #26)
> One comment about the code here:  it uses showlog.cgi, but doesn't really need
> to, since it's always getting the full log and doesn't need the summary.  It
> would probably be faster with some changes starting with:
> 
> -  $page = "/showlog.cgi?log=" . $tree . "/" . $id; // .
> 1233853948.1233859186.27458.gz";
> -  $page .= "&fulltext=1";
> +  $page = "/" . $tree . "/" . $id; // . 1233853948.1233859186.27458.gz";
> 
> which would download the raw log rather than the HTML-ized log.

But the problem in that case would be the gzip encoding.  I'm not sure what headers I should be sending in order for the tinderbox web server to actually deliver non-compressed data to me...
Isn't it better to get compressed data?
(In reply to comment #28)
> Isn't it better to get compressed data?

It is.  File a bug?
Product: Webtools → Tree Management
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: