Closed Bug 893404 Opened 11 years ago Closed 11 years ago

Intermittent browser_thumbnails_background.js | Test timed out

Categories

(Toolkit :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: cbook, Assigned: markh)

References

()

Details

(Keywords: intermittent-failure, Whiteboard: [test disabled on linux])

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=25265047&tree=Mozilla-Inbound

Rev5 MacOSX Mountain Lion 10.8 mozilla-inbound opt test mochitest-browser-chrome on 2013-07-13 10:44:08 PDT for push b527f451d64d

slave: talos-mtnlion-r5-014

TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_background.js | Test timed out
Pretty sure this is the orange Philor backed out bug 891169 and bug 870104 for.
This is kinda subtle and unrelated to the 2 bugs that were backed out.  The problem is that if we "early exit" due to PB mode, we end up with 2 "destroy" timers in place, one of which is never cancelled.  When it fires the browser is destroyed - but that only causes a problem depending on the timing of the destruction - most of the time we just re-create a new one and all is well (and/or the bad timer that fires does so after our tests have completed)

This problem happens because we destroy the timer after we call capture.start - specifically, what *should* happen is:

* previous capture completes - _done() method creates destroy timer.
* start new capture
* cancel destroy timer.
[wait for capture to finish]
* this capture completed - _done() creates new destroy timer.

What happens here is:

* previous capture completes - _done() method creates destroy timer.
* start new capture - but PB mode is active - so call _done() immediately
* _done method creates destroy timer - so now we have 2!
* cancel destroy timer - but we only cancel the second - first lives on.

Solution is simply to cancel the destroy timer *before* we call start.  This is more robust anyway - we really should cancel the destroy timer *before* we check the browser is OK to use, else a nested event-loop would cause the same basic problem (which is what I thought this was initially)

Anyway... patch is simple.
Assignee: nobody → mhammond
Attachment #775510 - Flags: review?(adw)
Comment on attachment 775510 [details] [diff] [review]
cancel the destroy timer before we call "start"

Review of attachment 775510 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for tracking this down.

I'd like to fix it a different way, because there's a problem with my patch in bug 891169: it makes it so that captures sometimes finish synchronously (when there are private windows open) and sometime asynchronously (all other times).  They should always finish one way or the other, so I'll modify my patch to call _done asyncly in Capture.start.

We might take your patch anyway, but I'd rather not since it would mask bugs like the one my patch introduced.  (And I'm so not worried about crazy things like nested event loops.)
Attachment #775510 - Flags: review?(adw)
Blocks: 891169
Just as an additional way to look at this:

(In reply to Drew Willcoxon :adw from comment #3)
> We might take your patch anyway, but I'd rather not since it would mask bugs
> like the one my patch introduced.

TBH, I don't think your patch *did* introduce a bug.  The only actual problem caused by calling _done synchly is to expose the fact that we are checking the validity of the browser element while a timer still exists to kill it :)  At face value, *that* sounds wrong rather than the fact a callback may or may not be called syncly.
Comment on attachment 775510 [details] [diff] [review]
cancel the destroy timer before we call "start"

IRC yada yada.
Attachment #775510 - Flags: review+
Any ideas why this regressed around the 31st July/1st Aug? :-)
Flags: needinfo?(mhammond)
The new failures looks quite a bit different than the old.  Highlights of the new logs include:

04:30:48     INFO -  ###!!! [Parent][AsyncChannel] Error: Channel error: cannot send/recv
04:31:18  WARNING -  TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_background.js | Test timed out
...
04:35:48  WARNING -  PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::layers::PLayerTransactionChild::SendPLayerConstructor(mozilla::layers::PLayerChild*)]

(ie, it looks like the child process terminated unexpectedly during the test, which would then account for the timeout - then the parent crashed?)

The first "new" occurrence of this was on inbound, rev 6b8df204adb4.  Directly before this rev on inbound we have lots of compositor related bugs - bug 899435, bug 874721, bug 874721, etc - all by Nick - and the compositor and e10s seem to be enemies at the moment :)

So my money is on some regression in the cross-process compositor stuff - so I'm CCing nrc and making this block the crash-e10s bug.  Nick - any insights?
Blocks: 899758
Flags: needinfo?(mhammond)
Summary: Intermittent TEST-UNEXPECTED-FAIL | chrome://mochitests/content/browser/toolkit/components/thumbnails/test/browser_thumbnails_background.js | Test timed out → Intermittent browser_thumbnails_background.js | Test timed out
Component: General → Graphics
Product: Toolkit → Core
Component: Graphics → Graphics: Layers
All the 8/01 failures are graphics-related crashes, but none of the failures since then are.  So on second look (or more like third or fourth by now), I think it's safe to say that crashes aren't the reason for these failures anymore.  I'm therefore moving this back to Toolkit.  Sorry I didn't look more closely at the recent failures before we met yesterday, Gavin!

All the failures after and including 9/19 are on ASan, which really doesn't like these tests for some reason.

That leaves only three failures, comments 24-26.  The logs for comments 24-25 aren't available anymore. :-(  Comment 26 has an additional failure that caused the later timeout: file.remove() threw an exception because the file didn't exist, preventing the test from completing.
No longer blocks: 899758
Component: Graphics: Layers → General
OS: Mac OS X → All
Product: Core → Toolkit
Hardware: x86 → All
Comment 35 is the same failure as comment 26.  file.remove() throws an exception because the file doesn't exist, preventing the test from finishing.  It's the same file trying to be removed.  I don't see any obvious reason why it might not exist.
Blocks: 931889
Test disabled on Linux in bug 931889.
Whiteboard: [test disabled on linux]
It looks like all the failures starting from comment 39 are unexplained Linux ASAN timeouts.  Maybe they timed out due to the aformentioned problem with removing a nonexistent file.  Regardless, this test was split up in bug 931889, and there are no reported failures in any of the new, smaller files.  I don't think there's any point in keeping this open anymore.  Let's file bugs against the new files as necessary.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: