Closed Bug 1294025 Opened 8 years ago Closed 7 years ago

Intermittent browser/base/content/test/plugins/browser_CTP_crashreporting.js | Uncaught exception - Timed out waiting for plugin binding to be in success state - timed out after 50 tries.

Categories

(Core Graveyard :: Plug-ins, defect, P3)

defect

Tracking

(firefox-esr52 unaffected, firefox54 wontfix, firefox55 fixed, firefox56 fixed)

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- wontfix
firefox55 --- fixed
firefox56 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: gsvelto)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:timing])

Attachments

(2 files)

Priority: -- → P4
Priority: P4 → P3
this bug has a few spikes and a many days with no failures:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1294025&startday=2017-05-01&endday=2017-05-30&tree=all

recently we have had many failures since May 23rd and it jumped up high May 25th mostly on linux32 debug and also on linux64 debug browser-chrome e10s!

here is a log for linux32-debug:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=102914117

and the related screenshot:
https://public-artifacts.taskcluster.net/WB2b4NwwTnO5SyKS78HqDQ/0/public/test_info//mozilla-test-fail-screenshot_UCEGME.png

and what I see in the above log is:
[task 2017-05-29T23:14:10.019309Z] 23:14:10     INFO - Entering test bound 
[task 2017-05-29T23:14:10.021042Z] 23:14:10     INFO - Buffered messages logged at 23:13:55
[task 2017-05-29T23:14:10.025967Z] 23:14:10     INFO - TEST-PASS | browser/base/content/test/plugins/browser_CTP_crashreporting.js | Plugin should not be activated - 
[task 2017-05-29T23:14:10.028058Z] 23:14:10     INFO - Buffered messages finished
[task 2017-05-29T23:14:10.031400Z] 23:14:10     INFO - TEST-UNEXPECTED-FAIL | browser/base/content/test/plugins/browser_CTP_crashreporting.js | Uncaught exception - Timed out waiting for plugin binding to be in success state - timed out after 50 tries.
[task 2017-05-29T23:14:10.033290Z] 23:14:10     INFO - Leaving test bound 
[task 2017-05-29T23:14:10.036124Z] 23:14:10     INFO - Entering test bound 


the failure seems to be on this line:
https://dxr.mozilla.org/mozilla-central/source/browser/base/content/test/plugins/browser_CTP_crashreporting.js?q=path%3Abrowser_CTP_crashreporting.js&redirect_type=single#139

I am glad we timeout and handle this faster than the harness timing out

:bsmedberg, I see you as the triage owner, can you get someone from the plug-ins team (or who is knowledgeable about this specific test case) to look at this failure and make it more reliable in the next 2 weeks?
Flags: needinfo?(benjamin)
Whiteboard: [stockwell needswork]
Joel, do you have a regression range from the 22-25 May that I could peruse? My old tool to generate regression ranges for nightlies is broken. I suspect recent crash reporting stuff gsvelto was working on but I'd be more comfortable to see that before reassigning.
Flags: needinfo?(benjamin) → needinfo?(jmaher)
I don't have a tighter range, I was just trying to help shed light on what was going on- my primary goal is to triage and get this to the test owners to fix the issues.  Right now I don't have bandwidth to do a lot of extra things
Flags: needinfo?(jmaher)
I suspect bug 1359326 could have made this worse; there might be a race between the crash processing and the crash submission display and it might have been widened by that bug. I'll have a look today.
(In reply to Gabriele Svelto [:gsvelto] from comment #11)
> I suspect bug 1359326 could have made this worse; there might be a race
> between the crash processing and the crash submission display and it might
> have been widened by that bug. I'll have a look today.

Did you get a chance to look at this? There are lots of failures now...would like to see this test fixed or disabled soon.
Flags: needinfo?(gsvelto)
(In reply to Geoff Brown [:gbrown] (pto June 12 - 16) from comment #15)
> Did you get a chance to look at this? There are lots of failures now...would
> like to see this test fixed or disabled soon.

Yes, but I couldn't find the root cause and I don't seem to be able to reproduce locally. I think I know a why to mitigate it though. I'll send a try run with the necessary change and I'll re-trigger this test a bunch of times to see if it gets better. If it doesn't we can disable it and then I'll re-enable it in bug 1323979 where I'm changing the way events about crashes are delivered throughout the system. Not clearing the NI for now.
I've managed to reproduce this locally using rr's chaos mode. I should be able to fix it on Monday, if not I'll disable it in the meantime. Taking the bug.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Flags: needinfo?(gsvelto)
Sorry if it took me a while to figure this one out but it was tricky. Originally I thought about bug 1359326 because it was a rewrite of the crash reporting flow so I thought that might have broken something. As it turns out, it didn't. Bug 1335536 on the other hand broke a promise chain that made crash reporting reliable. The reason why this has only shown up after bug 1359326 landed is that previously that part of the crash reporting chain was mostly made of C++ code that run on a background thread thus not affecting the event loop. The new code is JS, and while it's asynchronous and runs stuff in the background it's using promises to sync everything up, so it must have upset the broken promise chain by "injecting" stuff in the middle of it and revealing the race.

I've got a patch that restores the promise chain and it's currently running on try:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea482c07aa3416e68b1bd3b1d148145013b3875a&group_state=expanded

I've re-triggered the affected ~10 times and they don't seem to fail anymore. Before they failed very often, especially on debug builds.
Sorry for the typo, I meant I've re-triggered the affected *tests* ~10 times.
Comment on attachment 8876677 [details]
Bug 1294025 - Fix the broken promise chain when recording a crash submission attempt;

https://reviewboard.mozilla.org/r/148026/#review152494

sneaky!

I bet making some of this async/await would make it more readable.
Attachment #8876677 - Flags: review?(benjamin) → review+
Thanks for the review! I'm about to land but I'll leave the bug open for now, let's close it when we're sure that the test is not failing anymore.
Keywords: leave-open
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/aff2e55a7974
Fix the broken promise chain when recording a crash submission attempt; r=bsmedberg
The change here doesn't seem to have worked - this test is currently perma-failing.
it sounds like the changes made this worse- should we:
1) back out the changes
2) disable the test for linux debug?
Flags: needinfo?(gsvelto)
I'm fairly stumped but I'm running more tests right now. If they're inconclusive I'll disable the test today and re-enable it once I've refactored all the associated code in bug 1323979.
Flags: needinfo?(gsvelto)
The issue that's causing the failures on Linux debug builds is different than what I was seeing before and it looks like it's a race in the test code itself. Unfortunately verifying it takes hours because I really can't reproduce it easily on my machine so I have to wait the usual try-run turn around time. I'll disable the test in the meantime and I will file a bug to re-enable it once I've found a solution.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bacba9805202
Disable browser_CTP_crashreporting.js on Linux debug builds; r=me
Whiteboard: [stockwell needswork] → [stockwell needswork][test disabled]
OK, I've figured out why the test was perma-failing on Linux debug builds: it was timing out. Now that the promise chain is restored it often took a while longer for the fake crash submission server to respond thus triggering the rest of the test-chain. With the promise chain broken it was working some of the time because the code was just racing ahead to the desired condition without waiting for the actual events to happen. In a sense the test wasn't working, but it didn't explicitly fail either.
Try run with test re-enabled and the timeout of a check extended to prevent it from failing. I've retriggered the affected chunks ~15 times to ensure they're stable:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=294adafebc09b08ccd908f00a4db4f35653c43c7&group_state=expanded
This re-enables the test and lengthens the time we wait for the crash submitted status to be reported. I couldn't trigger the issue anymore over ~20 runs so I'm pretty sure it's settled for good.
Attachment #8878469 - Flags: review?(benjamin)
Attachment #8878469 - Flags: review?(benjamin) → review+
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d69a2c18f841
Ensure the test waits long enough for the success condition to be fullfilled; r=bsmedberg
Whiteboard: [stockwell needswork][test disabled] → [stockwell fixed:timing]
These failures seem to have gone away now on central?  Time to call this fixed?
Flags: needinfo?(gsvelto)
Flags: needinfo?(gbrown)
I think so! Thanks much :gsvelto.

(Failures continue on beta...could the patches be uplifted?)
Flags: needinfo?(gbrown)
Agreed, feel free to close.
Flags: needinfo?(gsvelto)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: