Closed Bug 1387827 Opened 7 years ago Closed 6 years ago

Permaorange devtools timed out after 1000 seconds of no output on Linux x64 JSDCov

Categories

(DevTools :: General, defect, P5)

defect

Tracking

(firefox58 disabled)

RESOLVED INCOMPLETE
Tracking Status
firefox58 --- disabled

People

(Reporter: intermittent-bug-filer, Assigned: CosminS)

References

Details

(Keywords: intermittent-failure, test-disabled, Whiteboard: [stockwell disabled])

Attachments

(2 files)

this failure is quite frequent and I we should look into it.  This week it is trending on 30+ failures.  Possibly 1 test needs to be disabled, or maybe we need more chunks or a longer timeout.

:gmierz, can you look into this?
Flags: needinfo?(gmierz2)
Whiteboard: [stockwell needswork]
I'm on it. Also, I'm leaving the ni? open to save this bug.
I've managed to get rid of most of the failures by increasing the number of chunks up to 16: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b9b1e7fa2e07343af1ff2fab697ad4f5d8bf537

Right now, I'm looking into if just skipping that one test 'browser_dbg_stack-03.js' will get rid of the last error. I'm still not sure why it's perma-failing but I think it's because of the use of the debugger in that test: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c25bc14ca0feea6e8887b0d4be74bedab9920eaf
The last push that I did didn't work [1]. But looking at the logs of each of the failing tests we have that 'this.content' is null in 'test-actors.js': https://dxr.mozilla.org/mozilla-central/source/devtools/client/shared/test/test-actor.js#681,702

So, there is something consistent to go from. I've also noticed that there are an incredible amount of connection closed errors (not just one) [2]. Also, there is another error at the start, [3].

It's also possible that either addons or marionette is broken as I found a warning in the log here: https://treeherder.mozilla.org/logviewer.html#?job_id=127425665&repo=try&lineNumber=2004

[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c25bc14ca0feea6e8887b0d4be74bedab9920eaf
[2]: https://treeherder.mozilla.org/logviewer.html#?job_id=127425665&repo=try&lineNumber=4067
[3]: https://treeherder.mozilla.org/logviewer.html#?job_id=127425665&repo=try&lineNumber=2337
I reviewed several logs from yesterday and noticed several ended in something like:

[task 2017-09-05T23:05:41.633056Z] 23:05:41     INFO - GECKO(1942) | console.log: [DISPATCH] {type:..,highlighted:..,nodeFront:.., }
[task 2017-09-05T23:05:42.058503Z] 23:05:42     INFO - GECKO(1942) | console.error:
[task 2017-09-05T23:05:42.059913Z] 23:05:42     INFO - GECKO(1942) |   Message: TypeError: content is null
[task 2017-09-05T23:05:42.060046Z] 23:05:42     INFO - GECKO(1942) |   Stack:
[task 2017-09-05T23:05:42.060112Z] 23:05:42     INFO - GECKO(1942) |     @http://example.com/browser/devtools/client/shared/test/test-actor.js:683:5
[task 2017-09-05T23:05:42.061357Z] 23:05:42     INFO - GECKO(1942) | @http://example.com/browser/devtools/client/shared/test/test-actor.js:683:5
[task 2017-09-05T23:22:22.089404Z] 23:22:22     INFO - Automation Error: mozprocess timed out after 1000 seconds running ...

even though they are running various tests:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=128710350&lineNumber=5255
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=128712747&lineNumber=5254
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=128712824&lineNumber=5006

:gmierz - Is that something you have noticed before?
Yes, nearly all the failures in devtools are for that reason. In comment 6, I've detailed what I've found so far in each of the failures. Most of them start with the error in [3] and then make multiple connection closed errors displayed in [2] and then finally fail with the content is null error. I haven't had much time to look further although I did try increasing the number of chunks and increasing the timeouts which didn't help.

Content is definitely null but I haven't found why it becomes null. The last thing I was looking at was trying to find something that sets content to null and see if it's being run, but I haven't had a chance to test this yet.

Would you have any thoughts about why this failure is happening?
Flags: needinfo?(gmierz2)
(In reply to Greg Mierzwinski [:gmierz] from comment #13)
> The last
> thing I was looking at was trying to find something that sets content to
> null and see if it's being run, but I haven't had a chance to test this yet.

That sounds like a good approach.
 
> Would you have any thoughts about why this failure is happening?

Sorry, no.
I spent some time looking into this and I checked if any place that sets content (or a content variable anyway) to null is being run and found that none of them are being run. So, either that is really the case or I somehow missed one of them. 'this.content' is (from what I understand) a content window and it looks to me like it's a devtools window but I haven't tested that yet.

In a previous comment I mentioned the connection closed errors which I've found to mean nothing for this error since we hit the same 'content is null' errors regardless of whether or not that failure is there. For some reason though, in some cases, the test fails and continues to fail on another test, then in others, the mozprocess times out and there are no other errors except for the first one. In my opinion, this means that there could be two different errors occurring - and one isn't being caught- or that it's the same error but a different "manifestation" of it is not being caught. This has nothing to do with the error itself, but it does help with categorizing the error(s) a little.

As I look through the logs though, I see a few errors that are either purposeful or are not being caught, so I plan on looking into that now.

Joel, would you have any thoughts about this or another idea of what I could try?
Flags: needinfo?(jmaher)
we fail on the same devtools chunks, but the last day we have greatly reduced the failures and now only 1 chunk is failing:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=jsdcov%20devtools&selectedJob=130324062

it seems to be failing right after:
TEST-START | devtools/client/debugger/test/mochitest/browser_dbg_stack-03.js

so I think possibly we can just skip that test?
Flags: needinfo?(jmaher)
Ah, that is great! :)

Do you know, off hand, what patch fixed this? Otherwise, I'll look around.

Yes, let's skip it. That test you mention has been failing for a long time now, and I have a feeling that the js debugger may be interfering with it because it uses the debugger also. I'll have a patch up soon to skip this one.
lets go the path of least resistance.
There seems to be two new errors now on jsdcov with the following error: https://treeherder.mozilla.org/logviewer.html#?job_id=131593077&repo=mozilla-central&lineNumber=1979

Not sure why it's happening but I'm going to open a new bug to disable 'browser_dbg_stack-03.js' since it's being a problem regardless of this error.
thanks :gmierz!
disabled the one test in bug 1400683, I assume this will be reduced greatly or completely in frequency.
unfortunately we still see a high failure rate, everything looks to be related to bug 1401215.
Even though it's only two chunks of permaorange, it's still two chunks of permaorange, not intermittent.
Summary: Intermittent devtools timed out after 1000 seconds of no output on Linux x64 JSDCov → Permaorange devtools timed out after 1000 seconds of no output on Linux x64 JSDCov
after bug 1393788 is completed we will dive into this bug and see what remains.
Depends on: 1393788
Attachment #8918902 - Flags: review?(gbrown) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7bf0e655457
Disable 2 devtools tests on coverage builds for frequent timeouts. r=gbrown, a=test-only
Whiteboard: [stockwell disable-recommended] → [stockwell disabled]
Status: RESOLVED → REOPENED
Component: General → Developer Tools
Keywords: test-disabled
Product: Release Engineering → Firefox
Resolution: FIXED → ---
Tests have been disabled here, so the bug shouldn't have been marked as fixed.
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE
Product: Firefox → DevTools
The linux64-jsdcov build has been disabled, and no longer runs in taskcluster, see bug 1496791.
See Also: → 1496791
Assignee: nobody → csabou

Stumbled upon these two lines (https://searchfox.org/mozilla-central/source/devtools/client/framework/browser-toolbox/test/browser.ini#27,32) as I was working on another bug, searched them with .mach test-info and found that both have:
windows10-64/ccov-opt-e10s: 0 failures ( 0 skipped) in 25 runs
linux1804-64/ccov-opt-e10s: 0 failures ( 0 skipped) in 12 runs
so that's the reason for this patch.

Assignee: csabou → nobody
Assignee: nobody → csabou
Pushed by shindli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/64b1ca50cf4f
Delete skip line for browser_browser_toolbox.js and browser_browser_toolbox_fission_inspector.js as they are green on ccov. r=jmaher
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: