Closed Bug 1480953 Opened 2 years ago Closed 1 year ago

Intermittent PID 7816 | TEST-UNEXPECTED-FAIL | damp | webconsole/custom.js: Test timed out

Categories

(Testing :: Talos, defect, P3)

Version 3
defect

Tracking

(firefox64 fixed, firefox65 fixed)

RESOLVED FIXED
mozilla65
Tracking Status
firefox64 --- fixed
firefox65 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jdescottes)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disable-recommended])

Attachments

(2 files, 1 obsolete file)

Filed by: ccoroiu [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=192013474&repo=autoland

https://queue.taskcluster.net/v1/task/LLLVEdboQwmLTcg73iGQSw/runs/0/artifacts/public/logs/live_backing.log

04:28:27     INFO -  PID 7816 | Loading test 'webconsole/custom.js'
04:28:27     INFO -  PID 7816 | Executing test 'webconsole/custom.js'
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Parent][MessageChannel] Error: (msgtype=0x170089,name=PBrowser::Msg_UpdateNativeWindowHandle) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Parent][MessageChannel] Error: (msgtype=0x17007C,name=PBrowser::Msg_Destroy) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | [Child 5452, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Child][MessageChannel] Error: (msgtype=0x170001,name=PBrowser::Msg_AsyncMessage) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | [Child 5452, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 346
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Child][MessageChannel] Error: (msgtype=0x4A0006,name=PHttpChannel::Msg_Cancel) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Child][MessageChannel] Error: (msgtype=0x170001,name=PBrowser::Msg_AsyncMessage) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Child][MessageChannel] Error: (msgtype=0x17002C,name=PBrowser::Msg_SetInputContext) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | ###!!! [Child][MessageChannel] Error: (msgtype=0x2E0118,name=PContent::Msg_DetachBrowsingContext) Closed channel: cannot send/recv
04:28:27     INFO -  PID 7816 |
04:28:27     INFO -  PID 7816 | Assertion failure
04:28:27     INFO -  PID 7816 | assert@resource:///modules/AsyncTabSwitcher.jsm:454:36
04:28:27     INFO -  PID 7816 | finish@resource:///modules/AsyncTabSwitcher.jsm:306:5
04:28:27     INFO -  PID 7816 | postActions@resource:///modules/AsyncTabSwitcher.jsm:638:7
04:28:27     INFO -  PID 7816 | onUnloadTimeout@resource:///modules/AsyncTabSwitcher.jsm:652:5
04:28:27     INFO -  PID 7816 | queueUnload/this.unloadTimer<@resource:///modules/AsyncTabSwitcher.jsm:1007:44
04:33:27     INFO -  PID 7816 | TEST-UNEXPECTED-FAIL | damp | webconsole/custom.js: Test timed out
04:33:27     INFO -  Terminating psutil.Process(pid=7816L, name='firefox.exe', started='04:26:29')
04:33:27     INFO -  TEST-UNEXPECTED-ERROR | damp | unexpected error
04:33:27    ERROR -  Traceback (most recent call last):
04:33:27     INFO -    File "C:\Users\task_1533356021\build\tests\talos\talos\run_tests.py", line 297, in run_tests
04:33:27     INFO -      talos_results.add(mytest.runTest(browser_config, test))
04:33:27     INFO -    File "C:\Users\task_1533356021\build\tests\talos\talos\ttest.py", line 63, in runTest
04:33:27     INFO -      return self._runTest(browser_config, test_config, setup)
04:33:27     INFO -    File "C:\Users\task_1533356021\build\tests\talos\talos\ttest.py", line 207, in _runTest
04:33:27     INFO -      debugger_args=browser_config['debugger_args']
04:33:27     INFO -    File "C:\Users\task_1533356021\build\tests\talos\talos\talos_process.py", line 156, in run_browser
04:33:27     INFO -      raise TalosError("unexpected error")
04:33:27     INFO -  TalosError: unexpected error
04:33:27     INFO -  TEST-INFO took 876498ms
04:33:27     INFO -  SUITE-END | took 876s
04:33:27    ERROR - Return code: 2
04:33:27  WARNING - setting return code to 2
04:33:27    ERROR - # TBPL FAILURE #
This got frequent when bug 1485660 started. Alexandre, can you take a look at this, please?
Flags: needinfo?(poirot.alex)
Joel, could take a look at this? Alexandre was ni'd 6 days ago but no response yet. This is now on disable recommended bugs but from I see it's a Talos test and don't know if we can do smthg about it. Can you advise what would be the course of action for this? Thank you.
Flags: needinfo?(jmaher)
Thanks :CosminS, we were looking into this today and :ochameau is on holiday right now.  I will leave the needinfo for me so I make sure to reply in the next day or so
Disable DAMP temporarily on windows 10 pgo.
As discussed on Slack, I am currently investigating a potential fix that would avoid skipping this for the moment. Will decide what to do tomorrow when I get more data.
This seems to silence the intermittent, even though it doesn't explain
why we started spiking after moving TabClient to a Front.
Assignee: nobody → jdescottes
Status: NEW → ASSIGNED
Priority: P5 → P3
Summary of my investigations around this topic:
- 75% of the failures occur on PGO builds. 
- If we disable the webconsole custom test, then the failure moves to the next test (inspector custom).
- Disabling the test running *before* webconsole custom (netmonitor complicated) makes the issue disappear.
- The first error we see *only* on failed tests is 
    ###!!! [Parent][MessageChannel] Error: (msgtype=0x190095,name=PBrowser::Msg_UpdateNativeWindowHandle) 
    Closed channel: cannot send/recv   
- Before the spike (and consequently before the TabClient -> front migration) the failures were less frequent, but the logs are really similar, with the same "Closed channel: cannot send/recv ". 

It seems we are hitting the same intermittent more often. If it's very flaky/racy, we might just be unlucky and made the test more intermittent, without actually regressing anything.

A few try pushes to illustrate
- central: 3/36 failures https://treeherder.mozilla.org/#/jobs?repo=try&revision=5ff0087728d46ac119fcff94691d037c343542a1
- central + try/catch in destroy: 2/35 failures https://treeherder.mozilla.org/#/jobs?repo=try&revision=fed1f8b02d51d9571e8769cab7401afc51280363
- central + skip netmonitor complicated test: 0/41 failures https://treeherder.mozilla.org/#/jobs?repo=try&revision=91e8656722b1b8e22934868bd048f5fc5b444dc9
- central + 1second pause after netmonitor complicated test: 0/47 failures https://treeherder.mozilla.org/#/jobs?repo=try&revision=19865baf260fdd0b3d63e014bcf952b16edbded3

I am proposing to land the "pause" fix, as it seems to completely silence the issue without impacting the tests too much. If this is not enough, we can still disable on PGO as proposed in Joel's patch
Flags: needinfo?(jmaher)
Keywords: leave-open
Pushed by jdescottes@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1504056ce8e6
Pause for 1 second after DAMP netmonitor complicated test;r=nchevobbe
Attachment #9019796 - Attachment is obsolete: true
It looks like this is related to tab animations and especially the ones happening on tab closing.
These assertions are being emitted by tab animations...

If I force disabling animation on opening and closing, the intermittents go away:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=02d197b8a0d594fbffa438dd6cb85fd9593d6c34
Flags: needinfo?(poirot.alex)
We used to have animations during tab opening/closing, but this was introducing
intermittents as we were not correctly waiting for animation's end.

MozReview-Commit-ID: 2mscsA8Uosd
Note that we can get rid of the current intermittent by disabling the close animation only:
  https://hg.mozilla.org/try/rev/b6b04aeee5ae
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=61f72e1f341f566eaa89be4848fd157d33f2fe30
And we may possibly also prevent the assertion by properly waiting for both animations, but that's more code/work...
(In reply to Alexandre Poirot [:ochameau] from comment #31)
> Note that we can get rid of the current intermittent by disabling the close
> animation only:
>   https://hg.mozilla.org/try/rev/b6b04aeee5ae
>  
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=61f72e1f341f566eaa89be4848fd157d33f2fe30

This try push seems to be on top of the existing 1s pause we have to avoid the intermittent. 
So I am not sure the try results are relevant? 
Note that 10 retries is low: I had green try pushes with 20 retriggers that still had the problem.
 
> And we may possibly also prevent the assertion by properly waiting for both
> animations, but that's more code/work...
(In reply to Julian Descottes [:jdescottes][:julian] from comment #32)
> (In reply to Alexandre Poirot [:ochameau] from comment #31)
> > Note that we can get rid of the current intermittent by disabling the close
> > animation only:
> >   https://hg.mozilla.org/try/rev/b6b04aeee5ae
> >  
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=61f72e1f341f566eaa89be4848fd157d33f2fe30
> 
> This try push seems to be on top of the existing 1s pause we have to avoid
> the intermittent. 

There was a child commit to remove the 1s pause:
  https://hg.mozilla.org/try/rev/2e05879cbded

> So I am not sure the try results are relevant? 
> Note that 10 retries is low: I had green try pushes with 20 retriggers that
> still had the problem.

I spawn some more on this try run to confirm.
Pushed by apoirot@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/496185f04933
Force disabling tab animations in DAMP. r=jdescottes
(In reply to Alexandre Poirot [:ochameau] from comment #33)
> > So I am not sure the try results are relevant? 
> > Note that 10 retries is low: I had green try pushes with 20 retriggers that
> > still had the problem.
> 
> I spawn some more on this try run to confirm.

Still green with 30 runs. So it seems to confirm it is specific to tab closing and not so much about tab opening.
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
You need to log in before you can comment on or make changes to this bug.