Closed Bug 1641328 Opened 4 months ago Closed 2 months ago

Perma [tier2] [fission] awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: Checkpoint was recorded

Categories

(Testing :: AWSY, defect, P3)

defect

Tracking

(Fission Milestone:M6b, firefox79 disabled, firefox80 disabled)

RESOLVED WORKSFORME
Fission Milestone M6b
Tracking Status
firefox79 --- disabled
firefox80 --- disabled

People

(Reporter: intermittent-bug-filer, Assigned: mccr8)

References

(Regression)

Details

(Keywords: assertion, intermittent-failure)

Attachments

(1 file)

Filed by: archaeopteryx [at] coole-files.de
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=303938963&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/P-cMCTMtR6WYTqmbvekTlQ/runs/0/artifacts/public/logs/live_backing.log


[task 2020-05-27T14:37:28.486Z] 14:37:28 INFO - starting checkpoint TabsClosedExtraProcesses...
[task 2020-05-27T14:37:28.720Z] 14:37:28 INFO - checkpoint created, stored in Z:\task_1590589260\build\tests\results\memory-report-TabsClosedExtraProcesses-0.json.gz
[task 2020-05-27T14:37:28.721Z] 14:37:28 INFO - closing preloaded browser
[task 2020-05-27T14:37:28.727Z] 14:37:28 INFO - starting checkpoint TabsClosed...
[task 2020-05-27T14:37:28.956Z] 14:37:28 INFO - checkpoint created, stored in Z:\task_1590589260\build\tests\results\memory-report-TabsClosed-0.json.gz
[task 2020-05-27T14:37:58.961Z] 14:37:58 INFO - starting checkpoint TabsClosedSettled...
[task 2020-05-27T14:37:58.962Z] 14:37:58 ERROR - Unexpected error: <type 'exceptions.IOError'>
[task 2020-05-27T14:37:58.963Z] 14:37:58 INFO - TEST-UNEXPECTED-FAIL | awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: Checkpoint was recorded
[task 2020-05-27T14:37:58.963Z] 14:37:58 INFO - Traceback (most recent call last):
[task 2020-05-27T14:37:58.963Z] 14:37:58 INFO - File "z:\task_1590589260\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 159, in run
[task 2020-05-27T14:37:58.963Z] 14:37:58 INFO - testMethod()
[task 2020-05-27T14:37:58.963Z] 14:37:58 INFO - File "Z:\task_1590589260\build\tests\awsy\awsy\test_memory_usage.py", line 237, in test_open_tabs
[task 2020-05-27T14:37:58.964Z] 14:37:58 INFO - create_checkpoint("TabsClosedSettled", itr)
[task 2020-05-27T14:37:58.964Z] 14:37:58 INFO - File "Z:\task_1590589260\build\tests\awsy\awsy\test_memory_usage.py", line 201, in create_checkpoint
[task 2020-05-27T14:37:58.964Z] 14:37:58 INFO - self.assertIsNotNone(checkpoint, "Checkpoint was recorded")
[task 2020-05-27T14:37:58.964Z] 14:37:58 INFO - TEST-INFO took 586297ms

Whiteboard: [perftest:triage]

:cpeterson, would someone from the fission team be able to look into this? This is only failing on fission tests.

Flags: needinfo?(cpeterson)
Whiteboard: [perftest:triage]

(In reply to Greg Mierzwinski [:sparky] from comment #8)

:cpeterson, would someone from the fission team be able to look into this? This is only failing on fission tests.

Thanks for the heads up. I'll send this bug to Fission triage for review.

Looks like the test has been failing 10-20 times a week, but only on windows10-64-shippable-qr.

Test failure from https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=309610455&repo=mozilla-central&lineNumber=1106

[task 2020-07-12T11:36:51.595Z] 11:36:51     INFO - starting checkpoint Start...
[task 2020-07-12T11:36:53.297Z] 11:36:53    ERROR - Checkpoint JavaScript error: JavaScriptError: Document was unloaded
[task 2020-07-12T11:36:53.297Z] 11:36:53    ERROR - stacktrace:
[task 2020-07-12T11:36:53.297Z] 11:36:53    ERROR - 	WebDriverError@chrome://marionette/content/error.js:175:5
[task 2020-07-12T11:36:53.297Z] 11:36:53    ERROR - 	JavaScriptError@chrome://marionette/content/error.js:354:5
[task 2020-07-12T11:36:53.297Z] 11:36:53    ERROR - 	evaluate.sandbox/promise</unloadHandler<@chrome://marionette/content/evaluate.js:132:20
[task 2020-07-12T11:36:53.297Z] 11:36:53    ERROR - 
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - Failed to gather test failure debug: Browsing context has been discarded
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - stacktrace:
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	WebDriverError@chrome://marionette/content/error.js:175:5
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	NoSuchWindowError@chrome://marionette/content/error.js:409:5
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	assert.that/<@chrome://marionette/content/assert.js:479:13
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	assert.open@chrome://marionette/content/assert.js:191:4
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	GeckoDriver.prototype.takeScreenshot@chrome://marionette/content/driver.js:3034:20
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	despatch@chrome://marionette/content/server.js:305:40
[task 2020-07-12T11:36:53.327Z] 11:36:53  WARNING - 	execute@chrome://marionette/content/server.js:275:16
[task 2020-07-12T11:36:53.328Z] 11:36:53  WARNING - 	onPacket/<@chrome://marionette/content/server.js:248:20
[task 2020-07-12T11:36:53.328Z] 11:36:53  WARNING - 	onPacket@chrome://marionette/content/server.js:249:9
[task 2020-07-12T11:36:53.328Z] 11:36:53  WARNING - 	_onJSONObjectReady/<@chrome://marionette/content/transport.js:501:20
[task 2020-07-12T11:36:53.328Z] 11:36:53  WARNING - 
[task 2020-07-12T11:36:53.328Z] 11:36:53     INFO - TEST-UNEXPECTED-FAIL | awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: Checkpoint was recorded
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO - Traceback (most recent call last):
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO -   File "z:\task_1594552757\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 196, in run
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO -     testMethod()
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO -   File "Z:\task_1594552757\build\tests\awsy\awsy\test_memory_usage.py", line 206, in test_open_tabs
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO -     create_checkpoint("Start", 0)
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO -   File "Z:\task_1594552757\build\tests\awsy\awsy\test_memory_usage.py", line 201, in create_checkpoint
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO -     self.assertIsNotNone(checkpoint, "Checkpoint was recorded")
[task 2020-07-12T11:36:53.329Z] 11:36:53     INFO - TEST-INFO took 21099ms
Fission Milestone: --- → ?
Flags: needinfo?(cpeterson)

Andrew, can you please take a look this AWSY timeout?

kmag says the test is probably timing out because the memory report generation is taking too long.

This bug is a Fission Nightly blocker. We'd like to fix this soon, so tracking for milestone M6b.

Severity: normal → S3
Fission Milestone: ? → M6b
Flags: needinfo?(continuation)
Priority: P5 → P3
Summary: Perma [tier2][fission] awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: Checkpoint was recorded → Perma [tier2] [fission] awsy/test_memory_usage.py TestMemoryUsage.test_open_tabs | AssertionError: Checkpoint was recorded
Assignee: nobody → continuation
Flags: needinfo?(continuation)

This is Windows-only which is odd. Also, when it was working generating the report was only taking 9 seconds, and after the failure it isn't finishing even after a minute so something must be going pretty wrong.

The gecko.log files look similar until you start getting a ton of these:
###!!! [Child][MessageChannel] Error: (msgtype=0x390118,name=PContent::Msg_AddMemoryReport) Channel closing: too late to send/recv, messages will be lost

This indicates that the process has already started to shut down by the time many of the reporters are able to report.

(In reply to Andreea Pavel [:apavel] from comment #10)

This is perma-failing on tier2 starting with this merge https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=d4c6cd2e13bb057eb4d5f8c579f440936bf17241

I've done a ton of retriggers, and I think that isn't quite right. I think the failure rate has just been very high for awhile. For instance, here's some retriggers from a push on Sunday: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=shippable%2Copt%2Care%2Cwe%2Cslim%2Cyet%2Ctests%2Cby%2Ctaskcluster%2C%2Cfission%2Cenabled%2Csy-tp6&revision=22f5f7e914449bde3bee21a322d67248cd5e0254

Here's some retriggers from a push on autoland, from before that merge: https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=windows%2C10%2Cshippable%2Copt%2Care%2Cwe%2Cslim%2Cyet%2Ctests%2Cby%2Ctaskcluster%2C%2Cfission%2Cenabled%2Csy-tp6&revision=d080e46635d3e8b163427f68282611aa6392e59f

It is also possible that there was some infrastructure change not reflected in the tree that has caused the failure rate to go up, but I'm not sure how I'd be able to figure that out.

The failure rate is so high on this test, I think it would be okay to disable it in Windows+Fission until I fix this problem. We don't seem to hit the failures on Linux, so I'd like to leave it on there.

I found a few things that AWSY could do better (which I'll file bugs for if I think they are worthwhile), but my guess is that this is a regression from bug 1602757. The timing lines up perfectly, too. This theory is based on my observation that one or more child processes are failing to send their reports. What could cause a child process to shut down before the report is done, if the test harness is just sitting around waiting for the reports?

Well...

  1. Before this specific AWSY check point, it calls do_full_gc(), which notifies "child-mmu-request" observers.
  2. This in turn sends a message to every child process, which causes all of them to notify "memory-pressure" observers with the "heap-minimize" value.
  3. One of the observers is the preallocated process manager. What does it do when it finds out we're under memory pressure? Why, it shuts down the process, of course!

Presumably there's some kind of race between the parent process trying to collect reports from every child and the preallocated process telling the parent that it is shutting down. I'll try to confirm this theory with some try pushes and then figure out what the best fix is here.

Regressed by: 1602757

:alexandrui, could you disable this awsy fission test on windows temporarily?

Flags: needinfo?(aionescu)
Flags: needinfo?(aionescu)
Keywords: leave-open
Whiteboard: [stockwell disabled]
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d932b3b69adc
disable test_memory_usage.py on win fission r=sparky
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3e4e8032b9e0
delete extra space and add missing one on a CLOSED TREE
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e5f48da8a203
delete tab spacing on a CLOSED TREE

I guess my analysis in comment 15 isn't right. After a bit of a detour to get a browser chrome test for about:memory, it appears that what happens normally, at least on OSX, is that if you kill a process while a memory report is underway then the BrowserParent for the child process ends up dying. This clears the mMemoryReportRequest field, which makes us run ~MemoryReportRequestHost(), which calls nsMemoryReporterManager::EndProcessReport, which then makes the manager stop waiting for a memory report from that process.

So there must be something subtler going on in the case of the Windows failure. Maybe the ContentParent isn't going away due to something about the way Marionette is running code or the specific way that the GC is invoked.

Depends on: 1655536
Status: NEW → RESOLVED
Closed: 2 months ago
Keywords: leave-open
Resolution: --- → WORKSFORME
Whiteboard: [stockwell disabled]
You need to log in before you can comment on or make changes to this bug.