Perma-failing awsy-dmd jobs with "TEST-UNEXPECTED-ERROR | TestMemoryUsage.test_open_tabs | WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'dmd-{Start,TabsOpenForceGC}-0-<pid>.json.gz'"
Categories
(Testing :: AWSY, defect, P1)
Tracking
(firefox-esr68 unaffected, firefox74 unaffected, firefox75 unaffected, firefox76 wontfix, firefox77 fixed)
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox74 | --- | unaffected |
firefox75 | --- | unaffected |
firefox76 | --- | wontfix |
firefox77 | --- | fixed |
People
(Reporter: malexandru, Assigned: n.nethercote)
References
(Regression)
Details
(Keywords: regression)
Attachments
(2 files, 1 obsolete file)
Central as Version Increase simulation:
https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&revision=977e1b0ec819e44708c47989529e8fd44e846771&searchStr=awsy&failure_classification_id=3
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295589269&repo=try&lineNumber=798
[task 2020-03-31T13:11:33.901Z] 13:11:33 INFO - switched to tab
[task 2020-03-31T13:11:33.907Z] 13:11:33 INFO - loading about:blank
[task 2020-03-31T13:11:33.952Z] 13:11:33 INFO - loaded!
[task 2020-03-31T13:11:43.961Z] 13:11:43 INFO - disabling preallocated process
[task 2020-03-31T13:13:43.958Z] 13:13:43 INFO - starting gc...
[task 2020-03-31T13:13:44.401Z] 13:13:44 INFO - gc done!
[task 2020-03-31T13:14:44.396Z] 13:14:44 INFO - starting checkpoint TabsOpenForceGC...
[task 2020-03-31T13:14:46.028Z] 13:14:46 INFO - checkpoint created, stored in Z:\\task_1585657111\\build\\tests\\results\\memory-report-TabsOpenForceGC-0.json.gz
[task 2020-03-31T13:14:46.029Z] 13:14:46 INFO - Starting TabsOpenForceGC DMD reports...
[task 2020-03-31T13:14:48.772Z] 13:14:48 INFO - Waiting for memory report to finish
[task 2020-03-31T13:14:49.773Z] 13:14:49 INFO - Waiting for memory report to finish
[task 2020-03-31T13:14:50.773Z] 13:14:50 INFO - Waiting for memory report to finish
[task 2020-03-31T13:14:51.774Z] 13:14:51 INFO - DMD started, prefixed with TabsOpenForceGC-0
[task 2020-03-31T13:14:51.774Z] 13:14:51 INFO - re-enabling preallocated process
[task 2020-03-31T13:14:51.789Z] 13:14:51 INFO - setting results
[task 2020-03-31T13:14:51.789Z] 13:14:51 INFO - tearing down!
[task 2020-03-31T13:14:51.789Z] 13:14:51 INFO - processing data in Z:\task_1585657111\build\tests\results!
[task 2020-03-31T13:14:52.012Z] 13:14:52 INFO - PERFHERDER_DATA: {"framework": {"name": "awsy"}, "suites": [{"lowerIsBetter": true, "subtests": [{"lowerIsBetter": true, "unit": "bytes", "name": "After tabs open [+30s, forced GC]", "value": 17389568}], "value": 17389567.999999974, "name": "Base Content Resident Unique Memory", "unit": "bytes"}, {"lowerIsBetter": true, "subtests": [{"lowerIsBetter": true, "unit": "bytes", "name": "After tabs open [+30s, forced GC]", "value": 1873848}], "value": 1873848.0000000005, "name": "Base Content Heap Unclassified", "unit": "bytes"}, {"name": "Base Content JS", "lowerIsBetter": true, "alertThreshold": 0.25, "value": 3803079.9999999967, "subtests": [{"lowerIsBetter": true, "unit": "bytes", "name": "After tabs open [+30s, forced GC]", "value": 3803080}], "unit": "bytes"}, {"lowerIsBetter": true, "subtests": [{"lowerIsBetter": true, "unit": "bytes", "name": "After tabs open [+30s, forced GC]", "value": 14233600}], "value": 14233600.000000015, "name": "Base Content Explicit", "unit": "bytes"}]}
[task 2020-03-31T13:14:52.013Z] 13:14:52 INFO - Perfherder data written to Z:\task_1585657111\build\tests\results\perfherder_data.json
[task 2020-03-31T13:14:52.021Z] 13:14:52 INFO - Fixing stacks for c:\users\task_1585657111\appdata\local\temp\dmd-TabsOpenForceGC-0-10784.json.gz, this may take a while
[task 2020-03-31T13:14:52.749Z] 13:14:52 INFO - fix-stacks error: failed to read debug info file `/builds/worker/workspace/obj-build/toolkit/library/build/xul.pdb` for `Z:\task_1585657111\build\application\firefox\xul.dll`
[task 2020-03-31T13:14:52.750Z] 13:14:52 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:52.773Z] 13:14:52 INFO - fix-stacks error: failed to read debug info file `/builds/worker/workspace/obj-build/security/nss3.pdb` for `Z:\task_1585657111\build\application\firefox\nss3.dll`
[task 2020-03-31T13:14:52.773Z] 13:14:52 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:52.823Z] 13:14:52 INFO - fix-stacks error: failed to read debug info file `/builds/worker/workspace/obj-build/mozglue/build/mozglue.pdb` for `Z:\task_1585657111\build\application\firefox\mozglue.dll`
[task 2020-03-31T13:14:52.823Z] 13:14:52 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:52.840Z] 13:14:52 INFO - fix-stacks error: failed to read debug info file `/builds/worker/workspace/obj-build/browser/app/firefox.pdb` for `Z:\task_1585657111\build\application\firefox\firefox.exe`
[task 2020-03-31T13:14:52.841Z] 13:14:52 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:53.157Z] 13:14:53 INFO - fix-stacks error: failed to read debug info file `wntdll.pdb` for `C:\Windows\SYSTEM32\ntdll.dll`
[task 2020-03-31T13:14:53.157Z] 13:14:53 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:53.259Z] 13:14:53 INFO - fix-stacks error: failed to read debug info file `wkernel32.pdb` for `C:\Windows\System32\KERNEL32.DLL`
[task 2020-03-31T13:14:53.259Z] 13:14:53 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:53.372Z] 13:14:53 INFO - fix-stacks error: failed to read debug info file `wkernelbase.pdb` for `C:\Windows\System32\KERNELBASE.dll`
[task 2020-03-31T13:14:53.373Z] 13:14:53 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:53.444Z] 13:14:53 INFO - fix-stacks error: failed to read debug info file `ucrtbase.pdb` for `C:\Windows\System32\ucrtbase.dll`
[task 2020-03-31T13:14:53.444Z] 13:14:53 INFO - fix-stacks note: this is expected and harmless for all PDB files on opt automation runs
[task 2020-03-31T13:14:54.159Z] 13:14:54 INFO - TEST-UNEXPECTED-ERROR | awsy/test_base_memory_usage.py TestMemoryUsage.test_open_tabs | WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\task_1585657111\\appdata\\local\\temp\\dmd-TabsOpenForceGC-0-10784.json.gz'
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - Traceback (most recent call last):
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - File "z:\task_1585657111\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 190, in run
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - self.tearDown()
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - File "Z:\task_1585657111\build\tests\awsy\awsy\test_base_memory_usage.py", line 77, in tearDown
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - AwsyTestCase.tearDown(self)
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - File "z:\task_1585657111\build\venv\lib\site-packages\awsy\awsy_test_case.py", line 116, in tearDown
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - self.cleanup_dmd()
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - File "z:\task_1585657111\build\venv\lib\site-packages\awsy\awsy_test_case.py", line 139, in cleanup_dmd
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - shutil.move(f, self._resultsDir)
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - File "c:\mozilla-build\python\Lib\shutil.py", line 317, in move
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - os.unlink(src)
[task 2020-03-31T13:14:54.160Z] 13:14:54 INFO - TEST-INFO took 272192ms
[task 2020-03-31T13:14:54.168Z] 13:14:54 INFO -
[task 2020-03-31T13:14:54.168Z] 13:14:54 INFO - SUMMARY
[task 2020-03-31T13:14:54.168Z] 13:14:54 INFO - -------
[task 2020-03-31T13:14:54.168Z] 13:14:54 INFO - passed: 0
[task 2020-03-31T13:14:54.168Z] 13:14:54 INFO - failed: 1
[task 2020-03-31T13:14:54.169Z] 13:14:54 INFO - todo: 0
[task 2020-03-31T13:14:54.169Z] 13:14:54 INFO -
[task 2020-03-31T13:14:54.169Z] 13:14:54 INFO - FAILED TESTS
[task 2020-03-31T13:14:54.169Z] 13:14:54 INFO - -------
[task 2020-03-31T13:14:54.169Z] 13:14:54 INFO - test_base_memory_usage.py test_base_memory_usage.TestMemoryUsage.test_open_tabs
[task 2020-03-31T13:14:54.170Z] 13:14:54 INFO - SUITE-END | took 272s
[task 2020-03-31T13:14:55.726Z] 13:14:55 ERROR - Return code: 10
[task 2020-03-31T13:14:55.727Z] 13:14:55 ERROR - Got 1 unexpected statuses
[task 2020-03-31T13:14:55.728Z] 13:14:55 INFO - AWSY exited with return code 10: WARNING
[task 2020-03-31T13:14:55.728Z] 13:14:55 WARNING - # TBPL WARNING #
[task 2020-03-31T13:14:55.728Z] 13:14:55 WARNING - setting return code to 1
[task 2020-03-31T13:14:55.728Z] 13:14:55 INFO - Running post-action listener: _package_coverage_data
[task 2020-03-31T13:14:55.728Z] 13:14:55 INFO - Running post-action listener: _resource_record_post_action
[task 2020-03-31T13:14:55.728Z] 13:14:55 INFO - Running post-action listener: process_java_coverage_data
[task 2020-03-31T13:14:55.728Z] 13:14:55 INFO - [mozharness: 2020-03-31 13:14:55.728000Z] Finished run-tests step (success)
[task 2020-03-31T13:14:55.729Z] 13:14:55 INFO - Running post-run listener: _resource_record_post_run
[task 2020-03-31T13:14:56.358Z] 13:14:56 INFO - Total resource usage - Wall time: 289s; CPU: 10.0%; Read bytes: 86516736; Write bytes: 276534784; Read time: 2; Write time: 4
[task 2020-03-31T13:14:56.358Z] 13:14:56 INFO - TinderboxPrint: CPU usage<br/>9.9%
[task 2020-03-31T13:14:56.358Z] 13:14:56 INFO - TinderboxPrint: I/O read bytes / time<br/>86,516,736 / 2
[task 2020-03-31T13:14:56.358Z] 13:14:56 INFO - TinderboxPrint: I/O write bytes / time<br/>276,534,784 / 4
[task 2020-03-31T13:14:56.359Z] 13:14:56 INFO - TinderboxPrint: CPU idle<br/>1,043.5 (90.1%)
[task 2020-03-31T13:14:56.359Z] 13:14:56 INFO - TinderboxPrint: CPU system<br/>41.2 (3.6%)
[task 2020-03-31T13:14:56.359Z] 13:14:56 INFO - TinderboxPrint: CPU user<br/>73.3 (6.3%)
[task 2020-03-31T13:14:56.360Z] 13:14:56 INFO - install - Wall time: 3s; CPU: 30.0%; Read bytes: 61440; Write bytes: 1929216; Read time: 0; Write time: 0
[task 2020-03-31T13:14:56.363Z] 13:14:56 INFO - run-tests - Wall time: 288s; CPU: 10.0%; Read bytes: 86455296; Write bytes: 254261248; Read time: 2; Write time: 4
[task 2020-03-31T13:14:56.429Z] 13:14:56 WARNING - returning nonzero exit status 1
Bob, could you please take a look?
Updated•5 years ago
|
Comment 1•5 years ago
|
||
Regression from bug 1604095?
Reporter | ||
Comment 2•5 years ago
|
||
Also seeing these failures: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295601561&repo=try&lineNumber=1532
Assignee | ||
Comment 3•5 years ago
|
||
Is the "when version number gets increased to 77 on 2020-04-06" part of the bug title a red herring? I found a different try push showing the same problems: https://treeherder.mozilla.org/#/jobs?repo=try&searchStr=sy-d&revision=1a35bc198ba01498d86c33bb54d7352246baba4c
And I can't think of a reason why a version bump would cause this problem. So I think it must have been like this for a while, but the SY jobs don't get run very often and so nobody noticed.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 4•5 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #3)
Is the "when version number gets increased to 77 on 2020-04-06" part of the bug title a red herring? I found a different try push showing the same problems: https://treeherder.mozilla.org/#/jobs?repo=try&searchStr=sy-d&revision=1a35bc198ba01498d86c33bb54d7352246baba4c
And I can't think of a reason why a version bump would cause this problem. So I think it must have been like this for a while, but the SY jobs don't get run very often and so nobody noticed.
This specific job, awsy-dmd, runs on try only.
I'm going to dig a bit, but apparently this doesn't look good.
Assignee | ||
Comment 5•5 years ago
|
||
I tried to do my own push from autoland. I used a 'awsy
match from mach try fuzzy
and got this result: https://treeherder.mozilla.org/#/jobs?repo=try&revision=195e5f450d5b8d58adecc202d3b3e73926a84916
All the normal AWSY jobs ran, but none of the AWSY-DMD ones. I will try again tomorrow to do a push involving the AWSY-DMD ones. No wonder nobody noticed this, if the failing jobs are hard to trigger even when people are trying to trigger them...
Comment 6•5 years ago
|
||
(In reply to Alexandru Michis [:malexandru] from comment #0)
Central as Version Increase simulation:
https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&revision=977e1b0ec819e44708c47989529e8fd44e846771&searchStr=awsy&failure_classification_id=3Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295589269&repo=try&lineNumber=798
Alexandru, an important question is: what's the repo and the revision you pushed to try?
Comment 7•5 years ago
|
||
mozilla-central based on https://hg.mozilla.org/try/rev/e1672b3231e9e612465168f16190a1392b422517 with the two revisions listed at https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&revision=977e1b0ec819e44708c47989529e8fd44e846771&searchStr=awsy&failure_classification_id=3 applied
You can identify the base revision by clicking at the revision at the bottom of the list in Treeherder and the clicking on the "parent" revision.
Comment 8•5 years ago
|
||
For the moment, the investigation revealed that the revision 87ab9a88abce844629bc656c3a07a8fc6be8d052 from m-c failed the awsy dmd jobs on try.
Assignee | ||
Comment 9•5 years ago
|
||
Aha, to run the awsy-dmd jobs you need to use the "full" job list, rather than the "target" job list. E.g. mach try fuzzy --full
, or when adding new jobs in Treeherder, clicking on the "Full job list" checkbox.
I triggered all the relevant jobs in my try push and I got the same Windows and Mac failures. So this has been perma-failing for a while, it is unrelated to the version bump, but nobody noticed because these jobs are so hard to trigger.
Anyway, I will investigate.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 10•5 years ago
|
||
Also seeing these failures: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=295601561&repo=try&lineNumber=1532
These Mac timeouts are pre-existing, see bug 1513628.
Comment 11•5 years ago
|
||
This try push based on mozilla-central cd3fa71346c78a3ffafa0b25f3a6cfcf28914dfc reveals that back in Feb 27 the awsy dmd jobs weren't failing.
Assignee | ||
Comment 12•5 years ago
|
||
Yes, bug 1604095 is definitely the regressor and it landed on March 5th.
I'm working on a fix.
Assignee | ||
Comment 13•5 years ago
|
||
I have determined the problem due to the fix-stacks
process that is created inheriting open file descriptors from its parent. I just need to move things around a bit so that the fix-stacks
process is created before some other files are opened. I will do that tomorrow.
Updated•5 years ago
|
Assignee | ||
Comment 14•5 years ago
|
||
Assignee | ||
Comment 15•5 years ago
|
||
Currently AWSY-with-DMD doesn't work on Windows. This is because fix-stacks
is initialized lazily, and by the time the initialization happens some file
descriptors for files are open, and that leads to some major Python2-on-Windows
sadness as described in the big comment in the commit.
To fix the problem, this commit adds an init
function to fix_stacks.py
so
that fix-stacks
can be initialized eagerly, hopefully before any file
descriptors for files are open.
For dmd.py
, other than fixing the AWSY problems, this has little effect,
because fix-stacks
is always initialized.
For utils.py
, which is used to process the output of most tests, this has a
more noticeable effect: the fix-stacks
process is always spawned, rather than
being spawned only when needed. If no stack traces appear in the test output,
this means that fix-stacks
is spawned unnecessarily. But it's cheap to spawn;
the expensive part only happens when stack traces start getting fixed. So I
think this change in behaviour is acceptable.
The commit also renames some things to more standard Python style, e.g.
json_mode
instead of jsonMode
.
Depends on D69477
Comment 16•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 17•5 years ago
|
||
bugherder |
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 18•5 years ago
|
||
Fun fact: once this lands, our Android host-utils bundle is going to need updating again, and our previous attempt at updating it (bug 1623134) hasn't enough finished landing yet. Fortunately, this bug's changes only affect opt builds and Android tests on run on debug. Otherwise, things would be even more complicated, because it's not possible to update fix_stacks.py
and utils.py
in tandem on Android.
Assignee | ||
Comment 19•5 years ago
|
||
See bug 1623134 comment 27 for a fuller description of the Android mess.
Comment 20•5 years ago
|
||
Not sure I'm stating the obvious, but a push to try of the patch in this bug doesn't fix awsy dmd.
Assignee | ||
Comment 21•5 years ago
|
||
(In reply to Alexandru Ionescu :alexandrui (needinfo me) from comment #20)
Not sure I'm stating the obvious, but a push to try of the patch in this bug doesn't fix awsy dmd.
There are two patches in this bug. Only the first has landed, but it's the second one that has the fix.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4846ec966cee45fec113d20f876377cb70847cd3 is my try push from last week showing that the second patch fixes the Windows awsy-dmd jobs.
Comment 22•5 years ago
|
||
Comment 23•5 years ago
|
||
Backed out changeset 916894e8b8fc (bug 1626272) for frequent test_dmd.js failures
Backout link: https://hg.mozilla.org/integration/autoland/rev/7439a3dd86c98ec28caeed28f9a6e942d8e75e59
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=296529512&repo=autoland&lineNumber=8728
task 2020-04-07T00:43:07.965Z] 00:43:07 INFO - TEST-START | memory/replace/dmd/test/test_dmd.js
[task 2020-04-07T00:43:08.545Z] 00:43:08 WARNING - TEST-UNEXPECTED-FAIL | memory/replace/dmd/test/test_dmd.js | xpcshell return code: 0
[task 2020-04-07T00:43:08.545Z] 00:43:08 INFO - TEST-INFO took 578ms
Assignee | ||
Comment 24•5 years ago
|
||
Argh, I don't understand those test_dmd.js failures at all.
The test invokes dmd.py
in a separate process. dmd.py
runs, stack-fixing the input file, and then producing an output file. We compare the output file against the expected output, and it looks as expected, so the test passes. Then, we try to remove the output file and get this error:
NS_ERROR_FILE_IS_LOCKED: Component returned failure code: 0x8052000e (NS_ERROR_FILE_IS_LOCKED) [nsIFile.remove]
It happens on most, but not all runs.
What I don't understand is that the backed-out commit doesn't change anything about how the output is produced. It only changes the stack-fixing of the input file.
But I'm wondering if there's a problem with the fact that fix_stacks.py
uses Popen
to spawn a fix-stacks
process but never tries to clean it up.
Assignee | ||
Comment 25•5 years ago
|
||
Unfortunate, but it gets us back to where we were before fix_stacks.py
came
along, and getting it working isn't worth the effort right now.
Updated•5 years ago
|
Comment 26•5 years ago
|
||
Comment 27•5 years ago
|
||
bugherder |
Comment 28•5 years ago
|
||
Comment on attachment 9137915 [details]
Bug 1626272 - Make fix-stacks
initialization eager. r=erahm
Revision D69478 was moved to bug 1628527. Setting attachment 9137915 [details] to obsolete.
Comment 29•5 years ago
|
||
The patch landed in nightly and beta is affected.
:njn, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 30•5 years ago
|
||
This doesn't need uplifting to beta because this job is not run by default, you have to specially ask for it from the "full" list of jobs.
Description
•