Open
Bug 1345093
Opened 7 years ago
Updated 2 years ago
Windows reftest timeouts after "JavaScript error: chrome://reftest/content/reftest.jsm, line 1741: NS_ERROR_FAILURE:"
Categories
(Testing :: Reftest, defect)
Tracking
(Not tracked)
NEW
People
(Reporter: kats, Unassigned)
References
(Depends on 20 open bugs)
Details
Attachments
(1 file)
96.37 KB,
text/plain
|
Details |
There are a lot of different intermittent failure bugs filed for what appears to be the same underlying root issue. The Windows 7 R(R) job runs, spits out this error: JavaScript error: chrome://reftest/content/reftest.jsm, line 1741: NS_ERROR_FAILURE: and then hangs. 330 seconds later the test times out and is force-crashed by the harness. Line 1741 in reftest.jsm is not a particularly interesting line: http://searchfox.org/mozilla-central/rev/1bc7720a434af3c13b68b8d69f92078cae8890c4/layout/tools/reftest/reftest.jsm#1741 so I don't really see why it would produce an NS_ERROR_FAILURE. Example logs: https://treeherder.mozilla.org/logviewer.html#?job_id=82088875&repo=mozilla-inbound&lineNumber=12663 https://treeherder.mozilla.org/logviewer.html#?job_id=82072350&repo=mozilla-inbound&lineNumber=12510 https://treeherder.mozilla.org/logviewer.html#?job_id=81783775&repo=mozilla-inbound&lineNumber=21205 https://treeherder.mozilla.org/logviewer.html#?job_id=81783793&repo=mozilla-inbound&lineNumber=21239 https://treeherder.mozilla.org/logviewer.html#?job_id=81783799&repo=mozilla-inbound&lineNumber=61369 It would be super if TreeHerder or the reftest harness or mozharness flagged the JavaScript error output as relevant, so that sheriffs can more easily annotate these failures as being the same thing, rather than ending up filing tons of one-off bugs.
Reporter | ||
Updated•7 years ago
|
Reporter | ||
Comment 1•7 years ago
|
||
The exact line number varies a bit depending on how far back you go, but it always points to the same actual line of code. I'm going to stop adding more bugs here because that would take me all day. Here's a search query to find more: https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&list_id=13475923&short_desc_type=allwordssubstr&short_desc=application%20timed%20out%20after%20330%20seconds%20with%20no%20output&resolution=---&query_format=advanced&component=Graphics&component=Layout&product=Core
Reporter | ||
Comment 2•7 years ago
|
||
Also a warning: this failure might become more frequent in the near future, after bug 1342450 lands. On a recent try push for that bug [1] 44% of the Windows 7 opt R(R) jobs encountered this error. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=08fd35010f0b5d58def870e29b7b858f69cda0ff&filter-searchStr=windows%20reftest&group_state=expanded
Blocks: 1342450
reftest.jsm is preprocessed. Are the line numbers in the errors adjusted for that, or do you need to unpack the version after preprocessing to see what's on line 1741?
Reporter | ||
Comment 4•7 years ago
|
||
Ah, good point. I downloaded the zip file [1] for the try push I linked above, and looked at the post-processed reftest.jsm. Line 1741 points to [2] which seems like a more reasonable culprit. [1] https://queue.taskcluster.net/v1/task/bQvMnnkFToiXdf5VpYy8Xw/artifacts/public/build/firefox-54.0a1.en-US.win32.reftest.tests.zip [2] http://searchfox.org/mozilla-central/rev/1bc7720a434af3c13b68b8d69f92078cae8890c4/layout/tools/reftest/reftest.jsm#1757
Reporter | ||
Comment 5•7 years ago
|
||
Also it seems like some (but not all) of the failing reftest jobs have gfx error output, for example: 05:52:19 INFO - [GFX1-]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000 and this shows up just before the NS_ERROR_FAILURE. My best guess is that we're running out of memory, and so both the GFX code and toDataURL are failing as a result.
Reporter | ||
Comment 6•7 years ago
|
||
I took the resource-usage file from a sample failing job [1] and extracted the "virt" samples. The sample with smallest "free" value is still 1233567744 - assuming this is virtual memory data during the run (I assume it is, but it's not documented) that means the free memory never dropped below 1.2G. That makes it pretty unlikely that running out of memory is the problem, but maybe we have really bad memory fragmentation? [1] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/77b2dbbe62bccc9204f81d0884a514db0bb6834fc5af5eda5e5afb84e7cbb4015e69db7e28d1b37801d8da4277e5b5132bf24809d54c1143d7c12a8d21c935b1
This is starting to sound a lot like bug 1300355.
See Also: → 1300355
Reporter | ||
Comment 8•7 years ago
|
||
Indeed it is. But in this case at least the error is being reported to Javascript, so we should be able to catch and force-trigger a GC/CC and try again. I'll write a patch that does that.
Comment 9•7 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #0) > It would be super if TreeHerder or the reftest harness or mozharness flagged > the JavaScript error output as relevant, so that sheriffs can more easily > annotate these failures as being the same thing, rather than ending up > filing tons of one-off bugs. Bug 1299274 was my idea. Hasn't gotten any traction since.
Reporter | ||
Comment 10•7 years ago
|
||
Try push with the OOM guard: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7df09d6534a28046016fe8fd9846795ea3672fe0 While testing the patch locally I realized that the error line (a call to canvas.toDataURL()) only actually gets run if the test is failing anyway. So this might actually be exactly the same as bug 1300355 but with a different manifestation (i.e. there's an OOM that causes the test to fail, but then during the reporting of the failure there's another OOM that causes this hang). Anyhow, I'll retrigger the reftest a bunch of times and see what happens.
Reporter | ||
Comment 11•7 years ago
|
||
So in https://treeherder.mozilla.org/logviewer.html#?job_id=82479713&repo=try&lineNumber=61373 it triggered my oom guard (I can see the output from it) but my recovery attempt by running a GC/CC didn't seem to help. It failed the allocation even after that, and ran into the same timeout. So whatever is fragmenting the memory isn't helped by GC'ing. We should probably fix the fragmentation - I left a comment in bug 1300355.
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•