Open Bug 1345093 Opened 7 years ago Updated 2 years ago

Windows reftest timeouts after "JavaScript error: chrome://reftest/content/reftest.jsm, line 1741: NS_ERROR_FAILURE:"

Categories

(Testing :: Reftest, defect)

Unspecified
Windows 7
defect

Tracking

(Not tracked)

People

(Reporter: kats, Unassigned)

References

(Depends on 20 open bugs)

Details

Attachments

(1 file)

There are a lot of different intermittent failure bugs filed for what appears to be the same underlying root issue. The Windows 7 R(R) job runs, spits out this error:
  JavaScript error: chrome://reftest/content/reftest.jsm, line 1741: NS_ERROR_FAILURE:
and then hangs. 330 seconds later the test times out and is force-crashed by the harness.

Line 1741 in reftest.jsm is not a particularly interesting line: http://searchfox.org/mozilla-central/rev/1bc7720a434af3c13b68b8d69f92078cae8890c4/layout/tools/reftest/reftest.jsm#1741 so I don't really see why it would produce an NS_ERROR_FAILURE.

Example logs:

https://treeherder.mozilla.org/logviewer.html#?job_id=82088875&repo=mozilla-inbound&lineNumber=12663
https://treeherder.mozilla.org/logviewer.html#?job_id=82072350&repo=mozilla-inbound&lineNumber=12510
https://treeherder.mozilla.org/logviewer.html#?job_id=81783775&repo=mozilla-inbound&lineNumber=21205
https://treeherder.mozilla.org/logviewer.html#?job_id=81783793&repo=mozilla-inbound&lineNumber=21239
https://treeherder.mozilla.org/logviewer.html#?job_id=81783799&repo=mozilla-inbound&lineNumber=61369

It would be super if TreeHerder or the reftest harness or mozharness flagged the JavaScript error output as relevant, so that sheriffs can more easily annotate these failures as being the same thing, rather than ending up filing tons of one-off bugs.
The exact line number varies a bit depending on how far back you go, but it always points to the same actual line of code. I'm going to stop adding more bugs here because that would take me all day. Here's a search query to find more:

https://bugzilla.mozilla.org/buglist.cgi?keywords=intermittent-failure%2C%20&keywords_type=allwords&list_id=13475923&short_desc_type=allwordssubstr&short_desc=application%20timed%20out%20after%20330%20seconds%20with%20no%20output&resolution=---&query_format=advanced&component=Graphics&component=Layout&product=Core
Also a warning: this failure might become more frequent in the near future, after bug 1342450 lands. On a recent try push for that bug [1] 44% of the Windows 7 opt R(R) jobs encountered this error.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=08fd35010f0b5d58def870e29b7b858f69cda0ff&filter-searchStr=windows%20reftest&group_state=expanded
Blocks: 1342450
reftest.jsm is preprocessed.  Are the line numbers in the errors adjusted for that, or do you need to unpack the version after preprocessing to see what's on line 1741?
Ah, good point. I downloaded the zip file [1] for the try push I linked above, and looked at the post-processed reftest.jsm. Line 1741 points to [2] which seems like a more reasonable culprit.

[1] https://queue.taskcluster.net/v1/task/bQvMnnkFToiXdf5VpYy8Xw/artifacts/public/build/firefox-54.0a1.en-US.win32.reftest.tests.zip
[2] http://searchfox.org/mozilla-central/rev/1bc7720a434af3c13b68b8d69f92078cae8890c4/layout/tools/reftest/reftest.jsm#1757
Also it seems like some (but not all) of the failing reftest jobs have gfx error output, for example:

05:52:19     INFO - [GFX1-]: Failed 2 buffer db=00000000 dw=00000000 for 0, 0, 800, 1000

and this shows up just before the NS_ERROR_FAILURE. My best guess is that we're running out of memory, and so both the GFX code and toDataURL are failing as a result.
Attached file Virt samples
I took the resource-usage file from a sample failing job [1] and extracted the "virt" samples. The sample with smallest "free" value is still 1233567744 - assuming this is virtual memory data during the run (I assume it is, but it's not documented) that means the free memory never dropped below 1.2G. That makes it pretty unlikely that running out of memory is the problem, but maybe we have really bad memory fragmentation?

[1] http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/77b2dbbe62bccc9204f81d0884a514db0bb6834fc5af5eda5e5afb84e7cbb4015e69db7e28d1b37801d8da4277e5b5132bf24809d54c1143d7c12a8d21c935b1
Indeed it is. But in this case at least the error is being reported to Javascript, so we should be able to catch and force-trigger a GC/CC and try again. I'll write a patch that does that.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #0)
> It would be super if TreeHerder or the reftest harness or mozharness flagged
> the JavaScript error output as relevant, so that sheriffs can more easily
> annotate these failures as being the same thing, rather than ending up
> filing tons of one-off bugs.

Bug 1299274 was my idea. Hasn't gotten any traction since.
Try push with the OOM guard: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7df09d6534a28046016fe8fd9846795ea3672fe0

While testing the patch locally I realized that the error line (a call to canvas.toDataURL()) only actually gets run if the test is failing anyway. So this might actually be exactly the same as bug 1300355 but with a different manifestation (i.e. there's an OOM that causes the test to fail, but then during the reporting of the failure there's another OOM that causes this hang).

Anyhow, I'll retrigger the reftest a bunch of times and see what happens.
So in https://treeherder.mozilla.org/logviewer.html#?job_id=82479713&repo=try&lineNumber=61373 it triggered my oom guard (I can see the output from it) but my recovery attempt by running a GC/CC didn't seem to help. It failed the allocation even after that, and ran into the same timeout. So whatever is fragmenting the memory isn't helped by GC'ing. We should probably fix the fragmentation - I left a comment in bug 1300355.
Depends on: 1341664
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: