Closed Bug 1324961 Opened 7 years ago Closed 7 years ago

Intermittent REFTEST ERROR | Got suite_end message before suite_start. Logged with data: {"thread": "None", "extra": {"results": {"AssertionUnexpectedFixed": 0, "Exception": 0, "UnexpectedFail": 0, "UnexpectedPass": 0, "Assert

Categories

(Testing :: Reftest, defect)

Version 3
defect
Not set
normal

Tracking

(firefox57 fixed, firefox58 fixed)

RESOLVED FIXED
mozilla58
Tracking Status
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:other])

Attachments

(1 file)

I will see if :gbrown can take a stab at this next week.
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork]
These logs do have a suite_start record, but it is in raw format, *very* long, and truncated.

https://public-artifacts.taskcluster.net/Hb9XD2WHQMmH1alNmP_7HQ/0/public/logs/live_backing.log

08:02:26     INFO - {"action":"suite_start","time":1503932545985,"thread":null,"pid":null,"source":"reftest","tests":["data:text/html,<body> == about:blank","data:text/plain, == about:blank","data:text/plain,HELLO != about:blank","file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async.xul == file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async-ref.xul"

...

,"file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/font-default
08:02:26     INFO - REFTEST INFO | Running chunk 8 out of 8 chunks.  tests 13672-15631/1960

(That line is 19456 characters long.)
(In reply to Geoff Brown [:gbrown] from comment #12)
> 08:02:26     INFO - REFTEST INFO | Running chunk 8 out of 8 chunks.  tests
> 13672-15631/1960

The first two numbers in the line above should not be greater than the 3rd.. I wonder if it's not calculating the tests to run properly and somehow thinks there are 15k tests when there should only be ~2k. Would certainly explain why it is way larger than normal.
I see the same curiosity in successful runs:

12:07:29     INFO - REFTEST SUITE-START | Running 15375 tests
12:07:29     INFO - REFTEST INFO | Running chunk 8 out of 8 chunks.  tests 13672-15631/1960

https://dxr.mozilla.org/mozilla-central/rev/1b4c59eef820b46eb0037aca68f83a15088db45f/layout/tools/reftest/reftest.jsm#556
This is probably also a good time to mention bug 1373745. I'm also currently working on standing up some reftest selftests in bug 1392390. Reftests are kind of a mess right now and have been neglected a bit too long. Hopefully between this and those other two bugs we can get them into a more sustainable state.
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
See Also: → 1394957
I managed to reproduce with extra logging and found a ValueError was thrown at:

https://dxr.mozilla.org/mozilla-central/rev/db7f19e26e571ae1dd309f5d2f387b06ba670c30/layout/tools/reftest/output.py#117

with:

Unterminated string starting at: line 1 column 563085 (char 563085)

https://public-artifacts.taskcluster.net/WafXjp7QStmHAp65_GOG6g/0/public/logs/live_backing.log
So looks like it's a buffering issue, but I don't get why that would be intermittent :/. The length of that message shouldn't be changing within the same push.
Failure frequency seems a bit less over the last few days, and less concentrated on one platform...finding it hard to reproduce to get further diagnostics.
I thought there might be a weakness in mozprocess' ProcessReader. I read that closely today and could not find a problem: It should reliably read from <file>.readline() and pass the result to the OutputHandler.
This became much, much more frequent yesterday, on windows 10 debug. I wonder why?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell unknown] → [stockwell needswork]
This is easy to reproduce on Windows 10/Debug now:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4303dd32a0de65a7feedc2e23e72edcffbf05ca0

I see mozprocess calling readline() [1] and sometimes getting the suite-start record in two different lines; the first line ends with a --DOMWINDOW record, from [2]. That is, it looks like the "--DOMWINDOW ..." is inserted in the suite-start; I assume the eol at the end of the --DOMWINDOW causes the suite-start to be broken down into two records:

<<line 1>> {"action":"suite_start","time":1507164418625,"thread":null, ... "file:///C:/slave/test/buil--DOMWINDOW == 2 (000001EB3BD22800) [pid = 7012] [serial = 2] [outer = 0000000000000000] [url = about:blank]\n
<<line 2>> d/tests/reftest/tests/layout/reftests/...== http://localhost:49864/1507164415119/666/reftest/gizmo.mp4.55thframe-ref.html"],"runinfo":{"skipped":264}}\n

(Line lengths differ from one run to another, but the first line seems to typically be 200 KB - 500 KB; the second line closer to 2 MB.)

Is Firefox interleaving stdout and stderr from different threads? Seeing this, I am curious -- why are mochitests not failing the same way?

[1] https://dxr.mozilla.org/mozilla-central/rev/c97190c389c4cfef20fe55b4bacade95a36ae6ef/testing/mozbase/mozprocess/mozprocess/processhandler.py#932
[2] https://dxr.mozilla.org/mozilla-central/rev/c97190c389c4cfef20fe55b4bacade95a36ae6ef/dom/base/nsGlobalWindow.cpp#1770
I believe mochitest is by design as per comments in bug 1405762.
I think bug 1405762 is unrelated to this. I think mochitest doesn't fail because it uses a special log delimiter:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#151

Maybe reftest needs to do the same thing? I don't remember why we added this for mochitest, but not reftest. I guess reftest *just worked* when this first got implemented so we didn't need it.
The delimiter is definitely interesting. Thanks for that.

BTW, the Windows 10 Debug troubles may have started here:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug%20reftest&tochange=cc51fec33925ca29c0e60a42db9636d76a21dc52&fromchange=5ef005eb34d90fe2932e08938a38aa92d83b995b

but I don't see how those changes could be related, other than perhaps changing some startup timing.
Looks like it didn't work.
Oh right, so this is the same underlying issue we've seen in countless other places on Windows.

On Windows, when stdout and stderr are buffered, they're not line-buffered, they're buffered in 4k chunks. This means that you can get dumb interleaved output.

We tried fixing this (in a bug I'll have to dig up) by making mozprocess read both streams on a thread and line-buffer them, but it caused issues with some test harnesses because the ordering of some lines changed, I think it might have broken leak log checking or something like that.
That was bug 798300. See also bug 1285254 for where we've hit the same issue with mach parsing the output of make.
Thanks that sounds like it! I guess the reason we don't hit this in mochitest is because the suite_start log is under 4k.

One workaround, would be to pass stderrOutputLine to mozprocess from reftest. This would keep stderr separate from stdout, but would no longer guarantee proper log ordering.

Another workaround would be to cut down the size of the suite_start message. The easiest way to do this would be to simply chunk this more. This also explains why it's more prevalent on Windows10, we only have two chunks there.
(In reply to Andrew Halberstadt [:ahal] from comment #37)
> Another workaround would be to cut down the size of the suite_start message.
> The easiest way to do this would be to simply chunk this more. This also
> explains why it's more prevalent on Windows10, we only have two chunks there.

The current Windows 10 Debug suite-start message is over 2MB, so that approach may not be feasible.

99% of the message is the list of tests, each of which includes full paths:

"file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async.xul == file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async-ref.xul"

There's lots of redundancy there, so some sort of relative path scheme might reduce it by 50%? It will still be a lot bigger than 4K.
Flags: needinfo?(gbrown)
At the very least, the fewer 4k chunks there are the less likely that an errant stderr message can slip in. I do agree we should also use relative file paths though.

This would also likely get fixed by run-by-manifest (bug 1353461)
Nvm, there'd still only be a single suite_start with run-by-manifest.
See Also: → 1406160
(In reply to Geoff Brown [:gbrown] from comment #29)
> why are mochitests not failing the same way?

A big difference is timing and process management: The mochitest python harness emits suite_start before starting firefox; for reftests, suite_start is emitted by reftest.jsm, in a firefox session that is probably just starting and producing DOMWINDOW chatter.

(The mochitest test list may also be shorter.)
Good call, this would get fixed by bug 1353461 after all then. Part of my plan there is to move the manifest parsing into python, which means we would also be logging suite_start from there.
(In reply to Geoff Brown [:gbrown] from comment #43)
> :ahal --
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=931c415c99c8f2051bae2b1f59a0880a9e1511ee looks very
> promising! Going ahead with that?

The last time we landed a patch that did this it got backed out for intermittent leakcheck failures due to the log order changing. So it'll need a lot more testing, but assuming the testing works I'd vote we take it.

But like Ted mentioned in bug 798300, even if we do hit the leakcheck intermittent, we could just fix the leakcheck code to not be so dependent on the ordering of logs. The only other reason not to take it is that some developers might object to not preserving the "true" log ordering. I guess we could get away with only splitting the streams on Windows.
Flags: needinfo?(ahalberstadt)
Comment on attachment 8915968 [details]
Bug 1324961 - [reftest] Process stdout/stderr separately on Windows,

https://reviewboard.mozilla.org/r/187252/#review192314
Attachment #8915968 - Flags: review?(gbrown) → review+
This looks pretty decent on try:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&fromchange=931c415c99c8f2051bae2b1f59a0880a9e1511ee&tochange=8124b468d047527eab6ce1b07cfae95de12cf7a7

Though there's a good chance it'll cause an infrequent intermittent. I guess as long as the intermittent it causes is less frequent than the one it solves, it's a net positive.
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8cf03b2c66c3
[reftest] Process stdout/stderr separately on Windows, r=gbrown
https://hg.mozilla.org/mozilla-central/rev/8cf03b2c66c3
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: