Closed
Bug 1324961
Opened 7 years ago
Closed 7 years ago
Intermittent REFTEST ERROR | Got suite_end message before suite_start. Logged with data: {"thread": "None", "extra": {"results": {"AssertionUnexpectedFixed": 0, "Exception": 0, "UnexpectedFail": 0, "UnexpectedPass": 0, "Assert
Categories
(Testing :: Reftest, defect)
Tracking
(firefox57 fixed, firefox58 fixed)
RESOLVED
FIXED
mozilla58
People
(Reporter: intermittent-bug-filer, Assigned: gbrown)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:other])
Attachments
(1 file)
Filed by: philringnalda [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=8214912&repo=autoland https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-linux/1482276113/autoland_ubuntu32_vm_test-crashtest-bm01-tests1-linux32-build65.txt.gz https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-linux/1482276113/autoland_ubuntu32_vm_test-crashtest-bm01-tests1-linux32-build65.txt.gz&only_show_unexpected=1
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 10•7 years ago
|
||
I will see if :gbrown can take a stab at this next week.
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 12•7 years ago
|
||
These logs do have a suite_start record, but it is in raw format, *very* long, and truncated. https://public-artifacts.taskcluster.net/Hb9XD2WHQMmH1alNmP_7HQ/0/public/logs/live_backing.log 08:02:26 INFO - {"action":"suite_start","time":1503932545985,"thread":null,"pid":null,"source":"reftest","tests":["data:text/html,<body> == about:blank","data:text/plain, == about:blank","data:text/plain,HELLO != about:blank","file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async.xul == file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async-ref.xul" ... ,"file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/font-default 08:02:26 INFO - REFTEST INFO | Running chunk 8 out of 8 chunks. tests 13672-15631/1960 (That line is 19456 characters long.)
Comment 13•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #12) > 08:02:26 INFO - REFTEST INFO | Running chunk 8 out of 8 chunks. tests > 13672-15631/1960 The first two numbers in the line above should not be greater than the 3rd.. I wonder if it's not calculating the tests to run properly and somehow thinks there are 15k tests when there should only be ~2k. Would certainly explain why it is way larger than normal.
Assignee | ||
Comment 14•7 years ago
|
||
I see the same curiosity in successful runs: 12:07:29 INFO - REFTEST SUITE-START | Running 15375 tests 12:07:29 INFO - REFTEST INFO | Running chunk 8 out of 8 chunks. tests 13672-15631/1960 https://dxr.mozilla.org/mozilla-central/rev/1b4c59eef820b46eb0037aca68f83a15088db45f/layout/tools/reftest/reftest.jsm#556
Comment 15•7 years ago
|
||
This is probably also a good time to mention bug 1373745. I'm also currently working on standing up some reftest selftests in bug 1392390. Reftests are kind of a mess right now and have been neglected a bit too long. Hopefully between this and those other two bugs we can get them into a more sustainable state.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Assignee | ||
Comment 16•7 years ago
|
||
I managed to reproduce with extra logging and found a ValueError was thrown at: https://dxr.mozilla.org/mozilla-central/rev/db7f19e26e571ae1dd309f5d2f387b06ba670c30/layout/tools/reftest/output.py#117 with: Unterminated string starting at: line 1 column 563085 (char 563085) https://public-artifacts.taskcluster.net/WafXjp7QStmHAp65_GOG6g/0/public/logs/live_backing.log
Comment 17•7 years ago
|
||
So looks like it's a buffering issue, but I don't get why that would be intermittent :/. The length of that message shouldn't be changing within the same push.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 19•7 years ago
|
||
Failure frequency seems a bit less over the last few days, and less concentrated on one platform...finding it hard to reproduce to get further diagnostics.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 21•7 years ago
|
||
I thought there might be a weakness in mozprocess' ProcessReader. I read that closely today and could not find a problem: It should reliably read from <file>.readline() and pass the result to the OutputHandler.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 27•7 years ago
|
||
This became much, much more frequent yesterday, on windows 10 debug. I wonder why?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell unknown] → [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 29•7 years ago
|
||
This is easy to reproduce on Windows 10/Debug now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4303dd32a0de65a7feedc2e23e72edcffbf05ca0 I see mozprocess calling readline() [1] and sometimes getting the suite-start record in two different lines; the first line ends with a --DOMWINDOW record, from [2]. That is, it looks like the "--DOMWINDOW ..." is inserted in the suite-start; I assume the eol at the end of the --DOMWINDOW causes the suite-start to be broken down into two records: <<line 1>> {"action":"suite_start","time":1507164418625,"thread":null, ... "file:///C:/slave/test/buil--DOMWINDOW == 2 (000001EB3BD22800) [pid = 7012] [serial = 2] [outer = 0000000000000000] [url = about:blank]\n <<line 2>> d/tests/reftest/tests/layout/reftests/...== http://localhost:49864/1507164415119/666/reftest/gizmo.mp4.55thframe-ref.html"],"runinfo":{"skipped":264}}\n (Line lengths differ from one run to another, but the first line seems to typically be 200 KB - 500 KB; the second line closer to 2 MB.) Is Firefox interleaving stdout and stderr from different threads? Seeing this, I am curious -- why are mochitests not failing the same way? [1] https://dxr.mozilla.org/mozilla-central/rev/c97190c389c4cfef20fe55b4bacade95a36ae6ef/testing/mozbase/mozprocess/mozprocess/processhandler.py#932 [2] https://dxr.mozilla.org/mozilla-central/rev/c97190c389c4cfef20fe55b4bacade95a36ae6ef/dom/base/nsGlobalWindow.cpp#1770
Comment 30•7 years ago
|
||
I believe mochitest is by design as per comments in bug 1405762.
Comment 31•7 years ago
|
||
I think bug 1405762 is unrelated to this. I think mochitest doesn't fail because it uses a special log delimiter: https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#151 Maybe reftest needs to do the same thing? I don't remember why we added this for mochitest, but not reftest. I guess reftest *just worked* when this first got implemented so we didn't need it.
Assignee | ||
Comment 32•7 years ago
|
||
The delimiter is definitely interesting. Thanks for that. BTW, the Windows 10 Debug troubles may have started here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug%20reftest&tochange=cc51fec33925ca29c0e60a42db9636d76a21dc52&fromchange=5ef005eb34d90fe2932e08938a38aa92d83b995b but I don't see how those changes could be related, other than perhaps changing some startup timing.
Comment 33•7 years ago
|
||
I have a potential fix on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cf8e503b6298edae68773467744775c3a4d40bf
Comment 34•7 years ago
|
||
Looks like it didn't work.
Comment 35•7 years ago
|
||
Oh right, so this is the same underlying issue we've seen in countless other places on Windows. On Windows, when stdout and stderr are buffered, they're not line-buffered, they're buffered in 4k chunks. This means that you can get dumb interleaved output. We tried fixing this (in a bug I'll have to dig up) by making mozprocess read both streams on a thread and line-buffer them, but it caused issues with some test harnesses because the ordering of some lines changed, I think it might have broken leak log checking or something like that.
Comment 36•7 years ago
|
||
That was bug 798300. See also bug 1285254 for where we've hit the same issue with mach parsing the output of make.
Comment 37•7 years ago
|
||
Thanks that sounds like it! I guess the reason we don't hit this in mochitest is because the suite_start log is under 4k. One workaround, would be to pass stderrOutputLine to mozprocess from reftest. This would keep stderr separate from stdout, but would no longer guarantee proper log ordering. Another workaround would be to cut down the size of the suite_start message. The easiest way to do this would be to simply chunk this more. This also explains why it's more prevalent on Windows10, we only have two chunks there.
Assignee | ||
Comment 38•7 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #37) > Another workaround would be to cut down the size of the suite_start message. > The easiest way to do this would be to simply chunk this more. This also > explains why it's more prevalent on Windows10, we only have two chunks there. The current Windows 10 Debug suite-start message is over 2MB, so that approach may not be feasible. 99% of the message is the list of tests, each of which includes full paths: "file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async.xul == file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async-ref.xul" There's lots of redundancy there, so some sort of relative path scheme might reduce it by 50%? It will still be a lot bigger than 4K.
Flags: needinfo?(gbrown)
Comment 39•7 years ago
|
||
At the very least, the fewer 4k chunks there are the less likely that an errant stderr message can slip in. I do agree we should also use relative file paths though. This would also likely get fixed by run-by-manifest (bug 1353461)
Comment 40•7 years ago
|
||
Nvm, there'd still only be a single suite_start with run-by-manifest.
Assignee | ||
Comment 41•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #29) > why are mochitests not failing the same way? A big difference is timing and process management: The mochitest python harness emits suite_start before starting firefox; for reftests, suite_start is emitted by reftest.jsm, in a firefox session that is probably just starting and producing DOMWINDOW chatter. (The mochitest test list may also be shorter.)
Comment 42•7 years ago
|
||
Good call, this would get fixed by bug 1353461 after all then. Part of my plan there is to move the manifest parsing into python, which means we would also be logging suite_start from there.
Assignee | ||
Comment 43•7 years ago
|
||
:ahal -- https://treeherder.mozilla.org/#/jobs?repo=try&revision=931c415c99c8f2051bae2b1f59a0880a9e1511ee looks very promising! Going ahead with that? I'm also trying: https://treeherder.mozilla.org/#/jobs?repo=try&revision=423def6405e1b065cfe0ea76fee9ed275a9735bc
Flags: needinfo?(ahalberstadt)
Comment hidden (Intermittent Failures Robot) |
Comment 45•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #43) > :ahal -- > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=931c415c99c8f2051bae2b1f59a0880a9e1511ee looks very > promising! Going ahead with that? The last time we landed a patch that did this it got backed out for intermittent leakcheck failures due to the log order changing. So it'll need a lot more testing, but assuming the testing works I'd vote we take it. But like Ted mentioned in bug 798300, even if we do hit the leakcheck intermittent, we could just fix the leakcheck code to not be so dependent on the ordering of logs. The only other reason not to take it is that some developers might object to not preserving the "true" log ordering. I guess we could get away with only splitting the streams on Windows.
Flags: needinfo?(ahalberstadt)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 47•7 years ago
|
||
mozreview-review |
Comment on attachment 8915968 [details] Bug 1324961 - [reftest] Process stdout/stderr separately on Windows, https://reviewboard.mozilla.org/r/187252/#review192314
Attachment #8915968 -
Flags: review?(gbrown) → review+
Comment 48•7 years ago
|
||
This looks pretty decent on try: https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&fromchange=931c415c99c8f2051bae2b1f59a0880a9e1511ee&tochange=8124b468d047527eab6ce1b07cfae95de12cf7a7 Though there's a good chance it'll cause an infrequent intermittent. I guess as long as the intermittent it causes is less frequent than the one it solves, it's a net positive.
Comment hidden (Intermittent Failures Robot) |
Comment 50•7 years ago
|
||
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8cf03b2c66c3 [reftest] Process stdout/stderr separately on Windows, r=gbrown
Comment 51•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8cf03b2c66c3
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox58:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Comment 52•7 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/21e3f04ae989
status-firefox57:
--- → fixed
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
You need to log in
before you can comment on or make changes to this bug.
Description
•