Intermittent REFTEST ERROR | Got suite_end message before suite_start. Logged with data: {"thread": "None", "extra": {"results": {"AssertionUnexpectedFixed": 0, "Exception": 0, "UnexpectedFail": 0, "UnexpectedPass": 0, "Assert

RESOLVED FIXED in Firefox 57

Status

Testing
Reftest
RESOLVED FIXED
11 months ago
a month ago

People

(Reporter: Treeherder Bug Filer, Assigned: gbrown)

Tracking

({intermittent-failure})

Version 3
mozilla58
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox57 fixed, firefox58 fixed)

Details

(Whiteboard: [stockwell fixed:other])

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

11 months ago
treeherder
Filed by: philringnalda [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=8214912&repo=autoland

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-linux/1482276113/autoland_ubuntu32_vm_test-crashtest-bm01-tests1-linux32-build65.txt.gz

https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-linux/1482276113/autoland_ubuntu32_vm_test-crashtest-bm01-tests1-linux32-build65.txt.gz&only_show_unexpected=1

Comment 1

6 months ago
5 failures in 879 pushes (0.006 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 5

Platform breakdown:
* windows8-64: 3
* osx-10-10: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-05-08&endday=2017-05-14&tree=all

Comment 2

5 months ago
11 failures in 814 pushes (0.014 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* try: 11

Platform breakdown:
* linux64: 6
* linux64-qr: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-06-12&endday=2017-06-18&tree=all

Comment 3

5 months ago
1 failures in 892 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-06-19&endday=2017-06-25&tree=all

Comment 4

5 months ago
1 failures in 718 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-06-26&endday=2017-07-02&tree=all

Comment 5

4 months ago
2 failures in 720 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 2

Platform breakdown:
* linux64-qr: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-07-10&endday=2017-07-16&tree=all

Comment 6

4 months ago
1 failures in 822 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-07-17&endday=2017-07-23&tree=all

Comment 7

4 months ago
6 failures in 1008 pushes (0.006 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 6

Platform breakdown:
* windows7-32: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-07-24&endday=2017-07-30&tree=all

Comment 8

3 months ago
3 failures in 949 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* try: 2
* mozilla-inbound: 1

Platform breakdown:
* windows7-32: 1
* osx-10-10: 1
* macosx64-stylo: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-08-14&endday=2017-08-20&tree=all

Comment 9

3 months ago
17 failures in 173 pushes (0.098 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* try: 7
* autoland: 7
* mozilla-central: 3

Platform breakdown:
* macosx64-stylo: 7
* windows8-64: 6
* osx-10-10: 3
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-08-23&endday=2017-08-23&tree=all
I will see if :gbrown can take a stab at this next week.
Flags: needinfo?(gbrown)
Whiteboard: [stockwell needswork]

Comment 11

3 months ago
50 failures in 908 pushes (0.055 failures/push) were associated with this bug in the last 7 days. 

This is the #44 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 23
* try: 20
* mozilla-central: 5
* mozilla-inbound: 2

Platform breakdown:
* macosx64-stylo: 28
* windows8-64: 10
* osx-10-10: 7
* windows7-32: 3
* macosx64-nightly: 1
* linux32-stylo: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-08-21&endday=2017-08-27&tree=all
(Assignee)

Comment 12

3 months ago
These logs do have a suite_start record, but it is in raw format, *very* long, and truncated.

https://public-artifacts.taskcluster.net/Hb9XD2WHQMmH1alNmP_7HQ/0/public/logs/live_backing.log

08:02:26     INFO - {"action":"suite_start","time":1503932545985,"thread":null,"pid":null,"source":"reftest","tests":["data:text/html,<body> == about:blank","data:text/plain, == about:blank","data:text/plain,HELLO != about:blank","file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async.xul == file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async-ref.xul"

...

,"file:///Users/cltbld/tasks/task_1503932423/build/tests/reftest/tests/layout/reftests/reftest-sanity/font-default
08:02:26     INFO - REFTEST INFO | Running chunk 8 out of 8 chunks.  tests 13672-15631/1960

(That line is 19456 characters long.)
(In reply to Geoff Brown [:gbrown] from comment #12)
> 08:02:26     INFO - REFTEST INFO | Running chunk 8 out of 8 chunks.  tests
> 13672-15631/1960

The first two numbers in the line above should not be greater than the 3rd.. I wonder if it's not calculating the tests to run properly and somehow thinks there are 15k tests when there should only be ~2k. Would certainly explain why it is way larger than normal.
(Assignee)

Comment 14

3 months ago
I see the same curiosity in successful runs:

12:07:29     INFO - REFTEST SUITE-START | Running 15375 tests
12:07:29     INFO - REFTEST INFO | Running chunk 8 out of 8 chunks.  tests 13672-15631/1960

https://dxr.mozilla.org/mozilla-central/rev/1b4c59eef820b46eb0037aca68f83a15088db45f/layout/tools/reftest/reftest.jsm#556
This is probably also a good time to mention bug 1373745. I'm also currently working on standing up some reftest selftests in bug 1392390. Reftests are kind of a mess right now and have been neglected a bit too long. Hopefully between this and those other two bugs we can get them into a more sustainable state.
(Assignee)

Updated

3 months ago
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
(Assignee)

Updated

3 months ago
See Also: → bug 1394957
(Assignee)

Comment 16

3 months ago
I managed to reproduce with extra logging and found a ValueError was thrown at:

https://dxr.mozilla.org/mozilla-central/rev/db7f19e26e571ae1dd309f5d2f387b06ba670c30/layout/tools/reftest/output.py#117

with:

Unterminated string starting at: line 1 column 563085 (char 563085)

https://public-artifacts.taskcluster.net/WafXjp7QStmHAp65_GOG6g/0/public/logs/live_backing.log
So looks like it's a buffering issue, but I don't get why that would be intermittent :/. The length of that message shouldn't be changing within the same push.

Comment 18

3 months ago
28 failures in 939 pushes (0.03 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* try: 11
* autoland: 11
* mozilla-central: 4
* mozilla-inbound: 2

Platform breakdown:
* macosx64-stylo: 13
* windows8-64: 8
* windows7-32: 3
* windows7-32-nightly: 1
* osx-10-10: 1
* linux64: 1
* linux32-stylo: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-08-28&endday=2017-09-03&tree=all
(Assignee)

Comment 19

2 months ago
Failure frequency seems a bit less over the last few days, and less concentrated on one platform...finding it hard to reproduce to get further diagnostics.

Comment 20

2 months ago
11 failures in 924 pushes (0.012 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 6
* try: 4
* mozilla-inbound: 1

Platform breakdown:
* osx-10-10: 4
* macosx64-stylo-disabled: 2
* macosx64-stylo: 2
* linux32-stylo-disabled: 2
* macosx64-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-09-04&endday=2017-09-10&tree=all
Whiteboard: [stockwell needswork] → [stockwell unknown]
(Assignee)

Comment 21

2 months ago
I thought there might be a weakness in mozprocess' ProcessReader. I read that closely today and could not find a problem: It should reliably read from <file>.readline() and pass the result to the OutputHandler.

Comment 22

2 months ago
19 failures in 1032 pushes (0.018 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-central: 8
* autoland: 6
* mozilla-inbound: 3
* try: 1
* mozilla-beta: 1

Platform breakdown:
* osx-10-10: 12
* windows8-64: 2
* macosx64-stylo-disabled: 2
* macosx64-nightly: 2
* windows7-32-vm: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-09-11&endday=2017-09-17&tree=all

Comment 23

2 months ago
23 failures in 175 pushes (0.131 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* autoland: 12
* mozilla-inbound: 9
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* windows10-64: 20
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-09-19&endday=2017-09-19&tree=all

Comment 24

2 months ago
48 failures in 943 pushes (0.051 failures/push) were associated with this bug in the last 7 days. 

This is the #37 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **  

Repository breakdown:
* mozilla-inbound: 22
* autoland: 16
* mozilla-central: 6
* mozilla-beta: 4

Platform breakdown:
* windows10-64: 28
* osx-10-10: 9
* windows8-64: 6
* macosx64-stylo-disabled: 3
* windows7-32-vm: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-09-18&endday=2017-09-24&tree=all

Comment 25

2 months ago
15 failures in 885 pushes (0.017 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-beta: 5
* mozilla-inbound: 4
* mozilla-central: 4
* autoland: 2

Platform breakdown:
* osx-10-10: 5
* macosx64-stylo-disabled: 3
* windows7-32-vm: 2
* windows10-64: 2
* windows7-32: 1
* macosx64-devedition: 1
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-09-25&endday=2017-10-01&tree=all

Comment 26

2 months ago
73 failures in 173 pushes (0.422 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* mozilla-inbound: 47
* autoland: 21
* mozilla-central: 4
* mozilla-beta: 1

Platform breakdown:
* windows10-64: 71
* windows7-32-vm: 1
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-10-03&endday=2017-10-03&tree=all
(Assignee)

Comment 27

2 months ago
This became much, much more frequent yesterday, on windows 10 debug. I wonder why?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell unknown] → [stockwell needswork]

Comment 28

2 months ago
73 failures in 175 pushes (0.417 failures/push) were associated with this bug yesterday.    

** This test has failed more than 200 times in the last 30 days. It should be disabled until it can be fixed. ** 

Repository breakdown:
* autoland: 43
* mozilla-inbound: 28
* try: 2

Platform breakdown:
* windows10-64: 73

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-10-04&endday=2017-10-04&tree=all
Whiteboard: [stockwell needswork] → [stockwell disable-recommended]
(Assignee)

Comment 29

2 months ago
This is easy to reproduce on Windows 10/Debug now:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4303dd32a0de65a7feedc2e23e72edcffbf05ca0

I see mozprocess calling readline() [1] and sometimes getting the suite-start record in two different lines; the first line ends with a --DOMWINDOW record, from [2]. That is, it looks like the "--DOMWINDOW ..." is inserted in the suite-start; I assume the eol at the end of the --DOMWINDOW causes the suite-start to be broken down into two records:

<<line 1>> {"action":"suite_start","time":1507164418625,"thread":null, ... "file:///C:/slave/test/buil--DOMWINDOW == 2 (000001EB3BD22800) [pid = 7012] [serial = 2] [outer = 0000000000000000] [url = about:blank]\n
<<line 2>> d/tests/reftest/tests/layout/reftests/...== http://localhost:49864/1507164415119/666/reftest/gizmo.mp4.55thframe-ref.html"],"runinfo":{"skipped":264}}\n

(Line lengths differ from one run to another, but the first line seems to typically be 200 KB - 500 KB; the second line closer to 2 MB.)

Is Firefox interleaving stdout and stderr from different threads? Seeing this, I am curious -- why are mochitests not failing the same way?

[1] https://dxr.mozilla.org/mozilla-central/rev/c97190c389c4cfef20fe55b4bacade95a36ae6ef/testing/mozbase/mozprocess/mozprocess/processhandler.py#932
[2] https://dxr.mozilla.org/mozilla-central/rev/c97190c389c4cfef20fe55b4bacade95a36ae6ef/dom/base/nsGlobalWindow.cpp#1770
I believe mochitest is by design as per comments in bug 1405762.
I think bug 1405762 is unrelated to this. I think mochitest doesn't fail because it uses a special log delimiter:
https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#151

Maybe reftest needs to do the same thing? I don't remember why we added this for mochitest, but not reftest. I guess reftest *just worked* when this first got implemented so we didn't need it.
(Assignee)

Comment 32

2 months ago
The delimiter is definitely interesting. Thanks for that.

BTW, the Windows 10 Debug troubles may have started here:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=win%20x64%20debug%20reftest&tochange=cc51fec33925ca29c0e60a42db9636d76a21dc52&fromchange=5ef005eb34d90fe2932e08938a38aa92d83b995b

but I don't see how those changes could be related, other than perhaps changing some startup timing.
I have a potential fix on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=4cf8e503b6298edae68773467744775c3a4d40bf
Looks like it didn't work.
Oh right, so this is the same underlying issue we've seen in countless other places on Windows.

On Windows, when stdout and stderr are buffered, they're not line-buffered, they're buffered in 4k chunks. This means that you can get dumb interleaved output.

We tried fixing this (in a bug I'll have to dig up) by making mozprocess read both streams on a thread and line-buffer them, but it caused issues with some test harnesses because the ordering of some lines changed, I think it might have broken leak log checking or something like that.
That was bug 798300. See also bug 1285254 for where we've hit the same issue with mach parsing the output of make.
Thanks that sounds like it! I guess the reason we don't hit this in mochitest is because the suite_start log is under 4k.

One workaround, would be to pass stderrOutputLine to mozprocess from reftest. This would keep stderr separate from stdout, but would no longer guarantee proper log ordering.

Another workaround would be to cut down the size of the suite_start message. The easiest way to do this would be to simply chunk this more. This also explains why it's more prevalent on Windows10, we only have two chunks there.
(Assignee)

Comment 38

2 months ago
(In reply to Andrew Halberstadt [:ahal] from comment #37)
> Another workaround would be to cut down the size of the suite_start message.
> The easiest way to do this would be to simply chunk this more. This also
> explains why it's more prevalent on Windows10, we only have two chunks there.

The current Windows 10 Debug suite-start message is over 2MB, so that approach may not be feasible.

99% of the message is the list of tests, each of which includes full paths:

"file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async.xul == file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/reftest-sanity/test-async-ref.xul"

There's lots of redundancy there, so some sort of relative path scheme might reduce it by 50%? It will still be a lot bigger than 4K.
Flags: needinfo?(gbrown)
At the very least, the fewer 4k chunks there are the less likely that an errant stderr message can slip in. I do agree we should also use relative file paths though.

This would also likely get fixed by run-by-manifest (bug 1353461)
Nvm, there'd still only be a single suite_start with run-by-manifest.
(Assignee)

Updated

2 months ago
See Also: → bug 1406160
(Assignee)

Comment 41

2 months ago
(In reply to Geoff Brown [:gbrown] from comment #29)
> why are mochitests not failing the same way?

A big difference is timing and process management: The mochitest python harness emits suite_start before starting firefox; for reftests, suite_start is emitted by reftest.jsm, in a firefox session that is probably just starting and producing DOMWINDOW chatter.

(The mochitest test list may also be shorter.)
Good call, this would get fixed by bug 1353461 after all then. Part of my plan there is to move the manifest parsing into python, which means we would also be logging suite_start from there.
(Assignee)

Comment 43

2 months ago
:ahal -- https://treeherder.mozilla.org/#/jobs?repo=try&revision=931c415c99c8f2051bae2b1f59a0880a9e1511ee looks very promising! Going ahead with that?

I'm also trying:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=423def6405e1b065cfe0ea76fee9ed275a9735bc
Flags: needinfo?(ahalberstadt)

Comment 44

2 months ago
43 failures in 119 pushes (0.361 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* mozilla-inbound: 28
* autoland: 13
* try: 2

Platform breakdown:
* windows10-64: 42
* macosx64-stylo-disabled: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-10-05&endday=2017-10-05&tree=all
(In reply to Geoff Brown [:gbrown] from comment #43)
> :ahal --
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=931c415c99c8f2051bae2b1f59a0880a9e1511ee looks very
> promising! Going ahead with that?

The last time we landed a patch that did this it got backed out for intermittent leakcheck failures due to the log order changing. So it'll need a lot more testing, but assuming the testing works I'd vote we take it.

But like Ted mentioned in bug 798300, even if we do hit the leakcheck intermittent, we could just fix the leakcheck code to not be so dependent on the ordering of logs. The only other reason not to take it is that some developers might object to not preserving the "true" log ordering. I guess we could get away with only splitting the streams on Windows.
Flags: needinfo?(ahalberstadt)
Comment hidden (mozreview-request)
(Assignee)

Comment 47

a month ago
mozreview-review
Comment on attachment 8915968 [details]
Bug 1324961 - [reftest] Process stdout/stderr separately on Windows,

https://reviewboard.mozilla.org/r/187252/#review192314
Attachment #8915968 - Flags: review?(gbrown) → review+
This looks pretty decent on try:
https://treeherder.mozilla.org/#/jobs?repo=try&author=ahalberstadt@mozilla.com&fromchange=931c415c99c8f2051bae2b1f59a0880a9e1511ee&tochange=8124b468d047527eab6ce1b07cfae95de12cf7a7

Though there's a good chance it'll cause an infrequent intermittent. I guess as long as the intermittent it causes is less frequent than the one it solves, it's a net positive.

Comment 49

a month ago
16 failures in 91 pushes (0.176 failures/push) were associated with this bug yesterday.    

Repository breakdown:
* mozilla-inbound: 9
* autoland: 4
* try: 3

Platform breakdown:
* windows10-64: 9
* windows10-64-stylo-disabled: 6
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-10-06&endday=2017-10-06&tree=all

Comment 50

a month ago
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8cf03b2c66c3
[reftest] Process stdout/stderr separately on Windows, r=gbrown
https://hg.mozilla.org/mozilla-central/rev/8cf03b2c66c3
Status: NEW → RESOLVED
Last Resolved: a month ago
status-firefox58: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58

Comment 52

a month ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/21e3f04ae989
status-firefox57: --- → fixed

Comment 53

a month ago
276 failures in 824 pushes (0.335 failures/push) were associated with this bug in the last 7 days. 

This is the #2 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **   

Repository breakdown:
* mozilla-inbound: 137
* autoland: 113
* mozilla-central: 17
* try: 7
* mozilla-beta: 2

Platform breakdown:
* windows10-64: 252
* windows10-64-stylo-disabled: 12
* osx-10-10: 5
* macosx64-nightly: 3
* windows7-32-vm: 2
* macosx64-stylo-disabled: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1324961&startday=2017-10-02&endday=2017-10-08&tree=all
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
You need to log in before you can comment on or make changes to this bug.