stylo: Reftests application timed out after 330 seconds with no output

RESOLVED FIXED in Firefox 55

Status

()

Core
CSS Parsing and Computation
P1
normal
RESOLVED FIXED
5 months ago
15 days ago

People

(Reporter: Tomcat, Assigned: hiro)

Tracking

(Blocks: 1 bug, {intermittent-failure})

unspecified
mozilla55
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [stockwell fixed], URL)

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(3 attachments)

(Reporter)

Description

5 months ago
[task 2017-03-10T11:47:38.437605Z] 11:47:38    ERROR - REFTEST ERROR | reftest | application timed out after 330 seconds with no output

like https://treeherder.mozilla.org/logviewer.html#?job_id=83018605&repo=autoland&lineNumber=1669 as example - not sure if this is a existing bug but seems to hit intermittently stylo reftests

Comment 1

5 months ago
21 failures in 172 pushes (0.122 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 21

Platform breakdown:
* linux64-stylo: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-10&endday=2017-03-10&tree=all

Comment 2

5 months ago
21 failures in 790 pushes (0.027 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 21

Platform breakdown:
* linux64-stylo: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-06&endday=2017-03-12&tree=all
Hm, looks like this is happening a fair bit, and there's no obvious place to start in the logs.

Time to throw up the bat signal I think. dmajor, do you have some cycles to help us out here?
Blocks: 1243581
Flags: needinfo?(dmajor)
Priority: -- → P1
Summary: Stylo Reftests application timed out after 330 seconds with no output → stylo: Reftests application timed out after 330 seconds with no output
Duplicate of this bug: 1340866

Comment 5

5 months ago
Is treeherder the only place to see these tests? Can I run them locally without jumping through hoops? Can I run them on Windows?
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #5)
> Is treeherder the only place to see these tests? Can I run them locally
> without jumping through hoops?

Yes, do a build with --enable-stylo in your mozconfig. Then do:

./mach reftest --disable-e10s --setpref=reftest.compareStyloToGecko=true layout/reftests/reftest-stylo.list

> Can I run them on Windows?

Xidorn does stylo development on windows so I think it works. That said, the only CI we have is linux64, and that's where the intermittent is, so I don't know how likely it is that the problem would reproduce on windows.

Another approach might be to get ASAN builds working (bug 1336013) and see if that spits out anything interesting on CI.
Assigning just to avoid having unassigned p1 bugs. Let me know if you aren't able to take this.
Assignee: nobody → dmajor
Whiteboard: [stockwell needswork]

Comment 8

5 months ago
> Yes, do a build with --enable-stylo in your mozconfig. Then do:
> 
> ./mach reftest --disable-e10s --setpref=reftest.compareStyloToGecko=true
> layout/reftests/reftest-stylo.list

Running locally on Windows, I notice a few things -

- I consistently hit bug 1347399 on layout/reftests/bugs/652991-2.html. (Or rather, I get a subsequent MOZ_CRASH, since my opt build skips the asserts)

- layout/reftests/bugs/613433-*.html consistently get stuck and won't proceed until I re-focus the reftest window (even if I'm hands-off for the whole run, not giving it any reason to lose focus)

- If I click around between windows while the reftest starts up, it gets into the same "stuck" state as above, and does nothing until I give focus to the reftest. This looks like the same thing that's happening in CI, but is it really something specific to stylo? My memory is fuzzy but I thought this was just par for the course with reftests in general.

Comment 9

5 months ago
15 failures in 156 pushes (0.096 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 13
* mozilla-inbound: 2

Platform breakdown:
* linux64-stylo: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-15&endday=2017-03-15&tree=all
Does comment 8 help at all? If not, any tips on what to dig further into?
Flags: needinfo?(bobbyholley)
(In reply to David Major [:dmajor] from comment #10)
> Does comment 8 help at all?

From what I see in the logs, we never even run a single test, so I think it's a different issue. Either the browser or the harness is hanging on startup.

> If not, any tips on what to dig further into?

Aside from the ASAN thing, I would probably add logging in a bunch of places in the browser and harness startup sequence, and then do retriggers on try to see where it gets stuck.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> From what I see in the logs, we never even run a single test, so I think
> it's a different issue. Either the browser or the harness is hanging on
> startup.

Are you sure? When I get my local build into the stuck-on-startup state by changing window focus, my console has the same five lines as the CI log, ending at "Marionette INFO":

[task 2017-03-10T11:42:05.234776Z] 11:42:05     INFO - REFTEST INFO | Checking for orphan ssltunnel processes...
[task 2017-03-10T11:42:05.256820Z] 11:42:05     INFO - REFTEST INFO | Checking for orphan xpcshell processes...
[task 2017-03-10T11:42:05.295845Z] 11:42:05     INFO - REFTEST INFO | Running with e10s: False
[task 2017-03-10T11:42:05.296841Z] 11:42:05     INFO - REFTEST INFO | Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -profile /tmp/tmpfGoyAf.mozrunner
[task 2017-03-10T11:42:07.982796Z] 11:42:07     INFO - 1489146127979	Marionette	INFO	Listening on port 2828
(In reply to David Major [:dmajor] from comment #12)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> > From what I see in the logs, we never even run a single test, so I think
> > it's a different issue. Either the browser or the harness is hanging on
> > startup.
> 
> Are you sure? When I get my local build into the stuck-on-startup state by
> changing window focus, my console has the same five lines as the CI log,
> ending at "Marionette INFO":

Oh, I misread your last point and didn't see that it was about startup. It could be a focus issue, though I thought that reftests were supposed to be able to deal with not being focused, or focus themselves (does the same thing happen in a non-stylo build?).

The treeherder summary also links to a failure screenshot, which looks...odd: https://public-artifacts.taskcluster.net/LSQ0KKfvTe6KIOb97HlexA/0/public/test_info//mozilla-test-fail-screenshot__bN4oS.png
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13)

> The treeherder summary also links to a failure screenshot, which
> looks...odd:
> https://public-artifacts.taskcluster.net/LSQ0KKfvTe6KIOb97HlexA/0/public/
> test_info//mozilla-test-fail-screenshot__bN4oS.png

Oh, interesting! Locally I get two empty windows in the same size and position. The larger one on the left was supposed to become the window where the reftests are loaded, and the thing on the right is the auxiliary window.

I'll try non stylo...
I can make the same thing happen with a non-stylo build.
(In reply to David Major [:dmajor] from comment #15)
> I can make the same thing happen with a non-stylo build.

Bugzilla has a good number of hits for "reftest 330" in non-stylo builds. I wonder if they're all focus?

Comment 17

5 months ago
22 failures in 132 pushes (0.167 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 19
* mozilla-central: 2
* mozilla-inbound: 1

Platform breakdown:
* linux64-stylo: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-16&endday=2017-03-16&tree=all
> - If I click around between windows while the reftest starts up, it gets
> into the same "stuck" state as above, and does nothing until I give focus to
> the reftest.

> I can make the same thing happen with a non-stylo build.

I get the feeling that this is a more general problem with reftests and window focus, and not really stylo-specific.

This is probably as far as I can take this without domain expertise. smaug, you show up the most on the log for nsFocusManager.cpp -- does the reftest focus code [1] seem ok to you? And, is nsFocusManager::ClearFocus (which is what I assume |gBrowser.focus()| does) do the right thing if the Firefox app itself doesn't have focus?

[1] https://dxr.mozilla.org/mozilla-central/rev/ff04d410e74b69acfab17ef7e73e7397602d5a68/layout/tools/reftest/reftest.jsm#413-421
Flags: needinfo?(bugs)

Comment 19

5 months ago
I'm not familiar with reftest setup... but what if gBrowser already has focus?

And ClearFocus? Where is that coming into play here?

http://searchfox.org/mozilla-central/rev/006005beff40d377cfd2f69d3400633c5ff09127/dom/interfaces/base/nsIFocusManager.idl#50 might be relevant here, depending on how reftests run. So activate the right top level window and then focus some element in it?
Flags: needinfo?(bugs)
As noted earlier, logs show this is basically a startup hang -- no tests are run.

Debug logs have several warnings on startup:

WARNING: stylo: No docshell yet, assuming Gecko style system: file /home/worker/workspace/build/src/dom/base/nsDocument.cpp, line 12983

WARNING: attempt to modify an immutable nsStandardURL: file /home/worker/workspace/build/src/netwerk/base/nsStandardURL.cpp, line 1644

WARNING: Failed to retarget HTML data delivery to the parser thread.: file /home/worker/workspace/build/src/parser/html/nsHtml5StreamParser.cpp, line 988

WARNING: NS_ENSURE_TRUE(standardURL) failed: file /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229

WARNING: stylo: cannot get ServoStyleSheets from XBL bindings yet. See bug 1290276.: file /home/worker/workspace/build/src/layout/base/nsCSSFrameConstructor.cpp, line 2716

WARNING: stylo: ServoStyleSets cannot handle @font-face rules yet. See bug 1290237.: file /home/worker/workspace/build/src/dom/base/nsDocument.cpp, line 12877

I don't know if any of these are cause for concern / related to the hang.


There are also screenshots, which are consistent and strange. And crash reports after the timeout...but I don't see anything unexpected in them.
The earliest stylo reftest "timed out after 330 seconds" that I can find is https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=stylo%20reftest&tochange=452781c4ee876084bdc6a05a99d21597b7445724&fromchange=bdbd9679bbf1cb4c928fb5e2e049ea9906e737fc&selectedJob=82831814...but I don't see anything related in that changeset or the previous few changesets.

Comment 22

5 months ago
16 failures in 138 pushes (0.116 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 13
* mozilla-inbound: 3

Platform breakdown:
* linux64-stylo: 15
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-17&endday=2017-03-17&tree=all

Comment 23

5 months ago
114 failures in 777 pushes (0.147 failures/push) were associated with this bug in the last 7 days. 

This is the #12 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 87
* mozilla-inbound: 20
* mozilla-central: 7

Platform breakdown:
* linux64-stylo: 113
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-13&endday=2017-03-19&tree=all
(In reply to Geoff Brown [:gbrown] from comment #20)
> Debug logs have several warnings on startup:

All of those warnings appear in passing test runs as well.

(In reply to Olli Pettay [:smaug] from comment #19)
> And ClearFocus? Where is that coming into play here?

So the test says `gBrowser.focus()` which I assumed (guessed) lands in nsDOMWindowUtils::Focus with aElement == nullptr, which would lead to ClearFocus.
I am pretty sure this is (1) a problem with window focus and (2) not specific to stylo.

Can you find a more Gecko-knowledgeable owner to take it from here?
Flags: needinfo?(bobbyholley)
the main concern I have here is that this seems to be only showing up as a stylo specific error (where are the other platforms?)  I know this does happen in other platforms, but looking for a few minutes on bugzilla results in 2 bugs with no activity in recent months (bug 1265229 and bug 1298796).

I did look at a few logs and I see this in the runner:
REFTEST INFO | Running with e10s: False

I filed bug 1348754 to look into why this is not in e10s mode.


From a sheriff perspective this looks like a stylo reftest specific failure and it is one of the top failures.  Please ensure this doesn't get passed around from team to team and received appropriate attention.  I would like to be a bit more patient here before disabling the tests or hiding them on treeherder- so far we are 1 week into this and I would like to see this resolved in a few days.
(In reply to David Major [:dmajor] from comment #25)
> I am pretty sure this is (1) a problem with window focus and (2) not
> specific to stylo.

This does seem to be triggered by something with stylo. Maybe it's timing-related or incidental, but the correlation noted in comment 26 seems to be strong enough that I think it's something we need to fix.

It's also not clear to me that we really know that focus is the culprit here and not just the symptom. It seems to me that retriggering with logging per comment 6 would be a good way to bisect where in the startup pipeline we're getting stuck.

I can't emphasize enough that, IME, the only fully-general and reliable way to debug intermittent CI failures is to push logging and then hit the retrigger button five or ten times until the failure appears (rinse and repeat with more logging to answer the next question that arises).

> Can you find a more Gecko-knowledgeable owner to take it from here?

I certainly can't make you work on it - but the stylo team is swamped and there's no obvious person (either inside our outside the team) with special expertise to give this to. We really just need somebody who's good at debugging to attack this from first principles and narrow down the cause to something of the form: "this bug happens because we get hung up here with these abnormal inputs/state".

Your skillset seems like a good match for this, but if you really don't want to I can try to find somebody else.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27)
> Your skillset seems like a good match for this, but if you really don't want
> to I can try to find somebody else.

I'll tell you why I ask, which I should have been more clear about upfront, is that I've been asked to ramp up on Quantum Flow stuff and have several weeks of work-travel coming up. Combined with my general low enthusiasm for printf debugging, I don't see myself getting around to this in the immediate future.
I've been looking more at the range gbrown pointed out, specifically:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=stylo%20reftest&tochange=bfd89f8fb93aed915d449184213078a1b946454e&fromchange=adb5053309977cfdf18e29ab041f37abbbe00d60

The retriggers are sure starting to make it seem like this started with wlach's push. My money is on this bit:

https://hg.mozilla.org/integration/autoland/rev/ebdd7d5fa7450f7ae6d685a584f136908b69e356#l2.12

Presumably that somehow changed whether or not these tests actually run in e10s mode. Joel, what's going on here? Are we somehow running in some franken-configuration that's half-e10s half-non-e10s in stylo? That might explain why this is only showing up for stylo reftests.
Flags: needinfo?(jmaher)
(In reply to David Major [:dmajor] from comment #28)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27)
> > Your skillset seems like a good match for this, but if you really don't want
> > to I can try to find somebody else.
> 
> I'll tell you why I ask, which I should have been more clear about upfront,
> is that I've been asked to ramp up on Quantum Flow stuff and have several
> weeks of work-travel coming up. Combined with my general low enthusiasm for
> printf debugging, I don't see myself getting around to this in the immediate
> future.

Ok, thanks for the heads-up. Hopefully this e10s configuration business will lead somewhere.
Assignee: dmajor → nobody

Comment 31

5 months ago
31 failures in 136 pushes (0.228 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 28
* mozilla-inbound: 2
* mozilla-central: 1

Platform breakdown:
* linux64-stylo: 31

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-20&endday=2017-03-20&tree=all
Component: Layout → CSS Parsing and Computation
I tried to reproduce this on try server unsuccessfully:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d808679009eb30842e2e2af1e1fdc8e15e1d17a6&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

this has real e10s mode as well as non-e10s (what we appear to be running all the time anyway).  I do wonder if we get things really running in e10s if this will not be an issue anymore- lets keep pushing on that until it is not a variable anymore.
Flags: needinfo?(jmaher)
Ok - over to Joel for now until we get this e10s automation business sorted out in bug 1348754.
Assignee: nobody → jmaher

Comment 34

5 months ago
31 failures in 174 pushes (0.178 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 18
* mozilla-inbound: 6
* mozilla-central: 6
* mozilla-beta: 1

Platform breakdown:
* linux64-stylo: 30
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-21&endday=2017-03-21&tree=all

Comment 35

5 months ago
22 failures in 186 pushes (0.118 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 17
* mozilla-inbound: 5

Platform breakdown:
* linux64-stylo: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-22&endday=2017-03-22&tree=all

Comment 36

5 months ago
33 failures in 174 pushes (0.19 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 24
* mozilla-central: 5
* mozilla-inbound: 4

Platform breakdown:
* linux64-stylo: 33

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-23&endday=2017-03-23&tree=all
as a note, I landed the code to run reftests in e10s, lets check back in on Monday and see where this is at (assuming it is merged around today)

Comment 38

5 months ago
28 failures in 153 pushes (0.183 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 24
* mozilla-inbound: 2
* mozilla-central: 2

Platform breakdown:
* linux64-stylo: 28

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-24&endday=2017-03-24&tree=all

Comment 39

5 months ago
161 failures in 898 pushes (0.179 failures/push) were associated with this bug in the last 7 days. 

This is the #2 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 123
* mozilla-inbound: 22
* mozilla-central: 15
* mozilla-beta: 1

Platform breakdown:
* linux64-stylo: 160
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-20&endday=2017-03-26&tree=all
moving to e10s doesn't solve this, we are still seeing failures at the same rate.

Comment 41

5 months ago
22 failures in 141 pushes (0.156 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 19
* mozilla-inbound: 3

Platform breakdown:
* linux64-stylo: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-27&endday=2017-03-27&tree=all
:bholley, after enabling e10s for the tests we still have a high failure rate- I cannot find more signs of non stylo reftest timeouts on startup (or other timeouts related to the browser startup/shutdown).  Can you find someone to look at this?  Give the high frequency of failure, I would like to see this addressed soon, or we move the reftests to tier-3 or disabled.
Flags: needinfo?(bobbyholley)
(Assignee)

Comment 43

5 months ago
Created attachment 8852252 [details] [diff] [review]
A difference between success log and failure log

FWIW, I did a quick check the difference.  It stuck just before loading reftest-stylo.list in failure case.
(Assignee)

Updated

5 months ago
Attachment #8852252 - Attachment is patch: true
Attachment #8852252 - Attachment mime type: text/x-patch → text/plain
After the harness times out, it kills the browser to get a crash report. Most of the crash reports here are unhelpful -- either minidump_stackwalk finds a bad header in the minidump or the report is not symbolicated -- but there are a few "good" ones. Here are a few recent, symbolicated, crash reports:

https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=86656873&lineNumber=1834
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=86656854&lineNumber=1834
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=86705759&lineNumber=1677
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=86700057&lineNumber=1836
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=86755890&lineNumber=1834

They seem consistent, at least for thread 0:

[task 2017-03-27T13:25:16.397866Z] 13:25:16     INFO - Thread 0 (crashed)
[task 2017-03-27T13:25:16.398422Z] 13:25:16     INFO -  0  libc-2.15.so + 0xe7993
[task 2017-03-27T13:25:16.398553Z] 13:25:16     INFO -     rax = 0xfffffffffffffffc   rdx = 0xffffffffffffffff
[task 2017-03-27T13:25:16.399129Z] 13:25:16     INFO -     rcx = 0xffffffffffffffff   rbx = 0x00007fa123bf2320
[task 2017-03-27T13:25:16.399259Z] 13:25:16     INFO -     rsi = 0x0000000000000006   rdi = 0x00007fa0fe649a00
[task 2017-03-27T13:25:16.399810Z] 13:25:16     INFO -     rbp = 0x00007ffe93ea5e90   rsp = 0x00007ffe93ea5e50
[task 2017-03-27T13:25:16.400371Z] 13:25:16     INFO -      r8 = 0x0000000000000000    r9 = 0x0000000000000292
[task 2017-03-27T13:25:16.400497Z] 13:25:16     INFO -     r10 = 0x0000000000000000   r11 = 0x0000000000000293
[task 2017-03-27T13:25:16.401083Z] 13:25:16     INFO -     r12 = 0x00007fa118748689   r13 = 0x00000000ffffffff
[task 2017-03-27T13:25:16.401213Z] 13:25:16     INFO -     r14 = 0x0000000000000006   r15 = 0x0000000000000001
[task 2017-03-27T13:25:16.401796Z] 13:25:16     INFO -     rip = 0x00007fa123ee3993
[task 2017-03-27T13:25:16.402344Z] 13:25:16     INFO -     Found by: given as instruction pointer in context
[task 2017-03-27T13:25:16.402445Z] 13:25:16     INFO -  1  libxul.so!PollWrapper [nsAppShell.cpp:ccf27d7cdcdc : 46 + 0x10]
[task 2017-03-27T13:25:16.403498Z] 13:25:16     INFO -     rbp = 0x00007ffe93ea5e90   rsp = 0x00007ffe93ea5e80
[task 2017-03-27T13:25:16.404014Z] 13:25:16     INFO -     rip = 0x00007fa1187486b5
[task 2017-03-27T13:25:16.404097Z] 13:25:16     INFO -     Found by: stack scanning
[task 2017-03-27T13:25:16.404208Z] 13:25:16     INFO -  2  libglib-2.0.so.0.3200.4 + 0x47ff6
[task 2017-03-27T13:25:16.404295Z] 13:25:16     INFO -     rbp = 0x00007fa0fe649a00   rsp = 0x00007ffe93ea5ea0
[task 2017-03-27T13:25:16.404802Z] 13:25:16     INFO -     rip = 0x00007fa11f596ff6
[task 2017-03-27T13:25:16.404922Z] 13:25:16     INFO -     Found by: call frame info
[task 2017-03-27T13:25:16.405006Z] 13:25:16     INFO -  3  libglib-2.0.so.0.3200.4 + 0x48124
[task 2017-03-27T13:25:16.405551Z] 13:25:16     INFO -     rsp = 0x00007ffe93ea5ef0   rip = 0x00007fa11f597124
[task 2017-03-27T13:25:16.405656Z] 13:25:16     INFO -     Found by: stack scanning
[task 2017-03-27T13:25:16.405760Z] 13:25:16     INFO -  4  libxul.so!nsAppShell::ProcessNextNativeEvent [nsAppShell.cpp:ccf27d7cdcdc : 279 + 0x5]
[task 2017-03-27T13:25:16.406665Z] 13:25:16     INFO -     rsp = 0x00007ffe93ea5f10   rip = 0x00007fa1187486fb
[task 2017-03-27T13:25:16.406819Z] 13:25:16     INFO -     Found by: stack scanning
[task 2017-03-27T13:25:16.407289Z] 13:25:16     INFO -  5  libxul.so!nsBaseAppShell::DoProcessNextNativeEvent [nsBaseAppShell.cpp:ccf27d7cdcdc : 138 + 0x10]
[task 2017-03-27T13:25:16.407412Z] 13:25:16     INFO -     rsp = 0x00007ffe93ea5f20   rip = 0x00007fa11871d135
[task 2017-03-27T13:25:16.407476Z] 13:25:16     INFO -     Found by: stack scanning
[task 2017-03-27T13:25:16.407628Z] 13:25:16     INFO -  6  librt-2.15.so + 0x415d
[task 2017-03-27T13:25:16.408136Z] 13:25:16     INFO -     rsp = 0x00007ffe93ea5f30   rip = 0x00007fa1249e415d
[task 2017-03-27T13:25:16.408221Z] 13:25:16     INFO -     Found by: stack scanning
[task 2017-03-27T13:25:16.408350Z] 13:25:16     INFO -  7  libxul.so!nsBaseAppShell::OnProcessNextEvent [nsBaseAppShell.cpp:ccf27d7cdcdc : 289 + 0x8]
[task 2017-03-27T13:25:16.408891Z] 13:25:16     INFO -     rsp = 0x00007ffe93ea5f60   rip = 0x00007fa118720314
[task 2017-03-27T13:25:16.408980Z] 13:25:16     INFO -     Found by: stack scanning
[task 2017-03-27T13:25:16.409113Z] 13:25:16     INFO -  8  libxul.so!nsThread::ProcessNextEvent [nsThread.cpp:ccf27d7cdcdc : 1225 + 0xf]
[task 2017-03-27T13:25:16.409632Z] 13:25:16     INFO -     rsp = 0x00007ffe93ea5fb0   rip = 0x00007fa116d501e6
[task 2017-03-27T13:25:16.409719Z] 13:25:16     INFO -     Found by: stack scanning

https://dxr.mozilla.org/mozilla-central/source/widget/gtk/nsAppShell.cpp#46

That's not providing any insight for me, but I thought I'd point it out. And of course, there are dozens of other threads for crash report experts to consider.

Comment 45

5 months ago
35 failures in 165 pushes (0.212 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 24
* mozilla-inbound: 11

Platform breakdown:
* linux64-stylo: 35

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-28&endday=2017-03-28&tree=all
Depends on: 1351518
Hiro's log diff in comment 43 led us to a theory in bug 1351518. Just pushed that, fingers crossed.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #46)
> Hiro's log diff in comment 43 led us to a theory in bug 1351518. Just pushed
> that, fingers crossed.

Didn't fix it: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=87151917

I think the next step is to do some pushes with logging in the harness and try to figure out where things are getting dropped on the floor. Hiro is looking at that (though having trouble triggering the crash with logging):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d4c84bd4fb7152307c8a6270c62645449e8eafc&selectedJob=87136973

Over to him for now.
Assignee: jmaher → hikezoe
(Assignee)

Comment 48

5 months ago
I am almost convinced that we can't reproduce this timeout on *try servers* if we try revisions based on current mozilla-central.
Actually I found a couple of tries [1][2][3][4][5] including this timeout in recent tries, but all of them are based on old revision.  The newest revision among them is [5], it's based on https://hg.mozilla.org/try/rev/19289cc8bf6f .

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=86546334&repo=try&lineNumber=1669
[2] https://treeherder.mozilla.org/logviewer.html#?job_id=86480286&repo=try&lineNumber=1669
[3] https://treeherder.mozilla.org/logviewer.html#?job_id=86537583&repo=try&lineNumber=1668
[4] https://treeherder.mozilla.org/logviewer.html#?job_id=86138085&repo=try&lineNumber=1796
[5] https://treeherder.mozilla.org/logviewer.html#?job_id=85807974&repo=try&lineNumber=1838

I am not sure why we still fail on m-c or other branches, but the difference is a clue to track this bug down.
Joel, do you know the difference between try server and other production servers?
Flags: needinfo?(jmaher)
(Assignee)

Comment 49

5 months ago
OK, I just realized that failure cases happened on ubuntu 12.04. (Though I did check several failure logs)

I think using desktop1604-test docker image will solve this.
Flags: needinfo?(jmaher)
Comment hidden (mozreview-request)
(Assignee)

Comment 51

5 months ago
Oh wait. We seem to use only ubuntu 12.04 for stylo reftest both on try and aurora (maybe m-c as well).
(Assignee)

Updated

5 months ago
Attachment #8852324 - Attachment is obsolete: true
Attachment #8852324 - Flags: review?(jmaher)
in bug 1309086 we started running reftests on 1604:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/tests.yml#960

but in reftests-stylo we do not specify the newer OS version:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/ci/test/tests.yml#1042

just adding one line will make a big difference there- good find!
Let's check: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff979905dabf69104ab12c91e1f7ce3660f1788c
(In reply to Geoff Brown [:gbrown] from comment #53)

It looks like switching to ubuntu 16.04 introduces several failures, but they seem consistent, some are unexpected passes, and I don't see any time outs. I suspect switching to 16.04 and updating stylo reftest expectations is the way forward.

:hiro - Let me know if I can help with anything.
(Assignee)

Comment 55

5 months ago
Geoff, I think switching to ubuntu 16.04 is worthwhile doing. If there is no problem with regard to server resources something.  We should try it.

Note: As far as I can tell we can't reproduce this timeout on *try* server if we use the revision based on recent m-c.  Joel's try in comment 8 couldn't reproduce it either.  There must be some changes in early this month that solved this timeout only on *try*.
(Assignee)

Updated

5 months ago
Attachment #8852324 - Attachment is obsolete: false
Attachment #8852324 - Flags: review?(jmaher)
(In reply to Hiroyuki Ikezoe (:hiro) from comment #55)
> Geoff, I think switching to ubuntu 16.04 is worthwhile doing. If there is no
> problem with regard to server resources something.  We should try it.

I think it is fine for server resources; in fact, I think 16.04 is preferred.

> Note: As far as I can tell we can't reproduce this timeout on *try* server
> if we use the revision based on recent m-c.  Joel's try in comment 8
> couldn't reproduce it either.  There must be some changes in early this
> month that solved this timeout only on *try*.

That is strange, and it reminds me of bug 1348754, but I can't think of anything else which could be different on try.
(In reply to Geoff Brown [:gbrown] from comment #54)
> (In reply to Geoff Brown [:gbrown] from comment #53)
> 
> It looks like switching to ubuntu 16.04 introduces several failures, but
> they seem consistent, some are unexpected passes, and I don't see any time
> outs. I suspect switching to 16.04 and updating stylo reftest expectations
> is the way forward.

Yes, this sounds like the right approach to me.

Hiro is a hero!

Comment 58

5 months ago
44 failures in 188 pushes (0.234 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 38
* mozilla-inbound: 5
* mozilla-central: 1

Platform breakdown:
* linux64-stylo: 44

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-29&endday=2017-03-29&tree=all

Comment 59

5 months ago
mozreview-review
Comment on attachment 8852324 [details]
Bug 1346232 - Use Ubuntu 16.04 docker image for stylo reftest to avoid timeouts.

https://reviewboard.mozilla.org/r/124590/#review127452

This looks good, but changing to ubuntu 16.04 will change which tests pass and fail, as my push in comment 53 demonstrated. :hiro, are you preparing a separate patch to update the test annotations?
(Assignee)

Comment 60

5 months ago
OK, not yet but  will do.
(Assignee)

Comment 61

5 months ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b50aeefb27e1ef21d8c7fa5290c7c23a8159795
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)

Comment 64

5 months ago
mozreview-review
Comment on attachment 8852324 [details]
Bug 1346232 - Use Ubuntu 16.04 docker image for stylo reftest to avoid timeouts.

https://reviewboard.mozilla.org/r/124590/#review127606

one line reviews are easy
Attachment #8852324 - Flags: review?(jmaher) → review+

Comment 65

5 months ago
mozreview-review
Comment on attachment 8852811 [details]
Bug 1346232 - Update reftest expectations.

https://reviewboard.mozilla.org/r/124974/#review127610

this is enabling many tests, a big win!

::: layout/reftests/line-breaking/reftest-stylo.list
(Diff revision 1)
>  == punctuation-open-3.html punctuation-open-3.html
>  == punctuation-open-4.html punctuation-open-4.html
>  == quotationmarks-1.html quotationmarks-1.html
> -# The following is currently disabled on Linux because of a rendering issue with missing-glyph
> +== quotationmarks-cjk-1.html quotationmarks-cjk-1.html
> -# representations on the test boxes. See bug
> -fails == quotationmarks-cjk-1.html quotationmarks-cjk-1.html

odd, this fails on non stylo still, while this is valid possibly we can remove the skip-if(gtkWidget) for the reftest.list file as well :)
Attachment #8852811 - Flags: review+
(Assignee)

Comment 66

5 months ago
(In reply to Joel Maher ( :jmaher) from comment #65)
> Comment on attachment 8852811 [details]
> Bug 1346232 - Update reftest expectations.
> 
> https://reviewboard.mozilla.org/r/124974/#review127610
> 
> this is enabling many tests, a big win!
> 
> ::: layout/reftests/line-breaking/reftest-stylo.list
> (Diff revision 1)
> >  == punctuation-open-3.html punctuation-open-3.html
> >  == punctuation-open-4.html punctuation-open-4.html
> >  == quotationmarks-1.html quotationmarks-1.html
> > -# The following is currently disabled on Linux because of a rendering issue with missing-glyph
> > +== quotationmarks-cjk-1.html quotationmarks-cjk-1.html
> > -# representations on the test boxes. See bug
> > -fails == quotationmarks-cjk-1.html quotationmarks-cjk-1.html
> 
> odd, this fails on non stylo still, while this is valid possibly we can
> remove the skip-if(gtkWidget) for the reftest.list file as well :)

Oh, indeed. I will talk with Masayuki about this tomorrow.

Thank you for the review!

Comment 67

5 months ago
mozreview-review
Comment on attachment 8852811 [details]
Bug 1346232 - Update reftest expectations.

https://reviewboard.mozilla.org/r/124974/#review127634

This looks great. Thanks so much!
Attachment #8852811 - Flags: review?(gbrown) → review+

Comment 68

5 months ago
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/245fb9f42112
Use Ubuntu 16.04 docker image for stylo reftest to avoid timeouts. r=jmaher
https://hg.mozilla.org/integration/autoland/rev/26a362c81067
Update reftest expectations. r=jmaher

Comment 69

5 months ago
21 failures in 123 pushes (0.171 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 16
* mozilla-central: 3
* mozilla-inbound: 2

Platform breakdown:
* linux64-stylo: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-30&endday=2017-03-30&tree=all
(Reporter)

Comment 70

5 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/245fb9f42112
https://hg.mozilla.org/mozilla-central/rev/26a362c81067
Status: NEW → RESOLVED
Last Resolved: 5 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [stockwell needswork] → [stockwell fixed]

Updated

5 months ago
Depends on: 1345283
(Assignee)

Updated

5 months ago
No longer depends on: 1345283
(Assignee)

Comment 71

5 months ago
Results after landing are pretty good.  I am really happy to help you guys, Bobby and Joel!

Comment 72

5 months ago
154 failures in 845 pushes (0.182 failures/push) were associated with this bug in the last 7 days. 

This is the #6 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 114
* mozilla-inbound: 33
* mozilla-central: 7

Platform breakdown:
* linux64-stylo: 153
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-27&endday=2017-04-02&tree=all
(In reply to Hiroyuki Ikezoe (:hiro) from comment #71)
> Results after landing are pretty good.  I am really happy to help you guys,
> Bobby and Joel!

Awesome, thanks so much for figuring this one out Hiro!
Duplicate of this bug: 1352869
You need to log in before you can comment on or make changes to this bug.