The default bug view has changed. See this FAQ.

stylo: Reftests application timed out after 330 seconds with no output

NEW
Assigned to

Status

()

Core
CSS Parsing and Computation
P1
normal
12 days ago
a day ago

People

(Reporter: Tomcat, Assigned: jmaher)

Tracking

(Blocks: 1 bug, {intermittent-failure})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [stockwell needswork], URL)

(Reporter)

Description

12 days ago
[task 2017-03-10T11:47:38.437605Z] 11:47:38    ERROR - REFTEST ERROR | reftest | application timed out after 330 seconds with no output

like https://treeherder.mozilla.org/logviewer.html#?job_id=83018605&repo=autoland&lineNumber=1669 as example - not sure if this is a existing bug but seems to hit intermittently stylo reftests

Comment 1

12 days ago
21 failures in 172 pushes (0.122 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 21

Platform breakdown:
* linux64-stylo: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-10&endday=2017-03-10&tree=all

Comment 2

10 days ago
21 failures in 790 pushes (0.027 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 21

Platform breakdown:
* linux64-stylo: 21

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-06&endday=2017-03-12&tree=all
Hm, looks like this is happening a fair bit, and there's no obvious place to start in the logs.

Time to throw up the bat signal I think. dmajor, do you have some cycles to help us out here?
Blocks: 1243581
Flags: needinfo?(dmajor)
Priority: -- → P1
Summary: Stylo Reftests application timed out after 330 seconds with no output → stylo: Reftests application timed out after 330 seconds with no output
Duplicate of this bug: 1340866
Is treeherder the only place to see these tests? Can I run them locally without jumping through hoops? Can I run them on Windows?
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #5)
> Is treeherder the only place to see these tests? Can I run them locally
> without jumping through hoops?

Yes, do a build with --enable-stylo in your mozconfig. Then do:

./mach reftest --disable-e10s --setpref=reftest.compareStyloToGecko=true layout/reftests/reftest-stylo.list

> Can I run them on Windows?

Xidorn does stylo development on windows so I think it works. That said, the only CI we have is linux64, and that's where the intermittent is, so I don't know how likely it is that the problem would reproduce on windows.

Another approach might be to get ASAN builds working (bug 1336013) and see if that spits out anything interesting on CI.
Assigning just to avoid having unassigned p1 bugs. Let me know if you aren't able to take this.
Assignee: nobody → dmajor
(Assignee)

Updated

8 days ago
Whiteboard: [stockwell needswork]
> Yes, do a build with --enable-stylo in your mozconfig. Then do:
> 
> ./mach reftest --disable-e10s --setpref=reftest.compareStyloToGecko=true
> layout/reftests/reftest-stylo.list

Running locally on Windows, I notice a few things -

- I consistently hit bug 1347399 on layout/reftests/bugs/652991-2.html. (Or rather, I get a subsequent MOZ_CRASH, since my opt build skips the asserts)

- layout/reftests/bugs/613433-*.html consistently get stuck and won't proceed until I re-focus the reftest window (even if I'm hands-off for the whole run, not giving it any reason to lose focus)

- If I click around between windows while the reftest starts up, it gets into the same "stuck" state as above, and does nothing until I give focus to the reftest. This looks like the same thing that's happening in CI, but is it really something specific to stylo? My memory is fuzzy but I thought this was just par for the course with reftests in general.

Comment 9

7 days ago
15 failures in 156 pushes (0.096 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 13
* mozilla-inbound: 2

Platform breakdown:
* linux64-stylo: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-15&endday=2017-03-15&tree=all
Does comment 8 help at all? If not, any tips on what to dig further into?
Flags: needinfo?(bobbyholley)
(In reply to David Major [:dmajor] from comment #10)
> Does comment 8 help at all?

From what I see in the logs, we never even run a single test, so I think it's a different issue. Either the browser or the harness is hanging on startup.

> If not, any tips on what to dig further into?

Aside from the ASAN thing, I would probably add logging in a bunch of places in the browser and harness startup sequence, and then do retriggers on try to see where it gets stuck.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> From what I see in the logs, we never even run a single test, so I think
> it's a different issue. Either the browser or the harness is hanging on
> startup.

Are you sure? When I get my local build into the stuck-on-startup state by changing window focus, my console has the same five lines as the CI log, ending at "Marionette INFO":

[task 2017-03-10T11:42:05.234776Z] 11:42:05     INFO - REFTEST INFO | Checking for orphan ssltunnel processes...
[task 2017-03-10T11:42:05.256820Z] 11:42:05     INFO - REFTEST INFO | Checking for orphan xpcshell processes...
[task 2017-03-10T11:42:05.295845Z] 11:42:05     INFO - REFTEST INFO | Running with e10s: False
[task 2017-03-10T11:42:05.296841Z] 11:42:05     INFO - REFTEST INFO | Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -profile /tmp/tmpfGoyAf.mozrunner
[task 2017-03-10T11:42:07.982796Z] 11:42:07     INFO - 1489146127979	Marionette	INFO	Listening on port 2828
(In reply to David Major [:dmajor] from comment #12)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #11)
> > From what I see in the logs, we never even run a single test, so I think
> > it's a different issue. Either the browser or the harness is hanging on
> > startup.
> 
> Are you sure? When I get my local build into the stuck-on-startup state by
> changing window focus, my console has the same five lines as the CI log,
> ending at "Marionette INFO":

Oh, I misread your last point and didn't see that it was about startup. It could be a focus issue, though I thought that reftests were supposed to be able to deal with not being focused, or focus themselves (does the same thing happen in a non-stylo build?).

The treeherder summary also links to a failure screenshot, which looks...odd: https://public-artifacts.taskcluster.net/LSQ0KKfvTe6KIOb97HlexA/0/public/test_info//mozilla-test-fail-screenshot__bN4oS.png
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #13)

> The treeherder summary also links to a failure screenshot, which
> looks...odd:
> https://public-artifacts.taskcluster.net/LSQ0KKfvTe6KIOb97HlexA/0/public/
> test_info//mozilla-test-fail-screenshot__bN4oS.png

Oh, interesting! Locally I get two empty windows in the same size and position. The larger one on the left was supposed to become the window where the reftests are loaded, and the thing on the right is the auxiliary window.

I'll try non stylo...
I can make the same thing happen with a non-stylo build.
(In reply to David Major [:dmajor] from comment #15)
> I can make the same thing happen with a non-stylo build.

Bugzilla has a good number of hits for "reftest 330" in non-stylo builds. I wonder if they're all focus?
22 failures in 132 pushes (0.167 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 19
* mozilla-central: 2
* mozilla-inbound: 1

Platform breakdown:
* linux64-stylo: 22

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-16&endday=2017-03-16&tree=all
> - If I click around between windows while the reftest starts up, it gets
> into the same "stuck" state as above, and does nothing until I give focus to
> the reftest.

> I can make the same thing happen with a non-stylo build.

I get the feeling that this is a more general problem with reftests and window focus, and not really stylo-specific.

This is probably as far as I can take this without domain expertise. smaug, you show up the most on the log for nsFocusManager.cpp -- does the reftest focus code [1] seem ok to you? And, is nsFocusManager::ClearFocus (which is what I assume |gBrowser.focus()| does) do the right thing if the Firefox app itself doesn't have focus?

[1] https://dxr.mozilla.org/mozilla-central/rev/ff04d410e74b69acfab17ef7e73e7397602d5a68/layout/tools/reftest/reftest.jsm#413-421
Flags: needinfo?(bugs)
I'm not familiar with reftest setup... but what if gBrowser already has focus?

And ClearFocus? Where is that coming into play here?

http://searchfox.org/mozilla-central/rev/006005beff40d377cfd2f69d3400633c5ff09127/dom/interfaces/base/nsIFocusManager.idl#50 might be relevant here, depending on how reftests run. So activate the right top level window and then focus some element in it?
Flags: needinfo?(bugs)
As noted earlier, logs show this is basically a startup hang -- no tests are run.

Debug logs have several warnings on startup:

WARNING: stylo: No docshell yet, assuming Gecko style system: file /home/worker/workspace/build/src/dom/base/nsDocument.cpp, line 12983

WARNING: attempt to modify an immutable nsStandardURL: file /home/worker/workspace/build/src/netwerk/base/nsStandardURL.cpp, line 1644

WARNING: Failed to retarget HTML data delivery to the parser thread.: file /home/worker/workspace/build/src/parser/html/nsHtml5StreamParser.cpp, line 988

WARNING: NS_ENSURE_TRUE(standardURL) failed: file /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229

WARNING: stylo: cannot get ServoStyleSheets from XBL bindings yet. See bug 1290276.: file /home/worker/workspace/build/src/layout/base/nsCSSFrameConstructor.cpp, line 2716

WARNING: stylo: ServoStyleSets cannot handle @font-face rules yet. See bug 1290237.: file /home/worker/workspace/build/src/dom/base/nsDocument.cpp, line 12877

I don't know if any of these are cause for concern / related to the hang.


There are also screenshots, which are consistent and strange. And crash reports after the timeout...but I don't see anything unexpected in them.
The earliest stylo reftest "timed out after 330 seconds" that I can find is https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=stylo%20reftest&tochange=452781c4ee876084bdc6a05a99d21597b7445724&fromchange=bdbd9679bbf1cb4c928fb5e2e049ea9906e737fc&selectedJob=82831814...but I don't see anything related in that changeset or the previous few changesets.
16 failures in 138 pushes (0.116 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 13
* mozilla-inbound: 3

Platform breakdown:
* linux64-stylo: 15
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-17&endday=2017-03-17&tree=all
114 failures in 777 pushes (0.147 failures/push) were associated with this bug in the last 7 days. 

This is the #12 most frequent failure this week. 

** This failure happened more than 75 times this week! Resolving this bug is a very high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 1 week, the affected test(s) may be disabled. **  

Repository breakdown:
* autoland: 87
* mozilla-inbound: 20
* mozilla-central: 7

Platform breakdown:
* linux64-stylo: 113
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-13&endday=2017-03-19&tree=all
(In reply to Geoff Brown [:gbrown] from comment #20)
> Debug logs have several warnings on startup:

All of those warnings appear in passing test runs as well.

(In reply to Olli Pettay [:smaug] from comment #19)
> And ClearFocus? Where is that coming into play here?

So the test says `gBrowser.focus()` which I assumed (guessed) lands in nsDOMWindowUtils::Focus with aElement == nullptr, which would lead to ClearFocus.
I am pretty sure this is (1) a problem with window focus and (2) not specific to stylo.

Can you find a more Gecko-knowledgeable owner to take it from here?
Flags: needinfo?(bobbyholley)
(Assignee)

Comment 26

3 days ago
the main concern I have here is that this seems to be only showing up as a stylo specific error (where are the other platforms?)  I know this does happen in other platforms, but looking for a few minutes on bugzilla results in 2 bugs with no activity in recent months (bug 1265229 and bug 1298796).

I did look at a few logs and I see this in the runner:
REFTEST INFO | Running with e10s: False

I filed bug 1348754 to look into why this is not in e10s mode.


From a sheriff perspective this looks like a stylo reftest specific failure and it is one of the top failures.  Please ensure this doesn't get passed around from team to team and received appropriate attention.  I would like to be a bit more patient here before disabling the tests or hiding them on treeherder- so far we are 1 week into this and I would like to see this resolved in a few days.
(In reply to David Major [:dmajor] from comment #25)
> I am pretty sure this is (1) a problem with window focus and (2) not
> specific to stylo.

This does seem to be triggered by something with stylo. Maybe it's timing-related or incidental, but the correlation noted in comment 26 seems to be strong enough that I think it's something we need to fix.

It's also not clear to me that we really know that focus is the culprit here and not just the symptom. It seems to me that retriggering with logging per comment 6 would be a good way to bisect where in the startup pipeline we're getting stuck.

I can't emphasize enough that, IME, the only fully-general and reliable way to debug intermittent CI failures is to push logging and then hit the retrigger button five or ten times until the failure appears (rinse and repeat with more logging to answer the next question that arises).

> Can you find a more Gecko-knowledgeable owner to take it from here?

I certainly can't make you work on it - but the stylo team is swamped and there's no obvious person (either inside our outside the team) with special expertise to give this to. We really just need somebody who's good at debugging to attack this from first principles and narrow down the cause to something of the form: "this bug happens because we get hung up here with these abnormal inputs/state".

Your skillset seems like a good match for this, but if you really don't want to I can try to find somebody else.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27)
> Your skillset seems like a good match for this, but if you really don't want
> to I can try to find somebody else.

I'll tell you why I ask, which I should have been more clear about upfront, is that I've been asked to ramp up on Quantum Flow stuff and have several weeks of work-travel coming up. Combined with my general low enthusiasm for printf debugging, I don't see myself getting around to this in the immediate future.
I've been looking more at the range gbrown pointed out, specifically:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=stylo%20reftest&tochange=bfd89f8fb93aed915d449184213078a1b946454e&fromchange=adb5053309977cfdf18e29ab041f37abbbe00d60

The retriggers are sure starting to make it seem like this started with wlach's push. My money is on this bit:

https://hg.mozilla.org/integration/autoland/rev/ebdd7d5fa7450f7ae6d685a584f136908b69e356#l2.12

Presumably that somehow changed whether or not these tests actually run in e10s mode. Joel, what's going on here? Are we somehow running in some franken-configuration that's half-e10s half-non-e10s in stylo? That might explain why this is only showing up for stylo reftests.
Flags: needinfo?(jmaher)
(In reply to David Major [:dmajor] from comment #28)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #27)
> > Your skillset seems like a good match for this, but if you really don't want
> > to I can try to find somebody else.
> 
> I'll tell you why I ask, which I should have been more clear about upfront,
> is that I've been asked to ramp up on Quantum Flow stuff and have several
> weeks of work-travel coming up. Combined with my general low enthusiasm for
> printf debugging, I don't see myself getting around to this in the immediate
> future.

Ok, thanks for the heads-up. Hopefully this e10s configuration business will lead somewhere.
Assignee: dmajor → nobody
31 failures in 136 pushes (0.228 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 28
* mozilla-inbound: 2
* mozilla-central: 1

Platform breakdown:
* linux64-stylo: 31

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-20&endday=2017-03-20&tree=all
Component: Layout → CSS Parsing and Computation
(Assignee)

Comment 32

a day ago
I tried to reproduce this on try server unsuccessfully:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d808679009eb30842e2e2af1e1fdc8e15e1d17a6&filter-resultStatus=success&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=running&filter-resultStatus=pending&filter-resultStatus=runnable

this has real e10s mode as well as non-e10s (what we appear to be running all the time anyway).  I do wonder if we get things really running in e10s if this will not be an issue anymore- lets keep pushing on that until it is not a variable anymore.
Flags: needinfo?(jmaher)
Ok - over to Joel for now until we get this e10s automation business sorted out in bug 1348754.
Assignee: nobody → jmaher
31 failures in 174 pushes (0.178 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 18
* mozilla-inbound: 6
* mozilla-central: 6
* mozilla-beta: 1

Platform breakdown:
* linux64-stylo: 30
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1346232&startday=2017-03-21&endday=2017-03-21&tree=all
You need to log in before you can comment on or make changes to this bug.