Closed Bug 1140502 Opened 5 years ago Closed 5 years ago

Investigate why bug 1118001 & bug 1139800 started spiking on March 4th 2015 (timeouts in test_media_queries.html, test_selectors.html on b2g emulator)

Categories

(Core :: CSS Parsing and Computation, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox37 --- wontfix
firefox38 --- wontfix
firefox39 --- fixed
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: dholbert, Assigned: dholbert)

References

Details

Attachments

(1 file)

Per RyanVM's request in bug 1118001 comment 156, I'm filing this bug on investigating a recent spike in b2g mochitest-9 timeouts, in these tests:
 bug 1139800: test_media_queries.html
 bug 1139800: test_selectors.html

I'm filing this as a separate bug so that I can CC people here without signing them up for TBPLBot spam for all the failures.
(In reply to Daniel Holbert [:dholbert] from comment #0)
>  bug 1139800: test_media_queries.html
>  bug 1139800: test_selectors.html

Sorry, bug-copypaste-fail -- that first line meant to say "bug 1118001: test_media_queries.html"

Note that the "bug 1118001 test_media_queries.html" timeout was already happening ~5 times per day, before the recent spike. Now, it's failing tens of times per day.

Here's a possible pretty-broad regression range (I'm guessing at the endpoints; I'll have a better idea after some retriggers are completed):
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?startdate=2015-03-04+02%3A00%3A00&enddate=2015-03-04+15%3A00%3A00

In there, the relevant-looking style-system / selector-related changes are:
> bc9e6089b8e1	Cameron McCormack — Bug 1136010 - Don't coalesce style contexts added to RestyleManager::mContextsToClear. r=dbaron
> 93e76e5a8186	Cameron McCormack — Bug 1138788 - Allow CSSParserImpl::SetURLValue assertion to succeed when under IsValueValidForProperty, even with no sheet principal. r=dholbert
> 38f11ce2749a	Xidorn Quan — Bug 1133624 - Add lang-specific ruby rules to ua.css. r=dbaron
So for the test_selectors.html failure, e.g. in its m-9 failure here:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f2b0eb82d21d&filter-ref_data_name=b2g_emulator_vm

...the test_selectors.html output ends with this:
{
19:40:22     INFO -  708 INFO TEST-PASS | layout/style/test/test_selectors.html | selector :-moz-any(:active,:link:focus) was cloned correctly
19:40:22     INFO -  709 INFO TEST-PASS | layout/style/test/test_selectors.html | selector :-moz-any(:active,:link:focus) serializes correctly after cloning
19:40:22     INFO -  710 INFO TEST-UNEXPECTED-FAIL | layout/style/test/test_selectors.html | Test timed out. - expected PASS
19:40:23     INFO -  711 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> matched clone of :-moz-any(a,input)
19:40:23     INFO -  712 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> matched clone of :-moz-any(a,input)
19:40:23     INFO -  713 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> matched clone of :-moz-any(a,input)
19:40:23     INFO -  714 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> did not match clone of :-moz-any(a,input)
19:40:23     INFO -  715 INFO TEST-PASS | layout/style/test/test_selectors.html | selector :-moz-any(a,input) serializes correctly after cloning
19:40:23     INFO -  716 INFO MEMORY STAT vsize after test: 218189824
19:40:23     INFO -  717 INFO MEMORY STAT residentFast after test: 77582336
19:40:23     INFO -  718 INFO MEMORY STAT heapAllocated after test: 97625024
19:40:24     INFO -  719 INFO TEST-OK | layout/style/test/test_selectors.html | took 314598ms
}


Note the several lines of normal output, followed by "TEST-OK" (!!!) after the TEST-UNEXPECTED-FAIL output. I think this test is just taking a little bit longer than the mochitest cutoff-time. (which is 300 seconds by default, IIRC. So:
 (1) this test may just need a requestLongerTimeout call
 (2) If it recently went from significantly-below-300s to above-300s (seems likely/possible given the sudden spike), that's worth investigating as a perf regression.
Hmm, some new test-code was also added to test_selectors.html on 3/3, in this cset:
 http://hg.mozilla.org/integration/mozilla-inbound/rev/0217de65664e#l4.1

It's conceivable that this new test-code made the test take a bit longer (and that pushed it over 300sec).
...though some green test runs[1][2] from 3/4 show test_selectors.html taking 233437ms-236672ms, and test_media_queries.html taking 242639ms-259480ms. So they were both already somewhat close.

Also: when test_media_queries.html fails, it looks like it's barely over the threshold (similar to test_selectors.html in comment 2) -- e.g. log [3] with "took 308405ms".

The possible regression here is a bit concerning, and hopefully retriggers can tell us what caused it. But for the moment, we probably should use requestLongerTimeout so these tests aren't getting quite so close to being treated as timeouts even under normal conditions.

[1] http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1425463690/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-mochitest-9-bm121-tests1-linux64-build8.txt.gz
[2] http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1425478490/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-mochitest-9-bm53-tests1-linux64-build56.txt.gz
[3] http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1425471411/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-mochitest-9-bm121-tests1-linux64-build10.txt.gz
Here's a patch to add requestLongerTimeout, since (per previous comment) these tests were already getting reasonably close to the 300 sec mochitest timeout.
Attachment #8574101 - Flags: review?(dbaron)
Attachment #8574101 - Flags: review?(dbaron) → review+
https://hg.mozilla.org/mozilla-central/rev/dacbd4fbf0ee
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
Note: From the retriggers I've been doing, it looks like test_reftests_with_caret.html may have gotten worse as well, though it's not quite as bad. Its runtime in a green run that I randomly picked was ~210 seconds, which is close enough to 300 that it's conceivable that it could be pushed over by a poorly-timed GC and/or whatever's causing these other tests to be slower.
Comment 8 is wrong; more retriggers show that it goes back further than that. In particular, my "good" push from comment 8 ended up having some test_media_queries & test_reftests_with_caret timeouts, and an earlier push ( https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=54ea9cbf8439&filter-ref_data_name=b2g_emulator_vm ) had a test_selectors timeout on one of its retriggers (the first orange "9" on that run).

Currently these timeouts seem to reproduce at least as far back as this run:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=78ae63e64244&filter-ref_data_name=b2g_emulator_vm

I'm going to dig a bit more, but if nothing obvious turns up, I'm probably going to to give up and/or punt to anyone else who's interested in investigating here pretty soon, since this is an inexact ("over the threshold or not"), lowish-frequency failure, for a test job that's not run on every single push for some reason, which makes an exact regression range hard to track down -- and we've fixed it with "requestLongerTimeout" which we arguably probably needed here already, since we were within spitting distance of the threshold.  And these tests aren't intended to be real-world perf tests, so a subtle slowdown in these tests (that's not caught by actual Talos tests) may be the sort of regression that's only exposed by pathological behavior & not really visible in the real world.

(I've already done a ton of retriggers for jobs in this range, so anyone who's interested in going back through this time-period on TreeHerder should have a lot of data to play with.)
One more data-point: The following run from ~2 days earlier, March 2nd, seems to be definitely "good":
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=346743dfd466&filter-ref_data_name=b2g_emulator_vm

I gave it a ton of mochitest-9 retriggers (for 67 total m-9 runs, it looks like), and it didn't hit this issue in any of those runs. (though it had a legitimate failure in test_reftests_with_caret.html in one of the runs)
...and this run from the end of March 3rd is pretty-likely "bad":
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=c5a6c94eda71&filter-ref_data_name=b2g_emulator_vm

It's had 31 m-9 runs, and 2 of those hit this category of failure. (One of which was the "test_selectors.html" timeout - bug 1139800 - which is pretty definitively new as of this issue here.)
It looks like the archived builds have been purged, so new retriggers from this time range are now purple, w/ failure to download the archive.

So -- final likely-regression-range:

Good -- no timeouts, except some in "test_reftests_with_caret.html" that are accompanied by "TEST-UNEXPECTED-FAIL ... image comparison (==)", which means they're not this bug:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a07003e503d6&filter-ref_data_name=b2g_emulator_vm

Bad (mentioned in comment 11, 2 runs w/ timeouts in test_selectors.html and/or test_media_queries.html):
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=c5a6c94eda71&filter-ref_data_name=b2g_emulator_vm

Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a07003e503d6&tochange=c5a6c94eda71

In that range, the last cset (for bug 1137031) seems the most likely to have caused restyling to be more expensive & hence prolonged these tests a bit.  It's also a correctness fix, so not likely to be rolled back.

I'm going to tentatively "blame" that bug, and call this investigation done.
Blocks: 1137031
Note that bug 1137031 was backported to Aurora on Friday, too -- if we see these timeouts happening on Aurora, then we can backport the requestLongerTimeout calls (from comment 6) to Aurora.
These timeouts seem to be happening on b2g37 a lot (though not on beta).  Any idea why?
Flags: needinfo?(dholbert)
(In reply to Daniel Holbert [:dholbert] from comment #12)
> In that range, the last cset (for bug 1137031) seems the most likely to have
> caused restyling to be more expensive & hence prolonged these tests a bit. 
> It's also a correctness fix, so not likely to be rolled back.

This seems a little suspicious, since it was only disabling the optimization from bug 931668 for a set of cases where transitions were involved.
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #14)
> These timeouts seem to be happening on b2g37 a lot (though not on beta). 
> Any idea why?

Nope. :-/ I haven't thought about this for a while, and I didn't really come to a concrete answer about what was going on when I was thinking about it back in early March.  As I recall, the data was messy & a bit hard to reason about, because these tests were already reasonably close to the timeout threshold. And the builds were gone by the time I was narrowing in on a maybe-regression-range, so comment 12 is the best information I ended up with, about what caused this.

Anyway -- maybe we should backport the requestLongerTimeout band-aid to b2g37?

(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #15)
> (In reply to Daniel Holbert [:dholbert] from comment #12)
> > In that range, the last cset (for bug 1137031) seems the most likely 
> 
> This seems a little suspicious, since it was only disabling the optimization
> from bug 931668 for a set of cases where transitions were involved.

OK.  Nothing else in that range jumped out at me.  Maybe one of my endpoints on the regression range was wrong, though (and I just got unlucky about which test-runs happened to be good/bad).
Flags: needinfo?(dholbert)
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #14)
> These timeouts seem to be happening on b2g37 a lot (though not on beta). 
> Any idea why?

Shot in the dark: at this point, it looks like things may have gotten worse with the latest beta --> b2g37 merge, on 3/25.

The push with that merge from beta[1] doesn't have any m-9 timeouts, but the *very next push*[2] has a M-9 timeout in test_media_queries.html.  Looking at the bug for that timeout (bug 1118001), this report (on f022936c0020) was the first b2g37 instance of that timeout in 2 weeks (and the next-oldest report was 2 months earlier) -- and it's been failing every day or so ever since.  So it looks like this is indeed the moment where the timeouts got bad on b2g37.

So, it's a bit odd/interesting that beta is still fine but b2g37-with-beta-merged is not-fine.

[1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-b2g37_v2_2&revision=ebb742bfe1a4
[2] https://treeherder.mozilla.org/#/jobs?repo=mozilla-b2g37_v2_2&revision=f022936c0020
B2G builds/tests don't run on beta.
Ah! That would explain dbaron's observation in comment 14 then, unless I'm misunderstanding him.
So expanding on comment 17: I think all we can learn from the recent b2g37 timeouts are that *something* that was merged over from mozilla-beta (as merged in https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/5b359c3c436f ) is responsible for the timeouts.

(The actual source-code content of that cset is ignorable; the relevant fact is that it has 2 parent csets, i.e. it's a merge.)

If that merge was as substantial as I suspect it was, then we don't actually get any information, really. :-/

Anyway: RyanVM, would you mind backporting the requestLongerTimeout cset in comment 6 to b2g37?  Alternately, I'm happy to do it, but I'm unfamiliar with b2g37 landing practices / starring responsibilities, so I'd rather defer to you if you don't mind. :)
Flags: needinfo?(ryanvm)
Will do.
(In reply to Daniel Holbert [:dholbert] from comment #20)
> So expanding on comment 17: I think all we can learn from the recent b2g37
> timeouts are that *something* that was merged over from mozilla-beta (as
> merged in
> https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/5b359c3c436f ) is
> responsible for the timeouts.

(er, to be more precise:  s/responsible for the timeouts/responsible for making these tests a bit slower, which gets them closer to the timeout-threshold/)
You need to log in before you can comment on or make changes to this bug.