1140502 - Investigate why bug 1118001 & bug 1139800 started spiking on March 4th 2015 (timeouts in test_media_queries.html, test_selectors.html on b2g emulator)

Assignee

Description

•

10 years ago

Per RyanVM's request in bug 1118001 comment 156, I'm filing this bug on investigating a recent spike in b2g mochitest-9 timeouts, in these tests: bug 1139800: test_media_queries.html bug 1139800: test_selectors.html I'm filing this as a separate bug so that I can CC people here without signing them up for TBPLBot spam for all the failures.

Daniel Holbert [:dholbert]

Assignee

Comment 1

•

10 years ago

(In reply to Daniel Holbert [:dholbert] from comment #0) > bug 1139800: test_media_queries.html > bug 1139800: test_selectors.html Sorry, bug-copypaste-fail -- that first line meant to say "bug 1118001: test_media_queries.html" Note that the "bug 1118001 test_media_queries.html" timeout was already happening ~5 times per day, before the recent spike. Now, it's failing tens of times per day. Here's a possible pretty-broad regression range (I'm guessing at the endpoints; I'll have a better idea after some retriggers are completed): https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?startdate=2015-03-04+02%3A00%3A00&enddate=2015-03-04+15%3A00%3A00 In there, the relevant-looking style-system / selector-related changes are: > bc9e6089b8e1 Cameron McCormack — Bug 1136010 - Don't coalesce style contexts added to RestyleManager::mContextsToClear. r=dbaron > 93e76e5a8186 Cameron McCormack — Bug 1138788 - Allow CSSParserImpl::SetURLValue assertion to succeed when under IsValueValidForProperty, even with no sheet principal. r=dholbert > 38f11ce2749a Xidorn Quan — Bug 1133624 - Add lang-specific ruby rules to ua.css. r=dbaron

Daniel Holbert [:dholbert]

Assignee

Comment 2

•

10 years ago

So for the test_selectors.html failure, e.g. in its m-9 failure here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f2b0eb82d21d&filter-ref_data_name=b2g_emulator_vm ...the test_selectors.html output ends with this: { 19:40:22 INFO - 708 INFO TEST-PASS | layout/style/test/test_selectors.html | selector :-moz-any(:active,:link:focus) was cloned correctly 19:40:22 INFO - 709 INFO TEST-PASS | layout/style/test/test_selectors.html | selector :-moz-any(:active,:link:focus) serializes correctly after cloning 19:40:22 INFO - 710 INFO TEST-UNEXPECTED-FAIL | layout/style/test/test_selectors.html | Test timed out. - expected PASS 19:40:23 INFO - 711 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> matched clone of :-moz-any(a,input) 19:40:23 INFO - 712 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> matched clone of :-moz-any(a,input) 19:40:23 INFO - 713 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> matched clone of :-moz-any(a,input) 19:40:23 INFO - 714 INFO TEST-PASS | layout/style/test/test_selectors.html | element in <input type='text'><a href='http://www.example.com/'></a><div></div><a name='foo'> did not match clone of :-moz-any(a,input) 19:40:23 INFO - 715 INFO TEST-PASS | layout/style/test/test_selectors.html | selector :-moz-any(a,input) serializes correctly after cloning 19:40:23 INFO - 716 INFO MEMORY STAT vsize after test: 218189824 19:40:23 INFO - 717 INFO MEMORY STAT residentFast after test: 77582336 19:40:23 INFO - 718 INFO MEMORY STAT heapAllocated after test: 97625024 19:40:24 INFO - 719 INFO TEST-OK | layout/style/test/test_selectors.html | took 314598ms } Note the several lines of normal output, followed by "TEST-OK" (!!!) after the TEST-UNEXPECTED-FAIL output. I think this test is just taking a little bit longer than the mochitest cutoff-time. (which is 300 seconds by default, IIRC. So: (1) this test may just need a requestLongerTimeout call (2) If it recently went from significantly-below-300s to above-300s (seems likely/possible given the sudden spike), that's worth investigating as a perf regression.

Daniel Holbert [:dholbert]

Assignee

Comment 3

•

10 years ago

Hmm, some new test-code was also added to test_selectors.html on 3/3, in this cset: http://hg.mozilla.org/integration/mozilla-inbound/rev/0217de65664e#l4.1 It's conceivable that this new test-code made the test take a bit longer (and that pushed it over 300sec).

Daniel Holbert [:dholbert]

Assignee

Comment 4

•

10 years ago

...though some green test runs[1][2] from 3/4 show test_selectors.html taking 233437ms-236672ms, and test_media_queries.html taking 242639ms-259480ms. So they were both already somewhat close. Also: when test_media_queries.html fails, it looks like it's barely over the threshold (similar to test_selectors.html in comment 2) -- e.g. log [3] with "took 308405ms". The possible regression here is a bit concerning, and hopefully retriggers can tell us what caused it. But for the moment, we probably should use requestLongerTimeout so these tests aren't getting quite so close to being treated as timeouts even under normal conditions. [1] http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1425463690/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-mochitest-9-bm121-tests1-linux64-build8.txt.gz [2] http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1425478490/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-mochitest-9-bm53-tests1-linux64-build56.txt.gz [3] http://ftp.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-inbound-emulator/1425471411/mozilla-inbound_ubuntu64_vm-b2g-emulator_test-mochitest-9-bm121-tests1-linux64-build10.txt.gz

Daniel Holbert [:dholbert]

Assignee

Comment 5

•

10 years ago

Attached patch band-aid which we should take regardless: request longer timeout — Details — Splinter Review

Here's a patch to add requestLongerTimeout, since (per previous comment) these tests were already getting reasonably close to the 300 sec mochitest timeout.

Attachment #8574101 - Flags: review?(dbaron)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

10 years ago

Attachment #8574101 - Flags: review?(dbaron) → review+

Ryan VanderMeulen [:RyanVM]

Comment 6

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/dacbd4fbf0ee

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox39: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla39

Daniel Holbert [:dholbert]

Assignee

Comment 7

•

10 years ago

Note: From the retriggers I've been doing, it looks like test_reftests_with_caret.html may have gotten worse as well, though it's not quite as bad. Its runtime in a green run that I randomly picked was ~210 seconds, which is close enough to 300 that it's conceivable that it could be pushed over by a poorly-timed GC and/or whatever's causing these other tests to be slower.

Comment hidden (obsolete)

Daniel Holbert [:dholbert]

Assignee

Comment 9

•

10 years ago

Comment 8 is wrong; more retriggers show that it goes back further than that. In particular, my "good" push from comment 8 ended up having some test_media_queries & test_reftests_with_caret timeouts, and an earlier push ( https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=54ea9cbf8439&filter-ref_data_name=b2g_emulator_vm ) had a test_selectors timeout on one of its retriggers (the first orange "9" on that run). Currently these timeouts seem to reproduce at least as far back as this run: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=78ae63e64244&filter-ref_data_name=b2g_emulator_vm I'm going to dig a bit more, but if nothing obvious turns up, I'm probably going to to give up and/or punt to anyone else who's interested in investigating here pretty soon, since this is an inexact ("over the threshold or not"), lowish-frequency failure, for a test job that's not run on every single push for some reason, which makes an exact regression range hard to track down -- and we've fixed it with "requestLongerTimeout" which we arguably probably needed here already, since we were within spitting distance of the threshold. And these tests aren't intended to be real-world perf tests, so a subtle slowdown in these tests (that's not caught by actual Talos tests) may be the sort of regression that's only exposed by pathological behavior & not really visible in the real world. (I've already done a ton of retriggers for jobs in this range, so anyone who's interested in going back through this time-period on TreeHerder should have a lot of data to play with.)

Daniel Holbert [:dholbert]

Assignee

Comment 10

•

10 years ago

One more data-point: The following run from ~2 days earlier, March 2nd, seems to be definitely "good": https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=346743dfd466&filter-ref_data_name=b2g_emulator_vm I gave it a ton of mochitest-9 retriggers (for 67 total m-9 runs, it looks like), and it didn't hit this issue in any of those runs. (though it had a legitimate failure in test_reftests_with_caret.html in one of the runs)

Daniel Holbert [:dholbert]

Assignee

Comment 11

•

10 years ago

...and this run from the end of March 3rd is pretty-likely "bad": https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=c5a6c94eda71&filter-ref_data_name=b2g_emulator_vm It's had 31 m-9 runs, and 2 of those hit this category of failure. (One of which was the "test_selectors.html" timeout - bug 1139800 - which is pretty definitively new as of this issue here.)

Daniel Holbert [:dholbert]

Assignee

Comment 12

•

10 years ago

It looks like the archived builds have been purged, so new retriggers from this time range are now purple, w/ failure to download the archive. So -- final likely-regression-range: Good -- no timeouts, except some in "test_reftests_with_caret.html" that are accompanied by "TEST-UNEXPECTED-FAIL ... image comparison (==)", which means they're not this bug: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=a07003e503d6&filter-ref_data_name=b2g_emulator_vm Bad (mentioned in comment 11, 2 runs w/ timeouts in test_selectors.html and/or test_media_queries.html): https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=c5a6c94eda71&filter-ref_data_name=b2g_emulator_vm Pushlog: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a07003e503d6&tochange=c5a6c94eda71 In that range, the last cset (for bug 1137031) seems the most likely to have caused restyling to be more expensive & hence prolonged these tests a bit. It's also a correctness fix, so not likely to be rolled back. I'm going to tentatively "blame" that bug, and call this investigation done.

Blocks: 1137031

Daniel Holbert [:dholbert]

Assignee

Comment 13

•

10 years ago

Note that bug 1137031 was backported to Aurora on Friday, too -- if we see these timeouts happening on Aurora, then we can backport the requestLongerTimeout calls (from comment 6) to Aurora.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 14

•

10 years ago

These timeouts seem to be happening on b2g37 a lot (though not on beta). Any idea why?

Flags: needinfo?(dholbert)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 15

•

10 years ago

(In reply to Daniel Holbert [:dholbert] from comment #12) > In that range, the last cset (for bug 1137031) seems the most likely to have > caused restyling to be more expensive & hence prolonged these tests a bit. > It's also a correctness fix, so not likely to be rolled back. This seems a little suspicious, since it was only disabling the optimization from bug 931668 for a set of cases where transitions were involved.

Daniel Holbert [:dholbert]

Assignee

Comment 16

•

10 years ago

(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #14) > These timeouts seem to be happening on b2g37 a lot (though not on beta). > Any idea why? Nope. :-/ I haven't thought about this for a while, and I didn't really come to a concrete answer about what was going on when I was thinking about it back in early March. As I recall, the data was messy & a bit hard to reason about, because these tests were already reasonably close to the timeout threshold. And the builds were gone by the time I was narrowing in on a maybe-regression-range, so comment 12 is the best information I ended up with, about what caused this. Anyway -- maybe we should backport the requestLongerTimeout band-aid to b2g37? (In reply to David Baron [:dbaron] ⏰UTC-7 from comment #15) > (In reply to Daniel Holbert [:dholbert] from comment #12) > > In that range, the last cset (for bug 1137031) seems the most likely > > This seems a little suspicious, since it was only disabling the optimization > from bug 931668 for a set of cases where transitions were involved. OK. Nothing else in that range jumped out at me. Maybe one of my endpoints on the regression range was wrong, though (and I just got unlucky about which test-runs happened to be good/bad).

Flags: needinfo?(dholbert)

Daniel Holbert [:dholbert]

Assignee

Comment 17

•

10 years ago

(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #14) > These timeouts seem to be happening on b2g37 a lot (though not on beta). > Any idea why? Shot in the dark: at this point, it looks like things may have gotten worse with the latest beta --> b2g37 merge, on 3/25. The push with that merge from beta[1] doesn't have any m-9 timeouts, but the *very next push*[2] has a M-9 timeout in test_media_queries.html. Looking at the bug for that timeout (bug 1118001), this report (on f022936c0020) was the first b2g37 instance of that timeout in 2 weeks (and the next-oldest report was 2 months earlier) -- and it's been failing every day or so ever since. So it looks like this is indeed the moment where the timeouts got bad on b2g37. So, it's a bit odd/interesting that beta is still fine but b2g37-with-beta-merged is not-fine. [1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-b2g37_v2_2&revision=ebb742bfe1a4 [2] https://treeherder.mozilla.org/#/jobs?repo=mozilla-b2g37_v2_2&revision=f022936c0020

Ryan VanderMeulen [:RyanVM]

Comment 18

•

10 years ago

B2G builds/tests don't run on beta.

Daniel Holbert [:dholbert]

Assignee

Comment 19

•

10 years ago

Ah! That would explain dbaron's observation in comment 14 then, unless I'm misunderstanding him.

Daniel Holbert [:dholbert]

Assignee

Comment 20

•

10 years ago

So expanding on comment 17: I think all we can learn from the recent b2g37 timeouts are that *something* that was merged over from mozilla-beta (as merged in https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/5b359c3c436f ) is responsible for the timeouts. (The actual source-code content of that cset is ignorable; the relevant fact is that it has 2 parent csets, i.e. it's a merge.) If that merge was as substantial as I suspect it was, then we don't actually get any information, really. :-/ Anyway: RyanVM, would you mind backporting the requestLongerTimeout cset in comment 6 to b2g37? Alternately, I'm happy to do it, but I'm unfamiliar with b2g37 landing practices / starring responsibilities, so I'd rather defer to you if you don't mind. :)

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 21

•

10 years ago

Will do.

Daniel Holbert [:dholbert]

Assignee

Comment 22

•

10 years ago

(In reply to Daniel Holbert [:dholbert] from comment #20) > So expanding on comment 17: I think all we can learn from the recent b2g37 > timeouts are that *something* that was merged over from mozilla-beta (as > merged in > https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/5b359c3c436f ) is > responsible for the timeouts. (er, to be more precise: s/responsible for the timeouts/responsible for making these tests a bit slower, which gets them closer to the timeout-threshold/)

Ryan VanderMeulen [:RyanVM]

Comment 23

•

10 years ago

https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/213391ba9fd7

Assignee: nobody → dholbert

status-b2g-v2.2: --- → fixed

status-b2g-master: --- → fixed

status-firefox37: --- → wontfix

status-firefox38: --- → wontfix

Flags: needinfo?(ryanvm)