Closed
Bug 1246056
Opened 9 years ago
Closed 9 years ago
Intermittent APZCBasicTester.Fling | Expected: (pointOut.y) > (lastPoint.y), actual: 40 vs 40
Categories
(Core :: Panning and Zooming, defect)
Tracking
()
RESOLVED
FIXED
mozilla47
Tracking | Status | |
---|---|---|
firefox45 | --- | unaffected |
firefox46 | --- | fixed |
firefox47 | --- | fixed |
firefox-esr45 | --- | unaffected |
People
(Reporter: philor, Assigned: kats)
References
Details
(Keywords: intermittent-failure)
Attachments
(3 files)
2.16 KB,
patch
|
Details | Diff | Splinter Review | |
793 bytes,
patch
|
Details | Diff | Splinter Review | |
58 bytes,
text/x-review-board-request
|
botond
:
review+
|
Details |
How you manage to have an intermittent failure only on Linux32 PGO I don't know; how you manage to have an intermittent failure that's around 1 in 25 hit on two first-runs on two Linux32 PGO builds in a row I also don't know. But I do know we're getting nowhere retriggering to find a push which started it.
https://treeherder.mozilla.org/logviewer.html#?job_id=21102797&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=21096157&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=21100301&repo=mozilla-inbound
Assignee | ||
Comment 1•9 years ago
|
||
I blame the PGO code generator. The gtests are single threaded and about as deterministic as you can get. On the other hand I've seen strange floating point inaccuracies creep into PGO builds that don't exist in non-PGO builds.
Reporter | ||
Comment 2•9 years ago
|
||
No, 1 in 25 - we retriggered 25 times on the build that first hit it, and didn't get another instance, we stayed closed for hours building PGO on rbarker's repush to blame it on that and triggered gtest 25 times and got it once. It's not the case that we intermittently produce a build which deterministically hits this, it's the case that a given build will intermittently hit this.
Assignee | ||
Comment 3•9 years ago
|
||
Oh, weird. The only source of nondeterminism I can think of that would explain this is if the PGO build is failing to initialize some variable properly, and depending on what value it picks up it fails or passes.
Assignee | ||
Comment 4•9 years ago
|
||
I guess the thing to do in that case would be to run the PGO build in valgrind and see if it catches anything.
Assignee | ||
Comment 5•9 years ago
|
||
Also just for the record if this *is* a regression the most likely change to have introduced it recently is bug 1236046, considering that the test simulates a fling which exercises the velocity computation code. But I did a pass over the relevant code and still don't see any reason why this should be nondeterministic.
Reporter | ||
Comment 6•9 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?job_id=21188916&repo=mozilla-inbound is Win7 debug, which should save some time chasing down PGO miscompilation.
Assignee | ||
Comment 7•9 years ago
|
||
I ran valgrind on this test in a local Linux debug build and it didn't turn up anything interesting. If it's not PGO miscompilation and it's not uninitialized read then I'm not sure where the nondeterminism is coming from. I'll think about it some more.
Comment hidden (Intermittent Failures Robot) |
Comment 9•9 years ago
|
||
It's probably easiest for this to wait until we have docker linux debug tests and then we can run it locally under rr chaos mode and hopefully catch it.
The only other option I can think of is to get some more debugging output of what's going on.
Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
Assignee: nobody → bgirard
Comment 11•9 years ago
|
||
I'm going to try to capture an RR replay. I ran into this bug making it difficult however:
https://github.com/mozilla/rr/issues/1644
Assignee | ||
Comment 12•9 years ago
|
||
I ran this gtest 1000 times with rr-chaos locally and couldn't repro the failure. However I was using a Linux debug build and maybe it only happens on Linux opt builds?
Assignee | ||
Comment 13•9 years ago
|
||
(Also I was running the single gtest in isolation, and maybe it needs to run as part the full suite)
Comment 14•9 years ago
|
||
I ran just the test and I also ran 'APZC*' for ~3000 cycles under Linux debug with rr+gecko chaos and it did not reproduce.
Comment 15•9 years ago
|
||
AFAIK this failure here doesn't come from a Linux PGO build. Can anyone confirm or deny?
https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=cb08d55f4e8b&selectedJob=7194463
Assignee | ||
Comment 16•9 years ago
|
||
Confirm, it's a regular Linux opt build. According to brasstacks it has happened on Linux opt, Linux pgo, and Windows 7 debug.
Assignee | ||
Comment 17•9 years ago
|
||
I did another 500 runs with an Linux opt build, APZCBasicTester.* and no repro.
Assignee | ||
Comment 18•9 years ago
|
||
I reproduced it after fiddling with some code manually. I'll debug.
Assignee: bgirard → bugmail.mozilla
Assignee | ||
Comment 19•9 years ago
|
||
The problem is that the MillisecondsSinceStartup call at [1] can return a negative value, because the implementation of that function uses GetStartupTime (which is lazy) and the MockContentController's time (which is initialized first). However it returns a uint32_t, so it really comes out as a really big positive number. If the negative value is < -5ms, the check I added at [3] trips, and so we enter the if condition when we shouldn't be, and that causes the failure.
I can fix this easily enough in the test framework by ensuring that the MCC's time is always >= GetStartupTime(), but I'm wondering if the code at [3] should be made robust to overflow as well. I'm leaning towards no because it's an implicit contract violation to be passing what is really a negative number as the uint32_t timestamp and I don't think production code will be hitting this in practice (but really who knows).
[1] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/InputUtils.h?rev=28db0fc71ad1#81
[2] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/APZTestCommon.h#292
[3] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/Axis.cpp?rev=17a70b41806c#78
Assignee | ||
Comment 20•9 years ago
|
||
Applying this patch simulates the failure
Assignee | ||
Comment 21•9 years ago
|
||
A 10ms sleep in the right place also reproduces the problem.
Assignee | ||
Comment 22•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/35185/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/35185/
Attachment #8719986 -
Flags: review?(botond)
Updated•9 years ago
|
Attachment #8719986 -
Flags: review?(botond) → review+
Comment 23•9 years ago
|
||
Comment on attachment 8719986 [details]
MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond
https://reviewboard.mozilla.org/r/35185/#review31805
That's a subtle bug - thanks for tracking it down!
Comment 24•9 years ago
|
||
Comment 25•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Comment 26•9 years ago
|
||
status-firefox45:
--- → unaffected
status-firefox46:
--- → affected
status-firefox-esr45:
--- → unaffected
Updated•9 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•