Closed Bug 1246056 Opened 8 years ago Closed 8 years ago

Intermittent APZCBasicTester.Fling | Expected: (pointOut.y) > (lastPoint.y), actual: 40 vs 40

Categories

(Core :: Panning and Zooming, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox45 --- unaffected
firefox46 --- fixed
firefox47 --- fixed
firefox-esr45 --- unaffected

People

(Reporter: philor, Assigned: kats)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files)

How you manage to have an intermittent failure only on Linux32 PGO I don't know; how you manage to have an intermittent failure that's around 1 in 25 hit on two first-runs on two Linux32 PGO builds in a row I also don't know. But I do know we're getting nowhere retriggering to find a push which started it.

https://treeherder.mozilla.org/logviewer.html#?job_id=21102797&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=21096157&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=21100301&repo=mozilla-inbound
I blame the PGO code generator. The gtests are single threaded and about as deterministic as you can get. On the other hand I've seen strange floating point inaccuracies creep into PGO builds that don't exist in non-PGO builds.
No, 1 in 25 - we retriggered 25 times on the build that first hit it, and didn't get another instance, we stayed closed for hours building PGO on rbarker's repush to blame it on that and triggered gtest 25 times and got it once. It's not the case that we intermittently produce a build which deterministically hits this, it's the case that a given build will intermittently hit this.
Oh, weird. The only source of nondeterminism I can think of that would explain this is if the PGO build is failing to initialize some variable properly, and depending on what value it picks up it fails or passes.
I guess the thing to do in that case would be to run the PGO build in valgrind and see if it catches anything.
Also just for the record if this *is* a regression the most likely change to have introduced it recently is bug 1236046, considering that the test simulates a fling which exercises the velocity computation code. But I did a pass over the relevant code and still don't see any reason why this should be nondeterministic.
https://treeherder.mozilla.org/logviewer.html#?job_id=21188916&repo=mozilla-inbound is Win7 debug, which should save some time chasing down PGO miscompilation.
I ran valgrind on this test in a local Linux debug build and it didn't turn up anything interesting. If it's not PGO miscompilation and it's not uninitialized read then I'm not sure where the nondeterminism is coming from. I'll think about it some more.
It's probably easiest for this to wait until we have docker linux debug tests and then we can run it locally under rr chaos mode and hopefully catch it.

The only other option I can think of is to get some more debugging output of what's going on.
Assignee: nobody → bgirard
I'm going to try to capture an RR replay. I ran into this bug making it difficult however:
https://github.com/mozilla/rr/issues/1644
I ran this gtest 1000 times with rr-chaos locally and couldn't repro the failure. However I was using a Linux debug build and maybe it only happens on Linux opt builds?
(Also I was running the single gtest in isolation, and maybe it needs to run as part the full suite)
I ran just the test and I also ran 'APZC*' for ~3000 cycles under Linux debug with rr+gecko chaos and it did not reproduce.
AFAIK this failure here doesn't come from a Linux PGO build. Can anyone confirm or deny?
https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=cb08d55f4e8b&selectedJob=7194463
Confirm, it's a regular Linux opt build. According to brasstacks it has happened on Linux opt, Linux pgo, and Windows 7 debug.
I did another 500 runs with an Linux opt build, APZCBasicTester.* and no repro.
I reproduced it after fiddling with some code manually. I'll debug.
Assignee: bgirard → bugmail.mozilla
The problem is that the MillisecondsSinceStartup call at [1] can return a negative value, because the implementation of that function uses GetStartupTime (which is lazy) and the MockContentController's time (which is initialized first). However it returns a uint32_t, so it really comes out as a really big positive number. If the negative value is < -5ms, the check I added at [3] trips, and so we enter the if condition when we shouldn't be, and that causes the failure.

I can fix this easily enough in the test framework by ensuring that the MCC's time is always >= GetStartupTime(), but I'm wondering if the code at [3] should be made robust to overflow as well. I'm leaning towards no because it's an implicit contract violation to be passing what is really a negative number as the uint32_t timestamp and I don't think production code will be hitting this in practice (but really who knows).

[1] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/InputUtils.h?rev=28db0fc71ad1#81
[2] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/APZTestCommon.h#292
[3] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/Axis.cpp?rev=17a70b41806c#78
Applying this patch simulates the failure
A 10ms sleep in the right place also reproduces the problem.
Attachment #8719986 - Flags: review?(botond) → review+
Comment on attachment 8719986 [details]
MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond

https://reviewboard.mozilla.org/r/35185/#review31805

That's a subtle bug - thanks for tracking it down!
https://hg.mozilla.org/mozilla-central/rev/96f7777635f8
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: