Intermittent APZCBasicTester.Fling | Expected: (pointOut.y) > (lastPoint.y), actual: 40 vs 40

RESOLVED FIXED in Firefox 46

Status

()

Core
Panning and Zooming
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: philor, Assigned: kats)

Tracking

({intermittent-failure})

Trunk
mozilla47
x86
Linux
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox45 unaffected, firefox46 fixed, firefox47 fixed, firefox-esr45 unaffected)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(3 attachments)

(Reporter)

Description

2 years ago
How you manage to have an intermittent failure only on Linux32 PGO I don't know; how you manage to have an intermittent failure that's around 1 in 25 hit on two first-runs on two Linux32 PGO builds in a row I also don't know. But I do know we're getting nowhere retriggering to find a push which started it.

https://treeherder.mozilla.org/logviewer.html#?job_id=21102797&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=21096157&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=21100301&repo=mozilla-inbound
I blame the PGO code generator. The gtests are single threaded and about as deterministic as you can get. On the other hand I've seen strange floating point inaccuracies creep into PGO builds that don't exist in non-PGO builds.
(Reporter)

Comment 2

2 years ago
No, 1 in 25 - we retriggered 25 times on the build that first hit it, and didn't get another instance, we stayed closed for hours building PGO on rbarker's repush to blame it on that and triggered gtest 25 times and got it once. It's not the case that we intermittently produce a build which deterministically hits this, it's the case that a given build will intermittently hit this.
Oh, weird. The only source of nondeterminism I can think of that would explain this is if the PGO build is failing to initialize some variable properly, and depending on what value it picks up it fails or passes.
I guess the thing to do in that case would be to run the PGO build in valgrind and see if it catches anything.
Also just for the record if this *is* a regression the most likely change to have introduced it recently is bug 1236046, considering that the test simulates a fling which exercises the velocity computation code. But I did a pass over the relevant code and still don't see any reason why this should be nondeterministic.
(Reporter)

Comment 6

2 years ago
https://treeherder.mozilla.org/logviewer.html#?job_id=21188916&repo=mozilla-inbound is Win7 debug, which should save some time chasing down PGO miscompilation.
I ran valgrind on this test in a local Linux debug build and it didn't turn up anything interesting. If it's not PGO miscompilation and it's not uninitialized read then I'm not sure where the nondeterminism is coming from. I'll think about it some more.

Comment 8

2 years ago
8 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 6
* fx-team: 2

Platform breakdown:
* linux32: 6
* windows7-32: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1246056&startday=2016-02-01&endday=2016-02-07&tree=all
It's probably easiest for this to wait until we have docker linux debug tests and then we can run it locally under rr chaos mode and hopefully catch it.

The only other option I can think of is to get some more debugging output of what's going on.

Comment 10

2 years ago
13 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 6
* mozilla-inbound: 5
* mozilla-central: 2

Platform breakdown:
* linux32: 7
* windows7-32: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1246056&startday=2016-02-08&endday=2016-02-14&tree=all

Updated

2 years ago
Assignee: nobody → bgirard
I'm going to try to capture an RR replay. I ran into this bug making it difficult however:
https://github.com/mozilla/rr/issues/1644
I ran this gtest 1000 times with rr-chaos locally and couldn't repro the failure. However I was using a Linux debug build and maybe it only happens on Linux opt builds?
(Also I was running the single gtest in isolation, and maybe it needs to run as part the full suite)
I ran just the test and I also ran 'APZC*' for ~3000 cycles under Linux debug with rr+gecko chaos and it did not reproduce.
AFAIK this failure here doesn't come from a Linux PGO build. Can anyone confirm or deny?
https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=cb08d55f4e8b&selectedJob=7194463
Confirm, it's a regular Linux opt build. According to brasstacks it has happened on Linux opt, Linux pgo, and Windows 7 debug.
I did another 500 runs with an Linux opt build, APZCBasicTester.* and no repro.
I reproduced it after fiddling with some code manually. I'll debug.
Assignee: bgirard → bugmail.mozilla
The problem is that the MillisecondsSinceStartup call at [1] can return a negative value, because the implementation of that function uses GetStartupTime (which is lazy) and the MockContentController's time (which is initialized first). However it returns a uint32_t, so it really comes out as a really big positive number. If the negative value is < -5ms, the check I added at [3] trips, and so we enter the if condition when we shouldn't be, and that causes the failure.

I can fix this easily enough in the test framework by ensuring that the MCC's time is always >= GetStartupTime(), but I'm wondering if the code at [3] should be made robust to overflow as well. I'm leaning towards no because it's an implicit contract violation to be passing what is really a negative number as the uint32_t timestamp and I don't think production code will be hitting this in practice (but really who knows).

[1] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/InputUtils.h?rev=28db0fc71ad1#81
[2] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/APZTestCommon.h#292
[3] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/Axis.cpp?rev=17a70b41806c#78
Created attachment 8719963 [details] [diff] [review]
Patch to repro problem

Applying this patch simulates the failure
Created attachment 8719979 [details] [diff] [review]
Simpler patch to reproduce the problem

A 10ms sleep in the right place also reproduces the problem.
Created attachment 8719986 [details]
MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond

Review commit: https://reviewboard.mozilla.org/r/35185/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/35185/
Attachment #8719986 - Flags: review?(botond)

Updated

2 years ago
Attachment #8719986 - Flags: review?(botond) → review+
Comment on attachment 8719986 [details]
MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond

https://reviewboard.mozilla.org/r/35185/#review31805

That's a subtle bug - thanks for tracking it down!

Comment 24

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/96f7777635f8

Comment 25

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/96f7777635f8
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox47: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Blocks: 1236046
https://hg.mozilla.org/releases/mozilla-aurora/rev/515e12afb4a5
status-firefox45: --- → unaffected
status-firefox46: --- → affected
status-firefox-esr45: --- → unaffected
status-firefox46: affected → fixed
You need to log in before you can comment on or make changes to this bug.