Closed Bug 1246056 Opened 9 years ago Closed 9 years ago

Intermittent APZCBasicTester.Fling | Expected: (pointOut.y) > (lastPoint.y), actual: 40 vs 40

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla47

Tracking Flags:

Tracking

Status

firefox45

---

unaffected

firefox46

---

fixed

firefox47

---

fixed

firefox-esr45

---

unaffected

People

(Reporter: philor, Assigned: kats)

References

Details

(Keywords: intermittent-failure)

Attachments

(3 files)

Patch to repro problem 9 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) 2.16 KB, patch		Details \| Diff \| Splinter Review
Simpler patch to reproduce the problem 9 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) 793 bytes, patch		Details \| Diff \| Splinter Review
MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond 9 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) 58 bytes, text/x-review-board-request	botond : review+	Details

Phil Ringnalda (:philor)

Reporter

Description

•

9 years ago

How you manage to have an intermittent failure only on Linux32 PGO I don't know; how you manage to have an intermittent failure that's around 1 in 25 hit on two first-runs on two Linux32 PGO builds in a row I also don't know. But I do know we're getting nowhere retriggering to find a push which started it. https://treeherder.mozilla.org/logviewer.html#?job_id=21102797&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=21096157&repo=mozilla-inbound https://treeherder.mozilla.org/logviewer.html#?job_id=21100301&repo=mozilla-inbound

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 1

•

9 years ago

I blame the PGO code generator. The gtests are single threaded and about as deterministic as you can get. On the other hand I've seen strange floating point inaccuracies creep into PGO builds that don't exist in non-PGO builds.

Phil Ringnalda (:philor)

Reporter

Comment 2

•

9 years ago

No, 1 in 25 - we retriggered 25 times on the build that first hit it, and didn't get another instance, we stayed closed for hours building PGO on rbarker's repush to blame it on that and triggered gtest 25 times and got it once. It's not the case that we intermittently produce a build which deterministically hits this, it's the case that a given build will intermittently hit this.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 3

•

9 years ago

Oh, weird. The only source of nondeterminism I can think of that would explain this is if the PGO build is failing to initialize some variable properly, and depending on what value it picks up it fails or passes.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 4

•

9 years ago

I guess the thing to do in that case would be to run the PGO build in valgrind and see if it catches anything.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 5

•

9 years ago

Also just for the record if this *is* a regression the most likely change to have introduced it recently is bug 1236046, considering that the test simulates a fling which exercises the velocity computation code. But I did a pass over the relevant code and still don't see any reason why this should be nondeterministic.

Phil Ringnalda (:philor)

Reporter

Comment 6

•

9 years ago

https://treeherder.mozilla.org/logviewer.html#?job_id=21188916&repo=mozilla-inbound is Win7 debug, which should save some time chasing down PGO miscompilation.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 7

•

9 years ago

I ran valgrind on this test in a local Linux debug build and it didn't turn up anything interesting. If it's not PGO miscompilation and it's not uninitialized read then I'm not sure where the nondeterminism is coming from. I'll think about it some more.

Comment hidden (Intermittent Failures Robot)

Jeff Muizelaar [:jrmuizel]

Comment 9

•

9 years ago

It's probably easiest for this to wait until we have docker linux debug tests and then we can run it locally under rr chaos mode and hopefully catch it. The only other option I can think of is to get some more debugging output of what's going on.

Comment hidden (Intermittent Failures Robot)

Benoit Girard (:BenWa)

Updated

•

9 years ago

Assignee: nobody → bgirard

Benoit Girard (:BenWa)

Comment 11

•

9 years ago

I'm going to try to capture an RR replay. I ran into this bug making it difficult however: https://github.com/mozilla/rr/issues/1644

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 12

•

9 years ago

I ran this gtest 1000 times with rr-chaos locally and couldn't repro the failure. However I was using a Linux debug build and maybe it only happens on Linux opt builds?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 13

•

9 years ago

(Also I was running the single gtest in isolation, and maybe it needs to run as part the full suite)

Benoit Girard (:BenWa)

Comment 14

•

9 years ago

I ran just the test and I also ran 'APZC*' for ~3000 cycles under Linux debug with rr+gecko chaos and it did not reproduce.

Benoit Girard (:BenWa)

Comment 15

•

9 years ago

AFAIK this failure here doesn't come from a Linux PGO build. Can anyone confirm or deny? https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=cb08d55f4e8b&selectedJob=7194463

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 16

•

9 years ago

Confirm, it's a regular Linux opt build. According to brasstacks it has happened on Linux opt, Linux pgo, and Windows 7 debug.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 17

•

9 years ago

I did another 500 runs with an Linux opt build, APZCBasicTester.* and no repro.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 18

•

9 years ago

I reproduced it after fiddling with some code manually. I'll debug.

Assignee: bgirard → bugmail.mozilla

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 19

•

9 years ago

The problem is that the MillisecondsSinceStartup call at [1] can return a negative value, because the implementation of that function uses GetStartupTime (which is lazy) and the MockContentController's time (which is initialized first). However it returns a uint32_t, so it really comes out as a really big positive number. If the negative value is < -5ms, the check I added at [3] trips, and so we enter the if condition when we shouldn't be, and that causes the failure. I can fix this easily enough in the test framework by ensuring that the MCC's time is always >= GetStartupTime(), but I'm wondering if the code at [3] should be made robust to overflow as well. I'm leaning towards no because it's an implicit contract violation to be passing what is really a negative number as the uint32_t timestamp and I don't think production code will be hitting this in practice (but really who knows). [1] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/InputUtils.h?rev=28db0fc71ad1#81 [2] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/test/gtest/APZTestCommon.h#292 [3] https://mxr.mozilla.org/mozilla-central/source/gfx/layers/apz/src/Axis.cpp?rev=17a70b41806c#78

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 20

•

9 years ago

Attached patch Patch to repro problem — Details — Splinter Review

Applying this patch simulates the failure

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 21

•

9 years ago

Attached patch Simpler patch to reproduce the problem — Details — Splinter Review

A 10ms sleep in the right place also reproduces the problem.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Comment 22

•

9 years ago

Attached file MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond — Details

Review commit: https://reviewboard.mozilla.org/r/35185/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/35185/

Attachment #8719986 - Flags: review?(botond)

Botond Ballo [:botond]

Updated

•

9 years ago

Attachment #8719986 - Flags: review?(botond) → review+

Botond Ballo [:botond]

Comment 23

•

9 years ago

Comment on attachment 8719986 [details] MozReview Request: Bug 1246056 - Ensure that the MockContentController's timestamp is always >= GetStartupTime(). r?botond https://reviewboard.mozilla.org/r/35185/#review31805 That's a subtle bug - thanks for tracking it down!

Pulsebot

Comment 24

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/96f7777635f8

Carsten Book [:Tomcat]

Comment 25

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/96f7777635f8

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox47: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla47

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Assignee

Updated

•

9 years ago

Blocks: 1236046

Ryan VanderMeulen [:RyanVM]

Comment 26

•

9 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/515e12afb4a5

status-firefox45: --- → unaffected

status-firefox46: --- → affected

status-firefox-esr45: --- → unaffected

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

status-firefox46: affected → fixed

You need to log in before you can comment on or make changes to this bug.