Closed Bug 1544452 Opened 6 months ago Closed 4 months ago

osx Mojave (10.14) - gtests seem to fail on expirationtracker in debug mode

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr68 --- fixed
firefox69 --- fixed

People

(Reporter: jmaher, Assigned: KrisWright)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

in starting to look at upgrading our osx machines from 10.10 to 10.14, some tests are failing.

In this case, I see gtest failing:
11:06:13 INFO - TEST-START | ExpirationTracker.main
11:06:13 INFO - [842, Main Thread] WARNING: Expired objects were not removed or marked used: file /builds/worker/workspace/build/src/obj-firefox/dist/include/nsExpirationTracker.h, line 257
11:06:14 INFO - [842, Main Thread] WARNING: Expired objects were not removed or marked used: file /builds/worker/workspace/build/src/obj-firefox/dist/include/nsExpirationTracker.h, line 257
11:06:14 INFO - [842, Main Thread] WARNING: Expired objects were not removed or marked used: file /builds/worker/workspace/build/src/obj-firefox/dist/include/nsExpirationTracker.h, line 257
11:06:14 WARNING - TEST-UNEXPECTED-FAIL | ExpirationTracker.main | Value of: timeDiffMS < periodMS && aObj->mExpired
11:06:14 INFO - Actual: false
11:06:14 INFO - Expected: true @ /builds/worker/workspace/build/src/xpcom/tests/gtest/TestExpirationTracker.cpp:134
11:06:14 WARNING - TEST-UNEXPECTED-FAIL | ExpirationTracker.main | Value of: timeDiffMS < periodMS && aObj->mExpired
11:06:14 INFO - Actual: false
11:06:14 INFO - Expected: true @ /builds/worker/workspace/build/src/xpcom/tests/gtest/TestExpirationTracker.cpp:134
11:06:14 WARNING - TEST-UNEXPECTED-FAIL | ExpirationTracker.main | Value of: timeDiffMS < periodMS && aObj->mExpired
11:06:14 INFO - Actual: false

you can see the test runs and logs here:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher%40mozilla.com&fromchange=96a96f509d98f90ce3f8e35990df75153755b877&tochange=b9ef9517732230f87398e4b26d30ed5c2e34adce&searchStr=gtest&selectedJob=240125939

:njn, would you know much about this test (I saw you wrote patches for this test in the past) and have ideas for fixing this on osx mojave?

Flags: needinfo?(n.nethercote)

Apologies: I have no recollection or knowledge of this test and the Mercurial history says I have only ever made one trivial change to it.

I tried running it on my Mac laptop, which has 10.14, but it succeeded.

Flags: needinfo?(n.nethercote)

thanks :njn, hg history can be difficult to really figure out who knows about the test. Thanks for giving this a try locally.

This reproduces with:

MOZ_CHAOSMODE=4 ./mach gtest ExpirationTracker.*

MOZ_CHAOSMODE=4 makes our timers less predictable. I think we're just making bad assumptions that our timers will always fire precisely at the the expected time.

:erahm - this bug now appears on both shippable and debug variants of gtest: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=875b73dc18b721a2cb92286a61bb7684b068b09a

Would you be able to take a look at this test? It seems to be specific to macosx1014. There is a timeline for migration to 1014; would it be acceptable to have the test disabled for macosx1014?

Flags: needinfo?(erahm)

Kris, can you take a look at this test failure? We're seeing increased tests failures on OSX 10.14 in the ExpirationTracker test. Comment 4 has details on reproducing locally. We might just need to tweak the upper bounds used in the test due to the inevitable flakiness of timers. At this point we don't really have an owner for this code, but if you run into issues ping froydnj or myself.

Flags: needinfo?(erahm) → needinfo?(kwright)

The timer is falling above/below bounds during this check https://searchfox.org/mozilla-central/rev/8ed8474757695cdae047150a0eaf94a5f1c96dbe/xpcom/tests/gtest/TestExpirationTracker.cpp#133

Some of the values fall in predictably and suggest maybe increasing the slack to give them a bit more room to fail...but sometimes the test still fails randomly on a very large/small value (e.g. some 200+ ms above or ~70ms below the boundary). It doesn't seem to be a particular pattern of incriminating objects causing this, which leads me to believe that the timer is behaving strangely. The fact that outside of chaos mode here and on other platforms the tests pass (while still printing the |WARNING: Expired objects were not removed or marked used| notice) too indicates this may be a timer issue.

Filed bug 1560411 about the timers.
Edit: Bah, I entered the wrong bug number.

Flags: needinfo?(kwright)

:kriswright - while bug 1560411 is being worked on, would it be acceptable to either disable the test for macosx, or insert a larger value that permits the test to pass on macosx1014? Timeline wise, I am hoping to migrate this suite to macosx1014 within the next couple of weeks.

(In reply to Edwin Gao (:egao) from comment #8)

:kriswright - while bug 1560411 is being worked on, would it be acceptable to either disable the test for macosx, or insert a larger value that permits the test to pass on macosx1014? Timeline wise, I am hoping to migrate this suite to macosx1014 within the next couple of weeks.

I think the best course of action here would be to adjust the bounds for the test, probably by inserting a new slack value (found here https://searchfox.org/mozilla-central/rev/f91bd38732d4a330eba4e780812274b98eb81274/xpcom/tests/gtest/TestExpirationTracker.cpp#46). It looks like the slack boundary may need to be significantly increased for the time being.

It appears that this is happening across several tests, so filed bug 1561959 about addressing the timer fails.

I tried changing the bounds for the slackMs value and tried several much larger values, but none of them seem to permit the test to pass on macosx1014 by itself. Several values from 200ms to 14000ms with no luck.

However, once I also modified the value of periodMS from 100ms to 40000000ms. This seems to have done the trick.

I'm pushing several try runs without the ridiculous period and slack values; we will see which ones work.

2000/1000: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=macosx%2Cgtest&revision=ad95459a235675c6d1d5d41a5b9024a252533f0d
1000/500: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=macosx%2Cgtest&revision=97c04e97f58059316b7fff965feddb376aed53fe
600/260: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=macosx%2Cgtest&revision=0cb49b483817b8d0e2787b7de0dcb93a4ab28a7c

(In reply to Edwin Gao (:egao) from comment #10)

I tried changing the bounds for the slackMs value and tried several much larger values, but none of them seem to permit the test to pass on macosx1014 by itself. Several values from 200ms to 14000ms with no luck.

However, once I also modified the value of periodMS from 100ms to 40000000ms. This seems to have done the trick.

I'm pushing several try runs without the ridiculous period and slack values; we will see which ones work.

2000/1000: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=macosx%2Cgtest&revision=ad95459a235675c6d1d5d41a5b9024a252533f0d
1000/500: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=macosx%2Cgtest&revision=97c04e97f58059316b7fff965feddb376aed53fe
600/260: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&searchStr=macosx%2Cgtest&revision=0cb49b483817b8d0e2787b7de0dcb93a4ab28a7c

Oh! LowerBoundMS is probably going negative. This will trip up the objects if the negative value is large enough. I apologize for the oversight. I had clamped the uint to 100 or greater.
I enabled logging to see how the timers behave now. Let's see if this works. If it clears I'll have a patch up shortly with a better slack range.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b25a141d469f67656142fdab2b0d93da4d98ef0

Increased & clamped a slack boundary at 400ms. Lower bound vals should not drop below 100 ms (timer period).

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b25a141d469f67656142fdab2b0d93da4d98ef0

Looks like the test is greenlighting now. I've put up a patch increasing slack boundaries to 400ms (the test up here is set to 1400 so I'm running a second push now to verify 400 as the right number). For the short term this is a good workaround to get the tests greenlighting again. If the newest push goes out of bounds again I'll adjust the patch accordingly.

keep in mind that try run is on 10.10, not 10.14, to run on 10.14, you need to uncomment out gtest from the test-sets.yml file:
https://searchfox.org/mozilla-central/source/taskcluster/ci/test/test-sets.yml#318

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #14)

keep in mind that try run is on 10.10, not 10.14, to run on 10.14, you need to uncomment out gtest from the test-sets.yml file:
https://searchfox.org/mozilla-central/source/taskcluster/ci/test/test-sets.yml#318

Oh! I apologize. Thank you for the info. This is a bit of a learning experience for me. Reruning now with the right settings - if it fails for any reason I will examine the logs in the morning.

Edit: I thought ExpirationTracker.main was passing but more repetitions started to fail. Will have a patch up soon with a sufficient upper bound range.

:kriswright - do note that there are other failure on macosx1014 for gtest that appears once ExpirationTracker tests pass. Notable failure are ThreadPool and WebRtcIceConnectTest failures; these are logged in bug 1561410 and bug bug 1558887.

Pushed by kwright@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/832c35afa350
Increase slack boundaries for ExpirationTracker gtest r=froydnj
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
Assignee: nobody → kwright

Comment on attachment 9074938 [details]
Bug 1544452 - Increase slack boundaries for ExpirationTracker gtest

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: testonly change to green up test on osx 10.14.
  • User impact if declined:
  • Fix Landed on Version: 69
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String or UUID changes made by this patch:
Attachment #9074938 - Flags: approval-mozilla-esr68?

Comment on attachment 9074938 [details]
Bug 1544452 - Increase slack boundaries for ExpirationTracker gtest

Doesn't really need approval since it's a test-only change, but sure. Approved for 68.1esr.

Attachment #9074938 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
You need to log in before you can comment on or make changes to this bug.