Closed Bug 1564621 Opened 5 years ago Closed 5 years ago

Intermittent <test-name> | application crashed [@ mozilla::dom::ScreenOrientation::~ScreenOrientation()]

Categories

(Core :: DOM: Core & HTML, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: marcosc)

References

Details

(4 keywords, Whiteboard: [geckoview:p1])

Crash Data

Attachments

(1 file)

Filed by: asferro [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=255609876&repo=try
Full log: https://queue.taskcluster.net/v1/task/L13q2qAOQPa3qBeWBXU1UQ/runs/0/artifacts/public/logs/live_backing.log


Enabling x86_64 geckoview junit tests triggers this intermittently.

Type: -- → defect
Component: General → Hardware Abstraction Layer (HAL)
Product: GeckoView → Core
Whiteboard: [geckoview]

Does anyone familiar with HAL know what's going on here? This assertion happens randomly on our test suite. See this try here: https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=255579830&revision=3195db587c1d3b46d315cf3df4ac1d3a146eacb6 (there are pretty much no changes with m-c, just enabling the tests)

Olli, do we really need to assert in ~ScreenOrientation that we've exited full screen [1] when shutting down?

This assertion failure is causing intermittent GeckoView junit test failures on x86_64. We can't disable individual tests because this assertion failure happens randomly in many tests.

[1] https://searchfox.org/mozilla-central/rev/15be167a5b436b57fef944b84eef061d24c1af8c/dom/base/ScreenOrientation.cpp#88

Component: Hardware Abstraction Layer (HAL) → DOM: Core & HTML
Flags: needinfo?(bugs)
Whiteboard: [geckoview] → [geckoview:p1]

I don't know. That assertion comes from bug 1131470.
I'm really not familiar with that code, would need to read it all.
It wasn't immediately clear whether the assertion is really something we must have, but not clear either whether we could remove it.

Perhaps baku remembers this better.

Flags: needinfo?(bugs) → needinfo?(amarchesini)

This assertion makes sense because, in theory, we want to release the fullscreen listener before the DTOR. If this doesn't happen, we have a bug somewhere. In general, reading the code quickly, it seems all too fragile:

  1. it seems that nothing guarantees Unlock to be called.
  2. the document could be hidden when LockOrientationTask() runs
  3. the pending promise could probably be overwritten if lock() is called more than once.

I would not remove that assertion. Instead I would improve the code in order to be sure to release the fullscreen listener correctly.
It seems that Marcos touched the code, recently. Maybe he has options and/or time to work on it.

Flags: needinfo?(amarchesini) → needinfo?(mcaceres)

To add some context: this assertion is preventing us from turning on x86_64 Junit tests for GeckoView.

Apologies for the delay... just getting to this...

(In reply to Andrea Marchesini [:baku] from comment #4)

This assertion makes sense because, in theory, we want to release the fullscreen listener before the DTOR. If this doesn't happen, we have a bug somewhere. In general, reading the code quickly, it seems all too fragile:

  1. it seems that nothing guarantees Unlock to be called.
  2. the document could be hidden when LockOrientationTask() runs
  3. the pending promise could probably be overwritten if lock() is called more than once.

All valid points. For 3, mPromise should be guarding against that...

I would not remove that assertion. Instead I would improve the code in order to be sure to release the fullscreen listener correctly.
It seems that Marcos touched the code, recently. Maybe he has options and/or time to work on it.

I'll got through it today and see if I can tighten some things up and figure out why mFullscreenListener is still alive.

Flags: needinfo?(mcaceres)
Assignee: nobody → mcaceres
Status: NEW → ASSIGNED
Priority: -- → P1

(In reply to Marcos Caceres [:marcosc] from comment #7)

Apologies for the delay... just getting to this...

(In reply to Andrea Marchesini [:baku] from comment #4)

This assertion makes sense because, in theory, we want to release the fullscreen listener before the DTOR. If this doesn't happen, we have a bug somewhere. In general, reading the code quickly, it seems all too fragile:

  1. it seems that nothing guarantees Unlock to be called.

This is more than likely the culprit. UnlockDeviceOrientation only gets called iff (mOrientationLock == hal::eScreenOrientation_None), so I think that's leaving mFullscreenListener alive.

  1. the document could be hidden when LockOrientationTask() runs

There is a check for this on line ~169, so I think this is ok:

  if (mDocument->Hidden()) {
    // Active orientation lock is not the document's orientation lock.
    mPromise->MaybeResolveWithUndefined();
    mDocument->SetOrientationPendingPromise(nullptr);
    return NS_OK;
  }
  1. the pending promise could probably be overwritten if lock() is called more than once.

AbortOrientationPromises() handles 3, actually.

I would not remove that assertion. Instead I would improve the code in order to be sure to release the fullscreen listener correctly.

Do we know what actual test is triggering this? I couldn't figure it out from the logs.

Flags: needinfo?(agi)

This happens in a few tests, e.g.:

org.mozilla.geckoview.test.MediaElementTest.webmTimeMedia
org.mozilla.geckoview.test.crash.CrashTest.crashContent
org.mozilla.geckoview.test.NavigationDelegateTest.onNewSession_supportNoOpener

You need to enable the debug x86_64 tests to run this in try: https://hg.mozilla.org/try/rev/134d603918a51b32c0a332514e52eed099232c7d

Unfortunately I've never been able to reproduce this locally.

Flags: needinfo?(agi)

(In reply to Agi | :agi | ⏰ EST | he/him from comment #9)

This happens in a few tests, e.g.:

org.mozilla.geckoview.test.MediaElementTest.webmTimeMedia
org.mozilla.geckoview.test.crash.CrashTest.crashContent
org.mozilla.geckoview.test.NavigationDelegateTest.onNewSession_supportNoOpener

I can't find these in Searchfox, or what they look like from a JS perspective? (I've zero Android or jUnit experience, sorry).

You need to enable the debug x86_64 tests to run this in try: https://hg.mozilla.org/try/rev/134d603918a51b32c0a332514e52eed099232c7d

Will add. Thank you!

Unfortunately I've never been able to reproduce this locally.

I have a hypothesis that .exitFullScreen() and .unlock() are causing a race condition.

I'll put up a patch, which includes enabling the x86_64 tests on Try... however, I've no idea how to get them to repeat to get a significant sample of runs to see if we can reproduce? Should I just get try to rebuild 20 times? Or is there some other way to get this to run continuously.

I'll try and see if it happens once... then can send it to rebuild 20 times. But if there is some way to get it run the tests 1000 times or so, that would be better.

(In reply to Marcos Caceres [:marcosc] from comment #10)

(In reply to Agi | :agi | ⏰ EST | he/him from comment #9)

This happens in a few tests, e.g.:

org.mozilla.geckoview.test.MediaElementTest.webmTimeMedia
org.mozilla.geckoview.test.crash.CrashTest.crashContent
org.mozilla.geckoview.test.NavigationDelegateTest.onNewSession_supportNoOpener

I can't find these in Searchfox, or what they look like from a JS perspective? (I've zero Android or jUnit experience, sorry).

You want to look for the last part of the name after the ., that's the method name, e.g. https://searchfox.org/mozilla-central/rev/4436573fa3d5c4311822090e188504c10c916c6f/mobile/android/geckoview/src/androidTest/java/org/mozilla/geckoview/test/NavigationDelegateTest.kt#1104

You need to enable the debug x86_64 tests to run this in try: https://hg.mozilla.org/try/rev/134d603918a51b32c0a332514e52eed099232c7d

Will add. Thank you!

Unfortunately I've never been able to reproduce this locally.

I have a hypothesis that .exitFullScreen() and .unlock() are causing a race condition.

I'll put up a patch, which includes enabling the x86_64 tests on Try... however, I've no idea how to get them to repeat to get a significant sample of runs to see if we can reproduce? Should I just get try to rebuild 20 times? Or is there some other way to get this to run continuously.

I'll try and see if it happens once... then can send it to rebuild 20 times. But if there is some way to get it run the tests 1000 times or so, that would be better.

I don't know of a better way to do that :) if it doesn't happen after 20 runs we should be in the clear, it happens every 2 or 3 runs usually. Note: there are other crashes too, that's normal.

Try chooser doesn't let me set the number of rebuilds 👎... so trying this manually:

./mach try --rebuild 20 -b o -p android -u junit

Will see what comes out (I'm just taking a wild guess at the above)...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=696e0ccee6d19dc5b8bd9bd762b144312cfefa83

Worst case, I can just configure try chooser manually 10 times or so.

I wonder if it would be too abhorrent to just explicitly call ::UnlockDeviceOrientation() in the DTOR? The would give us the guarantee of unlock, it would assure that mFullscreenListener gets nulled out.

I still think the fragility is coming from the ScreenOrientation::FullscreenEventListener::HandleEvent() not called before ScreenOrientation is destroyed - but that requires going up one level and figuring out if it is that event not dispatching (if that's the cause). Having looked at the code for a few days now, I can't see what else it could be.

Trying that out...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5e52103b0235c87d8688ff8bf93c8d4681f3eff

Flags: needinfo?(amarchesini)

It could be a solution, yes. But would be nice to have a bug to cleanup the current implementation (P3...?)

Flags: needinfo?(amarchesini)

We are unsure as to the actual cause of the crash, but we suspect that when fullscreen exists,
the even listener is not being called. That was leaving mFullscreenListener alive when the DTOR gets called
hitting the MOZ_ASSERT(!mFullscreenListener).

We now always all call UnlockDeviceOrientation() in the DTOR to ensure that:

  1. hal::UnlockScreenOrientation() gets called.
  2. mFullscreenListener gets null'ed
  3. we RemoveSystemEventListener() for "fullscreenchange".
Pushed by mcaceres@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/adc36d5fd87a
Intermittent <test-name> | application crashed [@ mozilla::dom::ScreenOrientation::~ScreenOrientation()] r=baku
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Marcos, should we uplift your fix to GeckoView Beta and Fennec (ESR 68.1)? Does your fix prevent a real crash or just the test assertion failure?

Flags: needinfo?(mcaceres)
Keywords: assertion

I'm not aware of it crashing anything in practice... it seems to occasionally happen in our testing infrastructure. Other folks here might know if this is actually crashing for real on Android. I'd feel more comfortable not uplifting and letting this sit on nightly for a bit.

Agi, wdyt?

Flags: needinfo?(mcaceres) → needinfo?(agi)

Not aware of any real crash for this, just an assertion failure in debug builds.

Flags: needinfo?(agi)

(In reply to :Agi | ⏰ PST | he/him from comment #23)

Not aware of any real crash for this, just an assertion failure in debug builds.

In that case, I'm setting firefox69=wontfix and firefox-esr68=wontfix because we don't need to uplift for GV Beta or Fennec ESR.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: