Closed Bug 1457590 Opened 7 years ago Closed 7 years ago

Intermittent gfx/layers/apz/test/mochitest/test_group_mouseevents.html | application crashed [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayer after Assertion failure: mIsSome

Categories

(Core :: Panning and Zooming, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: kats)

References

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(3 files)

Filed by: ncsoregi [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=176001225&repo=autoland https://queue.taskcluster.net/v1/task/C4TlOa9sS3qVQWi8jiXL4g/runs/0/artifacts/public/logs/live_backing.log 12:05:26 INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | helper_drag_scroll.html | Scroll position strictly increased 12:05:26 INFO - Buffered messages logged at 12:04:37 12:05:26 INFO - must wait for load 12:05:26 INFO - Buffered messages finished 12:05:26 ERROR - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | application terminated with exit code 1 12:05:26 INFO - runtests.py | Application ran for: 0:01:26.715013 12:05:26 INFO - zombiecheck | Reading PID log: /var/folders/_2/5rcdryb97wl3b_q4535l0xqc00000w/T/tmp17ZKFJpidlog 12:05:26 INFO - ==> process 2146 launched child process 2147 12:05:26 INFO - ==> process 2146 launched child process 2148 12:05:26 INFO - ==> process 2146 launched child process 2149 12:05:26 INFO - zombiecheck | Checking for orphan process with PID: 2147 12:05:26 INFO - zombiecheck | Checking for orphan process with PID: 2148 12:05:26 INFO - zombiecheck | Checking for orphan process with PID: 2149 12:05:26 INFO - mozcrash Copy/paste: /Users/cltbld/tasks/task_1524854160/build/macosx64-minidump_stackwalk /var/folders/_2/5rcdryb97wl3b_q4535l0xqc00000w/T/tmpovTx_6.mozrunner/minidumps/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.dmp /Users/cltbld/tasks/task_1524854160/build/symbols 12:05:46 INFO - mozcrash Saved minidump as /Users/cltbld/tasks/task_1524854160/build/blobber_upload_dir/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.dmp 12:05:46 INFO - mozcrash Saved app info as /Users/cltbld/tasks/task_1524854160/build/blobber_upload_dir/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.extra 12:05:46 INFO - PROCESS-CRASH | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | application crashed [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayerPixel, float> const&, mozilla::layers::ScrollbarData const&) const] 12:05:46 INFO - Crash dump filename: /var/folders/_2/5rcdryb97wl3b_q4535l0xqc00000w/T/tmpovTx_6.mozrunner/minidumps/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.dmp 12:05:46 INFO - Operating system: Mac OS X 12:05:46 INFO - 10.10.5 14F27 12:05:46 INFO - CPU: amd64 12:05:46 INFO - family 6 model 69 stepping 1 12:05:46 INFO - 4 CPUs 12:05:46 INFO - 12:05:46 INFO - GPU: UNKNOWN 12:05:46 INFO - 12:05:46 INFO - Crash reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS 12:05:46 INFO - Crash address: 0x0 12:05:46 INFO - Process uptime: 37 seconds 12:05:46 INFO -
According to Ben's comment in bug 1458400 comment 0, it's the *thumbData.mDirection at [1] that's asserting. Which is odd, because in the caller function we assert [2]. And these are both MOZ_ASSERTs so it's not a debug vs opt build issue. Somehow mDirection is getting set to Nothing in between [1] and [2]? Maybe a race condition with the updater thread updating the hit-testing tree node while this is running on the controller thread without holding the tree lock? [1] https://searchfox.org/mozilla-central/rev/795575287259a370d00518098472eaa5b362bfa7/gfx/layers/apz/src/AsyncPanZoomController.cpp#1754 [2] https://searchfox.org/mozilla-central/rev/795575287259a370d00518098472eaa5b362bfa7/gfx/layers/apz/src/APZCTreeManager.cpp#1782
Note, I only think its the mDirection because thats the only Maybe<> I could find involved in the code. It seems like a lot of inlining was happening, though, which made it hard to be completely sure.
I think comment 3 is accurate. We can fix either by excluding scrollbar nodes from being recycled, or ensure that we hold the tree lock while handling input events that fiddle with the scrollbar nodes. The latter is probably more robust, but we'll need to make sure we don't introduce any lock ordering violations.
I discussed this with Botond today, and we decided to do the following to solve the ordering violation: - Add a flag on the HTTN that tracks if the node can be recycled. This flag would only be read/mutated while holding the APZ tree lock - The functions that return a HTTN on the controller thread (e.g. GetTargetNode and GetTargetAPZC) would set the flag before returning the HTTN - The caller would clear the flag when it's done with the HTTN As a mechanism to make sure the flag is set/cleared correctly we could introduce a HitTestingTreeNodeRef structure that stores a RefPtr to the HitTestingTreeNode and manages the flag in a RAII fashion. This type would be the thing returned from GetTargetNode/GetTargetAPZC instead of the RefPtr<HitTestingTreeNode> that is currently used. Furthermore we can make things like HTTN::GetScrollbarData private and only accessible via the HTTNRef, to ensure that we only access it while the flag is set and the node is not in danger of getting recycled.
Crash Signature: [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayerPixel, float> const&, mozilla::layers::ScrollbarData const&) const] → [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint] [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayerPixel, float> const&, mozilla::layers::ScrollbarData const&) const]
Assignee: nobody → bugmail
I couldn't find a way to reliably reproduce this, but we can monitor the effectiveness of the patches via crash-stats, since we are seeing these crashes in the wild. Also, sorry if I went a little overboard with the aProofOfTreeLock business, but I figured it was better than documenting it in comments since this way it's compiler-enforceable.
Comment on attachment 8981996 [details] Bug 1457590 - Strengthen the contract around recycling HitTestingTreeNodes. https://reviewboard.mozilla.org/r/248048/#review254198
Attachment #8981996 - Flags: review?(botond) → review+
Comment on attachment 8981997 [details] Bug 1457590 - Add the HitTestingTreeNodeAutoLock class. https://reviewboard.mozilla.org/r/248050/#review254204 ::: gfx/layers/apz/src/HitTestingTreeNode.cpp:456 (Diff revision 3) > + return; > + } > + MOZ_ASSERT(mTreeMutex); > + > + { // scope lock > + RecursiveMutexAutoLock lock(*mTreeMutex); Are we sure a node will never outlive the APZCTM (and thus its mutex)?
Attachment #8981997 - Flags: review?(botond) → review+
Comment on attachment 8981998 [details] Bug 1457590 - Use the HitTestingTreeNodeAutoLock. https://reviewboard.mozilla.org/r/248052/#review254222 Nice and clean, I like it!
Attachment #8981998 - Flags: review?(botond) → review+
(In reply to Botond Ballo [:botond] from comment #28) > Are we sure a node will never outlive the APZCTM (and thus its mutex)? Fairly sure, yeah. Conceptually the APZCTM owns the nodes, and we don't have any code outside APZ that touches the HitTestingTreeNode. Any operations that hold refptrs to nodes should be going through the APZCTreeManager and have that on the stack, so the mutex should still be valid at that point. If we run into situations where that's not the case I think we'd have bigger problems.
Pushed by kgupta@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8a1bdb8d8bf5 Strengthen the contract around recycling HitTestingTreeNodes. r=botond https://hg.mozilla.org/integration/autoland/rev/437b901404db Add the HitTestingTreeNodeAutoLock class. r=botond https://hg.mozilla.org/integration/autoland/rev/85d13ab240fa Use the HitTestingTreeNodeAutoLock. r=botond
I discussed with with RyanVM on IRC - there is a real correctness issue affecting older versions, which then got exposed as an assertion failure due to recent changes. And it would be nice to fix that correctness issue, but the patch isn't totally trivial either. I'd prefer to let it bake for a bit, and considering we're close to the end of the current cycle we might as well let it ride the trains.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: