Closed
Bug 1457590
Opened 7 years ago
Closed 7 years ago
Intermittent gfx/layers/apz/test/mochitest/test_group_mouseevents.html | application crashed [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayer after Assertion failure: mIsSome
Categories
(Core :: Panning and Zooming, defect, P5)
Core
Panning and Zooming
Tracking
()
RESOLVED
FIXED
mozilla62
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | wontfix |
firefox60 | --- | wontfix |
firefox61 | --- | wontfix |
firefox62 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: kats)
References
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(3 files)
Filed by: ncsoregi [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=176001225&repo=autoland
https://queue.taskcluster.net/v1/task/C4TlOa9sS3qVQWi8jiXL4g/runs/0/artifacts/public/logs/live_backing.log
12:05:26 INFO - TEST-PASS | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | helper_drag_scroll.html | Scroll position strictly increased
12:05:26 INFO - Buffered messages logged at 12:04:37
12:05:26 INFO - must wait for load
12:05:26 INFO - Buffered messages finished
12:05:26 ERROR - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | application terminated with exit code 1
12:05:26 INFO - runtests.py | Application ran for: 0:01:26.715013
12:05:26 INFO - zombiecheck | Reading PID log: /var/folders/_2/5rcdryb97wl3b_q4535l0xqc00000w/T/tmp17ZKFJpidlog
12:05:26 INFO - ==> process 2146 launched child process 2147
12:05:26 INFO - ==> process 2146 launched child process 2148
12:05:26 INFO - ==> process 2146 launched child process 2149
12:05:26 INFO - zombiecheck | Checking for orphan process with PID: 2147
12:05:26 INFO - zombiecheck | Checking for orphan process with PID: 2148
12:05:26 INFO - zombiecheck | Checking for orphan process with PID: 2149
12:05:26 INFO - mozcrash Copy/paste: /Users/cltbld/tasks/task_1524854160/build/macosx64-minidump_stackwalk /var/folders/_2/5rcdryb97wl3b_q4535l0xqc00000w/T/tmpovTx_6.mozrunner/minidumps/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.dmp /Users/cltbld/tasks/task_1524854160/build/symbols
12:05:46 INFO - mozcrash Saved minidump as /Users/cltbld/tasks/task_1524854160/build/blobber_upload_dir/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.dmp
12:05:46 INFO - mozcrash Saved app info as /Users/cltbld/tasks/task_1524854160/build/blobber_upload_dir/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.extra
12:05:46 INFO - PROCESS-CRASH | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | application crashed [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayerPixel, float> const&, mozilla::layers::ScrollbarData const&) const]
12:05:46 INFO - Crash dump filename: /var/folders/_2/5rcdryb97wl3b_q4535l0xqc00000w/T/tmpovTx_6.mozrunner/minidumps/1D86CA9E-BCFC-474D-AFAF-CFEC25C082E0.dmp
12:05:46 INFO - Operating system: Mac OS X
12:05:46 INFO - 10.10.5 14F27
12:05:46 INFO - CPU: amd64
12:05:46 INFO - family 6 model 69 stepping 1
12:05:46 INFO - 4 CPUs
12:05:46 INFO -
12:05:46 INFO - GPU: UNKNOWN
12:05:46 INFO -
12:05:46 INFO - Crash reason: EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
12:05:46 INFO - Crash address: 0x0
12:05:46 INFO - Process uptime: 37 seconds
12:05:46 INFO -
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 3•7 years ago
|
||
According to Ben's comment in bug 1458400 comment 0, it's the *thumbData.mDirection at [1] that's asserting. Which is odd, because in the caller function we assert [2]. And these are both MOZ_ASSERTs so it's not a debug vs opt build issue. Somehow mDirection is getting set to Nothing in between [1] and [2]? Maybe a race condition with the updater thread updating the hit-testing tree node while this is running on the controller thread without holding the tree lock?
[1] https://searchfox.org/mozilla-central/rev/795575287259a370d00518098472eaa5b362bfa7/gfx/layers/apz/src/AsyncPanZoomController.cpp#1754
[2] https://searchfox.org/mozilla-central/rev/795575287259a370d00518098472eaa5b362bfa7/gfx/layers/apz/src/APZCTreeManager.cpp#1782
Comment 4•7 years ago
|
||
Note, I only think its the mDirection because thats the only Maybe<> I could find involved in the code. It seems like a lot of inlining was happening, though, which made it hard to be completely sure.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 8•7 years ago
|
||
I think comment 3 is accurate. We can fix either by excluding scrollbar nodes from being recycled, or ensure that we hold the tree lock while handling input events that fiddle with the scrollbar nodes. The latter is probably more robust, but we'll need to make sure we don't introduce any lock ordering violations.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 11•7 years ago
|
||
I discussed this with Botond today, and we decided to do the following to solve the ordering violation:
- Add a flag on the HTTN that tracks if the node can be recycled. This flag would only be read/mutated while holding the APZ tree lock
- The functions that return a HTTN on the controller thread (e.g. GetTargetNode and GetTargetAPZC) would set the flag before returning the HTTN
- The caller would clear the flag when it's done with the HTTN
As a mechanism to make sure the flag is set/cleared correctly we could introduce a HitTestingTreeNodeRef structure that stores a RefPtr to the HitTestingTreeNode and manages the flag in a RAII fashion. This type would be the thing returned from GetTargetNode/GetTargetAPZC instead of the RefPtr<HitTestingTreeNode> that is currently used. Furthermore we can make things like HTTN::GetScrollbarData private and only accessible via the HTTNRef, to ensure that we only access it while the flag is set and the node is not in danger of getting recycled.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 years ago
|
Crash Signature: [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayerPixel, float> const&, mozilla::layers::ScrollbarData const&) const] → [@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint]
[@ mozilla::layers::AsyncPanZoomController::ConvertScrollbarPoint(mozilla::gfx::PointTyped<mozilla::ParentLayerPixel, float> const&, mozilla::layers::ScrollbarData const&) const]
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → bugmail
Assignee | ||
Comment 14•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 18•7 years ago
|
||
I couldn't find a way to reliably reproduce this, but we can monitor the effectiveness of the patches via crash-stats, since we are seeing these crashes in the wild.
Also, sorry if I went a little overboard with the aProofOfTreeLock business, but I figured it was better than documenting it in comments since this way it's compiler-enforceable.
Assignee | ||
Comment 19•7 years ago
|
||
Update to fix the static analysis compile error: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7c74c0bddbc5ab3a94472d4dab7b571e6da40837
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 23•7 years ago
|
||
... and with the silly-null-deref fix: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1aa96a507724617e6531b72ba7ab0c7fbc858e46
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 27•7 years ago
|
||
mozreview-review |
Comment on attachment 8981996 [details]
Bug 1457590 - Strengthen the contract around recycling HitTestingTreeNodes.
https://reviewboard.mozilla.org/r/248048/#review254198
Attachment #8981996 -
Flags: review?(botond) → review+
Comment 28•7 years ago
|
||
mozreview-review |
Comment on attachment 8981997 [details]
Bug 1457590 - Add the HitTestingTreeNodeAutoLock class.
https://reviewboard.mozilla.org/r/248050/#review254204
::: gfx/layers/apz/src/HitTestingTreeNode.cpp:456
(Diff revision 3)
> + return;
> + }
> + MOZ_ASSERT(mTreeMutex);
> +
> + { // scope lock
> + RecursiveMutexAutoLock lock(*mTreeMutex);
Are we sure a node will never outlive the APZCTM (and thus its mutex)?
Attachment #8981997 -
Flags: review?(botond) → review+
Comment 29•7 years ago
|
||
mozreview-review |
Comment on attachment 8981998 [details]
Bug 1457590 - Use the HitTestingTreeNodeAutoLock.
https://reviewboard.mozilla.org/r/248052/#review254222
Nice and clean, I like it!
Attachment #8981998 -
Flags: review?(botond) → review+
Assignee | ||
Comment 30•7 years ago
|
||
(In reply to Botond Ballo [:botond] from comment #28)
> Are we sure a node will never outlive the APZCTM (and thus its mutex)?
Fairly sure, yeah. Conceptually the APZCTM owns the nodes, and we don't have any code outside APZ that touches the HitTestingTreeNode. Any operations that hold refptrs to nodes should be going through the APZCTreeManager and have that on the stack, so the mutex should still be valid at that point. If we run into situations where that's not the case I think we'd have bigger problems.
Comment 31•7 years ago
|
||
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8a1bdb8d8bf5
Strengthen the contract around recycling HitTestingTreeNodes. r=botond
https://hg.mozilla.org/integration/autoland/rev/437b901404db
Add the HitTestingTreeNodeAutoLock class. r=botond
https://hg.mozilla.org/integration/autoland/rev/85d13ab240fa
Use the HitTestingTreeNodeAutoLock. r=botond
Comment 32•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8a1bdb8d8bf5
https://hg.mozilla.org/mozilla-central/rev/437b901404db
https://hg.mozilla.org/mozilla-central/rev/85d13ab240fa
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox62:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Updated•7 years ago
|
status-firefox60:
--- → wontfix
status-firefox61:
--- → wontfix
status-firefox-esr52:
--- → unaffected
status-firefox-esr60:
--- → wontfix
Assignee | ||
Comment 33•7 years ago
|
||
I discussed with with RyanVM on IRC - there is a real correctness issue affecting older versions, which then got exposed as an assertion failure due to recent changes. And it would be nice to fix that correctness issue, but the patch isn't totally trivial either. I'd prefer to let it bake for a bit, and considering we're close to the end of the current cycle we might as well let it ride the trains.
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•