Closed Bug 1039159 Opened 8 years ago Closed 8 years ago

[tarako][dolphin][monkey test] monkey test crash at libxul.so!mozilla::layers::Axis::UpdateWithTouchAtDevicePoint(int, mozilla::TimeDuration const&) [nsTArray.h : 572 + 0xb]

Categories

(Core :: Panning and Zooming, defect)

All
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34
blocking-b2g 1.4+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v1.3T --- wontfix
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: angelc04, Assigned: kats)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [sprd333865][sprd325352][partner-blocker])

Attachments

(5 files)

---- stack ----
0  libc.so + 0x220a8
1  libxul.so!mozilla::layers::Axis::UpdateWithTouchAtDevicePoint(int, mozilla::TimeDuration const&) [nsTArray.h : 572 + 0xb]
2  libxul.so!mozilla::layers::AsyncPanZoomController::UpdateWithTouchAtDevicePoint(mozilla::MultiTouchInput const&) [AsyncPanZoomController.cpp : 1143 + 0xb]
3  libxul.so!mozilla::layers::AsyncPanZoomController::TrackTouch(mozilla::MultiTouchInput const&) [AsyncPanZoomController.cpp : 1249 + 0x7]
4  libxul.so!mozilla::layers::AsyncPanZoomController::OnTouchMove(mozilla::MultiTouchInput const&) [AsyncPanZoomController.cpp : 710 + 0x3]
5  libxul.so!mozilla::layers::AsyncPanZoomController::HandleInputEvent(mozilla::InputData const&) [AsyncPanZoomController.cpp : 590 + 0x7]
6  libxul.so!mozilla::layers::APZCTreeManager::ProcessTouchEvent(mozilla::WidgetTouchEvent&, mozilla::layers::ScrollableLayerGuid*) [APZCTreeManager.cpp : 495 + 0x7]
7  libxul.so!mozilla::dom::TabParent::SendRealTouchEvent(mozilla::WidgetTouchEvent&) [TabParent.cpp : 842 + 0x3]
8  libxul.so!mozilla::dom::TabParent::TryCapture(mozilla::WidgetGUIEvent const&) [TabParent.cpp : 886 + 0x7]
9  libxul.so!nsWindow::DispatchInputEvent(mozilla::WidgetGUIEvent&, bool*) [nsWindow.cpp : 260 + 0x5]
10  libxul.so!GeckoInputDispatcher::dispatchOnce() [nsAppShell.cpp : 263 + 0x9]
11  libxul.so!nsAppShell::ProcessNextNativeEvent(bool) [nsAppShell.cpp : 1009 + 0x5]
12  libxul.so!nsBaseAppShell::DoProcessNextNativeEvent(bool, unsigned int) [nsBaseAppShell.cpp : 140 + 0x9]
13  libxul.so!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool, unsigned int) [nsBaseAppShell.cpp : 298 + 0x5]
14  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp : 667 + 0xd]
15  libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp : 263 + 0xb]
16  libxul.so!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [MessagePump.cpp : 95 + 0x7]
17  libxul.so!MessageLoop::RunInternal() [message_loop.cc : 226 + 0x5]
18  libxul.so!MessageLoop::Run() [message_loop.cc : 219 + 0x5]
19  libxul.so!nsBaseAppShell::Run() [nsBaseAppShell.cpp : 164 + 0x7]
20  libxul.so!nsAppStartup::Run() [nsAppStartup.cpp : 276 + 0x5]
21  libxul.so!XREMain::XRE_mainRun() [nsAppRunner.cpp : 4010 + 0x5]
22  libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp : 4079 + 0x5]
23  libxul.so!XRE_main [nsAppRunner.cpp : 4291 + 0x3]
24  b2g!main [nsBrowserApp.cpp : 163 + 0xf]
25  libc.so!__libc_init [libc_init_dynamic.cpp : 112 + 0x7]
26  b2g + 0x1fea
27  linker!set_soinfo_pool_protection [linker.cpp : 291 + 0xb]
28  0xbee48b36
Whiteboard: [sprd333865]
Summary: [dolphin][monkey test] monkey test crash at libxul.so!mozilla::layers::Axis::UpdateWithTouchAtDevicePoint(int, mozilla::TimeDuration const&) [nsTArray.h : 572 + 0xb] → [tarako][dolphin][monkey test] monkey test crash at libxul.so!mozilla::layers::Axis::UpdateWithTouchAtDevicePoint(int, mozilla::TimeDuration const&) [nsTArray.h : 572 + 0xb]
Whiteboard: [sprd333865] → [sprd333865][sprd325352]
blocking-b2g: --- → 1.4?
Whiteboard: [sprd333865][sprd325352] → [sprd333865][sprd325352][partner-blocker]
Hi Jerry,

Is this something you can help for the analysis? Or is there anyway for further trouble shooting such as add more logs?
blocking-b2g: 1.4? → ---
Flags: needinfo?(hshih)
The call stack show that we might get crash in nsTArray mVelocityQueue.
The gfxPrefs::APZMaxVelocityQueueSize() size is 5.
So I don't think that we have no enough memory to append the new element into mVelocityQueue.
I will add some thread checking code for all mVelocityQueue accessing to make sure we only access this queue at main thread. And add some log in nsTArray when alloc memory failed.


  // Limit queue size pased on pref
  mVelocityQueue.AppendElement(mVelocity);
  if (mVelocityQueue.Length() > gfxPrefs::APZMaxVelocityQueueSize()) {
    mVelocityQueue.RemoveElementAt(0);
  }
Flags: needinfo?(hshih)
Hi Kats,

In comment 3, is it true that we always access the Axis.cpp data at b2g main thread?
Flags: needinfo?(bugmail.mozilla)
I did a quick look through the code, and it looks like Axis::CancelTouch is called from NotifyLayersUpdated which is on the compositor thread. CancelTouch modifies mVelocityQueue so that might be the source of the problem here.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #5)
> I did a quick look through the code, and it looks like Axis::CancelTouch is
> called from NotifyLayersUpdated which is on the compositor thread.
> CancelTouch modifies mVelocityQueue so that might be the source of the
> problem here.

Thanks for your quick look. :)
If we will access(maybe read and write) the apzc data in both main thread and compositor thread, we need to use a mutex to protect these data. Or just post all r/w request to main thread's message loop(or compositor thread).
Here are a lot of data members in apzc, do we have an existing mechanism to handling this problem?
I'm not sure which apzc function will be called at main thread or compositor thread.
Flags: needinfo?(bugmail.mozilla)
Unfortunately we don't have a consistent handling of data members across threads in APZC. There is a metabug that is tracking these sorts of issues; I'm linking it here. In this case I think it should be ok if we replace these three lines:

    mX.CancelTouch();
    mY.CancelTouch();
    SetState(NOTHING);

with a call to CancelAnimation(). CancelAnimation doesn't touch the mVelocityQueue member so it should remove this crash at least. We should definitely deal with the metabug and fix other possible threading problems throughout the code.
Component: General → Panning and Zooming
Flags: needinfo?(bugmail.mozilla)
OS: Linux → Gonk (Firefox OS)
Product: Firefox OS → Core
Hardware: x86_64 → All
Version: unspecified → Trunk
Duplicate of this bug: 1028814
Duplicate of this bug: 1013025
Jerry, let me know if you'd rather I wrote up the patch for this. I have a couple of 2.0 blockers on my plate right now but I should be able to take this later this week if you don't get to it first.
Hi Kats,
In CancelAnimation(), we just set the Velocity in axis into 0, but we don't clear all the queue in axis.
We might get a wrong velocity computing result for the further input(we calculate the velocity with the old value and the new coming value).
Comment on attachment 8460147 [details] [diff] [review]
add a monitor in axis to protect nsTArray struct

Could we just use a monitor to protect the nsTArray data?
Attachment #8460147 - Flags: feedback?(bugmail.mozilla)
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #11)
> Hi Kats,
> In CancelAnimation(), we just set the Velocity in axis into 0, but we don't
> clear all the queue in axis.
> We might get a wrong velocity computing result for the further input(we
> calculate the velocity with the old value and the new coming value).

Not really. To be honest I don't think the CancelTouch() calls there make sense at all. The code probably used to do something else before which is why they're there. Clearing the velocity queue when we get a new page load doesn't make much sense, because the velocity queue should only be based on when the user's finger is down/up and moving.

In other words, right now CancelTouch() has an effect if the user's finger is down and we load a new page while the user's finger is still down. In this case we currently clear the velocity queue. I would rather keep the velocity queue here because the user's finger hasn't left the screen and the velocity queue is still valid.

Changing the call to a CancelAnimation as I suggested makes sense, because you generally don't want an animation to persist from one page to the next, and we also never want to go into state NOTHING while we potentially have overscroll, as that will cause the page to get stuck in overscroll.
Comment on attachment 8460147 [details] [diff] [review]
add a monitor in axis to protect nsTArray struct

Review of attachment 8460147 [details] [diff] [review]:
-----------------------------------------------------------------

As explained above I would rather just get rid of the calls to CancelTouch() and SetState(NOTHING) in NotifyLayersUpdated and do a CancelAnimation() there instead.
Attachment #8460147 - Flags: feedback?(bugmail.mozilla) → feedback-
Attached patch Part 1 - FixSplinter Review
Assignee: nobody → bugmail.mozilla
Attachment #8460894 - Flags: review?(botond)
Attachment #8460894 - Flags: review?(botond) → review+
Comment on attachment 8460894 [details] [diff] [review]
Part 1 - Fix

Review of attachment 8460894 [details] [diff] [review]:
-----------------------------------------------------------------

I notice that, other than the call sites remove in this patch, CancelTouch() is only called from {pan gesture}-processing code. Should it also be called from {touch event}-processing code?
Attachment #8460895 - Flags: review?(botond) → review+
Yeah I suppose that makes sense :)
Attachment #8460965 - Flags: review?(botond)
Attachment #8460965 - Flags: review?(botond) → review+
Please merge these patches into v1.3t
Flags: needinfo?(wchang)
Hi Kats,

Can patch be used directly in v1.4? I am setting this bug to be v1.4 blocker for the patch to be landed in v1.4 as well.
blocking-b2g: --- → 1.4+
Flags: needinfo?(bugmail.mozilla)
I can backport the fix to 1.4, but I think it's kind of risky because the code was very different back then. It'll be up to you to ensure sufficient testing to make sure it doesn't break anything.
Flags: needinfo?(bugmail.mozilla)
Whiteboard: [sprd333865][sprd325352][partner-blocker] → [sprd333865][sprd325352][partner-blocker][NO_UPLIFT]
Here is the backported fix. I haven't tested it at all. Please also verify the fix actually fixes the bug on master before considering landing this on 1.3t/1.4.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #24)
> Created attachment 8461455 [details] [diff] [review]
Patch for B2G1.3/1.4
> (uncompiled, untested)

Here is the backported fix. I haven't tested it at
> all. Please also verify the fix actually fixes the bug on master before
> considering landing this on 1.3t/1.4.

Ying, please try it on 1.4, there's risk to land on v1.3 now.
Flags: needinfo?(ying.xu)
Not merging to 1.3t.
Flags: needinfo?(wchang)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #24)
> Created attachment 8461455 [details] [diff] [review]
> Patch for B2G1.3/1.4 (uncompiled, untested)
> 
> Here is the backported fix. I haven't tested it at all. Please also verify
> the fix actually fixes the bug on master before considering landing this on
> 1.3t/1.4.

Request from stakeholders, needinfo:"whsu".
I will test this patch via Monkey test.
Thanks!
Flags: needinfo?(whsu)
FWIW, due to recent policy changes (see dev-gaia), if you intend to also land this on v2.0, you'll need to explicitly request b2g32 on the patches as 1.4+ no longer grants auto-approval.
Hi, everyone,

After 4 hours monkey test, I still cannot catch this bug.
I would like to check the results again on Monday(8/4).

By the way, as we discussed via phone, if partner feels it is a critical bug, please help us verify this patch in parallel.


Keep needinfo "whsu". I will keep you posted.
Hi, everyone,

After executing 60 hours monkey test on the weekend (8/1~8/3), I cannot reproduce this crash event with the local patch (attachment 8461455 [details] [diff] [review])

* Build information:
 - Gaia      3feb37ee2ed2319c9e556728723a5517dc1663ea
 - Gecko     attachment 8461455 [details] [diff] [review]
 - BuildID   20140801173943
 - Version   30.0
Flags: needinfo?(whsu)
Kats -
given :whsu's test results above and 1.4+ here, can you do a 1.4 merge?

Also if you believe this is beneficial to 2.0, as suggested by :RyanVM please do a b2g32 approval request.
Flags: needinfo?(bugmail.mozilla)
Attachment #8461455 - Attachment description: Patch for B2G1.3/1.4 (uncompiled, untested) → Patch for B2G1.3/1.4
Whiteboard: [sprd333865][sprd325352][partner-blocker][NO_UPLIFT] → [sprd333865][sprd325352][partner-blocker]
Comment on attachment 8460894 [details] [diff] [review]
Part 1 - Fix

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unsure, but pretty early APZ code
User impact if declined: A rare crash might occur because of improper multithreaded access of a variable. Might also result in other types of undefined behaviour/corruption.
Testing completed: see comment 30 and comment 31. Patch has been tested on gecko 30 and gecko 34, not specifically on gecko 32 though.
Risk to taking this patch (and alternatives if risky): Low risk. In the worst case I think we might end up with transient unwanted behavior in cases where the user's finger is down while a new page is loading. I don't think it would be severe.
String or UUID changes made by this patch: none
Attachment #8460894 - Flags: approval-mozilla-b2g32?
Flags: needinfo?(bugmail.mozilla)
Note: we only need to uplift part 1 for the actual fix. Parts 2 and 3 are nice to have but not strictly necessary.
Keywords: crash
Comment on attachment 8460894 [details] [diff] [review]
Part 1 - Fix

Approving given the risk and that it may help with a crash. please land only part 1 which will fix the bug here.
Attachment #8460894 - Flags: approval-mozilla-b2g32? → approval-mozilla-b2g32+
Flags: needinfo?(ying.xu)
You need to log in before you can comment on or make changes to this bug.