1702504 - Crash in [@ mozilla::widget::TouchResampler::NotifyFrame]

Ryan VanderMeulen [:RyanVM]

Reporter

Description

•

3 years ago

•

Edited

Started in the 85-based Fenix & Focus releases.

Crash report: https://crash-stats.mozilla.org/report/index/bd0e1a9e-a29a-4238-b308-3ab920210401

MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(lastTouchTime)

Top 10 frames of crashing thread:

0 libxul.so mozilla::widget::TouchResampler::NotifyFrame widget/TouchResampler.cpp:92
1 libxul.so mozilla::widget::NPZCSupport::OnVsync widget/android/nsWindow.cpp:747
2 libxul.so void mozilla::jni::NativeStub<mozilla::java::AndroidVsync::NotifyVsync_t, mozilla::widget::AndroidVsyncSupport, mozilla::jni::Args<long> >::Wrap<&mozilla::widget::AndroidVsyncSupport widget/android/jni/Natives.h:1462
3 base.odex base.odex@0xae424 
4 base.art] base.art]@0x1dbfec 
5 boot-framework.art] boot-framework.art]@0x150fc 
6 boot-framework.art] boot-framework.art]@0x929e4c 
7 base.odex base.odex@0x264b48 
8 base.art] base.art]@0x1dc03c 
9 boot.art] boot.art]@0x414

Flags: needinfo?(mstange.moz)

Pascal Chevrel:pascalc (PTO until April 26)

Updated

•

3 years ago

status-firefox88: affected → wontfix

status-firefox89: affected → wontfix

status-firefox90: --- → affected

Marco Castelluccio [:marco]

Updated

•

3 years ago

Keywords: regression

Julien Cristau [:jcristau]

Updated

•

3 years ago

status-firefox90: affected → wontfix

status-firefox91: --- → affected

Pascal Chevrel:pascalc (PTO until April 26)

Updated

•

3 years ago

status-firefox91: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

2 years ago

Flags: needinfo?(mstange.moz)

BMO Automation

Updated

•

2 years ago

Has Regression Range: --- → yes

Botond Ballo [:botond]

Comment 1

•

2 years ago

Had a look at this, and I don't see any obvious reason why this crash might occur:

The value being asserted is a timestamp, which starts as null and is only modified in CurrentTouches::UpdateFromEvent() when processing a MULTITOUCH_MOVE event, set to the event's timestamp; that in turn is called from TouchResampler::ProcessEvent().
Due to this condition, the assertion is only reached if mDeferredTouchMoveEvents is nonempty.
mDeferredTouchMoveEvents is only populated here in TouchResampler::ProcessEvent(), if a MULTITOUCH_MOVE event is processed. Thus, if it's nonempty, a move event has been processed in the past and thus the timestamp should be set from the event. Assuming the event itself doesn't have a null timestamp, the assertion should pass.

I considered threading issues as well, but that part seems fine: ProcessEvent() is called from HandleMotionEvent() which should be called on the Java UI thread. The assertion is in TouchResampler::NotifyFrame() which is called from OnVsync() which is documented to also be called on the Java UI thread.

Markus, do you see anything I'm overlooking? If we can't track down the assertion's cause, should we revisit whether it needs to be a release assert? It seems we could fall back gracefully to e.g. calling FlushDeferredTouchMoveEventsUnresampled() instead.

Flags: needinfo?(mstange.moz)

Markus Stange [:mstange]

Comment 2

•

2 years ago

Thanks so much for taking a look!

(In reply to Botond Ballo [:botond] from comment #1)

I considered threading issues as well, but that part seems fine: ProcessEvent() is called from HandleMotionEvent() which should be called on the Java UI thread. The assertion is in TouchResampler::NotifyFrame() which is called from OnVsync() which is documented to also be called on the Java UI thread.

I wonder if we have a use-after-free even with everything happening on the Java UI thread because of this code in AndroidVsync::NotifyVsync:

  // Do not keep the lock held while calling OnVsync.
  nsTArray<Observer*> observers;
  {
    auto impl = mImpl.Lock();
    observers.AppendElements(impl->mInputObservers);
    observers.AppendElements(impl->mRenderObservers);
  }
  for (Observer* observer : observers) {
    observer->OnVsync(timeStamp);
  }

It doesn't look like I thought this code through when I wrote it. We are holding on to an array of raw pointers and expect all of them to still be valid when we call OnVsync on them. I'm not sure why I thought this was ok.

Specifically, if we have two observers, and if notifying the first observer causes code to run which deletes the second observer (and unregisters it), then we call OnVsync on the second observer when it's already gone.
There are currently only two possible observer types: NPZCSupport (which registers itself as an input observer), and AndroidVsyncSource::Display (which registers itself as a render observer). So this function will always call NPZCSupport::OnVsync before AndroidVsyncSource::Display::OnVsync.
I wonder if it's possible to have two NPZCSupport instances registered at the same time? And then whether it's possible that NPZCSupport::OnVsync on the first instance could cause the destruction of the second NPZCSupport instance?

If all those things came together, we'd call NPZCSupport::OnVsync on a destroyed NPZCSupport object, which would call TouchResampler::NotifyFrame on a destroyed TouchResampler object, and then this release assertion might be hit.

Flags: needinfo?(mstange.moz) → needinfo?(botond)

Botond Ballo [:botond]

Comment 3

•

2 years ago

(In reply to Markus Stange [:mstange] from comment #2)

I wonder if it's possible to have two NPZCSupport instances registered at the same time? And then whether it's possible that NPZCSupport::OnVsync on the first instance could cause the destruction of the second NPZCSupport instance?

Good questions. I'm not really familiar enough with the GeckoView integration to answer them, but perhaps Agi might know.

Specifically:

Can there be more than one NPZCSupport object alive at one time? (AndroidVsync has a static instance, so this would imply more than one could be registered at the same time.)
Could the code inside NPZCSupport::OnVsync() (particularly places that call back into GeckoView, like the GeckoResult::Complete() calls in FinishHandlingMotionEvent()) trigger destruction of another NPZCSupport instance?
Finally, what is a safe way NPZCSupport can register itself as an observer with another C++ component (AndroidVsync), such that the pointer stored by that component either keeps the NPZCSupport alive, or can check before invoking the observer whether it's still alive?

Flags: needinfo?(botond) → needinfo?(agi)

[ex-Mozilla] Agi Sferro | :agi

Comment 4

•

2 years ago

•

Edited

(In reply to Botond Ballo [:botond] from comment #3)

(In reply to Markus Stange [:mstange] from comment #2)

I wonder if it's possible to have two NPZCSupport instances registered at the same time? And then whether it's possible that NPZCSupport::OnVsync on the first instance could cause the destruction of the second NPZCSupport instance?

Good questions. I'm not really familiar enough with the GeckoView integration to answer them, but perhaps Agi might know.

Specifically:

Can there be more than one NPZCSupport object alive at one time? (AndroidVsync has a static instance, so this would imply more than one could be registered at the same time.)

There should be one for each GeckoSession ( = each tab)

Could the code inside NPZCSupport::OnVsync() (particularly places that call back into GeckoView, like the GeckoResult::Complete() calls in FinishHandlingMotionEvent()) trigger destruction of another NPZCSupport instance?

I can't think of a way that would happen right now, but it is possible.

Finally, what is a safe way NPZCSupport can register itself as an observer with another C++ component (AndroidVsync), such that the pointer stored by that component either keeps the NPZCSupport alive, or can check before invoking the observer whether it's still alive?

the Java GC owns NPZCSupport, so anywhere in native code we cannot assume that the pointer is valid. I think we would need to have the observers be some sort of WeakPtr (the easiest would be mozilla::jni::NativeWeakPtr: https://searchfox.org/mozilla-central/rev/ad38c9d1f0a9036c4da5271a849f47342b68adc8/widget/android/nsWindow.h#87)

Flags: needinfo?(agi)

Botond Ballo [:botond]

Comment 5

•

2 years ago

Thanks. Sounds like a potential fix worth trying would be to change the AndroidVsync observer registered by NPZCSupport from the NPZCSupport object itself, to a wrapper which holds a NativeWeakPtr to the NPZCSupport, and null-checks it at invocation time.

Botond Ballo [:botond]

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P2

Razvan Cojocaru

Comment 6

•

1 year ago

So you want AndroidVsync::RegisterObserver() and AndroidVsync::UnregisterObserver() to take a NativeWeakPtr<Observer> parameter (instead of Observer *), and store them in nsTArray<NativeWeakPtr<Observer>> arrays?

Botond Ballo [:botond]

Comment 7

•

1 year ago

(In reply to Razvan Cojocaru from comment #6)

So you want AndroidVsync::RegisterObserver() and AndroidVsync::UnregisterObserver() to take a NativeWeakPtr<Observer> parameter (instead of Observer *), and store them in nsTArray<NativeWeakPtr<Observer>> arrays?

This is an interesting alternative I hadn't considered. Unfortunately, I don't think we can do this, because I believe the only types that can be stored inside NativeWeakPtr<> are ones that derive from an appropriate JNI-related base class. This is true of NPZCSupport but not of AndroidVsyncSource.

So, I think what we'd want to do instead is:

Instead of having NPZCSupport implement AndroidVsync::Observer itself, give it a nested class NPZCSupport::Observer which implements AndroidVsync::Observer.
Have NPZCSupport::Observer store a NativeWeakPtr<NPZCSupport>.
In NPZCSupport::Observer::OnVsync(), null-check the NativeWeakPtr and only call NPZCSupport::OnVsync() if the object is still alive.

Razvan Cojocaru

Comment 8

•

1 year ago

•

Edited

(In reply to Botond Ballo [:botond] from comment #7)

Instead of having NPZCSupport implement AndroidVsync::Observer itself, give it a nested class NPZCSupport::Observer which implements AndroidVsync::Observer.

Speaking of NPZCSupport::Observer, it appears to be implemented like so:

  class Observer {
   public:
    // Will be called on the Java UI thread.
    virtual void OnVsync(const TimeStamp& aTimeStamp) = 0;
  };

I strongly believe that all classes meant as interfaces (that is, base classes with virtual - especially pure virtual - member functions) should have a virtual destructor. This is supported by a variety of sources: Stroustrup's FAQ, the C++ Core Guidelines, the Sonar documentation, and so on.

Perhaps this is something worth considering for the future (of course not for this bug)?

Razvan Cojocaru

Comment 9

•

1 year ago

(In reply to Botond Ballo [:botond] from comment #7)

(In reply to Razvan Cojocaru from comment #6)

So you want AndroidVsync::RegisterObserver() and AndroidVsync::UnregisterObserver() to take a NativeWeakPtr<Observer> parameter (instead of Observer *), and store them in nsTArray<NativeWeakPtr<Observer>> arrays?

This is an interesting alternative I hadn't considered. Unfortunately, I don't think we can do this, because I believe the only types that can be stored inside NativeWeakPtr<> are ones that derive from an appropriate JNI-related base class. This is true of NPZCSupport but not of AndroidVsyncSource.

So, I think what we'd want to do instead is:

Instead of having NPZCSupport implement AndroidVsync::Observer itself, give it a nested class NPZCSupport::Observer which implements AndroidVsync::Observer.

Who would manage the lifetime of the NPZCSupport::Observer instance we'd want to pass instead of this? It can't be a member of the current NPZCSupport, since in that case the problem persists: it will disappear alongside the object it is a member of. Creating it on the heap with new also doesn't appear to be a good solution at a first glance at the code, since AndroidVsync::UnregisterObserver() appears to simply remove the pointer from its raw-pointer arrays (this is why I proposed to store weak pointers instead of raw pointers in the observer arrays) - but doesn't delete it.

Since you've got much more experience with the Firefox code, I'm sure I'm just missing something?

Botond Ballo [:botond]

Comment 10

•

1 year ago

(In reply to Razvan Cojocaru from comment #8)

Speaking of NPZCSupport::Observer, it appears to be implemented like so:
  class Observer {
   public:
    // Will be called on the Java UI thread.
    virtual void OnVsync(const TimeStamp& aTimeStamp) = 0;
  };
I strongly believe that all classes meant as interfaces (that is, base classes with virtual - especially pure virtual - member functions) should have a virtual destructor. This is supported by a variety of sources: Stroustrup's FAQ, the C++ Core Guidelines, the Sonar documentation, and so on.

Perhaps this is something worth considering for the future (of course not for this bug)?

Agreed, we should give Observer a virtual destructor.

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Botond Ballo [:botond]

Comment 11

•

1 year ago

(In reply to Razvan Cojocaru from comment #9)

Who would manage the lifetime of the NPZCSupport::Observer instance we'd want to pass instead of this? It can't be a member of the current NPZCSupport, since in that case the problem persists: it will disappear alongside the object it is a member of. Creating it on the heap with new also doesn't appear to be a good solution at a first glance at the code, since AndroidVsync::UnregisterObserver() appears to simply remove the pointer from its raw-pointer arrays (this is why I proposed to store weak pointers instead of raw pointers in the observer arrays) - but doesn't delete it.

Since you've got much more experience with the Firefox code, I'm sure I'm just missing something?

You're not missing anything, this is a very good question that I've overlooked.

Would it be too hacky to give Observer a second virtual function named Dispose(), which UnregisterObserver() calls, and which in the case of NPZCSupport::Observer calls delete this (and then we can just create it using new)?

The other option I can think of is to change AndroidVsync to accept and store RefPtr<Observer> instead (i.e. strong references). However, in the case where the Observer is AndroidVsyncSource, that stores a strong refrence to AndroidVsync, so it seems we'd be creating a strong reference cycle.

I'm open to other suggestions as well.

Razvan Cojocaru

Comment 12

•

1 year ago

(In reply to Botond Ballo [:botond] from comment #10)

Agreed, we should give Observer a virtual destructor.

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Great, I'll do the modification in the patch then - should be short and painless. :) Thanks!

(In reply to Botond Ballo [:botond] from comment #11)

Would it be too hacky to give Observer a second virtual function named Dispose(), which UnregisterObserver() calls, and which in the case of NPZCSupport::Observer calls delete this (and then we can just create it using new)?

Well it is (the simplified to the base essentials version of) COM+'s AddRef()/Release() so it's not all that hacky. The solution's bound to be hacky anyway if we want to be able to store both regular pointers to on-the-stack objects and pointers to heap-allocated objects. We could just use standard C++ smart pointers as well and pass dummy do-nothing custom deleters for stack objects, but that's just as hacky and arguably more complicated than the Dispose() solution.

Just one last question: what's the proper way to create a NativeWeakPtr<NPZCSupport> from this (in a NPZCSupport object)? Or is that not what we want here?

Botond Ballo [:botond]

Comment 13

•

1 year ago

(In reply to Razvan Cojocaru from comment #12)

Just one last question: what's the proper way to create a NativeWeakPtr<NPZCSupport> from this (in a NPZCSupport object)? Or is that not what we want here?

Perhaps something like this:

Give nsWindow a method GetNPZCSupportWeakPtr() which returns a copy of mNPZCSupport
In NPZCSupport, do something like:

if (auto win = mWindow.Access()) {
  nsWindow* gkWindow = window->GetNsWindow();
  NativeWeakPtr<NPZCSupport> weakPtrToThis = gkWindow->GetNPZCSupportWeakPtr();
}

It's possible there is an easier way I'm not familiar with. We will ask for code review from someone on the GeckoView team and they can suggest improvements if needed.

Botond Ballo [:botond]

Comment 14

•

1 year ago

Another option might be to have NPZCSupport::Observer work more like NPZCSupport::InputEvent: store a java::PanZoomController::NativeProvider::GlobalRef, and recover the NativeWeakPtr<NPZCSupport> at invocation time like this.

Razvan Cojocaru

Comment 15

•

1 year ago

Attached file Bug 1702504 - Fix crash in [@ mozilla::widget::TouchResampler::NotifyFrame]. r?botond — Details

Give Observer a virtual destructor while at it.

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → rzvncj

Status: NEW → ASSIGNED

Razvan Cojocaru

Comment 16

•

1 year ago

(In reply to Botond Ballo [:botond] from comment #10)

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Gah, I've just re-read this and I've completely missed the part about the change being in a separate patch, and now the single patch has been accepted. I'm sorry, should have paid more attention. I can split them now if you want.

Botond Ballo [:botond]

Comment 17

•

1 year ago

(In reply to Razvan Cojocaru from comment #16)

(In reply to Botond Ballo [:botond] from comment #10)

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Gah, I've just re-read this and I've completely missed the part about the change being in a separate patch, and now the single patch has been accepted. I'm sorry, should have paid more attention. I can split them now if you want.

No worries, we're adding a new virtual function anyways so it seems fairly natural to add the virtual destructor in the same patch.

Pulsebot

Comment 18

•

1 year ago

Pushed by bballo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/81da50971081
Fix crash in [@ mozilla::widget::TouchResampler::NotifyFrame]. r=botond,geckoview-reviewers,m_kato

Botond Ballo [:botond]

Comment 19

•

1 year ago

Adding leave-open keyword since we shouldn't close this bug until we see the crashes actually stop occuring.

Keywords: leave-open

Sandor Molnar[:smolnar]

Comment 20

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/81da50971081

Botond Ballo [:botond]

Comment 21

•

1 year ago

(In reply to Botond Ballo [:botond] from comment #19)

Adding leave-open keyword since we shouldn't close this bug until we see the crashes actually stop occuring.

Writing down mostly for my own reference:

Our speculative fix landed in version 110, which will hit the release on Feb 14
Once 110 has been out for a few weeks, if we see crash reports from 108 and 109 but no crash reports from 110, that's a good sign our fix worked

Botond Ballo [:botond]

Comment 22

•

1 year ago

Unfortunately, we're still seeing crashes in 110, which suggests that our theory about the cause of the crash was incorrect.

I don't have any alternative theories at this time. If anyone else has suggestions, please feel free to share them in a comment. Meanwhile I'm going to leave the bug open.

Razvan Cojocaru

Comment 23

•

1 year ago

(In reply to Botond Ballo [:botond] from comment #22)

Unfortunately, we're still seeing crashes in 110, which suggests that our theory about the cause of the crash was incorrect.

I don't have any alternative theories at this time. If anyone else has suggestions, please feel free to share them in a comment. Meanwhile I'm going to leave the bug open.

Sorry to hear that. At least we've fixed a problem with the previous patch. :)

Razvan Cojocaru

Updated

•

1 year ago

Assignee: rzvncj → nobody

Status: ASSIGNED → NEW

Botond Ballo [:botond]

Comment 24

•

1 year ago

(In reply to Razvan Cojocaru from comment #23)

At least we've fixed a problem with the previous patch. :)

Indeed, thanks for your work on that!

Razvan Cojocaru

Comment 25

•

1 year ago

(In reply to Botond Ballo [:botond] from comment #24)

(In reply to Razvan Cojocaru from comment #23)

At least we've fixed a problem with the previous patch. :)

Indeed, thanks for your work on that!

Happy to help!

Makoto Kato [:m_kato]

Updated

•

1 year ago

Regressions: 1826027

BugBot [:suhaib / :marco/ :calixte]

Comment 26

•

7 months ago

The leave-open keyword is there and there is no activity for 6 months.
:botond, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(botond)

Botond Ballo [:botond]

Comment 27

•

7 months ago

The crash is still happening occasionally, so the bug should remain open.

However, the leave-open keyword can be removed (it only needs to be present at the time a patch lands, if we don't want the landing to close the bug).

Flags: needinfo?(botond)

Keywords: leave-open