Open Bug 1702504 Opened 3 years ago Updated 7 months ago

Crash in [@ mozilla::widget::TouchResampler::NotifyFrame]

Categories

(Core :: Panning and Zooming, defect, P2)

Unspecified
Android
defect

Tracking

()

Tracking Status
firefox-esr78 --- unaffected
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 --- wontfix

People

(Reporter: RyanVM, Unassigned)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Started in the 85-based Fenix & Focus releases.

Crash report: https://crash-stats.mozilla.org/report/index/bd0e1a9e-a29a-4238-b308-3ab920210401

MOZ_CRASH Reason: MOZ_RELEASE_ASSERT(lastTouchTime)

Top 10 frames of crashing thread:

0 libxul.so mozilla::widget::TouchResampler::NotifyFrame widget/TouchResampler.cpp:92
1 libxul.so mozilla::widget::NPZCSupport::OnVsync widget/android/nsWindow.cpp:747
2 libxul.so void mozilla::jni::NativeStub<mozilla::java::AndroidVsync::NotifyVsync_t, mozilla::widget::AndroidVsyncSupport, mozilla::jni::Args<long> >::Wrap<&mozilla::widget::AndroidVsyncSupport widget/android/jni/Natives.h:1462
3 base.odex base.odex@0xae424 
4 base.art] base.art]@0x1dbfec 
5 boot-framework.art] boot-framework.art]@0x150fc 
6 boot-framework.art] boot-framework.art]@0x929e4c 
7 base.odex base.odex@0x264b48 
8 base.art] base.art]@0x1dc03c 
9 boot.art] boot.art]@0x414 
Flags: needinfo?(mstange.moz)
Flags: needinfo?(mstange.moz)
Has Regression Range: --- → yes

Had a look at this, and I don't see any obvious reason why this crash might occur:

  • The value being asserted is a timestamp, which starts as null and is only modified in CurrentTouches::UpdateFromEvent() when processing a MULTITOUCH_MOVE event, set to the event's timestamp; that in turn is called from TouchResampler::ProcessEvent().
  • Due to this condition, the assertion is only reached if mDeferredTouchMoveEvents is nonempty.
  • mDeferredTouchMoveEvents is only populated here in TouchResampler::ProcessEvent(), if a MULTITOUCH_MOVE event is processed. Thus, if it's nonempty, a move event has been processed in the past and thus the timestamp should be set from the event. Assuming the event itself doesn't have a null timestamp, the assertion should pass.

I considered threading issues as well, but that part seems fine: ProcessEvent() is called from HandleMotionEvent() which should be called on the Java UI thread. The assertion is in TouchResampler::NotifyFrame() which is called from OnVsync() which is documented to also be called on the Java UI thread.

Markus, do you see anything I'm overlooking? If we can't track down the assertion's cause, should we revisit whether it needs to be a release assert? It seems we could fall back gracefully to e.g. calling FlushDeferredTouchMoveEventsUnresampled() instead.

Flags: needinfo?(mstange.moz)

Thanks so much for taking a look!

(In reply to Botond Ballo [:botond] from comment #1)

I considered threading issues as well, but that part seems fine: ProcessEvent() is called from HandleMotionEvent() which should be called on the Java UI thread. The assertion is in TouchResampler::NotifyFrame() which is called from OnVsync() which is documented to also be called on the Java UI thread.

I wonder if we have a use-after-free even with everything happening on the Java UI thread because of this code in AndroidVsync::NotifyVsync:

  // Do not keep the lock held while calling OnVsync.
  nsTArray<Observer*> observers;
  {
    auto impl = mImpl.Lock();
    observers.AppendElements(impl->mInputObservers);
    observers.AppendElements(impl->mRenderObservers);
  }
  for (Observer* observer : observers) {
    observer->OnVsync(timeStamp);
  }

It doesn't look like I thought this code through when I wrote it. We are holding on to an array of raw pointers and expect all of them to still be valid when we call OnVsync on them. I'm not sure why I thought this was ok.

Specifically, if we have two observers, and if notifying the first observer causes code to run which deletes the second observer (and unregisters it), then we call OnVsync on the second observer when it's already gone.
There are currently only two possible observer types: NPZCSupport (which registers itself as an input observer), and AndroidVsyncSource::Display (which registers itself as a render observer). So this function will always call NPZCSupport::OnVsync before AndroidVsyncSource::Display::OnVsync.
I wonder if it's possible to have two NPZCSupport instances registered at the same time? And then whether it's possible that NPZCSupport::OnVsync on the first instance could cause the destruction of the second NPZCSupport instance?

If all those things came together, we'd call NPZCSupport::OnVsync on a destroyed NPZCSupport object, which would call TouchResampler::NotifyFrame on a destroyed TouchResampler object, and then this release assertion might be hit.

Flags: needinfo?(mstange.moz) → needinfo?(botond)

(In reply to Markus Stange [:mstange] from comment #2)

I wonder if it's possible to have two NPZCSupport instances registered at the same time? And then whether it's possible that NPZCSupport::OnVsync on the first instance could cause the destruction of the second NPZCSupport instance?

Good questions. I'm not really familiar enough with the GeckoView integration to answer them, but perhaps Agi might know.

Specifically:

  1. Can there be more than one NPZCSupport object alive at one time? (AndroidVsync has a static instance, so this would imply more than one could be registered at the same time.)
  2. Could the code inside NPZCSupport::OnVsync() (particularly places that call back into GeckoView, like the GeckoResult::Complete() calls in FinishHandlingMotionEvent()) trigger destruction of another NPZCSupport instance?
  3. Finally, what is a safe way NPZCSupport can register itself as an observer with another C++ component (AndroidVsync), such that the pointer stored by that component either keeps the NPZCSupport alive, or can check before invoking the observer whether it's still alive?
Flags: needinfo?(botond) → needinfo?(agi)

(In reply to Botond Ballo [:botond] from comment #3)

(In reply to Markus Stange [:mstange] from comment #2)

I wonder if it's possible to have two NPZCSupport instances registered at the same time? And then whether it's possible that NPZCSupport::OnVsync on the first instance could cause the destruction of the second NPZCSupport instance?

Good questions. I'm not really familiar enough with the GeckoView integration to answer them, but perhaps Agi might know.

Specifically:

  1. Can there be more than one NPZCSupport object alive at one time? (AndroidVsync has a static instance, so this would imply more than one could be registered at the same time.)

There should be one for each GeckoSession ( = each tab)

  1. Could the code inside NPZCSupport::OnVsync() (particularly places that call back into GeckoView, like the GeckoResult::Complete() calls in FinishHandlingMotionEvent()) trigger destruction of another NPZCSupport instance?

I can't think of a way that would happen right now, but it is possible.

  1. Finally, what is a safe way NPZCSupport can register itself as an observer with another C++ component (AndroidVsync), such that the pointer stored by that component either keeps the NPZCSupport alive, or can check before invoking the observer whether it's still alive?

the Java GC owns NPZCSupport, so anywhere in native code we cannot assume that the pointer is valid. I think we would need to have the observers be some sort of WeakPtr (the easiest would be mozilla::jni::NativeWeakPtr: https://searchfox.org/mozilla-central/rev/ad38c9d1f0a9036c4da5271a849f47342b68adc8/widget/android/nsWindow.h#87)

Flags: needinfo?(agi)

Thanks. Sounds like a potential fix worth trying would be to change the AndroidVsync observer registered by NPZCSupport from the NPZCSupport object itself, to a wrapper which holds a NativeWeakPtr to the NPZCSupport, and null-checks it at invocation time.

Severity: -- → S3
Priority: -- → P2

So you want AndroidVsync::RegisterObserver() and AndroidVsync::UnregisterObserver() to take a NativeWeakPtr<Observer> parameter (instead of Observer *), and store them in nsTArray<NativeWeakPtr<Observer>> arrays?

(In reply to Razvan Cojocaru from comment #6)

So you want AndroidVsync::RegisterObserver() and AndroidVsync::UnregisterObserver() to take a NativeWeakPtr<Observer> parameter (instead of Observer *), and store them in nsTArray<NativeWeakPtr<Observer>> arrays?

This is an interesting alternative I hadn't considered. Unfortunately, I don't think we can do this, because I believe the only types that can be stored inside NativeWeakPtr<> are ones that derive from an appropriate JNI-related base class. This is true of NPZCSupport but not of AndroidVsyncSource.

So, I think what we'd want to do instead is:

  • Instead of having NPZCSupport implement AndroidVsync::Observer itself, give it a nested class NPZCSupport::Observer which implements AndroidVsync::Observer.
  • Have NPZCSupport::Observer store a NativeWeakPtr<NPZCSupport>.
  • In NPZCSupport::Observer::OnVsync(), null-check the NativeWeakPtr and only call NPZCSupport::OnVsync() if the object is still alive.

(In reply to Botond Ballo [:botond] from comment #7)

  • Instead of having NPZCSupport implement AndroidVsync::Observer itself, give it a nested class NPZCSupport::Observer which implements AndroidVsync::Observer.

Speaking of NPZCSupport::Observer, it appears to be implemented like so:

  class Observer {
   public:
    // Will be called on the Java UI thread.
    virtual void OnVsync(const TimeStamp& aTimeStamp) = 0;
  };

I strongly believe that all classes meant as interfaces (that is, base classes with virtual - especially pure virtual - member functions) should have a virtual destructor. This is supported by a variety of sources: Stroustrup's FAQ, the C++ Core Guidelines, the Sonar documentation, and so on.

Perhaps this is something worth considering for the future (of course not for this bug)?

(In reply to Botond Ballo [:botond] from comment #7)

(In reply to Razvan Cojocaru from comment #6)

So you want AndroidVsync::RegisterObserver() and AndroidVsync::UnregisterObserver() to take a NativeWeakPtr<Observer> parameter (instead of Observer *), and store them in nsTArray<NativeWeakPtr<Observer>> arrays?

This is an interesting alternative I hadn't considered. Unfortunately, I don't think we can do this, because I believe the only types that can be stored inside NativeWeakPtr<> are ones that derive from an appropriate JNI-related base class. This is true of NPZCSupport but not of AndroidVsyncSource.

So, I think what we'd want to do instead is:

  • Instead of having NPZCSupport implement AndroidVsync::Observer itself, give it a nested class NPZCSupport::Observer which implements AndroidVsync::Observer.

Who would manage the lifetime of the NPZCSupport::Observer instance we'd want to pass instead of this? It can't be a member of the current NPZCSupport, since in that case the problem persists: it will disappear alongside the object it is a member of. Creating it on the heap with new also doesn't appear to be a good solution at a first glance at the code, since AndroidVsync::UnregisterObserver() appears to simply remove the pointer from its raw-pointer arrays (this is why I proposed to store weak pointers instead of raw pointers in the observer arrays) - but doesn't delete it.

Since you've got much more experience with the Firefox code, I'm sure I'm just missing something?

(In reply to Razvan Cojocaru from comment #8)

Speaking of NPZCSupport::Observer, it appears to be implemented like so:

  class Observer {
   public:
    // Will be called on the Java UI thread.
    virtual void OnVsync(const TimeStamp& aTimeStamp) = 0;
  };

I strongly believe that all classes meant as interfaces (that is, base classes with virtual - especially pure virtual - member functions) should have a virtual destructor. This is supported by a variety of sources: Stroustrup's FAQ, the C++ Core Guidelines, the Sonar documentation, and so on.

Perhaps this is something worth considering for the future (of course not for this bug)?

Agreed, we should give Observer a virtual destructor.

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

(In reply to Razvan Cojocaru from comment #9)

Who would manage the lifetime of the NPZCSupport::Observer instance we'd want to pass instead of this? It can't be a member of the current NPZCSupport, since in that case the problem persists: it will disappear alongside the object it is a member of. Creating it on the heap with new also doesn't appear to be a good solution at a first glance at the code, since AndroidVsync::UnregisterObserver() appears to simply remove the pointer from its raw-pointer arrays (this is why I proposed to store weak pointers instead of raw pointers in the observer arrays) - but doesn't delete it.

Since you've got much more experience with the Firefox code, I'm sure I'm just missing something?

You're not missing anything, this is a very good question that I've overlooked.

Would it be too hacky to give Observer a second virtual function named Dispose(), which UnregisterObserver() calls, and which in the case of NPZCSupport::Observer calls delete this (and then we can just create it using new)?

The other option I can think of is to change AndroidVsync to accept and store RefPtr<Observer> instead (i.e. strong references). However, in the case where the Observer is AndroidVsyncSource, that stores a strong refrence to AndroidVsync, so it seems we'd be creating a strong reference cycle.

I'm open to other suggestions as well.

(In reply to Botond Ballo [:botond] from comment #10)

Agreed, we should give Observer a virtual destructor.

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Great, I'll do the modification in the patch then - should be short and painless. :) Thanks!

(In reply to Botond Ballo [:botond] from comment #11)

Would it be too hacky to give Observer a second virtual function named Dispose(), which UnregisterObserver() calls, and which in the case of NPZCSupport::Observer calls delete this (and then we can just create it using new)?

Well it is (the simplified to the base essentials version of) COM+'s AddRef()/Release() so it's not all that hacky. The solution's bound to be hacky anyway if we want to be able to store both regular pointers to on-the-stack objects and pointers to heap-allocated objects. We could just use standard C++ smart pointers as well and pass dummy do-nothing custom deleters for stack objects, but that's just as hacky and arguably more complicated than the Dispose() solution.

Just one last question: what's the proper way to create a NativeWeakPtr<NPZCSupport> from this (in a NPZCSupport object)? Or is that not what we want here?

(In reply to Razvan Cojocaru from comment #12)

Just one last question: what's the proper way to create a NativeWeakPtr<NPZCSupport> from this (in a NPZCSupport object)? Or is that not what we want here?

Perhaps something like this:

  • Give nsWindow a method GetNPZCSupportWeakPtr() which returns a copy of mNPZCSupport
  • In NPZCSupport, do something like:
if (auto win = mWindow.Access()) {
  nsWindow* gkWindow = window->GetNsWindow();
  NativeWeakPtr<NPZCSupport> weakPtrToThis = gkWindow->GetNPZCSupportWeakPtr();
}

It's possible there is an easier way I'm not familiar with. We will ask for code review from someone on the GeckoView team and they can suggest improvements if needed.

Another option might be to have NPZCSupport::Observer work more like NPZCSupport::InputEvent: store a java::PanZoomController::NativeProvider::GlobalRef, and recover the NativeWeakPtr<NPZCSupport> at invocation time like this.

Give Observer a virtual destructor while at it.

Assignee: nobody → rzvncj
Status: NEW → ASSIGNED

(In reply to Botond Ballo [:botond] from comment #10)

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Gah, I've just re-read this and I've completely missed the part about the change being in a separate patch, and now the single patch has been accepted. I'm sorry, should have paid more attention. I can split them now if you want.

(In reply to Razvan Cojocaru from comment #16)

(In reply to Botond Ballo [:botond] from comment #10)

Procedurally, in cases like this where we notice a cleanup/improvement opportunity in code we're working with that's not strictly related to the bug being fixed, we'll sometimes do the cleanup/improvement in an additional patch on the same bug. This both keeps the patches clean (avoids mixing different changes in one patch) while also avoiding the overhead of filing a new bug.

Gah, I've just re-read this and I've completely missed the part about the change being in a separate patch, and now the single patch has been accepted. I'm sorry, should have paid more attention. I can split them now if you want.

No worries, we're adding a new virtual function anyways so it seems fairly natural to add the virtual destructor in the same patch.

Pushed by bballo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/81da50971081
Fix crash in [@ mozilla::widget::TouchResampler::NotifyFrame]. r=botond,geckoview-reviewers,m_kato

Adding leave-open keyword since we shouldn't close this bug until we see the crashes actually stop occuring.

Keywords: leave-open

(In reply to Botond Ballo [:botond] from comment #19)

Adding leave-open keyword since we shouldn't close this bug until we see the crashes actually stop occuring.

Writing down mostly for my own reference:

  • Our speculative fix landed in version 110, which will hit the release on Feb 14
  • Once 110 has been out for a few weeks, if we see crash reports from 108 and 109 but no crash reports from 110, that's a good sign our fix worked

Unfortunately, we're still seeing crashes in 110, which suggests that our theory about the cause of the crash was incorrect.

I don't have any alternative theories at this time. If anyone else has suggestions, please feel free to share them in a comment. Meanwhile I'm going to leave the bug open.

(In reply to Botond Ballo [:botond] from comment #22)

Unfortunately, we're still seeing crashes in 110, which suggests that our theory about the cause of the crash was incorrect.

I don't have any alternative theories at this time. If anyone else has suggestions, please feel free to share them in a comment. Meanwhile I'm going to leave the bug open.

Sorry to hear that. At least we've fixed a problem with the previous patch. :)

Assignee: rzvncj → nobody
Status: ASSIGNED → NEW

(In reply to Razvan Cojocaru from comment #23)

At least we've fixed a problem with the previous patch. :)

Indeed, thanks for your work on that!

(In reply to Botond Ballo [:botond] from comment #24)

(In reply to Razvan Cojocaru from comment #23)

At least we've fixed a problem with the previous patch. :)

Indeed, thanks for your work on that!

Happy to help!

Regressions: 1826027

The leave-open keyword is there and there is no activity for 6 months.
:botond, maybe it's time to close this bug?
For more information, please visit BugBot documentation.

Flags: needinfo?(botond)

The crash is still happening occasionally, so the bug should remain open.

However, the leave-open keyword can be removed (it only needs to be present at the time a patch lands, if we don't want the landing to close the bug).

Flags: needinfo?(botond)
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: