Closed Bug 1810396 Opened 1 year ago Closed 1 year ago

Crash in [@ mozilla::CheckedInt<T>::value]

Categories

(Core :: Audio/Video: Playback, defect, P2)

All
Unspecified
defect

Tracking

()

RESOLVED FIXED
111 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox109 + wontfix
firefox110 + fixed
firefox111 + fixed

People

(Reporter: gsvelto, Assigned: az)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(2 files)

Crash report: https://crash-stats.mozilla.org/report/index/0cdeca89-9d89-44f5-84bc-2ba2f0230115

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  libxul.so  mozilla::CheckedInt<long>::value const  mfbt/CheckedInt.h:563
0  libxul.so  mozilla::media::TimeUnit::operator<= const  dom/media/TimeUnits.h:143
0  libxul.so  mozilla::MediaSourceTrackDemuxer::DoGetSamples  dom/media/mediasource/MediaSourceDemuxer.cpp:463
1  libxul.so  mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<int> >::applyImpl<mozilla::MediaSourceTrackDemuxer, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/nsThreadUtils.h:1147
1  libxul.so  mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<int> >::apply<mozilla::MediaSourceTrackDemuxer, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/nsThreadUtils.h:1153
1  libxul.so  mozilla::detail::MethodCall<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/MozPromise.h:1518
1  libxul.so  mozilla::detail::ProxyRunnable<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/MozPromise.h:1538
2  libxul.so  mozilla::TaskQueue::Runner::Run  xpcom/threads/TaskQueue.cpp:259
3  libxul.so  nsThreadPool::Run  xpcom/threads/nsThreadPool.cpp:309
4  libxul.so  nsThread::ProcessNextEvent  xpcom/threads/nsThread.cpp:1198

This does not look like a new problem, but rather one where the signature changed due to inlining (we should probably remove the first two frames from the signature). It's predominantly happening on Fenix.

This kind of looks like it could be a regression, honestly. I tried searching for crashes in the last 6 months with MediaSourceTrackDemuxer::DoGetSamples in the proto signature and I didn't see anything with the same volume as these two crashes.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 AArch64 and ARM crashes on release

:jimm, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)
Keywords: topcrash

(In reply to Andrew McCreight [:mccr8] from comment #1)

This kind of looks like it could be a regression, honestly. I tried searching for crashes in the last 6 months with MediaSourceTrackDemuxer::DoGetSamples in the proto signature and I didn't see anything with the same volume as these two crashes.

This crash signature is now the top content crash in Fenix 109, though I don't see any crash reports from Beta 110 or Nightly 111. In fact, over the last six months, there have been only 3 reports from Nightly and 5 from Beta, even if I group the MediaSourceTrackDemuxer::DoGetSamples and CheckedInt<T>::value signatures together. Why is this crash so much more common on Release than Nightly or Beta?

It looks like there may have been a spike in Fenix 102 and another in 108:

https://crash-stats.mozilla.org/search/?signature=~mozilla%3A%3AMediaSourceTrackDemuxer%3A%3ADoGetSamples&signature=~mozilla%3A%3ACheckedInt%3CT%3E%3A%3Avalue&product=Fenix&product=Focus&date=%3E%3D2022-02-23T06%3A28%3A00.000Z&date=%3C2023-01-23T06%3A28%3A00.000Z&_facets=signature&_facets=version&_facets=cpu_arch&_facets=major_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-major_version

Hardware: Unspecified → All

This is the top content crash in Fenix 109 Release. It look like a regression in Fenix 108+ Release to me. We've received 4000+ of crash reports from Fenix 108+109 Release, but only 173 from Fenix 107 Release.

We've received only two crash reports from Beta 108-110 and one from Nightly 108-111. Why is this crash so much more likely on Release?

The last thing to touch MediaSourceDemuxer.cpp was bug 1764293, which landed in the 108 timeframe. And it also directly touched the DoGetSamples function.

Flags: needinfo?(azebrowski)
Regressed by: 1764293

It looks like the root cause may be due to a timing issue where a sample pointer is invalidated while waiting for a lock, though the pointer itself is created / validated immediately before that here. We have crashes with the same signature going back for quite a few versions to at least 99 with a spike in 102 as cpeterson pointed out and on other platforms as well, e.g. this report on Windows 10 102.7.0esr. I don't think that the recent change to MediaSourceDemuxer.cpp is responsible for the timing issue itself, but it fixed issues with embedded playback of streams with mismatched start times that might be more likely to trigger the issue. A quick fix could be to add an additional pointer validity check, though that doesn't answer the question as to why the pointer's becoming invalid in the first place, especially on Android. I think it'd fix the crash but not the underlying issue so I believe we'd still have playback failures. I'm still looking at this -- will chat with alwu on this later today and NI padenot as well.

Flags: needinfo?(azebrowski) → needinfo?(padenot)

If that crash line is correct, which means one of those two TimeUnit is invalid. But I couldn't find a reasonable explanation for that.

sample (MediaRawData)

When appending sample into samples, there is already an assertion to ensure the sample is valid. That sample would always be guaranteed alive because it's holding by a RefPtr.

Okay, let's see another possibility. Is it possible someone modifying its mTime during waiting for the monitor? When the TrackBufferManager returns a sample via GetSample(), the returned sample is a duplication so only MediaSourceDemuxer is holding that sample. So That is not possible someone modifying sample's mTime while MediaSourceDemuxer is accessing it.

mNextRandomAccessPoint

It doesn't get initialized explicitly so the default value for that TimeUnit is zero, which means its value is valid. When assigning a value to it, we call TrackBuffersManager::GetNextRandomAccessPoint and these are places the TimeUnit will be returned. The returned TimeUnit would be either a sample's time (which we've proved it should be valid if it's returned by GetSample() in the above paragraph) or TimeUnit::FromInfinity() (which is also a valid value). So from this point of view, it's also not possible to be invalid.

One interesting thing is, all those crashes have same address (0x0000000000000020 for 64 bits and 0x00000018 for 32 bits). That make me think of the possibility of maybe the infinite value (INT_64MAX) somehow didn't in the range when checking the value is valid or not.

I found a post which is about the value didn't get printed correctly on Android. That seems related with the toolchain mismatch. Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?

Thank you.

Flags: needinfo?(cpeterson)

The bug is marked as tracked for firefox109 (release), tracked for firefox110 (beta) and tracked for firefox111 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

:jimm, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

:az was thinking this might have blown up due to her fix addressing playback of a particular set of videos that trigger the issue. She's going to try to put something defensive together and mark this leave-open.

Chris - it would be great if we had urls reported in crash reports for something like this. Re - https://docs.google.com/document/d/1uloP5DXQWp3QnKA7UIQUfnBBfgK3KEr5XdA0S_TjCH8/edit#bookmark=id.dgm1eoo8hdb1

Flags: needinfo?(jmathies)

(In reply to Jim Mathies [:jimm] from comment #10)

:az was thinking this might have blown up due to her fix addressing playback of a particular set of videos that trigger the issue. She's going to try to put something defensive together and mark this leave-open.

Chris - it would be great if we had urls reported in crash reports for something like this. Re - https://docs.google.com/document/d/1uloP5DXQWp3QnKA7UIQUfnBBfgK3KEr5XdA0S_TjCH8/edit#bookmark=id.dgm1eoo8hdb1

From crash stats, this blew up in 109, and the twitter fix landed in 108, so bug 1764293 was not the cause.

Flags: needinfo?(padenot)
No longer regressed by: 1764293

(In reply to Alastor Wu [:alwu] from comment #8)

Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?

Could you clarify which toolchains you're referring to? In the 108 cycle, we updated to clang 15 in bug 1784202 and rust 1.65 in bug 1789508. We haven't updated our NDK in years.

Flags: needinfo?(alwu)

(In reply to Alastor Wu [:alwu] from comment #8)

I found a post which is about the value didn't get printed correctly on Android. That seems related with the toolchain mismatch. Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?

Ryan answered the question about updating the toolchain in comment 12. For more details about these updates, I would ask :glandium.

In the 108 cycle, we updated to clang 15 in bug 1784202 and rust 1.65 in bug 1789508. We haven't updated our NDK in years.

(In reply to Jim Mathies [:jimm] from comment #13)

Created attachment 9315433 [details]
Screenshot 2023-02-01 at 13-34-15 Crash Stats signature report @mozilla CheckedInt T value.png

Possibly this regressed in 108.2.0?

https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FIREFOX_108_0_1_RELEASE&tochange=FIREFOX_108_0_2_RELEASE

We might see fewer crashes from 108.1.x than from 108.2.0 because early Fenix updates are throttled.

I see 108.2.0's pushlog includes a uniffi update. We've had Android crashes caused by uniffi updates multiple times, but those bugs were usually on Android x86 devices. This bug's crashes are nearly all arm64. So uniffi is probably not a problem here.

Flags: needinfo?(cpeterson)

After digging a bit deeper, I think that the crash addresses of 0x0000000000000020 on 64bit and 0x00000018 on 32bit are hints that RefPtr<MediaRawData> sample is a null pointer despite the guards we have in place. MediaRawData is subclassed from MediaData here. If we look at the definition of MediaData, we can see that we have an enum and int64_t defined before the TimeUnit we try to access here.

Looking at the way this is laid out in memory, we get the following on a x64 Linux machine:

(rr) ptype /o MediaData                                                                                                                                                                                   [45/27666]
/* offset      |    size */  type = class mozilla::MediaData {
                             protected:
/*      8      |       8 */    class mozilla::ThreadSafeAutoRefCnt {
                                 public:
                                   static const bool isThreadSafe;
                                 private:
/*      8      |       8 */        struct std::atomic<unsigned long> [with _Tp = unsigned long] : public std::__atomic_base<_Tp> {
                                       static const bool is_always_lock_free;
                                       /* total size (bytes):    8 */
                                   } mValue;
                                   /* total size (bytes):    8 */
                               } mRefCnt;
                             public:
/*     16      |       4 */    const mozilla::MediaData::Type mType;
/* XXX  4-byte hole      */
/*     24      |       8 */    int64_t mOffset;
/*     32      |      16 */    class mozilla::media::TimeUnit

The offset of 32 for the TimeUnit works out to 0x20, the same crash address we see on 64 bit Android/Windows. If the 4-byte hole doesn't exist on 32 bit machines then the offset would be 24, working out to 0x18 which is the same crash address we see on 32 bit machines. Then, if sample is a null pointer, sample->mTime would be attempting to access those same crash addresses.

If this theory is correct, we still need to answer the question as to how the sample pointer is becoming invalidated and alwu's comment has already pointed out a lot of the places where we have guards to prevent this from happening, so it could still be toolchain related. We also need to know why the vast majority of the crashes are on 64 bit Android and only cropped up recently, and I'm not sure why beta and nightly aren't being affected nearly as much as release. I'm uploading a patch with another null pointer check to aid in diagnosing that might fix the crash on release while also adding in a crash for beta/nightly so we could try to sort it out there, though I'm having problems accessing try at the moment and haven't been able to submit a try build.

Assignee: nobody → azebrowski
Status: NEW → ASSIGNED

I don't have explicit answer for what kind of toolchain change might cause this, just wanted to know if there is any possible change which would affect Android.

Flags: needinfo?(alwu)
Attachment #9315492 - Attachment description: Bug 1810396 - Check validity of pointer to sample data when returning samples from demuxer and induce crash on debug builds for diagnostic purposes r=alwu → Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu

(In reply to Ryan VanderMeulen [:RyanVM] from comment #14)

For what it's worth, Desktop 109 releases are hitting the same crashes (albeit lower volume):
https://crash-stats.mozilla.org/report/index/26453d43-6968-44e4-8b5c-7e6790230201

FWIW as well, this 32-bit Windows crash confirms what [:az] wrote earlier in comment 16, i.e. the code where we crash is trying to do a comparison between the 64-bit values stored in sample->mTime and mNextRandomAccessPoint but sample is a null pointer.

0:031> dt xul!mozilla::MediaRawData
   +0x000 __VFN_table : Ptr32 
   +0x008 mRefCnt          : mozilla::ThreadSafeAutoRefCnt
   +0x00c mType            : mozilla::MediaData::Type
   +0x010 mOffset          : Int8B
-> +0x018 mTime            : mozilla::media::TimeUnit <-
   +0x028 mTimecode        : mozilla::media::TimeUnit
  ...

0:031> dt xul!mozilla::MediaSourceTrackDemuxer
   +0x000 __VFN_table : Ptr32 
   +0x004 mRefCnt          : mozilla::ThreadSafeAutoRefCnt
   +0x008 mParent          : RefPtr<mozilla::MediaSourceDemuxer>
   +0x00c mTaskQueue       : RefPtr<mozilla::TaskQueue>
   +0x010 mType            : mozilla::TrackInfo::TrackType
   +0x014 mMonitor         : mozilla::Monitor
-> +0x040 mNextRandomAccessPoint : mozilla::media::TimeUnit <-
   +0x050 mManager         : RefPtr<mozilla::TrackBuffersManager>
   ...

126cb276 8d742408         lea     esi, [esp+8]
[...]
126cb28b 8b06             mov     eax, dword ptr [esi]
126cb28d 8b4818           mov     ecx, dword ptr [eax+18h] <-- crashing instruction
126cb290 3b4f40           cmp     ecx, dword ptr [this->mNextRandomAccessPoint{.mValue.mValue(!!)} (edi+40h)]
126cb293 8b401c           mov     eax, dword ptr [eax+1Ch]
126cb296 1b4744           sbb     eax, dword ptr [this->mNextRandomAccessPoint.mValue.mValue(!!) (edi+40h)+4]

0:031> r esi // address of sample
esi=320ff368
0:031> ? poi(esi)  // contents of sample
Evaluate expression: 0 = 00000000
0:031> r eax // contents of sample as well
eax=00000000
See Also: → 1814600
Pushed by azebrowski@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6e83e2546b09
Additional sanity checks for sample data when returning samples from demuxer r=alwu
Severity: -- → S2
Keywords: leave-open
OS: Android → Unspecified
Priority: -- → P2

Comment on attachment 9315492 [details]
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes will continue if declined. These crashes are only happening on release, and this patch should help diagnose further and potentially reduce the crash count from the new null pointer check.
  • Is this code covered by automated tests?: Unknown
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Patch is small -- adds a null pointer check in the place we're already crashing and two diagnostic asserts along with an trivial datatype change
  • String changes made/needed:
  • Is Android affected?: Yes
Attachment #9315492 - Flags: approval-mozilla-release?

Given that this is only affecting release, after some discussion it seems like an uplift would be prudent to help narrow this down some. We still need to investigate further to find the root cause.

Diagnostic asserts aren't enabled past early beta, though. Is the remaining part of the patch still worth taking without those being effectively enabled?

Flags: needinfo?(azebrowski)

Apart from the asserts I think it'd still be worth it. The thought is that the new null pointer check would allow it to fail gracefully rather than outright crashing, and if the crashes continue we'll have some additional info to diagnose with either way.

Flags: needinfo?(azebrowski)
Attachment #9315492 - Flags: approval-mozilla-beta?

Comment on attachment 9315492 [details]
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu

I am taking the patch to the beta branch before the merge.

Attachment #9315492 - Flags: approval-mozilla-release?
Attachment #9315492 - Flags: approval-mozilla-release-
Attachment #9315492 - Flags: approval-mozilla-beta?
Attachment #9315492 - Flags: approval-mozilla-beta+

:az do you still need leave-open on this or can it be resolved?

Flags: needinfo?(azebrowski)

Looks like this can be marked resolved, closing.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Flags: needinfo?(azebrowski)
Resolution: --- → FIXED
Target Milestone: --- → 111 Branch
See Also: → 1838256
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: