1810396 - Crash in [@ mozilla::CheckedInt<T>::value]

Gabriele Svelto [:gsvelto] (PTO)

Reporter

Description

•

1 year ago

Crash report: https://crash-stats.mozilla.org/report/index/0cdeca89-9d89-44f5-84bc-2ba2f0230115

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0  libxul.so  mozilla::CheckedInt<long>::value const  mfbt/CheckedInt.h:563
0  libxul.so  mozilla::media::TimeUnit::operator<= const  dom/media/TimeUnits.h:143
0  libxul.so  mozilla::MediaSourceTrackDemuxer::DoGetSamples  dom/media/mediasource/MediaSourceDemuxer.cpp:463
1  libxul.so  mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<int> >::applyImpl<mozilla::MediaSourceTrackDemuxer, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/nsThreadUtils.h:1147
1  libxul.so  mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<int> >::apply<mozilla::MediaSourceTrackDemuxer, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/nsThreadUtils.h:1153
1  libxul.so  mozilla::detail::MethodCall<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/MozPromise.h:1518
1  libxul.so  mozilla::detail::ProxyRunnable<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> >   xpcom/threads/MozPromise.h:1538
2  libxul.so  mozilla::TaskQueue::Runner::Run  xpcom/threads/TaskQueue.cpp:259
3  libxul.so  nsThreadPool::Run  xpcom/threads/nsThreadPool.cpp:309
4  libxul.so  nsThread::ProcessNextEvent  xpcom/threads/nsThread.cpp:1198

This does not look like a new problem, but rather one where the signature changed due to inlining (we should probably remove the first two frames from the signature). It's predominantly happening on Fenix.

Andrew McCreight [:mccr8]

Comment 1

•

1 year ago

This kind of looks like it could be a regression, honestly. I tried searching for crashes in the last 6 months with MediaSourceTrackDemuxer::DoGetSamples in the proto signature and I didn't see anything with the same volume as these two crashes.

BugBot [:suhaib / :marco/ :calixte]

Comment 2

•

1 year ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 AArch64 and ARM crashes on release

:jimm, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

Keywords: topcrash

Chris Peterson [:cpeterson]

Comment 3

•

1 year ago

(In reply to Andrew McCreight [:mccr8] from comment #1)

This kind of looks like it could be a regression, honestly. I tried searching for crashes in the last 6 months with MediaSourceTrackDemuxer::DoGetSamples in the proto signature and I didn't see anything with the same volume as these two crashes.

This crash signature is now the top content crash in Fenix 109, though I don't see any crash reports from Beta 110 or Nightly 111. In fact, over the last six months, there have been only 3 reports from Nightly and 5 from Beta, even if I group the MediaSourceTrackDemuxer::DoGetSamples and CheckedInt<T>::value signatures together. Why is this crash so much more common on Release than Nightly or Beta?

It looks like there may have been a spike in Fenix 102 and another in 108:

https://crash-stats.mozilla.org/search/?signature=~mozilla%3A%3AMediaSourceTrackDemuxer%3A%3ADoGetSamples&signature=~mozilla%3A%3ACheckedInt%3CT%3E%3A%3Avalue&product=Fenix&product=Focus&date=%3E%3D2022-02-23T06%3A28%3A00.000Z&date=%3C2023-01-23T06%3A28%3A00.000Z&_facets=signature&_facets=version&_facets=cpu_arch&_facets=major_version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-major_version

status-firefox109: --- → affected

status-firefox110: --- → ?

status-firefox111: --- → ?

status-firefox-esr102: --- → affected

Hardware: Unspecified → All

Chris Peterson [:cpeterson]

Comment 4

•

1 year ago

This is the top content crash in Fenix 109 Release. It look like a regression in Fenix 108+ Release to me. We've received 4000+ of crash reports from Fenix 108+109 Release, but only 173 from Fenix 107 Release.

We've received only two crash reports from Beta 108-110 and one from Nightly 108-111. Why is this crash so much more likely on Release?

status-firefox110: ? → affected

status-firefox111: ? → affected

Keywords: regression

Ryan VanderMeulen [:RyanVM]

Comment 5

•

1 year ago

The last thing to touch MediaSourceDemuxer.cpp was bug 1764293, which landed in the 108 timeframe. And it also directly touched the DoGetSamples function.

status-firefox-esr102: affected → unaffected

Flags: needinfo?(azebrowski)

Regressed by: 1764293

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

tracking-firefox109: --- → +

tracking-firefox110: --- → +

tracking-firefox111: --- → +

az [:az]

Assignee

Comment 6

•

1 year ago

It looks like the root cause may be due to a timing issue where a sample pointer is invalidated while waiting for a lock, though the pointer itself is created / validated immediately before that here. We have crashes with the same signature going back for quite a few versions to at least 99 with a spike in 102 as cpeterson pointed out and on other platforms as well, e.g. this report on Windows 10 102.7.0esr. I don't think that the recent change to MediaSourceDemuxer.cpp is responsible for the timing issue itself, but it fixed issues with embedded playback of streams with mismatched start times that might be more likely to trigger the issue. A quick fix could be to add an additional pointer validity check, though that doesn't answer the question as to why the pointer's becoming invalid in the first place, especially on Android. I think it'd fix the crash but not the underlying issue so I believe we'd still have playback failures. I'm still looking at this -- will chat with alwu on this later today and NI padenot as well.

Flags: needinfo?(azebrowski) → needinfo?(padenot)

Alastor Wu [:alwu]

Comment 7

•

1 year ago

•

Edited

If that crash line is correct, which means one of those two TimeUnit is invalid. But I couldn't find a reasonable explanation for that.

sample (`MediaRawData`)

When appending sample into samples, there is already an assertion to ensure the sample is valid. That sample would always be guaranteed alive because it's holding by a RefPtr.

Okay, let's see another possibility. Is it possible someone modifying its mTime during waiting for the monitor? When the TrackBufferManager returns a sample via GetSample(), the returned sample is a duplication so only MediaSourceDemuxer is holding that sample. So That is not possible someone modifying sample's mTime while MediaSourceDemuxer is accessing it.

mNextRandomAccessPoint

It doesn't get initialized explicitly so the default value for that TimeUnit is zero, which means its value is valid. When assigning a value to it, we call TrackBuffersManager::GetNextRandomAccessPoint and these are places the TimeUnit will be returned. The returned TimeUnit would be either a sample's time (which we've proved it should be valid if it's returned by GetSample() in the above paragraph) or TimeUnit::FromInfinity() (which is also a valid value). So from this point of view, it's also not possible to be invalid.

Alastor Wu [:alwu]

Comment 8

•

1 year ago

•

Edited

One interesting thing is, all those crashes have same address (0x0000000000000020 for 64 bits and 0x00000018 for 32 bits). That make me think of the possibility of maybe the infinite value (INT_64MAX) somehow didn't in the range when checking the value is valid or not.

I found a post which is about the value didn't get printed correctly on Android. That seems related with the toolchain mismatch. Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?

Thank you.

Flags: needinfo?(cpeterson)

BugBot [:suhaib / :marco/ :calixte]

Comment 9

•

1 year ago

The bug is marked as tracked for firefox109 (release), tracked for firefox110 (beta) and tracked for firefox111 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

:jimm, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 10

•

1 year ago

:az was thinking this might have blown up due to her fix addressing playback of a particular set of videos that trigger the issue. She's going to try to put something defensive together and mark this leave-open.

Chris - it would be great if we had urls reported in crash reports for something like this. Re - https://docs.google.com/document/d/1uloP5DXQWp3QnKA7UIQUfnBBfgK3KEr5XdA0S_TjCH8/edit#bookmark=id.dgm1eoo8hdb1

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 11

•

1 year ago

•

Edited

(In reply to Jim Mathies [:jimm] from comment #10)

:az was thinking this might have blown up due to her fix addressing playback of a particular set of videos that trigger the issue. She's going to try to put something defensive together and mark this leave-open.

Chris - it would be great if we had urls reported in crash reports for something like this. Re - https://docs.google.com/document/d/1uloP5DXQWp3QnKA7UIQUfnBBfgK3KEr5XdA0S_TjCH8/edit#bookmark=id.dgm1eoo8hdb1

From crash stats, this blew up in 109, and the twitter fix landed in 108, so bug 1764293 was not the cause.

Flags: needinfo?(padenot)

No longer regressed by: 1764293

Ryan VanderMeulen [:RyanVM]

Comment 12

•

1 year ago

•

Edited

(In reply to Alastor Wu [:alwu] from comment #8)

Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?

Could you clarify which toolchains you're referring to? In the 108 cycle, we updated to clang 15 in bug 1784202 and rust 1.65 in bug 1789508. We haven't updated our NDK in years.

Flags: needinfo?(alwu)

Jim Mathies [:jimm]

Comment 13

•

1 year ago

•

Edited

Attached image Screenshot 2023-02-01 at 13-34-15 Crash Stats signature report @mozilla CheckedInt T value.png — Details

Possibly this regressed in 108.2.0?

https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FIREFOX_108_0_1_RELEASE&tochange=FIREFOX_108_0_2_RELEASE

Ryan VanderMeulen [:RyanVM]

Comment 14

•

1 year ago

•

Edited

For what it's worth, Desktop 109 releases are hitting the same crashes (albeit lower volume):
https://crash-stats.mozilla.org/report/index/26453d43-6968-44e4-8b5c-7e6790230201

Also, we do see some reports from older versions like 106/107 also:
https://crash-stats.mozilla.org/report/index/8d1682ee-8e12-4535-8971-56d020221112
https://crash-stats.mozilla.org/report/index/607755a4-68c2-4e50-b2bf-79b0f0230128

Chris Peterson [:cpeterson]

Comment 15

•

1 year ago

(In reply to Alastor Wu [:alwu] from comment #8)

I found a post which is about the value didn't get printed correctly on Android. That seems related with the toolchain mismatch. Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?

Ryan answered the question about updating the toolchain in comment 12. For more details about these updates, I would ask :glandium.

In the 108 cycle, we updated to clang 15 in bug 1784202 and rust 1.65 in bug 1789508. We haven't updated our NDK in years.

(In reply to Jim Mathies [:jimm] from comment #13)

Created attachment 9315433 [details]
Screenshot 2023-02-01 at 13-34-15 Crash Stats signature report @mozilla CheckedInt T value.png

Possibly this regressed in 108.2.0?

https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FIREFOX_108_0_1_RELEASE&tochange=FIREFOX_108_0_2_RELEASE

We might see fewer crashes from 108.1.x than from 108.2.0 because early Fenix updates are throttled.

I see 108.2.0's pushlog includes a uniffi update. We've had Android crashes caused by uniffi updates multiple times, but those bugs were usually on Android x86 devices. This bug's crashes are nearly all arm64. So uniffi is probably not a problem here.

Flags: needinfo?(cpeterson)

az [:az]

Assignee

Comment 16

•

1 year ago

•

Edited

After digging a bit deeper, I think that the crash addresses of 0x0000000000000020 on 64bit and 0x00000018 on 32bit are hints that RefPtr<MediaRawData> sample is a null pointer despite the guards we have in place. MediaRawData is subclassed from MediaData here. If we look at the definition of MediaData, we can see that we have an enum and int64_t defined before the TimeUnit we try to access here.

Looking at the way this is laid out in memory, we get the following on a x64 Linux machine:

(rr) ptype /o MediaData                                                                                                                                                                                   [45/27666]
/* offset      |    size */  type = class mozilla::MediaData {
                             protected:
/*      8      |       8 */    class mozilla::ThreadSafeAutoRefCnt {
                                 public:
                                   static const bool isThreadSafe;
                                 private:
/*      8      |       8 */        struct std::atomic<unsigned long> [with _Tp = unsigned long] : public std::__atomic_base<_Tp> {
                                       static const bool is_always_lock_free;
                                       /* total size (bytes):    8 */
                                   } mValue;
                                   /* total size (bytes):    8 */
                               } mRefCnt;
                             public:
/*     16      |       4 */    const mozilla::MediaData::Type mType;
/* XXX  4-byte hole      */
/*     24      |       8 */    int64_t mOffset;
/*     32      |      16 */    class mozilla::media::TimeUnit

The offset of 32 for the TimeUnit works out to 0x20, the same crash address we see on 64 bit Android/Windows. If the 4-byte hole doesn't exist on 32 bit machines then the offset would be 24, working out to 0x18 which is the same crash address we see on 32 bit machines. Then, if sample is a null pointer, sample->mTime would be attempting to access those same crash addresses.

If this theory is correct, we still need to answer the question as to how the sample pointer is becoming invalidated and alwu's comment has already pointed out a lot of the places where we have guards to prevent this from happening, so it could still be toolchain related. We also need to know why the vast majority of the crashes are on 64 bit Android and only cropped up recently, and I'm not sure why beta and nightly aren't being affected nearly as much as release. I'm uploading a patch with another null pointer check to aid in diagnosing that might fix the crash on release while also adding in a crash for beta/nightly so we could try to sort it out there, though I'm having problems accessing try at the moment and haven't been able to submit a try build.

az [:az]

Assignee

Comment 17

•

1 year ago

Attached file Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu — Details

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → azebrowski

Status: NEW → ASSIGNED

Alastor Wu [:alwu]

Comment 18

•

1 year ago

I don't have explicit answer for what kind of toolchain change might cause this, just wanted to know if there is any possible change which would affect Android.

Flags: needinfo?(alwu)

Phabricator Automation

Updated

•

1 year ago

Attachment #9315492 - Attachment description: Bug 1810396 - Check validity of pointer to sample data when returning samples from demuxer and induce crash on debug builds for diagnostic purposes r=alwu → Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu

Yannis Juglaret [:yannis]

Comment 19

•

1 year ago

•

Edited

(In reply to Ryan VanderMeulen [:RyanVM] from comment #14)

For what it's worth, Desktop 109 releases are hitting the same crashes (albeit lower volume):
https://crash-stats.mozilla.org/report/index/26453d43-6968-44e4-8b5c-7e6790230201

FWIW as well, this 32-bit Windows crash confirms what [:az] wrote earlier in comment 16, i.e. the code where we crash is trying to do a comparison between the 64-bit values stored in sample->mTime and mNextRandomAccessPoint but sample is a null pointer.

0:031> dt xul!mozilla::MediaRawData
   +0x000 __VFN_table : Ptr32 
   +0x008 mRefCnt          : mozilla::ThreadSafeAutoRefCnt
   +0x00c mType            : mozilla::MediaData::Type
   +0x010 mOffset          : Int8B
-> +0x018 mTime            : mozilla::media::TimeUnit <-
   +0x028 mTimecode        : mozilla::media::TimeUnit
  ...

0:031> dt xul!mozilla::MediaSourceTrackDemuxer
   +0x000 __VFN_table : Ptr32 
   +0x004 mRefCnt          : mozilla::ThreadSafeAutoRefCnt
   +0x008 mParent          : RefPtr<mozilla::MediaSourceDemuxer>
   +0x00c mTaskQueue       : RefPtr<mozilla::TaskQueue>
   +0x010 mType            : mozilla::TrackInfo::TrackType
   +0x014 mMonitor         : mozilla::Monitor
-> +0x040 mNextRandomAccessPoint : mozilla::media::TimeUnit <-
   +0x050 mManager         : RefPtr<mozilla::TrackBuffersManager>
   ...

126cb276 8d742408         lea     esi, [esp+8]
[...]
126cb28b 8b06             mov     eax, dword ptr [esi]
126cb28d 8b4818           mov     ecx, dword ptr [eax+18h] <-- crashing instruction
126cb290 3b4f40           cmp     ecx, dword ptr [this->mNextRandomAccessPoint{.mValue.mValue(!!)} (edi+40h)]
126cb293 8b401c           mov     eax, dword ptr [eax+1Ch]
126cb296 1b4744           sbb     eax, dword ptr [this->mNextRandomAccessPoint.mValue.mValue(!!) (edi+40h)+4]

0:031> r esi // address of sample
esi=320ff368
0:031> ? poi(esi)  // contents of sample
Evaluate expression: 0 = 00000000
0:031> r eax // contents of sample as well
eax=00000000

Yannis Juglaret [:yannis]

Updated

•

1 year ago

Comment 20

•

1 year ago

Pushed by azebrowski@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6e83e2546b09
Additional sanity checks for sample data when returning samples from demuxer r=alwu

az [:az]

Assignee

Updated

•

1 year ago

Severity: -- → S2

Keywords: leave-open

OS: Android → Unspecified

Priority: -- → P2

Iulian Moraru

Comment 21

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6e83e2546b09

az [:az]

Assignee

Comment 22

•

1 year ago

Comment on attachment 9315492 [details]
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu

Beta/Release Uplift Approval Request

User impact if declined: Crashes will continue if declined. These crashes are only happening on release, and this patch should help diagnose further and potentially reduce the crash count from the new null pointer check.
Is this code covered by automated tests?: Unknown
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Patch is small -- adds a null pointer check in the place we're already crashing and two diagnostic asserts along with an trivial datatype change
String changes made/needed:
Is Android affected?: Yes

Attachment #9315492 - Flags: approval-mozilla-release?

az [:az]

Assignee

Comment 23

•

1 year ago

Given that this is only affecting release, after some discussion it seems like an uplift would be prudent to help narrow this down some. We still need to investigate further to find the root cause.

Ryan VanderMeulen [:RyanVM]

Comment 24

•

1 year ago

•

Edited

Diagnostic asserts aren't enabled past early beta, though. Is the remaining part of the patch still worth taking without those being effectively enabled?

Flags: needinfo?(azebrowski)

az [:az]

Assignee

Comment 25

•

1 year ago

Apart from the asserts I think it'd still be worth it. The thought is that the new null pointer check would allow it to fail gracefully rather than outright crashing, and if the crashes continue we'll have some additional info to diagnose with either way.

Flags: needinfo?(azebrowski)

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

Attachment #9315492 - Flags: approval-mozilla-beta?

Pascal Chevrel:pascalc (PTO until April 26)

Comment 26

•

1 year ago

Comment on attachment 9315492 [details]
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu

I am taking the patch to the beta branch before the merge.

Attachment #9315492 - Flags: approval-mozilla-release?

Attachment #9315492 - Flags: approval-mozilla-release-

Attachment #9315492 - Flags: approval-mozilla-beta?

Attachment #9315492 - Flags: approval-mozilla-beta+

Pascal Chevrel:pascalc (PTO until April 26)

Comment 27

•

1 year ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/08f4a64e6b4c

status-firefox110: affected → fixed

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

status-firefox109: affected → wontfix

Donal Meehan [:dmeehan]

Comment 28

•

1 year ago

:az do you still need leave-open on this or can it be resolved?

Flags: needinfo?(azebrowski)

az [:az]

Assignee

Comment 29

•

1 year ago

Looks like this can be marked resolved, closing.

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

Flags: needinfo?(azebrowski)

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

1 year ago

status-firefox111: affected → fixed

Target Milestone: --- → 111 Branch

BugBot (nomail) [:suhaib / :marco/ :calixte]

Updated

•

1 year ago

Keywords: leave-open

Paul Adenot (:padenot)

Updated

•

10 months ago

Screenshot 2023-02-01 at 13-34-15 Crash Stats signature report @mozilla CheckedInt T value.png 1 year ago Jim Mathies [:jimm] 54.65 KB, image/png		Details
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu 1 year ago az [:az] 48 bytes, text/x-phabricator-request	pascalc : approval-mozilla-beta+ pascalc : approval-mozilla-release-	Details \| Review

Crash in [@ mozilla::CheckedInt<T>::value]

sample (MediaRawData)

mNextRandomAccessPoint

Beta/Release Uplift Approval Request

sample (`MediaRawData`)