Crash in [@ mozilla::CheckedInt<T>::value]
Categories
(Core :: Audio/Video: Playback, defect, P2)
Tracking
()
People
(Reporter: gsvelto, Assigned: az)
References
Details
(Keywords: crash, regression, topcrash)
Crash Data
Attachments
(2 files)
54.65 KB,
image/png
|
Details | |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-release-
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/0cdeca89-9d89-44f5-84bc-2ba2f0230115
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 libxul.so mozilla::CheckedInt<long>::value const mfbt/CheckedInt.h:563
0 libxul.so mozilla::media::TimeUnit::operator<= const dom/media/TimeUnits.h:143
0 libxul.so mozilla::MediaSourceTrackDemuxer::DoGetSamples dom/media/mediasource/MediaSourceDemuxer.cpp:463
1 libxul.so mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<int> >::applyImpl<mozilla::MediaSourceTrackDemuxer, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > xpcom/threads/nsThreadUtils.h:1147
1 libxul.so mozilla::detail::RunnableMethodArguments<StoreCopyPassByRRef<int> >::apply<mozilla::MediaSourceTrackDemuxer, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > xpcom/threads/nsThreadUtils.h:1153
1 libxul.so mozilla::detail::MethodCall<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > xpcom/threads/MozPromise.h:1518
1 libxul.so mozilla::detail::ProxyRunnable<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true>, RefPtr<mozilla::MozPromise<RefPtr<mozilla::MediaTrackDemuxer::SamplesHolder>, mozilla::MediaResult, true> > xpcom/threads/MozPromise.h:1538
2 libxul.so mozilla::TaskQueue::Runner::Run xpcom/threads/TaskQueue.cpp:259
3 libxul.so nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:309
4 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1198
This does not look like a new problem, but rather one where the signature changed due to inlining (we should probably remove the first two frames from the signature). It's predominantly happening on Fenix.
Comment 1•1 year ago
|
||
This kind of looks like it could be a regression, honestly. I tried searching for crashes in the last 6 months with MediaSourceTrackDemuxer::DoGetSamples in the proto signature and I didn't see anything with the same volume as these two crashes.
Comment 2•1 year ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 AArch64 and ARM crashes on release
:jimm, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment 3•1 year ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #1)
This kind of looks like it could be a regression, honestly. I tried searching for crashes in the last 6 months with MediaSourceTrackDemuxer::DoGetSamples in the proto signature and I didn't see anything with the same volume as these two crashes.
This crash signature is now the top content crash in Fenix 109, though I don't see any crash reports from Beta 110 or Nightly 111. In fact, over the last six months, there have been only 3 reports from Nightly and 5 from Beta, even if I group the MediaSourceTrackDemuxer::DoGetSamples
and CheckedInt<T>::value
signatures together. Why is this crash so much more common on Release than Nightly or Beta?
It looks like there may have been a spike in Fenix 102 and another in 108:
Comment 4•1 year ago
|
||
This is the top content crash in Fenix 109 Release. It look like a regression in Fenix 108+ Release to me. We've received 4000+ of crash reports from Fenix 108+109 Release, but only 173 from Fenix 107 Release.
We've received only two crash reports from Beta 108-110 and one from Nightly 108-111. Why is this crash so much more likely on Release?
Comment 5•1 year ago
|
||
The last thing to touch MediaSourceDemuxer.cpp was bug 1764293, which landed in the 108 timeframe. And it also directly touched the DoGetSamples function.
Updated•1 year ago
|
It looks like the root cause may be due to a timing issue where a sample pointer is invalidated while waiting for a lock, though the pointer itself is created / validated immediately before that here. We have crashes with the same signature going back for quite a few versions to at least 99 with a spike in 102 as cpeterson pointed out and on other platforms as well, e.g. this report on Windows 10 102.7.0esr. I don't think that the recent change to MediaSourceDemuxer.cpp is responsible for the timing issue itself, but it fixed issues with embedded playback of streams with mismatched start times that might be more likely to trigger the issue. A quick fix could be to add an additional pointer validity check, though that doesn't answer the question as to why the pointer's becoming invalid in the first place, especially on Android. I think it'd fix the crash but not the underlying issue so I believe we'd still have playback failures. I'm still looking at this -- will chat with alwu on this later today and NI padenot as well.
Comment 7•1 year ago
•
|
||
If that crash line is correct, which means one of those two TimeUnit
is invalid. But I couldn't find a reasonable explanation for that.
sample (MediaRawData
)
When appending sample
into samples
, there is already an assertion to ensure the sample is valid. That sample would always be guaranteed alive because it's holding by a RefPtr
.
Okay, let's see another possibility. Is it possible someone modifying its mTime
during waiting for the monitor? When the TrackBufferManager
returns a sample via GetSample()
, the returned sample is a duplication so only MediaSourceDemuxer
is holding that sample. So That is not possible someone modifying sample's mTime
while MediaSourceDemuxer
is accessing it.
mNextRandomAccessPoint
It doesn't get initialized explicitly so the default value for that TimeUnit
is zero, which means its value is valid. When assigning a value to it, we call TrackBuffersManager::GetNextRandomAccessPoint
and these are places the TimeUnit
will be returned. The returned TimeUnit
would be either a sample's time (which we've proved it should be valid if it's returned by GetSample()
in the above paragraph) or TimeUnit::FromInfinity()
(which is also a valid value). So from this point of view, it's also not possible to be invalid.
Comment 8•1 year ago
•
|
||
One interesting thing is, all those crashes have same address (0x0000000000000020
for 64 bits and 0x00000018
for 32 bits). That make me think of the possibility of maybe the infinite value (INT_64MAX
) somehow didn't in the range when checking the value is valid or not.
I found a post which is about the value didn't get printed correctly on Android. That seems related with the toolchain mismatch. Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?
Thank you.
Comment 9•1 year ago
|
||
The bug is marked as tracked for firefox109 (release), tracked for firefox110 (beta) and tracked for firefox111 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.
:jimm, could you please find an assignee for this tracked bug? Given that it is a regression and we know the cause, we could also simply backout the regressor. If you disagree with the tracking decision, please talk with the release managers.
For more information, please visit auto_nag documentation.
Comment 10•1 year ago
|
||
:az was thinking this might have blown up due to her fix addressing playback of a particular set of videos that trigger the issue. She's going to try to put something defensive together and mark this leave-open.
Chris - it would be great if we had urls reported in crash reports for something like this. Re - https://docs.google.com/document/d/1uloP5DXQWp3QnKA7UIQUfnBBfgK3KEr5XdA0S_TjCH8/edit#bookmark=id.dgm1eoo8hdb1
Comment 11•1 year ago
•
|
||
(In reply to Jim Mathies [:jimm] from comment #10)
:az was thinking this might have blown up due to her fix addressing playback of a particular set of videos that trigger the issue. She's going to try to put something defensive together and mark this leave-open.
Chris - it would be great if we had urls reported in crash reports for something like this. Re - https://docs.google.com/document/d/1uloP5DXQWp3QnKA7UIQUfnBBfgK3KEr5XdA0S_TjCH8/edit#bookmark=id.dgm1eoo8hdb1
From crash stats, this blew up in 109, and the twitter fix landed in 108, so bug 1764293 was not the cause.
Comment 12•1 year ago
•
|
||
(In reply to Alastor Wu [:alwu] from comment #8)
Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?
Could you clarify which toolchains you're referring to? In the 108 cycle, we updated to clang 15 in bug 1784202 and rust 1.65 in bug 1789508. We haven't updated our NDK in years.
Comment 13•1 year ago
•
|
||
Possibly this regressed in 108.2.0?
Comment 14•1 year ago
•
|
||
For what it's worth, Desktop 109 releases are hitting the same crashes (albeit lower volume):
https://crash-stats.mozilla.org/report/index/26453d43-6968-44e4-8b5c-7e6790230201
Also, we do see some reports from older versions like 106/107 also:
https://crash-stats.mozilla.org/report/index/8d1682ee-8e12-4535-8971-56d020221112
https://crash-stats.mozilla.org/report/index/607755a4-68c2-4e50-b2bf-79b0f0230128
Comment 15•1 year ago
|
||
(In reply to Alastor Wu [:alwu] from comment #8)
I found a post which is about the value didn't get printed correctly on Android. That seems related with the toolchain mismatch. Chris, do you know if Fenix team or build team changes/updates the toolchain recently? Maybe we can check that direction as well to see if there is any possible regression or not. Or you know who is the right person to answer this question?
Ryan answered the question about updating the toolchain in comment 12. For more details about these updates, I would ask :glandium.
In the 108 cycle, we updated to clang 15 in bug 1784202 and rust 1.65 in bug 1789508. We haven't updated our NDK in years.
(In reply to Jim Mathies [:jimm] from comment #13)
Created attachment 9315433 [details]
Screenshot 2023-02-01 at 13-34-15 Crash Stats signature report @mozilla CheckedInt T value.pngPossibly this regressed in 108.2.0?
We might see fewer crashes from 108.1.x than from 108.2.0 because early Fenix updates are throttled.
I see 108.2.0's pushlog includes a uniffi update. We've had Android crashes caused by uniffi updates multiple times, but those bugs were usually on Android x86 devices. This bug's crashes are nearly all arm64. So uniffi is probably not a problem here.
Assignee | ||
Comment 16•1 year ago
•
|
||
After digging a bit deeper, I think that the crash addresses of 0x0000000000000020
on 64bit and 0x00000018
on 32bit are hints that RefPtr<MediaRawData> sample
is a null pointer despite the guards we have in place. MediaRawData
is subclassed from MediaData
here. If we look at the definition of MediaData, we can see that we have an enum
and int64_t
defined before the TimeUnit
we try to access here.
Looking at the way this is laid out in memory, we get the following on a x64 Linux machine:
(rr) ptype /o MediaData [45/27666]
/* offset | size */ type = class mozilla::MediaData {
protected:
/* 8 | 8 */ class mozilla::ThreadSafeAutoRefCnt {
public:
static const bool isThreadSafe;
private:
/* 8 | 8 */ struct std::atomic<unsigned long> [with _Tp = unsigned long] : public std::__atomic_base<_Tp> {
static const bool is_always_lock_free;
/* total size (bytes): 8 */
} mValue;
/* total size (bytes): 8 */
} mRefCnt;
public:
/* 16 | 4 */ const mozilla::MediaData::Type mType;
/* XXX 4-byte hole */
/* 24 | 8 */ int64_t mOffset;
/* 32 | 16 */ class mozilla::media::TimeUnit
The offset of 32
for the TimeUnit
works out to 0x20
, the same crash address we see on 64 bit Android/Windows. If the 4-byte hole doesn't exist on 32 bit machines then the offset would be 24
, working out to 0x18
which is the same crash address we see on 32 bit machines. Then, if sample
is a null pointer, sample->mTime
would be attempting to access those same crash addresses.
If this theory is correct, we still need to answer the question as to how the sample pointer is becoming invalidated and alwu's comment has already pointed out a lot of the places where we have guards to prevent this from happening, so it could still be toolchain related. We also need to know why the vast majority of the crashes are on 64 bit Android and only cropped up recently, and I'm not sure why beta and nightly aren't being affected nearly as much as release. I'm uploading a patch with another null pointer check to aid in diagnosing that might fix the crash on release while also adding in a crash for beta/nightly so we could try to sort it out there, though I'm having problems accessing try at the moment and haven't been able to submit a try build.
Assignee | ||
Comment 17•1 year ago
|
||
Updated•1 year ago
|
Comment 18•1 year ago
|
||
I don't have explicit answer for what kind of toolchain change might cause this, just wanted to know if there is any possible change which would affect Android.
Updated•1 year ago
|
Comment 19•1 year ago
•
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #14)
For what it's worth, Desktop 109 releases are hitting the same crashes (albeit lower volume):
https://crash-stats.mozilla.org/report/index/26453d43-6968-44e4-8b5c-7e6790230201
FWIW as well, this 32-bit Windows crash confirms what [:az] wrote earlier in comment 16, i.e. the code where we crash is trying to do a comparison between the 64-bit values stored in sample->mTime
and mNextRandomAccessPoint
but sample
is a null pointer.
0:031> dt xul!mozilla::MediaRawData
+0x000 __VFN_table : Ptr32
+0x008 mRefCnt : mozilla::ThreadSafeAutoRefCnt
+0x00c mType : mozilla::MediaData::Type
+0x010 mOffset : Int8B
-> +0x018 mTime : mozilla::media::TimeUnit <-
+0x028 mTimecode : mozilla::media::TimeUnit
...
0:031> dt xul!mozilla::MediaSourceTrackDemuxer
+0x000 __VFN_table : Ptr32
+0x004 mRefCnt : mozilla::ThreadSafeAutoRefCnt
+0x008 mParent : RefPtr<mozilla::MediaSourceDemuxer>
+0x00c mTaskQueue : RefPtr<mozilla::TaskQueue>
+0x010 mType : mozilla::TrackInfo::TrackType
+0x014 mMonitor : mozilla::Monitor
-> +0x040 mNextRandomAccessPoint : mozilla::media::TimeUnit <-
+0x050 mManager : RefPtr<mozilla::TrackBuffersManager>
...
126cb276 8d742408 lea esi, [esp+8]
[...]
126cb28b 8b06 mov eax, dword ptr [esi]
126cb28d 8b4818 mov ecx, dword ptr [eax+18h] <-- crashing instruction
126cb290 3b4f40 cmp ecx, dword ptr [this->mNextRandomAccessPoint{.mValue.mValue(!!)} (edi+40h)]
126cb293 8b401c mov eax, dword ptr [eax+1Ch]
126cb296 1b4744 sbb eax, dword ptr [this->mNextRandomAccessPoint.mValue.mValue(!!) (edi+40h)+4]
0:031> r esi // address of sample
esi=320ff368
0:031> ? poi(esi) // contents of sample
Evaluate expression: 0 = 00000000
0:031> r eax // contents of sample as well
eax=00000000
Comment 20•1 year ago
|
||
Pushed by azebrowski@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6e83e2546b09 Additional sanity checks for sample data when returning samples from demuxer r=alwu
Comment 21•1 year ago
|
||
bugherder |
Assignee | ||
Comment 22•1 year ago
|
||
Comment on attachment 9315492 [details]
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu
Beta/Release Uplift Approval Request
- User impact if declined: Crashes will continue if declined. These crashes are only happening on release, and this patch should help diagnose further and potentially reduce the crash count from the new null pointer check.
- Is this code covered by automated tests?: Unknown
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Patch is small -- adds a null pointer check in the place we're already crashing and two diagnostic asserts along with an trivial datatype change
- String changes made/needed:
- Is Android affected?: Yes
Assignee | ||
Comment 23•1 year ago
|
||
Given that this is only affecting release, after some discussion it seems like an uplift would be prudent to help narrow this down some. We still need to investigate further to find the root cause.
Comment 24•1 year ago
•
|
||
Diagnostic asserts aren't enabled past early beta, though. Is the remaining part of the patch still worth taking without those being effectively enabled?
Assignee | ||
Comment 25•1 year ago
|
||
Apart from the asserts I think it'd still be worth it. The thought is that the new null pointer check would allow it to fail gracefully rather than outright crashing, and if the crashes continue we'll have some additional info to diagnose with either way.
Updated•1 year ago
|
Comment 26•1 year ago
|
||
Comment on attachment 9315492 [details]
Bug 1810396 - Additional sanity checks for sample data when returning samples from demuxer r=alwu
I am taking the patch to the beta branch before the merge.
Comment 27•1 year ago
|
||
bugherder uplift |
Updated•1 year ago
|
Comment 28•1 year ago
|
||
:az do you still need leave-open on this or can it be resolved?
Assignee | ||
Comment 29•1 year ago
|
||
Looks like this can be marked resolved, closing.
Updated•1 year ago
|
Updated•1 year ago
|
Description
•