Crash in [@ mozilla::WebrtcGmpVideoEncoder::Encoded]
Categories
(Core :: WebRTC, defect)
Tracking
()
Tracking | Status | |
---|---|---|
relnote-firefox | --- | 66+ |
firefox-esr60 | --- | unaffected |
firefox65 | --- | unaffected |
firefox66 | blocking | verified |
firefox67 | --- | verified |
firefox68 | --- | verified |
People
(Reporter: philipp, Assigned: dminor)
References
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file, 1 obsolete file)
47 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
pascalc
:
approval-mozilla-release+
|
Details | Review |
[Tracking Requested - why for this release]:
This bug is for crash report bp-fa088414-0209-4152-9e76-b635b0190315.
Top 10 frames of crashing thread:
0 libxul.so mozilla::WebrtcGmpVideoEncoder::Encoded media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp:570
1 libxul.so mozilla::gmp::GMPVideoEncoderParent::RecvEncoded dom/media/gmp/GMPVideoEncoderParent.cpp:256
2 libxul.so mozilla::gmp::PGMPVideoEncoderParent::OnMessageReceived ipc/ipdl/PGMPVideoEncoderParent.cpp:321
3 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2160
4 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1936
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1161
6 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:474
7 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:333
8 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:315
9 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:449
this crash signature is newly showing up since the first rc for firefox 66.0 (build 20190311211048) on fennec and is continuing in the subsequent rcs as well.
it's accounting for 12% of all crashes on fennec 66 currently.
the changelog between 66.0b13 and 66.0rc1 was quite contained: https://bugzilla.mozilla.org/buglist.cgi?bug_id=1534256%2C1498973%2C1533261%2C1528045%2C1531893%2C1528317%2C1518524%2C1531176%2C1533667
only bug 1533261 would stick out as obviously related to webrtc in there...
Comment 1•6 years ago
|
||
Byron, can you take a look at this? This is a high crash rate for 66 release (Firefox for Android only)
If you think that your patch in bug 1533261 may have caused the fennec crash, and we can back it out quickly, we could potentially respin the Fennec 66 tomorrow morning.
Comment 2•6 years ago
•
|
||
Bug 1533261 isn't causing this. This might have been spotted over in bug 1533001, and appears to be related to the openh264 plugin.
Reporter | ||
Comment 3•6 years ago
|
||
in a spot check in some of the crash reports, affected users still had OpenH264 1.7.1 though.
Assignee | ||
Comment 4•6 years ago
|
||
Because of Bug 1532756, OpenH264 on Android was non-functional for Firefox 66 from around Jan 10 to March 8, so we would not expect to see crashes from before March 8th.
Comment 5•6 years ago
|
||
That widens the regression range considerably...
Assignee | ||
Comment 7•6 years ago
|
||
There are a small number of null pointer crashes, e.g. https://crash-stats.mozilla.com/report/index/a7de2d23-dbef-42db-898e-0e09a0190317
The remainder all seem to be on an address ending in 0x012 which is not four byte aligned. Many of these are flagged as SIGBUS / BUS_ADALN. BUS_ADALN is breakpadese for BUS_ADRALN which is an unaligned access error.
The code around the crash site has not changed recently. Maybe we've changed some compilation flags and we're now generating code that expects aligned reads, or maybe we used to guarantee alignment of buffers coming from GMP and that is no longer the case, or maybe this is just a red herring.
Comment 8•6 years ago
|
||
This is a high enough crash volume that it should likely block the 66 Fennec release (or, really, keep it throttled at its current 5%).
Comment 9•6 years ago
|
||
Top crashing devices are Oppo F7 followed by Samsung Galaxy Note 8 (which I don't believe we have in house). The next one is the Galaxy Tab 4.
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #10)
Should we pref off h264 on android for now?
I think that makes sense. All reports are apparently from chatroulette variants, e.g. dirtyroulette.com, chatroulette.com, chatrandom.com/gay) but that could just be because they are popular, I'm not sure. It doesn't reproduce for me on my phone on mozilla.github.io/webrtc-landing or appear.in, and I've been doing a fair bit of openh264 on Android over the past few weeks. I run into Bug 1397539 before I see any crashes.
We could add error checking for null and unaligned buffers, but I'm not really comfortable doing that without understanding why we're all of a sudden seeing problems.
Assignee | ||
Comment 13•6 years ago
|
||
Randell, I don't suppose you know of any reason why we'd suddenly see problems with unaligned accesses here?
Comment 14•6 years ago
|
||
Unaligned access often (I think?) is a compiler or code generation issue. froydnj, I think you worked through something like this for timers recently. Is it possible we're seeing an issue with LTO? But that wouldn't impact release yet, would it?
Comment 15•6 years ago
|
||
James, nathan - any thoughts on these sorts of crashes on android? (unaligned access)
Do we need to rebuild and hope it goes away (compiler/linker bug)? were compiler settings for OpenH264 builds wrong?
Comment 16•6 years ago
|
||
OpenH264 is built independently of Fennec; is it possible something involved in Arm64 work indirectly/unexpectedly impacted this?
Reporter | ||
Comment 17•6 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #15)
Do we need to rebuild and hope it goes away (compiler/linker bug)?
crashes are present in 66.0rc1, rc2 and rc3 - so not constrained to a single build unfortunately.
Comment 18•6 years ago
|
||
So: this is where it's failing:
https://hg.mozilla.org/releases/mozilla-release/annotate/7c987c3a58404a6c480e2952a4f5ef4a70aee058/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l568
568 case GMP_BufferLength32:
569 // presumes we can do unaligned loads
570 size = (reinterpret_cast<uint32_t>(buffer));
It also applies to the 16-bit variant, though perhaps we're not hitting that (likely not)
This implies that if we're hitting this due to changes in the OpenH264 code (probably), we can resolve this by doing something similar to the 24-bit case (byte-reads). However, if something is corrupting the data and we're happening to hit this due to that, it may just hit another problem. I would expect the more likely case is it's not corrupted, but there was a change in the GMP plugin that triggered this. (Didn't we update recently-ish?)
Comment 19•6 years ago
|
||
FYI, this is only armeabi-v7a - "Providing the hardware bit for strict alignment checking is not turned on (which, as on x86, no general-purpose OS is realistically going to do), AArch64 does permit unaligned data accesses to Normal (not Device) memory with the regular load/store instructions." - https://stackoverflow.com/questions/38535738/does-aarch64-support-unaligned-access
Assignee | ||
Comment 20•6 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #18)
So: this is where it's failing:
https://hg.mozilla.org/releases/mozilla-release/annotate/7c987c3a58404a6c480e2952a4f5ef4a70aee058/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l568568 case GMP_BufferLength32:
569 // presumes we can do unaligned loads
570 size = (reinterpret_cast<uint32_t>(buffer));It also applies to the 16-bit variant, though perhaps we're not hitting that (likely not)
The OpenH264 plugin always uses GMP_BufferLength32, so we don't have to worry about the 16-bit variant (https://github.com/cisco/openh264/blob/openh264v1.7.1/module/gmp-openh264.cpp#L571)
This implies that if we're hitting this due to changes in the OpenH264 code (probably), we can resolve this by doing something similar to the 24-bit case (byte-reads). However, if something is corrupting the data and we're happening to hit this due to that, it may just hit another problem. I would expect the more likely case is it's not corrupted, but there was a change in the GMP plugin that triggered this. (Didn't we update recently-ish?)
There have been no changes in the OpenH264 plugin, we're still shipping 1.7.1. I have been working on the 1.8.1 update, but that has only been rolled out to Nightly desktop platforms so far, due to Bug 1533001.
So that would leave a compiler related change, something GMP related, or data corruption.
I can definitely add code to detect and handle the unaligned data (or just return an error in that case) but it would be good to understand what has changed to make it necessary to do that now.
Comment 21•6 years ago
|
||
There have been no changes in the OpenH264 plugin, we're still shipping 1.7.1. I have been working on the 1.8.1 update, but that has only been rolled out to Nightly desktop platforms so far, due to Bug 1533001.
So that would leave a compiler related change, something GMP related, or data corruption.
I can definitely add code to detect and handle the unaligned data (or just return an error in that case) but it would be good to understand what has changed to make it necessary to do that now.
So with all the changes for AArch64, it's certainly conceivable that some compiler option got tweaked and changed the generated code here. I'll defer to the android experts on if that could have happened.
Comment 22•6 years ago
|
||
The code landed in bug 999704 - there's no discussion of the alignment issue; I vaguely remember discussing it on IRC with <someone> 5 years ago; likely including android (since I know x86 can do unaligned loads). So if the plugin didn't change it's very likely the mozilla codegen did somehow.
Comment 23•6 years ago
|
||
Hi, I investigated this crash and tried to reproduce the problem using Samsung Galaxy Note 8 (Android 9), Samsung Galaxy tab S3 (Android 8.0) and Huawei P20 Lite (Android 8) on the latest version of Release 66.0 build 3. I tried to crash the app by streaming on http://mozilla.github.io/webrtc-landing/pc_test.html and appear.in several times but without any success.
Also, I tried with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html from bug 1397539 and I can reproduce the problem.
Due to the fact that this ticket is a critical one, can we do something from the QA side to help with the investigation?
Assignee | ||
Comment 24•6 years ago
|
||
I'm going to prepare a patch to handle unaligned reads this morning.
I'm not 100% certain, but I don't think there have been any major build changes for aarch64 recently. We added support for aarch64 in 55. (Bug 1366404)
Comment 26•6 years ago
|
||
dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.
Assignee | ||
Comment 27•6 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #26)
dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.
I'll give it a try. Stefan, would you be able to try this as well?
Assignee | ||
Comment 28•6 years ago
|
||
The calls to readUint32 memcpy to a properly aligned buffer avoiding any
problems with alignment. This uses the endian routines to ensure that the reads
will match the endianess of the current machine. The memcpy adds some overhead
but it seems negligible compared to the amount of work done to packetize and
send the encoded data.
These changes were tested by adding code to create an unaligned buffer and
memcopying the received buffer into it.
This also adds a null check for the received buffer as we have seen a small
volume of null pointer crashes.
Assignee | ||
Comment 29•6 years ago
|
||
I've added a patch to handle unaligned reads from the buffer. I've tested it locally by artificially creating an unaligned buffer.
The risk with this patch is that the unaligned reads are a symptom of a problem elsewhere, in which case this will just move the problem rather than fix it. If we suspect this is the case, we could instead add a check for an unaligned buffer and return early.
Comment 30•6 years ago
|
||
(In reply to Dan Minor [:dminor] from comment #27)
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #26)
dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.
I'll give it a try. Stefan, would you be able to try this as well?
I can reproduce the crash 5/5 times using this link.
Devices: Samsung Galaxy Tab S3 (Android 8.0) and Samsung Galaxy Note 8 (Android 9).
Comment 31•6 years ago
|
||
Commented in Phab that the buffer lengths (except for the odd 24-bit, as commented in the code) are implicitly processor-byte-order.
Probably comments should be added to gmp-video-codec.h specifying this.
Comment 32•6 years ago
|
||
dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.
I'll give it a try. Stefan, would you be able to try this as well?
I can reproduce the crash 5/5 times using this link.
Devices: Samsung Galaxy Tab S3 (Android 8.0) and Samsung Galaxy Note 8 (Android 9).
Stefan: does it fail with older Fennecs? 65 or 64? Thanks!
Comment 33•6 years ago
|
||
Do we have a good way to test this yet (that isn't based on going on chatroulette)? If we can set one up now, that will help us feel confident for a dot release.
Assignee | ||
Comment 34•6 years ago
|
||
I'm planning to land the patch to handle unaligned access when my Try job finishes. Once that hits Nightly, Stefan should be able to test if the patch fixes the problem. If it doesn't, then it seems more likely to me that we're hitting some sort of data corruption problem.
Assignee | ||
Comment 35•6 years ago
|
||
(In reply to Liz Henry (:lizzard) (use needinfo) from comment #33)
Do we have a good way to test this yet (that isn't based on going on chatroulette)? If we can set one up now, that will help us feel confident for a dot release.
It reproduces for Stefan on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html. It doesn't reproduce for me, unfortunately (different hardware, the fact that I'm running local builds, I'm not sure...)
Comment 36•6 years ago
|
||
Comment 37•6 years ago
|
||
bugherder |
Comment 38•6 years ago
•
|
||
Hello, I tried to reproduce this crash on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html and here are my findings:
- 64.0.2 build 1 -> not reproducible;
- 65.0.1 build 2 -> not reproducible;
- 66.0 build 1 -> reproducible;
- 66.0 build 2 -> reproducible;
- 66.0 build 3 -> reproducible;
- 67.0b3 -> reproducible;
Devices:
-Samsung Galaxy Tab S3 (Android 8.0);
-Samsung Galaxy Note 8 (Android 9).
Also, I tried with Nightly 68.0a1 (2019-03-19) and I can't reproduce the crash.
Updated•6 years ago
|
Assignee | ||
Comment 39•6 years ago
|
||
Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!
Beta/Release Uplift Approval Request
- Feature/Bug causing the regression: None
- User impact if declined: Frequent crashes on sites using OpenH264 for WebRTC.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: The crash occurs on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html on affected devices.
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This change just makes a copy of the value to a properly aligned buffer before reading it to avoid unaligned reads.
- String changes made/needed: None
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 40•6 years ago
|
||
Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!
Beta/Release Uplift Approval Request
- Feature/Bug causing the regression: None
- User impact if declined: Frequent crashes on sites using OpenH264 for WebRTC.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: The crash occurs on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html on affected devices.
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This change just makes a copy of the value to a properly aligned buffer before reading it to avoid unaligned reads.
- String changes made/needed: None
Updated•6 years ago
|
Updated•6 years ago
|
Comment 41•6 years ago
|
||
Looks good, Pascal do you want this for tomorrow's beta? That will help us be more confident to bring it to m-r for next week.
Comment 42•6 years ago
|
||
(In reply to Stefan Deiac from comment #38)
Also, I tried with Nightly 68.0a1 (2019-03-19) and I can't reproduce the crash.
Thanks Stefan - can you verify if the nightly has the fix or not? If the rev for the build is after 1239a10882e9, it has the fix. Since the fix landed in central circa 3pm PDT, it likely would have missed the morning "Nightly", but may have been in the Nightly built later in the day. The buildid should provide the info.
Thanks again!
Dan (or nathan?) do you have an idea what might have regressed this in 66, given Stefan's tests above?
Updated•6 years ago
|
Comment 43•6 years ago
|
||
Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!
Crash fix, taking in 67 Beta 4 so as that we can better evaluate the potential impact of taking it in a 66 dot release. Thanks
Assignee | ||
Comment 44•6 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #42)
Dan (or nathan?) do you have an idea what might have regressed this in 66, given Stefan's tests above?
I think a mozregression run by someone able to reproduce the bug would be the fastest way to determine that. When I started investigating this bug I checked the mercurial logs in some likely places but I didn't find anything that looked relevant.
Comment 45•6 years ago
|
||
Hi, I tried to reproduce the crash on the latest version of Nightly 68.0a1 (2019-03-21) using Samsung Galaxy Tab S3 (Android 8.0) but I wasn't able. Due to that, I'll mark the 68 flag as verified, thanks.
Comment 46•6 years ago
|
||
bugherder uplift |
Comment 47•6 years ago
|
||
Verified as fixed on the Beta 67.0b4 using Samsung Galaxy Tab S3 (Android 8.0) and Samsung Galaxy Note 8 (Android 9),
Thanks.
Comment 48•6 years ago
•
|
||
Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!
Fennec crash fix already verified in Beta, approved for our next 66 dot release, thanks.
Comment 49•6 years ago
|
||
Sorry, was out on PTO, but it looks like y'all got things sorted!
Comment 50•6 years ago
|
||
bugherder uplift |
Comment 51•6 years ago
|
||
This got accidentally backed out from beta https://hg.mozilla.org/releases/mozilla-beta/rev/1dabaf68f8fe6e50d20202773f1a8bd95dadb4b4 and relanded as https://hg.mozilla.org/releases/mozilla-beta/rev/65a648289c8a20cc8ccb57651ece3e32fb77932d
Comment 52•6 years ago
|
||
Added to 66.0.2 Android release notes with this wording:
Fix frequent crashes on sites using the OpenH264 video codec for WebRTC (Bug 1535766)
Comment 53•6 years ago
|
||
Tested on 66.0.2 following the steps from comment 39 and couldn't reproduce the crash on the affected devices:
- Samsung Galaxy Tab S3 (Android 8.0)
- Samsung Galaxy Note 8 (Android 9)
Due to that, I'll mark the 66 flag as verified.
Comment 54•5 years ago
|
||
Please specify a root cause for this bug. See :tmaity for more information.
Assignee | ||
Updated•5 years ago
|
Comment 55•2 years ago
|
||
Removing regressionwindow-wanted
keyword because this bug has been resolved.
Comment 56•2 years ago
|
||
Removing regressionwindow-wanted
keyword because this bug has been resolved.
Comment 57•2 years ago
|
||
Removing regressionwindow-wanted
keyword because this bug has been resolved.
Description
•