Closed Bug 1535766 Opened 5 years ago Closed 5 years ago

Crash in [@ mozilla::WebrtcGmpVideoEncoder::Encoded]

Categories

(Core :: WebRTC, defect)

66 Branch
ARM
Android
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla68
Tracking Status
relnote-firefox --- 66+
firefox-esr60 --- unaffected
firefox65 --- unaffected
firefox66 blocking verified
firefox67 --- verified
firefox68 --- verified

People

(Reporter: philipp, Assigned: dminor)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 1 obsolete file)

[Tracking Requested - why for this release]:

This bug is for crash report bp-fa088414-0209-4152-9e76-b635b0190315.

Top 10 frames of crashing thread:

0 libxul.so mozilla::WebrtcGmpVideoEncoder::Encoded media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp:570
1 libxul.so mozilla::gmp::GMPVideoEncoderParent::RecvEncoded dom/media/gmp/GMPVideoEncoderParent.cpp:256
2 libxul.so mozilla::gmp::PGMPVideoEncoderParent::OnMessageReceived ipc/ipdl/PGMPVideoEncoderParent.cpp:321
3 libxul.so mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2160
4 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1936
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1161
6 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:474
7 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:333
8 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:315
9 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:449

this crash signature is newly showing up since the first rc for firefox 66.0 (build 20190311211048) on fennec and is continuing in the subsequent rcs as well.
it's accounting for 12% of all crashes on fennec 66 currently.

the changelog between 66.0b13 and 66.0rc1 was quite contained: https://bugzilla.mozilla.org/buglist.cgi?bug_id=1534256%2C1498973%2C1533261%2C1528045%2C1531893%2C1528317%2C1518524%2C1531176%2C1533667
only bug 1533261 would stick out as obviously related to webrtc in there...

Byron, can you take a look at this? This is a high crash rate for 66 release (Firefox for Android only)
If you think that your patch in bug 1533261 may have caused the fennec crash, and we can back it out quickly, we could potentially respin the Fennec 66 tomorrow morning.

Flags: needinfo?(docfaraday)

Bug 1533261 isn't causing this. This might have been spotted over in bug 1533001, and appears to be related to the openh264 plugin.

Flags: needinfo?(docfaraday)
See Also: → 1533001

in a spot check in some of the crash reports, affected users still had OpenH264 1.7.1 though.

Because of Bug 1532756, OpenH264 on Android was non-functional for Firefox 66 from around Jan 10 to March 8, so we would not expect to see crashes from before March 8th.

That widens the regression range considerably...

I'll have a look.

Assignee: nobody → dminor

There are a small number of null pointer crashes, e.g. https://crash-stats.mozilla.com/report/index/a7de2d23-dbef-42db-898e-0e09a0190317

The remainder all seem to be on an address ending in 0x012 which is not four byte aligned. Many of these are flagged as SIGBUS / BUS_ADALN. BUS_ADALN is breakpadese for BUS_ADRALN which is an unaligned access error.

The code around the crash site has not changed recently. Maybe we've changed some compilation flags and we're now generating code that expects aligned reads, or maybe we used to guarantee alignment of buffers coming from GMP and that is no longer the case, or maybe this is just a red herring.

This is a high enough crash volume that it should likely block the 66 Fennec release (or, really, keep it throttled at its current 5%).

Top crashing devices are Oppo F7 followed by Samsung Galaxy Note 8 (which I don't believe we have in house). The next one is the Galaxy Tab 4.

Should we pref off h264 on android for now?

Flags: needinfo?(dminor)

(In reply to Byron Campen [:bwc] from comment #10)

Should we pref off h264 on android for now?

I think that makes sense. All reports are apparently from chatroulette variants, e.g. dirtyroulette.com, chatroulette.com, chatrandom.com/gay) but that could just be because they are popular, I'm not sure. It doesn't reproduce for me on my phone on mozilla.github.io/webrtc-landing or appear.in, and I've been doing a fair bit of openh264 on Android over the past few weeks. I run into Bug 1397539 before I see any crashes.

We could add error checking for null and unaligned buffers, but I'm not really comfortable doing that without understanding why we're all of a sudden seeing problems.

Flags: needinfo?(dminor)

Randell, I don't suppose you know of any reason why we'd suddenly see problems with unaligned accesses here?

Flags: needinfo?(rjesup)

Unaligned access often (I think?) is a compiler or code generation issue. froydnj, I think you worked through something like this for timers recently. Is it possible we're seeing an issue with LTO? But that wouldn't impact release yet, would it?

Flags: needinfo?(nfroyd)

James, nathan - any thoughts on these sorts of crashes on android? (unaligned access)

Do we need to rebuild and hope it goes away (compiler/linker bug)? were compiler settings for OpenH264 builds wrong?

Flags: needinfo?(rjesup) → needinfo?(snorp)

OpenH264 is built independently of Fennec; is it possible something involved in Arm64 work indirectly/unexpectedly impacted this?

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #15)

Do we need to rebuild and hope it goes away (compiler/linker bug)?

crashes are present in 66.0rc1, rc2 and rc3 - so not constrained to a single build unfortunately.

So: this is where it's failing:
https://hg.mozilla.org/releases/mozilla-release/annotate/7c987c3a58404a6c480e2952a4f5ef4a70aee058/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l568

568 case GMP_BufferLength32:
569 // presumes we can do unaligned loads
570 size = (reinterpret_cast<uint32_t>(buffer));

It also applies to the 16-bit variant, though perhaps we're not hitting that (likely not)

This implies that if we're hitting this due to changes in the OpenH264 code (probably), we can resolve this by doing something similar to the 24-bit case (byte-reads). However, if something is corrupting the data and we're happening to hit this due to that, it may just hit another problem. I would expect the more likely case is it's not corrupted, but there was a change in the GMP plugin that triggered this. (Didn't we update recently-ish?)

Flags: needinfo?(dminor)

FYI, this is only armeabi-v7a - "Providing the hardware bit for strict alignment checking is not turned on (which, as on x86, no general-purpose OS is realistically going to do), AArch64 does permit unaligned data accesses to Normal (not Device) memory with the regular load/store instructions." - https://stackoverflow.com/questions/38535738/does-aarch64-support-unaligned-access

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #18)

So: this is where it's failing:
https://hg.mozilla.org/releases/mozilla-release/annotate/7c987c3a58404a6c480e2952a4f5ef4a70aee058/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l568

568 case GMP_BufferLength32:
569 // presumes we can do unaligned loads
570 size = (reinterpret_cast<uint32_t>(buffer));

It also applies to the 16-bit variant, though perhaps we're not hitting that (likely not)

The OpenH264 plugin always uses GMP_BufferLength32, so we don't have to worry about the 16-bit variant (https://github.com/cisco/openh264/blob/openh264v1.7.1/module/gmp-openh264.cpp#L571)

This implies that if we're hitting this due to changes in the OpenH264 code (probably), we can resolve this by doing something similar to the 24-bit case (byte-reads). However, if something is corrupting the data and we're happening to hit this due to that, it may just hit another problem. I would expect the more likely case is it's not corrupted, but there was a change in the GMP plugin that triggered this. (Didn't we update recently-ish?)

There have been no changes in the OpenH264 plugin, we're still shipping 1.7.1. I have been working on the 1.8.1 update, but that has only been rolled out to Nightly desktop platforms so far, due to Bug 1533001.

So that would leave a compiler related change, something GMP related, or data corruption.

I can definitely add code to detect and handle the unaligned data (or just return an error in that case) but it would be good to understand what has changed to make it necessary to do that now.

Flags: needinfo?(dminor)

There have been no changes in the OpenH264 plugin, we're still shipping 1.7.1. I have been working on the 1.8.1 update, but that has only been rolled out to Nightly desktop platforms so far, due to Bug 1533001.

So that would leave a compiler related change, something GMP related, or data corruption.

I can definitely add code to detect and handle the unaligned data (or just return an error in that case) but it would be good to understand what has changed to make it necessary to do that now.

So with all the changes for AArch64, it's certainly conceivable that some compiler option got tweaked and changed the generated code here. I'll defer to the android experts on if that could have happened.

The code landed in bug 999704 - there's no discussion of the alignment issue; I vaguely remember discussing it on IRC with <someone> 5 years ago; likely including android (since I know x86 can do unaligned loads). So if the plugin didn't change it's very likely the mozilla codegen did somehow.

Hi, I investigated this crash and tried to reproduce the problem using Samsung Galaxy Note 8 (Android 9), Samsung Galaxy tab S3 (Android 8.0) and Huawei P20 Lite (Android 8) on the latest version of Release 66.0 build 3. I tried to crash the app by streaming on http://mozilla.github.io/webrtc-landing/pc_test.html and appear.in several times but without any success.
Also, I tried with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html from bug 1397539 and I can reproduce the problem.
Due to the fact that this ticket is a critical one, can we do something from the QA side to help with the investigation?

I'm going to prepare a patch to handle unaligned reads this morning.

I'm not 100% certain, but I don't think there have been any major build changes for aarch64 recently. We added support for aarch64 in 55. (Bug 1366404)

Flags: needinfo?(snorp)

dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.

Flags: needinfo?(dminor)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #26)

dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.

I'll give it a try. Stefan, would you be able to try this as well?

Flags: needinfo?(dminor) → needinfo?(stefan.deiac)

The calls to readUint32 memcpy to a properly aligned buffer avoiding any
problems with alignment. This uses the endian routines to ensure that the reads
will match the endianess of the current machine. The memcpy adds some overhead
but it seems negligible compared to the amount of work done to packetize and
send the encoded data.

These changes were tested by adding code to create an unaligned buffer and
memcopying the received buffer into it.

This also adds a null check for the received buffer as we have seen a small
volume of null pointer crashes.

I've added a patch to handle unaligned reads from the buffer. I've tested it locally by artificially creating an unaligned buffer.

The risk with this patch is that the unaligned reads are a symptom of a problem elsewhere, in which case this will just move the problem rather than fix it. If we suspect this is the case, we could instead add a check for an unaligned buffer and return early.

(In reply to Dan Minor [:dminor] from comment #27)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #26)

dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.

I'll give it a try. Stefan, would you be able to try this as well?

I can reproduce the crash 5/5 times using this link.
Devices: Samsung Galaxy Tab S3 (Android 8.0) and Samsung Galaxy Note 8 (Android 9).

Flags: needinfo?(stefan.deiac)

Commented in Phab that the buffer lengths (except for the odd 24-bit, as commented in the code) are implicitly processor-byte-order.

Probably comments should be added to gmp-video-codec.h specifying this.

dminor: does it repro with you with https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html ? If so, that's great for resolving this.

I'll give it a try. Stefan, would you be able to try this as well?

I can reproduce the crash 5/5 times using this link.
Devices: Samsung Galaxy Tab S3 (Android 8.0) and Samsung Galaxy Note 8 (Android 9).

Stefan: does it fail with older Fennecs? 65 or 64? Thanks!

Flags: needinfo?(stefan.deiac)

Do we have a good way to test this yet (that isn't based on going on chatroulette)? If we can set one up now, that will help us feel confident for a dot release.

I'm planning to land the patch to handle unaligned access when my Try job finishes. Once that hits Nightly, Stefan should be able to test if the patch fixes the problem. If it doesn't, then it seems more likely to me that we're hitting some sort of data corruption problem.

(In reply to Liz Henry (:lizzard) (use needinfo) from comment #33)

Do we have a good way to test this yet (that isn't based on going on chatroulette)? If we can set one up now, that will help us feel confident for a dot release.

It reproduces for Stefan on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html. It doesn't reproduce for me, unfortunately (different hardware, the fact that I'm running local builds, I'm not sure...)

Pushed by dminor@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1239a10882e9
Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Hello, I tried to reproduce this crash on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html and here are my findings:

  • 64.0.2 build 1 -> not reproducible;
  • 65.0.1 build 2 -> not reproducible;
  • 66.0 build 1 -> reproducible;
  • 66.0 build 2 -> reproducible;
  • 66.0 build 3 -> reproducible;
  • 67.0b3 -> reproducible;

Devices:
-Samsung Galaxy Tab S3 (Android 8.0);
-Samsung Galaxy Note 8 (Android 9).

Also, I tried with Nightly 68.0a1 (2019-03-19) and I can't reproduce the crash.

Flags: needinfo?(stefan.deiac)
Attachment #9051798 - Attachment is obsolete: true

Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: None
  • User impact if declined: Frequent crashes on sites using OpenH264 for WebRTC.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: The crash occurs on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html on affected devices.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This change just makes a copy of the value to a properly aligned buffer before reading it to avoid unaligned reads.
  • String changes made/needed: None
Attachment #9052002 - Flags: approval-mozilla-beta?
Flags: qe-verify?

Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: None
  • User impact if declined: Frequent crashes on sites using OpenH264 for WebRTC.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: The crash occurs on https://nils-ohlmeier.github.io/webrtc-landing/pc_test_loop_h264.html on affected devices.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This change just makes a copy of the value to a properly aligned buffer before reading it to avoid unaligned reads.
  • String changes made/needed: None
Attachment #9052002 - Flags: approval-mozilla-release?
Flags: qe-verify? → qe-verify+

Looks good, Pascal do you want this for tomorrow's beta? That will help us be more confident to bring it to m-r for next week.

Flags: needinfo?(pascalc)

(In reply to Stefan Deiac from comment #38)

Also, I tried with Nightly 68.0a1 (2019-03-19) and I can't reproduce the crash.

Thanks Stefan - can you verify if the nightly has the fix or not? If the rev for the build is after 1239a10882e9, it has the fix. Since the fix landed in central circa 3pm PDT, it likely would have missed the morning "Nightly", but may have been in the Nightly built later in the day. The buildid should provide the info.

Thanks again!

Dan (or nathan?) do you have an idea what might have regressed this in 66, given Stefan's tests above?

Flags: needinfo?(stefan.deiac)
Flags: needinfo?(dminor)
QA Whiteboard: [qa-triaged]

Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!

Crash fix, taking in 67 Beta 4 so as that we can better evaluate the potential impact of taking it in a 66 dot release. Thanks

Flags: needinfo?(pascalc)
Attachment #9052002 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #42)

Dan (or nathan?) do you have an idea what might have regressed this in 66, given Stefan's tests above?

I think a mozregression run by someone able to reproduce the bug would be the fastest way to determine that. When I started investigating this bug I checked the mercurial logs in some likely places but I didn't find anything that looked relevant.

Flags: needinfo?(dminor)

Hi, I tried to reproduce the crash on the latest version of Nightly 68.0a1 (2019-03-21) using Samsung Galaxy Tab S3 (Android 8.0) but I wasn't able. Due to that, I'll mark the 68 flag as verified, thanks.

QA Whiteboard: [qa-triaged]
Flags: qe-verify+
Flags: needinfo?(stefan.deiac)

Verified as fixed on the Beta 67.0b4 using Samsung Galaxy Tab S3 (Android 8.0) and Samsung Galaxy Note 8 (Android 9),
Thanks.

Comment on attachment 9052002 [details]
Bug 1535766 - Handle unaligned buffers in WebrtcGmpVideoCodec::Encoded; r=bwc!

Fennec crash fix already verified in Beta, approved for our next 66 dot release, thanks.

Attachment #9052002 - Flags: approval-mozilla-release? → approval-mozilla-release+

Sorry, was out on PTO, but it looks like y'all got things sorted!

Flags: needinfo?(nfroyd)

Added to 66.0.2 Android release notes with this wording:

Fix frequent crashes on sites using the OpenH264 video codec for WebRTC (Bug 1535766)

Tested on 66.0.2 following the steps from comment 39 and couldn't reproduce the crash on the affected devices:

  • Samsung Galaxy Tab S3 (Android 8.0)
  • Samsung Galaxy Note 8 (Android 9)
    Due to that, I'll mark the 66 flag as verified.
Status: RESOLVED → VERIFIED

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?
Root Cause: ? → Coding: Semantic Error

Removing regressionwindow-wanted keyword because this bug has been resolved.

Removing regressionwindow-wanted keyword because this bug has been resolved.

Removing regressionwindow-wanted keyword because this bug has been resolved.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: