Closed Bug 1790160 Opened 2 years ago Closed 1 year ago

Crash with Avast in [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread]

Categories

(Core :: WebRTC: Audio/Video, defect)

Unspecified
Windows
defect

Tracking

()

VERIFIED FIXED
111 Branch
Tracking Status
firefox-esr102 --- verified
firefox109 --- wontfix
firefox110 --- verified
firefox111 --- verified

People

(Reporter: aryx, Assigned: pehrsons)

References

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

~350 crash reports stored with recent Firefox versions (104 - 102) in the last week, all on Windows. Multiple with 10 for the crash count which users experience. Crashes got frequent on 2022-09-01.

They very often share aswJsFlt64.dll.t01 as loaded module, owned by AVG Technologies USA, LLC. / Avast Software s.r.o.. Observed versions:

  • 18.0.1443.0
  • 18.0.1450.0
  • 18.0.1451.0

Thomas, did new versions of your products roll out at that time and are aware of issues with the interaction between them and Firefox? Thank you in advance.

Crash report: https://crash-stats.mozilla.org/report/index/162b5420-65b3-46b8-995e-444840220909

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll MessageLoop::PostTask ipc/chromium/src/base/message_loop.cc:411
1 xul.dll mozilla::camera::CamerasParent::DispatchToVideoCaptureThread dom/media/systemservices/CamerasParent.cpp:205
2 xul.dll mozilla::camera::CamerasParent::StopVideoCapture dom/media/systemservices/CamerasParent.cpp:215
3 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:579
4 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:567
5 xul.dll mozilla::ipc::PBackgroundParent::OnChannelError ipc/ipdl/PBackgroundParent.cpp:6533
6 xul.dll mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError ipc/glue/MessageChannel.cpp:2027
7 xul.dll mozilla::detail::RunnableMethodImpl<mozilla::UntrustedModulesProcessor*, void  xpcom/threads/nsThreadUtils.h:1200
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1199
9 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:300
Flags: needinfo?(salomon)

Redirect a needinfo that is pending on an inactive user to the triage owner.
:haik, since the bug has high severity and recent activity, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(salomon) → needinfo?(haftandilian)

@Lukas, we're seeing an increased number of Firefox crashes with the Avast DLL loaded starting last month in September. While we don't have a clear root cause, do you know if this aligns with any changes in Avast that may be triggering these crashes?

Flags: needinfo?(haftandilian) → needinfo?(rypacek)
See Also: → 1794064

Bug 1794064 landed a blocklist update for aswJsFlt.dll versions 18.0.1473.0 and earlier. Adding needinfo for myself to check how this affected the crash rate for this bug.

Flags: needinfo?(haftandilian)

FYI, we're still seeing some crashes from the 1478 build.

Redirect a needinfo that is pending on an inactive user to the triage owner.
:haik, since the bug has high severity and recent activity, could you please find another way to get the information or close the bug as INCOMPLETE if it is not actionable?

For more information, please visit auto_nag documentation.

Flags: needinfo?(rypacek) → needinfo?(haftandilian)

@Yannis, is this something you are working with Avast on?

Flags: needinfo?(yjuglaret)
Flags: needinfo?(haftandilian)

(In reply to Haik Aftandilian [:haik] from comment #6)

@Yannis, is this something you are working with Avast on?

No, we were focused on understanding and reproducing bug 1794064. I can take a lot at this one too.

Flags: needinfo?(yjuglaret)

I don't think Avast has any responsibility in this bug. I am confident that the underlying bug lives in our own code. It could be that some of the operations they do add delays that could turn out to make an existing race condition bug more likely to happen for Avast users compared to normal users, for example if Avast decides to scan a JS script before it gets loaded, or a downloaded file before it's written on disk.

We also haven't gotten any crash reports from 106, which might correspond to the big libwebrtc update we took in bug 1766646. Might still be good to understand what fixed it in case there's something we could reasonably backport to ESR still.

Transferring to a media category for further debugging.

Component: Other → Audio/Video: Recording
Product: External Software Affecting Firefox → Core

Added a brand new signature with inlined functions.

Crash Signature: [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] → [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ RefPtr<T>::get | RefPtr<T>::operator-> | MessageLoop::PostTask_Helper | MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread]

I cracked open a minidump out of curiosity and found out that sVideoCaptureThread here is NULL and that's what seems to be causing the crash.

I believe it's not sVideoCaptureThread but rather sVideoCaptureThread->message_loop() which is NULL. Disassembling a few instructions before the crash yields:

rcx is MessageLoop* this

xul!MessageLoop::PostTask [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 410]:
00007ff8`083ddea0 4156            push    r14
00007ff8`083ddea2 56              push    rsi
00007ff8`083ddea3 57              push    rdi
00007ff8`083ddea4 53              push    rbx
00007ff8`083ddea5 4883ec38        sub     rsp,38h
00007ff8`083ddea9 4889ce          mov     rsi,rcx
00007ff8`083ddeac 488b057d817b04  mov     rax,qword ptr [xul!__security_cookie (00007ff8`0cb96030)]
00007ff8`083ddeb3 4831e0          xor     rax,rsp
00007ff8`083ddeb6 4889442430      mov     qword ptr [rsp+30h],rax
00007ff8`083ddebb 4c8b32          mov     r14,qword ptr [rdx]
00007ff8`083ddebe 488b8980000000  mov     rcx,qword ptr [rcx+80h] <---- crashing instruction

@karlt, drawing attention to this bug. It's causing around 50-100 crashes per day and there is a hint in the previous comments regarding the root cause.

Flags: needinfo?(karlt)

sVideoCaptureThread is reset, under (presumably) the same mutex as needed to read it in DispatchToVideoCaptureThread(), with a reference to a CamerasParent before the thread is Stop()ed, after which message_loop() might return null.

DispatchToVideoCaptureThread() null checks sVideoCaptureThread with the mutex held.

If mutex were different, then that would mean that all CamerasParents have been destroyed and then another created.

message_loop() can also return null before Thread::ThreadMain() initializes, but StartWithOptions() waits for this with the mutex held.

Andreas has been rewriting some code around here, so drawing his attention.

Component: Audio/Video: Recording → WebRTC: Audio/Video
Flags: needinfo?(karlt) → needinfo?(apehrson)

I see another path through which message_loop() could start returning null: nsThreadManager::ShutdownNonMainThreads. But that doesn't seem to be the case here, given what main thread is doing (per the report in comment 0).

There's also the chance the video capture thread quit by itself. It is a windows UI thread, which should have created a window too. Excuse my lack of win-fu here, but is it possible something other than us, say Avast, sent this thread WM_QUIT?

Main thread seems to be processing a WM_DESTROY but I couldn't see that message being handled by the video capture thread.

Depends on: 1800215
Flags: needinfo?(apehrson)

If this is right that it quit through an unexpected WM_QUIT, then if it was an nsThread instead (what bug 1800215 will change), then when this happens shutdown tasks would run, followed by Dispatch() starting to fail.

Would we need to support nsThread's shutdown task as a possible path to CamerasParent shutdown? That would probably wreak havoc in CamerasChild and the things in the content process using it. Maybe it could be fixed with a refactor of CamerasParent/VideoEngine. It could be that only device enumeration (and device change events) needs this to be a ui thread and have a window, and that could go away and be recreated on request.

But first it would be good to verify what really happens.

Since the crash volume is low (less than 15 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: S2 → S3

Copying crash signatures from duplicate bugs.

Crash Signature: [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ RefPtr<T>::get | RefPtr<T>::operator-> | MessageLoop::PostTask_Helper | MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] → [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ RefPtr<T>::get | RefPtr<T>::operator-> | MessageLoop::PostTask_Helper | MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ M…
Duplicate of this bug: 1806918

Got some more information over in bug 1806918:

https://bugzilla.mozilla.org/show_bug.cgi?id=1806918#c8

Does this sound plausible?

Crash Signature: [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ RefPtr<T>::get | RefPtr<T>::operator-> | MessageLoop::PostTask_Helper | MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ → [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@ RefPtr<T>::get | RefPtr<T>::operator-> | MessageLoop::PostTask_Helper | MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread] [@
Flags: needinfo?(jib)

It is plausible, because the video capture thread on Windows is a NonMainUIThread, which, as a UI thread, will process windows messages and therefore handle WM_QUIT messages by quitting.

Basically what I was getting at with comment 16 and comment 17. And I don't see any other paths leading to the thread unexpectedly quitting.

Since this is a non-main thread, would it be legit to ignore WM_QUIT here?

For reference, chromium in its equivalent message pump considers a WM_QUIT unexpected and ignores it.

Assignee: nobody → apehrson
Status: NEW → ASSIGNED

I would also advocate checking that message_loop is non-null in addition to the null-check on sVideoCaptureThread, here:

https://searchfox.org/mozilla-central/rev/a156a65ced2dae5913ae35a68e9445b8ee7ca457/dom/media/systemservices/CamerasParent.cpp#213

(In reply to Byron Campen [:bwc] from comment #25)

I would also advocate checking that message_loop is non-null in addition to the null-check on sVideoCaptureThread, here:

https://searchfox.org/mozilla-central/rev/a156a65ced2dae5913ae35a68e9445b8ee7ca457/dom/media/systemservices/CamerasParent.cpp#213

That would be racy :(

Hi, nice investigation!

I am able to reproduce the crash consistently by doing what you suggest: sending a WM_QUIT message to the window. So in the end it seems likely that my comment 8 was too affirmative and that Avast and other software are doing something like this. In any case we should be able to use the reproducer below as a basis to try to fix the problem.

Here are the steps to reproduce the crash, should work in Nightly and Release on Windows X64:

PS> Get-FileHash .\VideoCrasher.exe

Algorithm       Hash
---------       ----
SHA256          689E03689FBC00DEBE6A3C9C4198B88B9C808EFA40179265A1E6B0820BE4DDFD
  • Open this page in a new tab.
  • Click on "Start camera", click "Allow".
  • Run VideoCrasher.exe. The output should look like this:
Found VideoCapture thread window [pid = 2128, tid = 19268], sending WM_QUIT...
Result: 1
  • Close the tab.

Thanks Yannis for the repro tooling! Since you're all setup, could you give this tentative fix a spin to see whether it seems sufficient?

Flags: needinfo?(yjuglaret)

(In reply to Andreas Pehrson [:pehrsons] from comment #26)

(In reply to Byron Campen [:bwc] from comment #25)

I would also advocate checking that message_loop is non-null in addition to the null-check on sVideoCaptureThread, here:

https://searchfox.org/mozilla-central/rev/a156a65ced2dae5913ae35a68e9445b8ee7ca457/dom/media/systemservices/CamerasParent.cpp#213

That would be racy :(

If null-checking that is racy, then calling PostTask on it a few lines later is pretty scary...

(In reply to Andreas Pehrson [:pehrsons] from comment #28)

Thanks Yannis for the repro tooling! Since you're all setup, could you give this tentative fix a spin to see whether it seems sufficient?

I confirm that this patch does prevent the crash -- at least when triggered with my reproducer in the way described above.

I was a bit worried because it seems that the reproducer was now finding two windows belonging to the VideoCapture thread:

Found VideoCapture thread window [hwnd = 00000000003105F4, pid = 12632, tid = 18088], sending WM_QUIT...
Result: 1
Found VideoCapture thread window [hwnd = 00000000002F06F8, pid = 12632, tid = 18088], sending WM_QUIT...
Result: 1

But actually that would already be the case before the patch if I didn't send the WM_QUIT message, so it seems normal (probably the second window was dead by the time EnumWindows would reach it when sending WM_QUIT):

Found VideoCapture thread window [hwnd = 00000000000B03E0, pid = 28796, tid = 14512].
Found VideoCapture thread window [hwnd = 00000000000B03C6, pid = 28796, tid = 14512].
Flags: needinfo?(yjuglaret)

As MessagePumpForUI is an old fork from Chromium, this patch effectively applies
the Chromium patch
https://source.chromium.org/chromium/chromium/src/+/b26918c5ab0da3130b92339f44ebdc4bfee1b351

For those threads where we use MessagePumpForUI we already have a way of
stopping the thread and thereby quitting the message loop. At the same time some
applications may send us unexpected WM_QUIT messages on those threads. Generally
the code using the thread is not able to handle this gracefully and we end up
crashing.

In particular there is a correlation between these crashes and certain antivirus
applications.

Andreas, while blaming the code in Chromium to see what caused them to add this (seems to be third-party software sending that message, bringing us back to the original Avast correlation!), I found this follow-up fix that Chromium did: https://source.chromium.org/chromium/chromium/src/+/6b9a2faf89e9b67fb6cb8d0986c431923f757b98

We probably want that one as well?

Flags: needinfo?(apehrson)

Good catch, thanks. I'll add a patch for that fix too, and go through the log a bit more to see if something else stands out.

Flags: needinfo?(apehrson)

Ok, that fix seems to be it as far as WM_QUIT is concerned.

I confirm that the two patches together also prevent the crash -- same results as in comment 30.

Pushed by pehrsons@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/8b71825335f3
In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r=mossop,ipc-reviewers,gfx-reviewers,nika,nical
https://hg.mozilla.org/integration/autoland/rev/1c6f228f529e
In MessagePumpForUI for Windows, effectively apply upstream's commit 6b9a2fa. r=ipc-reviewers,nika
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 111 Branch
Flags: needinfo?(jib)

The patch landed in nightly and beta is affected.
:pehrsons, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox110 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(apehrson)

Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!

Beta/Release Uplift Approval Request

  • User impact if declined: External software (we have proof of Avast antivirus) may trigger a parent process crash on Windows under some conditions (related to getUserMedia or enumerateDevices with video).
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: See comment 27.
    We'll also know from crash-stats but only after uplifting.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is a cherry-pick of upstream commits that have been retrofitted on our since-long forked version of that code. It has baked upstream for several years. We also don't seem to have any code relying on the particular behavior these patches are changing.
  • String changes made/needed:
  • Is Android affected?: No

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: While we have only seen evidence of a parent process nullptr crash, there is in theory a small risk of a race where the MessageLoop gets destroyed while we are dispatching to it.
  • User impact if declined: External software (we have proof of Avast antivirus) may trigger a parent process crash on Windows under some conditions (related to getUserMedia or enumerateDevices with video).
  • Fix Landed on Version: 111
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is a cherry-pick of upstream commits that have been retrofitted on our since-long forked version of that code. It has baked upstream for several years. We also don't seem to have any code relying on the particular behavior these patches are changing.
Flags: needinfo?(apehrson)
Attachment #9311563 - Flags: approval-mozilla-esr102?
Attachment #9311563 - Flags: approval-mozilla-beta?
Attachment #9311715 - Flags: approval-mozilla-esr102?
Attachment #9311715 - Flags: approval-mozilla-beta?

Comment on attachment 9311715 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit 6b9a2fa. r?#ipc-reviewers!

Approved for 110 beta 3, we don't have enough users on nightly to make sure the patch stops the crashes but nightly + aurora + beta should give us this information.

Attachment #9311715 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #9311563 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
QA Whiteboard: [qa-triaged]
Attached image image.png

Hello! Reproduced the issue with Firefox 110.0a1 (20230110214526) on Windows 10x64 and 11x64 while using the STR from comment 27. After closing the tab Firefox crashes with this crash ID.
The issue is verified fixed with Firefox 110.0b3 and 111.0a1 (20230119163652) on Windows 10x64 and Windows 11 x64. The crash is no longer reproducing after following STR from comment 27.

One note is that after running the VideoCrasher.exe on beta and the latest nightly while the camera is started I can see two Found VideoCapture thread window [pid = 29012, tid = 28656], sending WM_QUIT...Result: 1 messages instead of one like it was before. Is that ok or should we file a new issue for that? Thank you!

Flags: needinfo?(apehrson)
QA Whiteboard: [qa-triaged]
Flags: qe-verify+

(In reply to Alexandru Trif, QA [:atrif] from comment #43)

One note is that after running the VideoCrasher.exe on beta and the latest nightly while the camera is started I can see two Found VideoCapture thread window [pid = 29012, tid = 28656], sending WM_QUIT...Result: 1 messages instead of one like it was before. Is that ok or should we file a new issue for that? Thank you!

That is fine thanks. Yannis came across this too and analyzed it in comment 30.

Flags: needinfo?(apehrson)

Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!

See comment 40.

Attachment #9311563 - Flags: approval-mozilla-release?

Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!

Interestingly, the crash volume in 109 seems a lot lower than it was in 108. I don't think we need to worry about taking this in the 109.0.1 dot release, but let me know if you feel strongly otherwise.

Attachment #9311563 - Flags: approval-mozilla-release? → approval-mozilla-release-

Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!

Approved for 102.8esr, however!

Attachment #9311563 - Flags: approval-mozilla-esr102? → approval-mozilla-esr102+
Attachment #9311715 - Flags: approval-mozilla-esr102? → approval-mozilla-esr102+
Flags: qe-verify+

Verified fixed with Firefox 102.8.0esr (20230206170028) on Windows 10x64 and Windows 11x64. The crash is no longer reproducing after following STR from comment 27.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: