Crash with Avast in [@ MessageLoop::PostTask | mozilla::camera::CamerasParent::DispatchToVideoCaptureThread]
Categories
(Core :: WebRTC: Audio/Video, defect)
Tracking
()
People
(Reporter: aryx, Assigned: pehrsons)
References
Details
(Keywords: crash)
Crash Data
Attachments
(3 files)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-release-
RyanVM
:
approval-mozilla-esr102+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-esr102+
|
Details | Review |
11.02 KB,
image/png
|
Details |
~350 crash reports stored with recent Firefox versions (104 - 102) in the last week, all on Windows. Multiple with 10 for the crash count which users experience. Crashes got frequent on 2022-09-01.
They very often share aswJsFlt64.dll.t01
as loaded module, owned by AVG Technologies USA, LLC
. / Avast Software s.r.o.
. Observed versions:
- 18.0.1443.0
- 18.0.1450.0
- 18.0.1451.0
Thomas, did new versions of your products roll out at that time and are aware of issues with the interaction between them and Firefox? Thank you in advance.
Crash report: https://crash-stats.mozilla.org/report/index/162b5420-65b3-46b8-995e-444840220909
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll MessageLoop::PostTask ipc/chromium/src/base/message_loop.cc:411
1 xul.dll mozilla::camera::CamerasParent::DispatchToVideoCaptureThread dom/media/systemservices/CamerasParent.cpp:205
2 xul.dll mozilla::camera::CamerasParent::StopVideoCapture dom/media/systemservices/CamerasParent.cpp:215
3 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:579
4 xul.dll mozilla::ipc::IProtocol::DestroySubtree ipc/glue/ProtocolUtils.cpp:567
5 xul.dll mozilla::ipc::PBackgroundParent::OnChannelError ipc/ipdl/PBackgroundParent.cpp:6533
6 xul.dll mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError ipc/glue/MessageChannel.cpp:2027
7 xul.dll mozilla::detail::RunnableMethodImpl<mozilla::UntrustedModulesProcessor*, void xpcom/threads/nsThreadUtils.h:1200
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1199
9 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:300
Comment 1•2 years ago
|
||
Redirect a needinfo that is pending on an inactive user to the triage owner.
:haik, since the bug has high severity and recent activity, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 2•2 years ago
|
||
@Lukas, we're seeing an increased number of Firefox crashes with the Avast DLL loaded starting last month in September. While we don't have a clear root cause, do you know if this aligns with any changes in Avast that may be triggering these crashes?
Comment 3•2 years ago
|
||
Bug 1794064 landed a blocklist update for aswJsFlt.dll
versions 18.0.1473.0 and earlier. Adding needinfo for myself to check how this affected the crash rate for this bug.
Comment 4•2 years ago
|
||
FYI, we're still seeing some crashes from the 1478 build.
Comment 5•2 years ago
|
||
Redirect a needinfo that is pending on an inactive user to the triage owner.
:haik, since the bug has high severity and recent activity, could you please find another way to get the information or close the bug as INCOMPLETE
if it is not actionable?
For more information, please visit auto_nag documentation.
Comment 6•2 years ago
|
||
@Yannis, is this something you are working with Avast on?
Comment 7•2 years ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #6)
@Yannis, is this something you are working with Avast on?
No, we were focused on understanding and reproducing bug 1794064. I can take a lot at this one too.
Comment 8•2 years ago
•
|
||
I don't think Avast has any responsibility in this bug. I am confident that the underlying bug lives in our own code. It could be that some of the operations they do add delays that could turn out to make an existing race condition bug more likely to happen for Avast users compared to normal users, for example if Avast decides to scan a JS script before it gets loaded, or a downloaded file before it's written on disk.
Comment 9•2 years ago
•
|
||
We also haven't gotten any crash reports from 106, which might correspond to the big libwebrtc update we took in bug 1766646. Might still be good to understand what fixed it in case there's something we could reasonably backport to ESR still.
Comment 10•2 years ago
|
||
Transferring to a media category for further debugging.
Comment 11•2 years ago
|
||
Added a brand new signature with inlined functions.
Comment 12•2 years ago
|
||
I cracked open a minidump out of curiosity and found out that sVideoCaptureThread
here is NULL and that's what seems to be causing the crash.
Comment 13•2 years ago
|
||
I believe it's not sVideoCaptureThread
but rather sVideoCaptureThread->message_loop()
which is NULL. Disassembling a few instructions before the crash yields:
rcx is MessageLoop* this
xul!MessageLoop::PostTask [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 410]:
00007ff8`083ddea0 4156 push r14
00007ff8`083ddea2 56 push rsi
00007ff8`083ddea3 57 push rdi
00007ff8`083ddea4 53 push rbx
00007ff8`083ddea5 4883ec38 sub rsp,38h
00007ff8`083ddea9 4889ce mov rsi,rcx
00007ff8`083ddeac 488b057d817b04 mov rax,qword ptr [xul!__security_cookie (00007ff8`0cb96030)]
00007ff8`083ddeb3 4831e0 xor rax,rsp
00007ff8`083ddeb6 4889442430 mov qword ptr [rsp+30h],rax
00007ff8`083ddebb 4c8b32 mov r14,qword ptr [rdx]
00007ff8`083ddebe 488b8980000000 mov rcx,qword ptr [rcx+80h] <---- crashing instruction
Comment 14•2 years ago
|
||
@karlt, drawing attention to this bug. It's causing around 50-100 crashes per day and there is a hint in the previous comments regarding the root cause.
Comment 15•2 years ago
|
||
sVideoCaptureThread
is reset, under (presumably) the same mutex as needed to read it in DispatchToVideoCaptureThread()
, with a reference to a CamerasParent
before the thread is Stop()
ed, after which message_loop()
might return null.
DispatchToVideoCaptureThread()
null checks sVideoCaptureThread
with the mutex held.
If mutex were different, then that would mean that all CamerasParent
s have been destroyed and then another created.
message_loop() can also return null before Thread::ThreadMain()
initializes, but StartWithOptions()
waits for this with the mutex held.
Andreas has been rewriting some code around here, so drawing his attention.
Assignee | ||
Comment 16•2 years ago
|
||
I see another path through which message_loop() could start returning null: nsThreadManager::ShutdownNonMainThreads
. But that doesn't seem to be the case here, given what main thread is doing (per the report in comment 0).
There's also the chance the video capture thread quit by itself. It is a windows UI thread, which should have created a window too. Excuse my lack of win-fu here, but is it possible something other than us, say Avast, sent this thread WM_QUIT?
Main thread seems to be processing a WM_DESTROY but I couldn't see that message being handled by the video capture thread.
Assignee | ||
Comment 17•2 years ago
|
||
If this is right that it quit through an unexpected WM_QUIT, then if it was an nsThread instead (what bug 1800215 will change), then when this happens shutdown tasks would run, followed by Dispatch() starting to fail.
Would we need to support nsThread's shutdown task as a possible path to CamerasParent shutdown? That would probably wreak havoc in CamerasChild and the things in the content process using it. Maybe it could be fixed with a refactor of CamerasParent/VideoEngine. It could be that only device enumeration (and device change events) needs this to be a ui thread and have a window, and that could go away and be recreated on request.
But first it would be good to verify what really happens.
Comment 18•2 years ago
|
||
Since the crash volume is low (less than 15 per week), the severity is downgraded to S3
. Feel free to change it back if you think the bug is still critical.
For more information, please visit auto_nag documentation.
Comment 20•2 years ago
|
||
Copying crash signatures from duplicate bugs.
Comment 22•2 years ago
|
||
Got some more information over in bug 1806918:
https://bugzilla.mozilla.org/show_bug.cgi?id=1806918#c8
Does this sound plausible?
Assignee | ||
Comment 23•2 years ago
|
||
It is plausible, because the video capture thread on Windows is a NonMainUIThread, which, as a UI thread, will process windows messages and therefore handle WM_QUIT
messages by quitting.
Basically what I was getting at with comment 16 and comment 17. And I don't see any other paths leading to the thread unexpectedly quitting.
Since this is a non-main thread, would it be legit to ignore WM_QUIT here?
Assignee | ||
Comment 24•2 years ago
|
||
For reference, chromium in its equivalent message pump considers a WM_QUIT unexpected and ignores it.
Assignee | ||
Updated•2 years ago
|
Comment 25•2 years ago
|
||
I would also advocate checking that message_loop is non-null in addition to the null-check on sVideoCaptureThread, here:
Assignee | ||
Comment 26•2 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #25)
I would also advocate checking that message_loop is non-null in addition to the null-check on sVideoCaptureThread, here:
That would be racy :(
Comment 27•2 years ago
•
|
||
Hi, nice investigation!
I am able to reproduce the crash consistently by doing what you suggest: sending a WM_QUIT message to the window. So in the end it seems likely that my comment 8 was too affirmative and that Avast and other software are doing something like this. In any case we should be able to use the reproducer below as a basis to try to fix the problem.
Here are the steps to reproduce the crash, should work in Nightly and Release on Windows X64:
- Make sure you have the X64 Microsoft Visual C++ Redistributable installed.
- Download and check the hash for VideoCrasher.exe (source code available here):
PS> Get-FileHash .\VideoCrasher.exe
Algorithm Hash
--------- ----
SHA256 689E03689FBC00DEBE6A3C9C4198B88B9C808EFA40179265A1E6B0820BE4DDFD
- Open this page in a new tab.
- Click on "Start camera", click "Allow".
- Run
VideoCrasher.exe
. The output should look like this:
Found VideoCapture thread window [pid = 2128, tid = 19268], sending WM_QUIT...
Result: 1
- Close the tab.
Assignee | ||
Comment 28•2 years ago
|
||
Thanks Yannis for the repro tooling! Since you're all setup, could you give this tentative fix a spin to see whether it seems sufficient?
Comment 29•2 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #26)
(In reply to Byron Campen [:bwc] from comment #25)
I would also advocate checking that message_loop is non-null in addition to the null-check on sVideoCaptureThread, here:
That would be racy :(
If null-checking that is racy, then calling PostTask on it a few lines later is pretty scary...
Comment 30•2 years ago
•
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #28)
Thanks Yannis for the repro tooling! Since you're all setup, could you give this tentative fix a spin to see whether it seems sufficient?
I confirm that this patch does prevent the crash -- at least when triggered with my reproducer in the way described above.
I was a bit worried because it seems that the reproducer was now finding two windows belonging to the VideoCapture thread:
Found VideoCapture thread window [hwnd = 00000000003105F4, pid = 12632, tid = 18088], sending WM_QUIT...
Result: 1
Found VideoCapture thread window [hwnd = 00000000002F06F8, pid = 12632, tid = 18088], sending WM_QUIT...
Result: 1
But actually that would already be the case before the patch if I didn't send the WM_QUIT
message, so it seems normal (probably the second window was dead by the time EnumWindows
would reach it when sending WM_QUIT
):
Found VideoCapture thread window [hwnd = 00000000000B03E0, pid = 28796, tid = 14512].
Found VideoCapture thread window [hwnd = 00000000000B03C6, pid = 28796, tid = 14512].
Assignee | ||
Comment 31•2 years ago
|
||
As MessagePumpForUI is an old fork from Chromium, this patch effectively applies
the Chromium patch
https://source.chromium.org/chromium/chromium/src/+/b26918c5ab0da3130b92339f44ebdc4bfee1b351
For those threads where we use MessagePumpForUI we already have a way of
stopping the thread and thereby quitting the message loop. At the same time some
applications may send us unexpected WM_QUIT messages on those threads. Generally
the code using the thread is not able to handle this gracefully and we end up
crashing.
In particular there is a correlation between these crashes and certain antivirus
applications.
Comment 32•2 years ago
•
|
||
Andreas, while blaming the code in Chromium to see what caused them to add this (seems to be third-party software sending that message, bringing us back to the original Avast correlation!), I found this follow-up fix that Chromium did: https://source.chromium.org/chromium/chromium/src/+/6b9a2faf89e9b67fb6cb8d0986c431923f757b98
We probably want that one as well?
Assignee | ||
Comment 33•2 years ago
|
||
Good catch, thanks. I'll add a patch for that fix too, and go through the log a bit more to see if something else stands out.
Assignee | ||
Comment 34•2 years ago
|
||
Ok, that fix seems to be it as far as WM_QUIT is concerned.
Assignee | ||
Comment 35•2 years ago
|
||
As a follow-up to the prior patch, this patch effectively applies upstream's
https://source.chromium.org/chromium/chromium/src/+/6b9a2faf89e9b67fb6cb8d0986c431923f757b98
Comment 36•2 years ago
|
||
I confirm that the two patches together also prevent the crash -- same results as in comment 30.
Comment 37•2 years ago
|
||
Comment 38•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8b71825335f3
https://hg.mozilla.org/mozilla-central/rev/1c6f228f529e
Updated•2 years ago
|
Updated•2 years ago
|
Comment 39•2 years ago
|
||
The patch landed in nightly and beta is affected.
:pehrsons, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox110
towontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 40•2 years ago
|
||
Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!
Beta/Release Uplift Approval Request
- User impact if declined: External software (we have proof of Avast antivirus) may trigger a parent process crash on Windows under some conditions (related to getUserMedia or enumerateDevices with video).
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: See comment 27.
We'll also know from crash-stats but only after uplifting. - List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This is a cherry-pick of upstream commits that have been retrofitted on our since-long forked version of that code. It has baked upstream for several years. We also don't seem to have any code relying on the particular behavior these patches are changing.
- String changes made/needed:
- Is Android affected?: No
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: While we have only seen evidence of a parent process nullptr crash, there is in theory a small risk of a race where the MessageLoop gets destroyed while we are dispatching to it.
- User impact if declined: External software (we have proof of Avast antivirus) may trigger a parent process crash on Windows under some conditions (related to getUserMedia or enumerateDevices with video).
- Fix Landed on Version: 111
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This is a cherry-pick of upstream commits that have been retrofitted on our since-long forked version of that code. It has baked upstream for several years. We also don't seem to have any code relying on the particular behavior these patches are changing.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 41•2 years ago
|
||
Comment on attachment 9311715 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit 6b9a2fa. r?#ipc-reviewers!
Approved for 110 beta 3, we don't have enough users on nightly to make sure the patch stops the crashes but nightly + aurora + beta should give us this information.
Updated•2 years ago
|
Comment 42•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Updated•2 years ago
|
Comment 43•2 years ago
•
|
||
Hello! Reproduced the issue with Firefox 110.0a1 (20230110214526) on Windows 10x64 and 11x64 while using the STR from comment 27. After closing the tab Firefox crashes with this crash ID.
The issue is verified fixed with Firefox 110.0b3 and 111.0a1 (20230119163652) on Windows 10x64 and Windows 11 x64. The crash is no longer reproducing after following STR from comment 27.
One note is that after running the VideoCrasher.exe
on beta and the latest nightly while the camera is started I can see two Found VideoCapture thread window [pid = 29012, tid = 28656], sending WM_QUIT...Result: 1
messages instead of one like it was before. Is that ok or should we file a new issue for that? Thank you!
Updated•2 years ago
|
Assignee | ||
Comment 44•2 years ago
|
||
(In reply to Alexandru Trif, QA [:atrif] from comment #43)
One note is that after running the
VideoCrasher.exe
on beta and the latest nightly while the camera is started I can see twoFound VideoCapture thread window [pid = 29012, tid = 28656], sending WM_QUIT...Result: 1
messages instead of one like it was before. Is that ok or should we file a new issue for that? Thank you!
That is fine thanks. Yannis came across this too and analyzed it in comment 30.
Comment 45•2 years ago
|
||
Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!
See comment 40.
Comment 46•2 years ago
|
||
Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!
Interestingly, the crash volume in 109 seems a lot lower than it was in 108. I don't think we need to worry about taking this in the 109.0.1 dot release, but let me know if you feel strongly otherwise.
Updated•2 years ago
|
Comment 47•2 years ago
|
||
Comment on attachment 9311563 [details]
Bug 1790160 - In MessagePumpForUI for Windows, effectively apply upstream's commit b26918c. r?#ipc-reviewers!, r?#gfx-reviewers!, r?mossop!
Approved for 102.8esr, however!
Updated•2 years ago
|
Comment 48•2 years ago
|
||
bugherder uplift |
Updated•2 years ago
|
Comment 49•2 years ago
|
||
Verified fixed with Firefox 102.8.0esr (20230206170028) on Windows 10x64 and Windows 11x64. The crash is no longer reproducing after following STR from comment 27.
Description
•