Closed Bug 1334097 Opened 3 years ago Closed 2 years ago

Crash in OOM | unknown | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | mozilla::ipc::MessageChannel::MaybeUndeferIncall

Categories

(Core :: IPC, defect, critical)

52 Branch
All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox51 --- unaffected
firefox52 --- wontfix
firefox-esr52 54+ fixed
firefox53 --- wontfix
firefox54 + fixed
firefox55 --- fixed

People

(Reporter: philipp, Assigned: billm)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-2f8dd5ce-2d43-4ee4-8f6c-dd1c92170126.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	mozalloc_abort(char const* const) 	memory/mozalloc/mozalloc_abort.cpp:33
1 	mozglue.dll 	mozalloc_handle_oom(unsigned int) 	memory/mozalloc/mozalloc_oom.cpp:46
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp:85
3 	xul.dll 	mozilla::ipc::MessageChannel::MaybeUndeferIncall() 	ipc/glue/MessageChannel.cpp:1854
4 	xul.dll 	`anonymous namespace'::UnhookNeuteredWindows 	ipc/glue/WindowsMessageLoop.cpp:645
5 	xul.dll 	mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp:1324
6 	xul.dll 	mozilla::plugins::PPluginScriptableObjectChild::CallNPN_Evaluate(nsCString const&, mozilla::plugins::Variant*, bool*) 	obj-firefox/ipc/ipdl/PPluginScriptableObjectChild.cpp:104
7 	xul.dll 	mozilla::plugins::PluginScriptableObjectChild::Evaluate(_NPString*, _NPVariant*) 	dom/plugins/ipc/PluginScriptableObjectChild.cpp:1189
8 	npswf32_24_0_0_194.dll 	F_2019925551_______________________________________________________________________________ 	F_1128892918________________________________________________________________________:48888


plugin crashes with this signature are regressing since firefox 52 builds in all versions of windows and in code that apparently got added with bug 1312960.

Correlations for Firefox Aurora
(100.0% in signature vs 14.05% overall) Module "plugin-container.exe" = true
(100.0% in signature vs 22.11% overall) Module "comdlg32.dll" = true
(100.0% in signature vs 26.38% overall) startup_crash = null
(95.00% in signature vs 19.43% overall) Module "winspool.drv" = true [100.0% vs 22.07% if platform_pretty_version = Windows 10]
(100.0% in signature vs 16.47% overall) Module "dinput8.dll" = true [100.0% vs 22.28% if platform_pretty_version = Windows 10]
(100.0% in signature vs 16.61% overall) Module "dsound.dll" = true [100.0% vs 22.41% if platform_pretty_version = Windows 10]
(100.0% in signature vs 22.81% overall) Module "icm32.dll" = true [100.0% vs 31.83% if platform_pretty_version = Windows 10]
(100.0% in signature vs 30.89% overall) Module "urlmon.dll" = true [100.0% vs 42.64% if platform_pretty_version = Windows 7]
(30.00% in signature vs 86.75% overall) plugin_version = null
(70.00% in signature vs 13.25% overall) plugin = true
(70.00% in signature vs 13.25% overall) process_type = plugin
(70.00% in signature vs 11.62% overall) Module "NPSWF32_24_0_0_194.dll" = true [83.33% vs 11.10% if platform_version = 6.1.7601 Service Pack 1]
(45.00% in signature vs 90.97% overall) e10s_enabled = 1
(50.00% in signature vs 10.72% overall) plugin_version = 24.0.0.194
There are not a ton of these crashes (I only see 54). About half are in the plugin process. I looked at a few that weren't, and they all still had Flash on the stack. None of them have an OOM allocation size. I assume that's expected for the plugin process, but I'm not sure why that is the case for the others. It might be because the process annotation isn't being applied properly. I see this crash which does not have process type plugin, but the stack does not look like a main process stack to me: bp-fe8a5291-5352-4bf0-8710-29b072170120
Bill, any thoughts on what might be going on here?
Flags: needinfo?(wmccloskey)
It's possible to tell the allocation size by loading the minidump, but you're right that currently OOM size annotations don't work in plugin processes because we try to send that data through the IPC layer instead of the out-of-band annotation layer.
Well, we're allocating a small, fixed-sized object. There's not a lot we can do about that. My patch switched us from using a vector to a linked list. A linked list should actually reduce the chance of OOM since it doesn't need contiguous memory.
Flags: needinfo?(wmccloskey)
[Tracking Requested - why for this release]: #23 top crash on 52.0b in last 7 days.
The crash signature appeared from 52.0a2 and seen on 53 and 54 as well. It seems not a general OOM case.
(In reply to Astley Chen [:astley] (UTC+8) from comment #5)
> It seems not a general OOM case.

Why do you say that?
We used to see crashes on the line after it (mPending.insertBack(task);), so it might be possible this is some kind of signature change. I don't think anybody checked the process types on those crashes. I'd guess that maybe this could happen if the main process is hanging really badly, and the plugin process keeps trying to send messages, so the message queue backs up.
Flags: needinfo?(kchen)
The crash rate seems mitigated after new flash version 24.0.0.221 was out on 2/14, but that might not be the case though. Besides, this crash is out of Top50 crasher on 52.0b. Keep monitoring.

[1] https://helpx.adobe.com/flash-player/release-note/fp_24_air_24_release_notes.html
the number of submitted crash reports only has gone down after the infobar submission mechanism has been switched off in 52.0b6 (before that 85% of reports came through the infobar) - so the real world impact of this will stay the same
(In reply to [:philipp] from comment #9)
> the number of submitted crash reports only has gone down after the infobar
> submission mechanism has been switched off in 52.0b6 (before that 85% of
> reports came through the infobar) - so the real world impact of this will
> stay the same

Thanks for the heads-up. Where could I get this info(infobar is on or off in specific build) ?
This does seem like a problem. The more I think about it, OOMs in the plugin process don't make a lot of sense. The plugin is almost always Flash, and it runs all its code in a separate process. Our plugin process is just a broker. So it should be very small. It seems like we must be leaking messages somewhere. I'll look into this next week.
I think I know what's happening here. We've always had a problem where we busy wait during Call invocations. If there's a deferred messages, we undefer it [1]. Then we don't block on the condition var because mPending is non-empty [2]. But then when we dispatch the message, we defer it again because nothing has really changed [3].

I think my patch made this worse. Before, we would constantly shift the message back and forth between mPending and mDeferred, but that was it. My patch causes us to continually allocate new MessageTasks and post them to the event queue.

I'll have to think about how to fix this. I've wanted to fix the busy waiting for a while, but I've been scared to do so. We don't exactly have great tests for plugins, and I don't understand all the code perfectly.

On the other hand, I think I did the re-posting thing because it's needed to preserve the correct message ordering. So if we don't fix the busy waiting, there isn't an obvious solution.

[1] http://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/ipc/glue/MessageChannel.cpp#1460
[2] http://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/ipc/glue/MessageChannel.cpp#1463
[3] http://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/ipc/glue/MessageChannel.cpp#1957
Actually, the busy waiting has not been around forever as I assumed. It started happening in bug 1170231. I'll have to read that over very carefully to try to understand what it was trying to fix.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(kchen)
See comment 14.
Assignee: nobody → wmccloskey
Attachment #8860506 - Flags: review?(aklotz)
Oh, I went through the test case in bug 1170231 and it passes. Adding back the old condition caused it to deadlock. So I think this patch successfully avoids the deadlock while also avoiding busy waiting.
Adding tracking for 54 since we are possibly close to a fix. Thanks Bill!
Comment on attachment 8860506 [details] [diff] [review]
Avoid busy waiting caused by MaybeUndeferIncall

Review of attachment 8860506 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!
Attachment #8860506 - Flags: review?(aklotz) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/78480dd41c9d
Avoid busy waiting caused by MaybeUndeferIncall (r=aklotz)
https://hg.mozilla.org/mozilla-central/rev/78480dd41c9d
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
This grafts cleanly to Beta, so it's good for uplift if/when you feel comfortable nominating it. For ESR52, it'll need a bit of rebasing. I'm thinking this is something we may want to consider backporting all the way back to ESR52 if it's reasonable to do so, though.
Flags: needinfo?(wmccloskey)
Attached patch esr patchSplinter Review
[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined: We think this will fix the #1 plugin crash in ESR.
Fix Landed on Version: 55
Risk to taking this patch (and alternatives if risky): At worst, this could cause more plugin instability. It doesn't affect our regular IPC paths.
String or UUID changes made by this patch: None

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Flags: needinfo?(wmccloskey)
Attachment #8863044 - Flags: approval-mozilla-esr52?
Comment on attachment 8860506 [details] [diff] [review]
Avoid busy waiting caused by MaybeUndeferIncall

Approval Request Comment
[Feature/Bug causing the regression]: bug 1312960
[User impact if declined]: OOM crashes in plugin process
[Is this code covered by automated tests?]: somewhat
[Has the fix been verified in Nightly?]: no
[Needs manual test from QE? If yes, steps to reproduce]: no STR
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: not very. it only affects plugins.
[Why is the change risky/not risky?]:
[String changes made/needed]: none
Attachment #8860506 - Flags: approval-mozilla-beta?
Also, I don't see any crashes on Nightly since the fix landed. However, the crash rate on Nightly was pretty low before my patch, so it's not obviously fixed.
Comment on attachment 8860506 [details] [diff] [review]
Avoid busy waiting caused by MaybeUndeferIncall

Fix a crash. Beta54+. Should be in 54 beta 4.
Attachment #8860506 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Bill McCloskey (:billm) from comment #24)
> [User impact if declined]: OOM crashes in plugin process
> [Is this code covered by automated tests?]: somewhat
> [Has the fix been verified in Nightly?]: no
> [Needs manual test from QE? If yes, steps to reproduce]: no STR

Setting qe-verify- since there's nothing actionable for QE here.
Flags: qe-verify-
Comment on attachment 8863044 [details] [diff] [review]
esr patch

fix top plugin crash in esr52.2
Attachment #8863044 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
You need to log in before you can comment on or make changes to this bug.