1334097 - Crash in OOM | unknown | mozalloc_abort | mozalloc_handle_oom | moz_xmalloc | mozilla::ipc::MessageChannel::MaybeUndeferIncall

Reporter

Description

•

7 years ago

This bug was filed from the Socorro interface and is 
report bp-2f8dd5ce-2d43-4ee4-8f6c-dd1c92170126.
=============================================================
Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	mozglue.dll 	mozalloc_abort(char const* const) 	memory/mozalloc/mozalloc_abort.cpp:33
1 	mozglue.dll 	mozalloc_handle_oom(unsigned int) 	memory/mozalloc/mozalloc_oom.cpp:46
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp:85
3 	xul.dll 	mozilla::ipc::MessageChannel::MaybeUndeferIncall() 	ipc/glue/MessageChannel.cpp:1854
4 	xul.dll 	`anonymous namespace'::UnhookNeuteredWindows 	ipc/glue/WindowsMessageLoop.cpp:645
5 	xul.dll 	mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp:1324
6 	xul.dll 	mozilla::plugins::PPluginScriptableObjectChild::CallNPN_Evaluate(nsCString const&, mozilla::plugins::Variant*, bool*) 	obj-firefox/ipc/ipdl/PPluginScriptableObjectChild.cpp:104
7 	xul.dll 	mozilla::plugins::PluginScriptableObjectChild::Evaluate(_NPString*, _NPVariant*) 	dom/plugins/ipc/PluginScriptableObjectChild.cpp:1189
8 	npswf32_24_0_0_194.dll 	F_2019925551_______________________________________________________________________________ 	F_1128892918________________________________________________________________________:48888


plugin crashes with this signature are regressing since firefox 52 builds in all versions of windows and in code that apparently got added with bug 1312960.

Correlations for Firefox Aurora
(100.0% in signature vs 14.05% overall) Module "plugin-container.exe" = true
(100.0% in signature vs 22.11% overall) Module "comdlg32.dll" = true
(100.0% in signature vs 26.38% overall) startup_crash = null
(95.00% in signature vs 19.43% overall) Module "winspool.drv" = true [100.0% vs 22.07% if platform_pretty_version = Windows 10]
(100.0% in signature vs 16.47% overall) Module "dinput8.dll" = true [100.0% vs 22.28% if platform_pretty_version = Windows 10]
(100.0% in signature vs 16.61% overall) Module "dsound.dll" = true [100.0% vs 22.41% if platform_pretty_version = Windows 10]
(100.0% in signature vs 22.81% overall) Module "icm32.dll" = true [100.0% vs 31.83% if platform_pretty_version = Windows 10]
(100.0% in signature vs 30.89% overall) Module "urlmon.dll" = true [100.0% vs 42.64% if platform_pretty_version = Windows 7]
(30.00% in signature vs 86.75% overall) plugin_version = null
(70.00% in signature vs 13.25% overall) plugin = true
(70.00% in signature vs 13.25% overall) process_type = plugin
(70.00% in signature vs 11.62% overall) Module "NPSWF32_24_0_0_194.dll" = true [83.33% vs 11.10% if platform_version = 6.1.7601 Service Pack 1]
(45.00% in signature vs 90.97% overall) e10s_enabled = 1
(50.00% in signature vs 10.72% overall) plugin_version = 24.0.0.194

Andrew McCreight [:mccr8]

Comment 1

•

7 years ago

There are not a ton of these crashes (I only see 54). About half are in the plugin process. I looked at a few that weren't, and they all still had Flash on the stack. None of them have an OOM allocation size. I assume that's expected for the plugin process, but I'm not sure why that is the case for the others. It might be because the process annotation isn't being applied properly. I see this crash which does not have process type plugin, but the stack does not look like a main process stack to me: bp-fe8a5291-5352-4bf0-8710-29b072170120

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox54: ? → affected

Ryan VanderMeulen [:RyanVM]

Comment 2

•

7 years ago

Bill, any thoughts on what might be going on here?

Flags: needinfo?(wmccloskey)

Benjamin Smedberg

Comment 3

•

7 years ago

It's possible to tell the allocation size by loading the minidump, but you're right that currently OOM size annotations don't work in plugin processes because we try to send that data through the IPC layer instead of the out-of-band annotation layer.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 4

•

7 years ago

Well, we're allocating a small, fixed-sized object. There's not a lot we can do about that. My patch switched us from using a vector to a linked list. A linked list should actually reduce the chance of OOM since it doesn't need contiguous memory.

Flags: needinfo?(wmccloskey)

Astley Chen (inactive)

Comment 5

•

7 years ago

[Tracking Requested - why for this release]: #23 top crash on 52.0b in last 7 days.
The crash signature appeared from 52.0a2 and seen on 53 and 54 as well. It seems not a general OOM case.

tracking-firefox52: --- → ?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 6

•

7 years ago

(In reply to Astley Chen [:astley] (UTC+8) from comment #5)
> It seems not a general OOM case.

Why do you say that?

Andrew McCreight [:mccr8]

Comment 7

•

7 years ago

We used to see crashes on the line after it (mPending.insertBack(task);), so it might be possible this is some kind of signature change. I don't think anybody checked the process types on those crashes. I'd guess that maybe this could happen if the main process is hanging really badly, and the plugin process keeps trying to send messages, so the message queue backs up.

Kan-Ru Chen [:kanru] (UTC+9)

Updated

•

7 years ago

Flags: needinfo?(kchen)

Astley Chen (inactive)

Comment 8

•

7 years ago

The crash rate seems mitigated after new flash version 24.0.0.221 was out on 2/14, but that might not be the case though. Besides, this crash is out of Top50 crasher on 52.0b. Keep monitoring.

[1] https://helpx.adobe.com/flash-player/release-note/fp_24_air_24_release_notes.html

tracking-firefox52: ? → ---

[:philipp]

Reporter

Comment 9

•

7 years ago

the number of submitted crash reports only has gone down after the infobar submission mechanism has been switched off in 52.0b6 (before that 85% of reports came through the infobar) - so the real world impact of this will stay the same

Astley Chen (inactive)

Comment 10

•

7 years ago

(In reply to [:philipp] from comment #9)
> the number of submitted crash reports only has gone down after the infobar
> submission mechanism has been switched off in 52.0b6 (before that 85% of
> reports came through the infobar) - so the real world impact of this will
> stay the same

Thanks for the heads-up. Where could I get this info(infobar is on or off in specific build) ?

[:philipp]

Reporter

Comment 11

•

7 years ago

the infobar mechanism is active in beta builds which have the EARLY_BETA_OR_EARLIER flag set. there is some documentation about it at https://wiki.mozilla.org/Release_Management/Beta_Release_Checklist#Set_EARLY_BETA_OR_EARLIER - but i don't know how to easily check/verify that.

it's possible to exclude all reports coming through the infobar in supersearch though in order to get a consistent picture:
https://crash-stats.mozilla.com/search/?signature=%3DOOM%20|%20unknown%20|%20mozalloc_abort%20|%20mozalloc_handle_oom%20|%20moz_xmalloc%20|%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AMaybeUndeferIncall&submitted_from_infobar=!__true__&date=%3E%3D2016-08-21T08%3A33%3A55.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=install_time&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Astley Chen (inactive)

Updated

•

7 years ago

status-firefox52: affected → fix-optional

Liz Henry (:lizzard) (relman/hg->git project)

Comment 12

•

7 years ago

Even excluding those infobar submission reports, using the query from comment 11, I'm seeing very high crash volume in ESR52.  And, around 100 crashes a week from various beta 53 versions. 

Bill is there any thing actionable here? Can you find someone to take another look at the ESR crashes? If you facet on platform you get nearly all Windows NT:

https://crash-stats.mozilla.com/search/?signature=%3DOOM%20%7C%20unknown%20%7C%20mozalloc_abort%20%7C%20mozalloc_handle_oom%20%7C%20moz_xmalloc%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AMaybeUndeferIncall&submitted_from_infobar=%21__true__&date=%3E%3D2016-08-21T08%3A33%3A55.000Z&date=%3C2017-04-14T01%3A18%3A00.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=install_time&_facets=platform&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-platform

status-firefox52: fix-optional → wontfix

status-firefox53: affected → wontfix

status-firefox-esr52: --- → affected

tracking-firefox-esr52: --- → ?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 13

•

7 years ago

This does seem like a problem. The more I think about it, OOMs in the plugin process don't make a lot of sense. The plugin is almost always Flash, and it runs all its code in a separate process. Our plugin process is just a broker. So it should be very small. It seems like we must be leaking messages somewhere. I'll look into this next week.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 14

•

7 years ago

I think I know what's happening here. We've always had a problem where we busy wait during Call invocations. If there's a deferred messages, we undefer it [1]. Then we don't block on the condition var because mPending is non-empty [2]. But then when we dispatch the message, we defer it again because nothing has really changed [3].

I think my patch made this worse. Before, we would constantly shift the message back and forth between mPending and mDeferred, but that was it. My patch causes us to continually allocate new MessageTasks and post them to the event queue.

I'll have to think about how to fix this. I've wanted to fix the busy waiting for a while, but I've been scared to do so. We don't exactly have great tests for plugins, and I don't understand all the code perfectly.

On the other hand, I think I did the re-posting thing because it's needed to preserve the correct message ordering. So if we don't fix the busy waiting, there isn't an obvious solution.

[1] http://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/ipc/glue/MessageChannel.cpp#1460
[2] http://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/ipc/glue/MessageChannel.cpp#1463
[3] http://searchfox.org/mozilla-central/rev/d8ac097a1de2544f0e51d186bc850662c5b7430a/ipc/glue/MessageChannel.cpp#1957

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 15

•

7 years ago

Actually, the busy waiting has not been around forever as I assumed. It started happening in bug 1170231. I'll have to read that over very carefully to try to understand what it was trying to fix.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

7 years ago

Flags: needinfo?(wmccloskey)

Flags: needinfo?(kchen)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 16

•

7 years ago

Attached patch Avoid busy waiting caused by MaybeUndeferIncall — Details — Splinter Review

See comment 14.

Assignee: nobody → wmccloskey

Attachment #8860506 - Flags: review?(aklotz)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 17

•

7 years ago

Oh, I went through the test case in bug 1170231 and it passes. Adding back the old condition caused it to deadlock. So I think this patch successfully avoids the deadlock while also avoiding busy waiting.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 18

•

7 years ago

Adding tracking for 54 since we are possibly close to a fix. Thanks Bill!

status-firefox55: --- → affected

tracking-firefox54: --- → +

(No longer employed by Mozilla) Aaron Klotz

Comment 19

•

7 years ago

Comment on attachment 8860506 [details] [diff] [review]
Avoid busy waiting caused by MaybeUndeferIncall

Review of attachment 8860506 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!

Attachment #8860506 - Flags: review?(aklotz) → review+

Pulsebot

Comment 20

•

7 years ago

Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/78480dd41c9d
Avoid busy waiting caused by MaybeUndeferIncall (r=aklotz)

Carsten Book [:Tomcat]

Comment 21

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/78480dd41c9d

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox55: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla55

Ryan VanderMeulen [:RyanVM]

Comment 22

•

7 years ago

This grafts cleanly to Beta, so it's good for uplift if/when you feel comfortable nominating it. For ESR52, it'll need a bit of rebasing. I'm thinking this is something we may want to consider backporting all the way back to ESR52 if it's reasonable to do so, though.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 23

•

7 years ago

Attached patch esr patch — Details — Splinter Review

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
User impact if declined: We think this will fix the #1 plugin crash in ESR.
Fix Landed on Version: 55
Risk to taking this patch (and alternatives if risky): At worst, this could cause more plugin instability. It doesn't affect our regular IPC paths.
String or UUID changes made by this patch: None

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.

Flags: needinfo?(wmccloskey)

Attachment #8863044 - Flags: approval-mozilla-esr52?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 24

•

7 years ago

Comment on attachment 8860506 [details] [diff] [review]
Avoid busy waiting caused by MaybeUndeferIncall

Approval Request Comment
[Feature/Bug causing the regression]: bug 1312960
[User impact if declined]: OOM crashes in plugin process
[Is this code covered by automated tests?]: somewhat
[Has the fix been verified in Nightly?]: no
[Needs manual test from QE? If yes, steps to reproduce]: no STR
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: not very. it only affects plugins.
[Why is the change risky/not risky?]:
[String changes made/needed]: none

Attachment #8860506 - Flags: approval-mozilla-beta?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 25

•

7 years ago

Also, I don't see any crashes on Nightly since the fix landed. However, the crash rate on Nightly was pretty low before my patch, so it's not obviously fixed.

Gerry Chang [:gchang]

Comment 26

•

7 years ago

Comment on attachment 8860506 [details] [diff] [review]
Avoid busy waiting caused by MaybeUndeferIncall

Fix a crash. Beta54+. Should be in 54 beta 4.

Attachment #8860506 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 27

•

7 years ago

uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/4a587e536117

status-firefox54: affected → fixed

Andrei Vaida [:avaida]

Comment 28

•

7 years ago

(In reply to Bill McCloskey (:billm) from comment #24)
> [User impact if declined]: OOM crashes in plugin process
> [Is this code covered by automated tests?]: somewhat
> [Has the fix been verified in Nightly?]: no
> [Needs manual test from QE? If yes, steps to reproduce]: no STR

Setting qe-verify- since there's nothing actionable for QE here.

Flags: qe-verify-

Julien Cristau [:jcristau] (back April 22)

Updated

•

7 years ago

tracking-firefox-esr52: ? → 54+

Julien Cristau [:jcristau] (back April 22)

Comment 29

•

7 years ago

Comment on attachment 8863044 [details] [diff] [review]
esr patch

fix top plugin crash in esr52.2

Attachment #8863044 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+

Ryan VanderMeulen [:RyanVM]

Comment 30

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr52/rev/fe2a2c7e88cb

status-firefox-esr52: affected → fixed

Avoid busy waiting caused by MaybeUndeferIncall 7 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 7.55 KB, patch	bugzilla : review+ gchang : approval-mozilla-beta+	Details \| Diff \| Splinter Review
esr patch 7 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 7.28 KB, patch	jcristau : approval-mozilla-esr52+	Details \| Diff \| Splinter Review