Closed Bug 1158761 Opened 9 years ago Closed 9 years ago

crash in nsBaseAppShell::RunSyncSectionsInternal(bool, unsigned int)

Categories

(Core :: Widget, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox38.0.5 --- fixed
firefox39 + fixed
firefox40 + fixed
firefox-esr38 --- fixed

People

(Reporter: ananuti, Assigned: bugzilla)

References

Details

(Keywords: crash, regression, topcrash-win)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-f33171a8-c0cf-4048-a76e-bccfb2150427.
=============================================================

STR:
1. verify java https://www.java.com/verify
2. closed tab after verified
Blocks: 1157488
Flags: needinfo?(bobbyholley)
So, just as I was worried about.

In bug 1157488, I added a release-mode assertion against re-entering sync sections, because that machinery is totally not re-entrant and it seemed like exactly the sort of thing that would cause weird plugin crashes. The crash report here indicates that Java is doing something exactly like that. This could explain a variety of weird plugin crashes.

Aaron, are you the right person to look into this? If not, any idea who is?
Flags: needinfo?(bobbyholley) → needinfo?(aklotz)
To be clear, the invariant that this assertion is trying to enforce is that code run from RunSyncSectionsInternal never spins the event loop, which is presumably what's happening here. It would be nice if the stack in the crash were more useful so that we could figure out how that's happening and whether that's an invariant we can really guarantee or whether we need to handle it somehow.
What kind of information would you like to see in the stack? I might be able to clean it up a bit.
I am almost certain it's possible to run arbitrary script from some of these callbacks (e.g. the one used to load images).
> What kind of information would you like to see in the stack? I might be able
> to clean it up a bit.

It would be helpful to see:

RunSyncSectionsInternal()
...
EventLoopSpinning
...
SomeCulprit()
RunSyncSectionsInternal()

So that we can identify |SomeCulprit|.

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #4)
> I am almost certain it's possible to run arbitrary script from some of these
> callbacks (e.g. the one used to load images).

You mean javascript URIs etc? Do we have a hope of forbidding that? If not, how do we handle this case?
Flags: needinfo?(dmajor)
(In reply to Bobby Holley (:bholley) from comment #1)
> So, just as I was worried about.
> 
> In bug 1157488, I added a release-mode assertion against re-entering sync
> sections, because that machinery is totally not re-entrant and it seemed
> like exactly the sort of thing that would cause weird plugin crashes. The
> crash report here indicates that Java is doing something exactly like that.
> This could explain a variety of weird plugin crashes.
> 
> Aaron, are you the right person to look into this? If not, any idea who is?

In the de facto sense, I probably am. Thanks for the heads up.
Flags: needinfo?(aklotz)
(In reply to Bobby Holley (:bholley) from comment #5)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #4)
> > I am almost certain it's possible to run arbitrary script from some of these
> > callbacks (e.g. the one used to load images).
> 
> You mean javascript URIs etc? Do we have a hope of forbidding that? If not,
> how do we handle this case?

I mean that HTMLImageElement::QueueImageLoadTask queues a call to nsImageLoadingContent::LoadImage, which can fire DOM events.
I get the feeling this ought to be easy to repro. Live machines are better than minidumps. Anyone wanna give it a try?

Crash ID 	Comment
27d9d941-21fb-4b28-badd-8cd142150427 	Java Test?!
5b136807-83cf-4af3-8379-e9c9b2150426 	Java 64 Bit check
8d563483-c36d-409e-9a67-bf79d2150426 	Trying to run the Java test page. Keeps crashing.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7)
> (In reply to Bobby Holley (:bholley) from comment #5)
> > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #4)
> > > I am almost certain it's possible to run arbitrary script from some of these
> > > callbacks (e.g. the one used to load images).
> > 
> > You mean javascript URIs etc? Do we have a hope of forbidding that? If not,
> > how do we handle this case?
> 
> I mean that HTMLImageElement::QueueImageLoadTask queues a call to
> nsImageLoadingContent::LoadImage, which can fire DOM events.

Couldn't we fire those events asynchronously, or at the very least append them to run as another sync section?
Possibly, idk.  Putting them in another sync section doesn't help because you can still do an XHR from there.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #10)
> Possibly, idk.  Putting them in another sync section doesn't help because
> you can still do an XHR from there.

Right, good point. They'd need to be async I guess.
xul!nsBaseAppShell::RunSyncSectionsInternal
xul!nsBaseAppShell::OnProcessNextEvent
xul!nsThread::ProcessNextEvent
xul!NS_ProcessPendingEvents
xul!nsBaseAppShell::NativeEventCallback
xul!nsAppShell::EventWindowProc
user32!InternalCallWinProc
user32!UserCallWinProcCheckWow
user32!DispatchMessageWorker
user32!DispatchMessageA
[opaque npjp2 frames]
xul!nsPluginHost::StopPluginInstance
xul!nsObjectLoadingContent::DoStopPlugin
xul!nsObjectLoadingContent::StopPluginInstance
xul!CheckPluginStopEvent::Run
xul!nsBaseAppShell::RunSyncSectionsInternal  <---
xul!nsBaseAppShell::AfterProcessNextEvent
xul!NS_ProcessNextEvent
xul!mozilla::ipc::MessagePump::Run
xul!MessageLoop::RunHandler
xul!MessageLoop::Run
xul!nsBaseAppShell::Run
xul!nsAppShell::Run
xul!nsAppStartup::Run
xul!XREMain::XRE_mainRun
xul!XREMain::XRE_main
xul!XRE_main
firefox!do_main
firefox!NS_internal_main
firefox!wmain
firefox!__tmainCRTStartup
kernel32!BaseThreadInitThunk
ntdll!__RtlUserThreadStart
ntdll!_RtlUserThreadStart
Flags: needinfo?(dmajor)
Thanks a lot dmajor!

Aaron, is there a reason that CheckPluginStopEvent needs to run as a sync section rather than fully asynchronously?
Keywords: regression
(In reply to Bobby Holley (:bholley) from comment #13)
> Aaron, is there a reason that CheckPluginStopEvent needs to run as a sync
> section rather than fully asynchronously?
Flags: needinfo?(aklotz)
I have no idea :-( I'm not very familiar with that part of the plugin code, I'm afraid.
Flags: needinfo?(aklotz)
(In reply to Bobby Holley (:bholley) from comment #13)
> Aaron, is there a reason that CheckPluginStopEvent needs to run as a sync
> section rather than fully asynchronously?
Let's see if johns has some time to comment...
Flags: needinfo?(john)
Any code anywhere that touches plugins, in terms of starting/stopping/script-touching them, runs the risk of the plugin spinning the event loop as such. Java spins the event loop basically 100% of the time you touch it. It's method of operation as far as I can tell is:

> InteractionFromBrowser() {
>   SendMessageToJavaThread();
>   while (!IsResponseAvailable()) {
>     SpinEventLoop();
>   }
>   return GetResponse();
> }

I don't fully appreciate the differences of a sync vs "100% async" event on the loop. I think CheckPluginStopEvent could be async, but unless you can guarantee that no sync section ever runs content script, there's no way they won't hit java spinning the event loop.
Flags: needinfo?(john)
(In reply to John Schoenick [:johns] from comment #18)
> I don't fully appreciate the differences of a sync vs "100% async" event on
> the loop.

"100% async" means creating a runnable and dispatching as normal. "Sync sections" are runnables that are explicitly dispatched to the AppShell via RunInStableState. These events get run "between" events - they're basically a special queue that gets drained each time we process an event.

> I think CheckPluginStopEvent could be async

That sounds great - we should try it. Do you happen to know why this code is currently using RunInStableState?

> but unless you can guarantee that no sync section ever runs content script

That would be the goal here.

> there's no way they won't hit java spinning the event loop.

Right. So we need to avoid running content script and java in sync sections.
(In reply to Bobby Holley (:bholley) from comment #19)
> (In reply to John Schoenick [:johns] from comment #18)
> > I don't fully appreciate the differences of a sync vs "100% async" event on
> > the loop.
> 
> "100% async" means creating a runnable and dispatching as normal. "Sync
> sections" are runnables that are explicitly dispatched to the AppShell via
> RunInStableState. These events get run "between" events - they're basically
> a special queue that gets drained each time we process an event.
> 
> > I think CheckPluginStopEvent could be async
> 
> That sounds great - we should try it. Do you happen to know why this code is
> currently using RunInStableState?

IIRC This code was using RunInStableState for plugin events since antiquity, and I was simply preserving that ordering when I added this event. I'm ~80% sure it doesn't need it, but it could introduce some new issues with e.g. plugins outliving their owning page - but nothing unfixable.

> > but unless you can guarantee that no sync section ever runs content script
> 
> That would be the goal here.
> 
> > there's no way they won't hit java spinning the event loop.
> 
> Right. So we need to avoid running content script and java in sync sections.

Anything that touches plugins at all should probably assert in sync sections - probably in PluginModuleParent. It's not just java that does this :(. I think there's even an edge case in windows where we have to spin the event loop when IPCing to plugin-container to avoid a windows API deadlock (child is waiting for us to process a message).
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7)
> (In reply to Bobby Holley (:bholley) from comment #5)
> > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #4)
> > > I am almost certain it's possible to run arbitrary script from some of these
> > > callbacks (e.g. the one used to load images).
> > 
> > You mean javascript URIs etc? Do we have a hope of forbidding that? If not,
> > how do we handle this case?
> 
> I mean that HTMLImageElement::QueueImageLoadTask queues a call to
> nsImageLoadingContent::LoadImage, which can fire DOM events.

( this is also my fault )

The LoadImage call needs to be sync, but IIRC these events firing inline from the LoadImage task isn't necessary/proper per spec
Keywords: topcrash-win
This is now the #1 crash in nightly 40. Is this issue tied to nightly-only features, or will it ride the trains if we don't do anything?
(In reply to David Major [:dmajor] from comment #22)
> This is now the #1 crash in nightly 40. Is this issue tied to nightly-only
> features, or will it ride the trains if we don't do anything?
It will ride the trains.

I'll try moving CheckPluginStopEvent and see what happens.
(In reply to David Major [:dmajor] from comment #22)
> This is now the #1 crash in nightly 40. Is this issue tied to nightly-only
> features, or will it ride the trains if we don't do anything?

The reason this is crashing is that I added a release-mode assertion against this behavior. But we're only asserting in the case where we enter non-reentrant-safe code, which will cause us to run the same event multiple times. This is extremely unsafe, and most likely the cause of other, harder-to-track-down crashes.
(In reply to Bobby Holley (:bholley) from comment #24)
> The reason this is crashing is that I added a release-mode assertion

And to think I doubted that this would happen ;)
It's a useful assert; it helped us find this issue. And if we can fix things so that it stops firing, even better!

[Tracking Requested - why for this release]: comment 23
Changing CheckPluginStopEvent over to a regular runnable has broken a bunch of tests :-(

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5106c6981c18
(In reply to Aaron Klotz [:aklotz] (please use needinfo) from comment #27)
> Changing CheckPluginStopEvent over to a regular runnable has broken a bunch
> of tests :-(
> 
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=5106c6981c18

Looks like it's just dom/plugins/test/mochitest/test_instance_unparentN.html, no? Is that actually testing invariants that need to be preserved?
I'll take a closer look tomorrow.
This test is wrong:

>    p1.callOnDestroy(onDestroy);
>    setTimeout(checkPluginAlreadyDestroyed, 0);
>    d1.removeChild(p1);

The setTimeout will now fire before the CheckPluginStop event triggered from the third line. I don't think this test means to be checking that the plugin will stop before other events, so those two lines can likely be swapped to fix this.
Since this is top crash, we'll track it for 40.
Since the crash report indicates there are crashes in 39, 38, as well as 40, I changed the flags for those to affected, but only set tracking for 40.
I'd prefer not to see this crash reach our aurora users. Merge day is coming soon. If we can't get a fix in time, how about silencing the assert for now? The volume on this crash is really high and I can't imagine that it would be replaced by an equal number of weird plugin crashes.
(In reply to John Schoenick [:johns] from comment #30)
> This test is wrong:
> 
> >    p1.callOnDestroy(onDestroy);
> >    setTimeout(checkPluginAlreadyDestroyed, 0);
> >    d1.removeChild(p1);
> 
> The setTimeout will now fire before the CheckPluginStop event triggered from
> the third line. I don't think this test means to be checking that the plugin
> will stop before other events, so those two lines can likely be swapped to
> fix this.

I've tried that with no luck, unfortunately. checkPluginAlreadyDestroyed is still executing before the plugin is destroyed.
OK, I think I've got a fix for those tests. It looks like the first executeSoon arrives before the plugin stop event, but a second one arrives after. Running a try push ATM, but I think that if checkPluginAlreadyDestroyed executeSoons itself a second time, the tests pass.
I'll put this part up first so we can get that reviewed. I'll upload the tests as soon as I've verified that they look OK on try.
Attachment #8602898 - Flags: review?(bobbyholley)
Comment on attachment 8602898 [details] [diff] [review]
Part 1: Run the event fully asynchronously

Review of attachment 8602898 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for seeing this through! r=bholley
Attachment #8602898 - Flags: review?(bobbyholley) → review+
This patch just updates tests to take into account that there is now a slight delay in the execution of the plugin stop event. We re-run the check a few times until the plugin stop event has executed. This is a little fragile but I'm not optimistic that there is any alternative.
Assignee: nobody → aklotz
Status: NEW → ASSIGNED
Attachment #8603204 - Flags: review?(jmathies)
Attachment #8603204 - Flags: review?(jmathies) → review+
https://hg.mozilla.org/mozilla-central/rev/e7219efa371e
https://hg.mozilla.org/mozilla-central/rev/afbc1f7623e7
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
This signature is now gone from desktop nightlies. The crashes continue on Fennec; I've opened bug 1163814.
Can we nominate this for Beta/release/esr38 please?
Flags: needinfo?(aklotz)
Comment on attachment 8602898 [details] [diff] [review]
Part 1: Run the event fully asynchronously

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
This simple patch could help alleviate numerous plugin crashes which is something that would be nice to fix in ESR.

User impact if declined: 
Weird plugin crashes

Fix Landed on Version:
Landed on 40, already requesting uplift to 39 and 38 release

Risk to taking this patch (and alternatives if risky): 
Low, trivial patch

String or UUID changes made by this patch: None

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.

Approval Request Comment
[Feature/regressing bug #]: NPAPI Plugins
[User impact if declined]: Weird plugin crashes
[Describe test coverage new/current, TreeHerder]: Covered by existing tests plus test modifications in part 2 of this patch
[Risks and why]: Low, trivial patch, already on aurora
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8602898 - Flags: approval-mozilla-release?
Attachment #8602898 - Flags: approval-mozilla-esr38?
Attachment #8602898 - Flags: approval-mozilla-beta?
Comment on attachment 8602898 [details] [diff] [review]
Part 1: Run the event fully asynchronously

Top crash, taking it.
Attachment #8602898 - Flags: approval-mozilla-release?
Attachment #8602898 - Flags: approval-mozilla-release+
Attachment #8602898 - Flags: approval-mozilla-esr38?
Attachment #8602898 - Flags: approval-mozilla-esr38+
Attachment #8602898 - Flags: approval-mozilla-beta?
Attachment #8602898 - Flags: approval-mozilla-beta+
Backed out from Beta due to test_instance_re-parent.html failures. Interestingly, release/esr38 are OK. Aaron is investigating, but we decided to backout for now to avoid getting in the way of other release work off the branch.
https://treeherder.mozilla.org/logviewer.html#?job_id=372247&repo=mozilla-beta

https://hg.mozilla.org/releases/mozilla-beta/rev/31d67ade8354
Flags: needinfo?(aklotz)
Setting qe-verify- for this fix since it's already covered by automated tests.
Flags: qe-verify-
Any news here? :)
I missed this for 39 - it wasn't tracked. Tracking it now!
The immediate symptom of this particular crash can't be happening on 39, since the assert was added in 40 by bug 1157488.

The underlying problem is still present on 39, and may be causing random crashes out in the long tail. Uplifting would be good if we can, but I think we could survive without if we had to.
Depends on: 1172627
OK. sounds like I can wontfix this for 39. If that's not correct and you all have a fix coming, please let me know and re-set the flag to "affected".   
It seems odd to have this fixed in 38 and 40 but not in 39.
Comment on attachment 8602898 [details] [diff] [review]
Part 1: Run the event fully asynchronously

Removing the approval flag to get it off the needs-uplift radar.
Attachment #8602898 - Flags: approval-mozilla-beta+
I've uplifted bug 1172627 to beta. That should take care of test_instance_re-parent.html.
Flags: needinfo?(aklotz) → needinfo?(ryanvm)
Try push of these patches re-applied to mozilla-beta tip:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=408ac3ec45eb
Comment on attachment 8602898 [details] [diff] [review]
Part 1: Run the event fully asynchronously

Try likes it. Re-nominating for beta.
Flags: needinfo?(ryanvm)
Attachment #8602898 - Flags: approval-mozilla-beta?
Comment on attachment 8602898 [details] [diff] [review]
Part 1: Run the event fully asynchronously

OK, let's uplift this to beta.
Attachment #8602898 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.