Closed Bug 1324820 Opened 8 years ago Closed 8 years ago

IPCError-browser | ShutDownKill received when closing FX while a gif is autoplaying from Twitter moments

Categories

(Core :: DOM: Content Processes, defect, P1)

53 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
platform-rel --- ?
firefox-esr45 --- unaffected
firefox50 --- unaffected
firefox51 --- unaffected
firefox52 --- unaffected
firefox53 --- fixed

People

(Reporter: kjozwiak, Assigned: tnguyen)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [platform-rel-Twitter])

Crash Data

Attachments

(2 files, 3 obsolete files)

Created a new bug as per bug#1279293comment#63. When you close fx while the browser is autoplaying a gif from one of Twitter's "Moment" pages, you'll receive the [@ IPCError-browser | ShutDownKill] crash every time the browser is shut down. I can reproduce this 100% of the time using the following STR:

* install the latest version of m-c
* open https://twitter.com/i/moments/810878086774456320
* let one of the random gifs start autoplaying and quickly shutdown FX

You'll notice that the shutdown process will hang and will take several seconds for fx to correctly close.

I've created a quick video of the STR that I used to reproduce the crash on a brand new m-c installation under macOS 10.12.2:
* https://youtu.be/SQ6x5iWVaqg

Crash Reports:

* bp-ff8ee990-5b23-43c3-a3e8-4e48e2161219
* bp-1146cec8-6ed7-42cf-9571-67f872161219
* bp-1384ea42-7d95-46f8-be8b-a3d5e2161219
* bp-e296be83-7fa1-4ede-afb0-455cb2161219
Has STR: --- → yes
Summary: IPCError-browser | ShutDownKill received when closing FX while a gif is autoplaying from Twitter oments → IPCError-browser | ShutDownKill received when closing FX while a gif is autoplaying from Twitter moments
Flags: needinfo?(tnguyen)
Thanks for reporting this, I am looking at this bug.
At the first look, I think the reason is that XMLHttpRequestMainThread::SendInternal opened a http channel (asycnopen2). Although FF is shutting down, the channel is still suspended but never got callback. The 6 secs shutdownkill timer is fired then causes a silent crash.
Flags: needinfo?(tnguyen)
Assignee: nobody → tnguyen
Blocks: 1305486
Keywords: regression
Priority: -- → P1
This bug is a regression caused by bug 1315386, per Thomas.
MozReview-Commit-ID: IGsPCkrEEuG
This silent crash does not occur if I compile and build FF from source. I've just pushed my patch to try
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7c5f5fb81ba97448a92c3cc99018db65c41e68a&selectedJob=33499220

Could you please download the build from the try running
https://queue.taskcluster.net/v1/task/Ro5Ghbi-SR-o6TP6OUl39g/runs/0/artifacts/public/build/target.dmg
and see if the patch could fix this bug
Flags: needinfo?(kjozwiak)
I reproduced the original issue using the STR from comment#0 with the following build:

* fx53.0a1, buildid: 20161228030213, changeset: d7b6af32811b
** https://archive.mozilla.org/pub/firefox/nightly/2016/12/2016-12-28-03-02-13-mozilla-central/

Using the build that Thomas provided in comment#6, I went through the following test cases:

* going through STR using regular windows/tabs
* going through STR using default container window/tabs
* going through STR using custom container window/tabs
* going through STR using PB window/tabs
* opened the website from the STR in two different tabs and closed FX
** container/regular tab
** two default tabs
** container/PB
** default container/custom container
* checked about:crashes after every test case to ensure that new silent crashes haven't been introduced

Platforms Used: (I'll check Win/Linux once this lands into m-c)

* macOS 10.12.2 - PASSED

Thomas, it looks like the try build has fixed the issue. I tried several different test cases and couldn't reproduce the original crash.
Flags: needinfo?(kjozwiak)
Attachment #8822138 - Flags: review?(gpascutto)
Comment on attachment 8822138 [details] [diff] [review]
Resolve all pending lookups to fix IPCError-browser | ShutDownKill

This looks correct (DoLookup does a check for shuttingdown and will immediately resolve with null if so), but at the same time feels also quite icky. We're resolving callbacks when we already got notified that we're supposed to be shutting down.

Tracing this, it seems to be caused by sync XHR, which would partially explain why it's possible for the missing callback to block anything at all. I see there's work done to resolve issues where sync XHR times out for any reason. Would it make sense to add checks here to prevent the spinning when the browser is shutting down? i.e. break out the spinning loop on shutdown.

I'm going to f? a few people here because this may be a case where the real problem is not in this code.

https://dxr.mozilla.org/mozilla-central/rev/143bb4b9249e528e658f6ccc449991794b8675f8/dom/xhr/XMLHttpRequestMainThread.cpp#3006
Attachment #8822138 - Flags: feedback?(bugs)
Attachment #8822138 - Flags: feedback?(amarchesini)
Comment on attachment 8822138 [details] [diff] [review]
Resolve all pending lookups to fix IPCError-browser | ShutDownKill

Can't give any feedback about this patch, but indeed, cancelling sync XHR when the TabChild/ContentChild is going away makes sense, if doable.
Note, need to check also all the other cases when we spin event loop: modal dialogs at least, but possibly also some random JS scripts.
Attachment #8822138 - Flags: feedback?(bugs)
Some discussion from #developers:

21:08 <gcp> the patch there is a fix, but I'm not sure if that's the correct fix
21:08 <gcp> i.e. going back and issuing a bunch of callbacks in shutdown code
21:08 <@bz> yeah, that's a bit odd.
21:09 <@bz> I'd think necko would cancel the channel and send OnStopRequest to the XHR....
21:09 <gcp> necko is probably blocked on the classifier returning from the classification callback
21:09 <gcp> though that wasn't in the stack
21:16 <gcp> oh, the original bug is already in DOM :P
21:18 <@bz> gcp: necko should cancel the classification and move on, I'd think
21:19 <@bz> gcp: just like it would if the user navigated away from the page!
21:19 — @bz really hopes we would not keep waiting on the classifier then.

Can we shelve the current patch and investigate why necko is unable to unstuck itself on shutdown here? (note mozilla::dom::TabChild::RecvDestroy() in the stack)
platform-rel: --- → ?
Whiteboard: [platform-rel-Twitter]
Thanks gcp and smaug for looking at this, so it's exactly blocked in a sync xhr until the channel stopped and send OnStopRequest to XHR (then mark mFlagSyncLooping to false).
I am considering about not keep waiting on the classifier, and cancelling all the channels in pending queue. However, we start doing shutdown classifier very early in "quit-application", will we probably lost some tasks in shutdown, for example cancel doing xhr request in unload event? Hmm, I don't know, maybe I am wrong.
Crash volume for signature 'IPCError-browser | ShutDownKill':
 - nightly (version 53): 79125 crashes from 2016-11-14.
 - aurora  (version 52): 118636 crashes from 2016-11-14.
 - beta    (version 51): 412191 crashes from 2016-11-14.
 - release (version 50): 3115 crashes from 2016-11-01.
 - esr     (version 45): 132 crashes from 2016-07-06.

Crash volume on the last weeks (Week N is from 01-02 to 01-08):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly   11114   11304   13841   11315   10841    9916    7074
 - aurora    18482   20233   21166   20277   19016   13352     152
 - beta       4854   11234  104993  116974   80635   70864   21296
 - release     381     401     650     560     456     367     157
 - esr           3       3       8       2       4      10       7

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly           #1
 - aurora            #1
 - beta              #1
 - release           #29
 - esr               #1
Hi :smaug
Should we cancel all pending channels in classifier queue after "quit-application" event?
Flags: needinfo?(bugs)
I guess so. Do we not cancel the channel automatically? Why doesn't Necko call OnStopRequest or such?
Or is the issue that OnStopRequest happens later, and classifier doesn't expect it to happen so late?
Flags: needinfo?(bugs)
Oh, the channel was not cancelled automatically. Another approach, I think it's more reasonable, that we should not add any pending channel after shutting down.
In that way, we do not expect any request after "quit-application" event rather than cancel pending request, but we may reject requests (if there's any) in unload event.
Attachment #8822138 - Flags: review?(gpascutto)
Attachment #8822138 - Flags: feedback?(amarchesini)
MozReview-Commit-ID: LfxKf8JNOCC
Attachment #8822138 - Attachment is obsolete: true
Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels

The crash signature did not appear before (or it rarely appeared), it occurs frequently after we added the shutdown-aware (stop doing classifier things while shutting down). 
I did not trace why necko does not send OnStopRequest (or doing cancel), but, I think we still need to resolve pending lookups, otherwise there might be a leak in worker. Updated the patch to send NS_ERROR_ABORT to cancel the channel (or we could observe shutdown in nsChannelClassifier).
Attachment #8824356 - Flags: feedback?(gpascutto)
Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels

Review of attachment 8824356 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1179,5 @@
>  
>  NS_IMETHODIMP
>  nsUrlClassifierClassifyCallback::HandleEvent(const nsACString& tables)
>  {
>    nsresult response = TablesToResponse(tables);

If we're shutting down, it would be nice to avoid this call too.

So perhaps `nsresult response = NS_OK;` and then do `response = TablesToResponse(tables);` after the `if (gShuttingDownThread)` check?
Thanks Francois.
I could see the channel is cancelled during shutting down, but no onstoprequest. Imho, a suspended request is supposed not to dispatch event (such as onstopRequest) until the request is resumed. Then it led a hang in sync XHR here.
For now, if we send NS_ERROR_ABORT to nsChannelClassifier we could cancel the channel then resume it, so we could resolve the hang. But I am thinking do we need to do resume all suspended channels if ff is shutting down, to avoid the same accidental issue later?
Flags: needinfo?(dd.mozilla)
(In reply to Thomas Nguyen[:tnguyen] (use ni? plz) from comment #20)
> Thanks Francois.
> I could see the channel is cancelled during shutting down, but no
> onstoprequest. Imho, a suspended request is supposed not to dispatch event
> (such as onstopRequest) until the request is resumed. Then it led a hang in
> sync XHR here.
More clearly, after Cancel(), we need to call Resume() to  send out OnStopRequest
Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels

Review of attachment 8824356 [details] [diff] [review]:
-----------------------------------------------------------------

>but, I think we still need to resolve pending lookups, otherwise there might be a leak in worker

I don't understand. This patch will cause nsIUrlClassifierLookupCallback->LookupComplete() to be called on the entries in mPendingLookups. How does that stop a leak?
Attachment #8824356 - Flags: feedback?(gpascutto)
Oh, if there's a pending lookup has not been resolved, in other words mPendingLookups (mWorker) contains an element which was not freed. 
That occurs in shutdown, and it might have effect and cause some intermittent leak tests, I think.
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#309
It's an nsTArray, it will free the elements on destruction. There's no raw pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting a leakcheck fail because shutdown hangs in the middle, so destructors are never run.

Unless I missed something obvious or there's side effects in other code, I do not believe this is a valid reasoning to be executing these callbacks if we're supposed to be shutting down.
Thanks gcp,
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never run.

Shawn talked to me that it seems have highly leak intermittent in XHR test,
And I found

07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierClassifyCallback
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierDBService
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierDBServiceWorker
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierLookupCallback
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 10 nsWeakReference
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsXMLHttpRequestXPCOMifier
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 xptiInterfaceInfo
Probably nsUrlClassifierLookupCallback leaks in the middle of shutting down
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never called

Oh, you are right. I looked at the code again. Thanks you.
Shawn pushed a patch in bug Bug 1324856 to resolve the loop in sync XHR (in XMLHttpRequestMainThread), the basic idea is catching "shutdown" and then break the loop, but I am not very sure if it's a complete fix. I am taking a look at necko part today to see why necko cannot unstuck it self, just turned on nsHttp and ipdl log and tracing the log.

- OnStopRequest is expected to be received in XHR, otherwise it will hang.
- In this case the parent channel is suspended (and also its pump) and there will be no OnStopRequest here until it's resumed.
- The same in child channel, there will be no OnStopRequest received from parent and be dispatched to listeners.
- The child and parent will be cancelled during shutting down but could not be deleted itself (I could not see any "Destroying HttpChannelParent" or HttpChannelChild related to the channels). It's supposed that these channels will be deleted after getting OnStopRequest then send/receive DeleteChannel and DeleteSelf message.

The OnStopRequest event seem be pended due to the channel (or pump) is suspended, not very sure resuming then dispatching OnStopRequest (if ff is shutting down) is a good idea. I am confused if we need to resume channel, pump, in parent, in child, or just sending out OnStopRequest (or do nothing, just fix in XMLHttpRequestMainThread) in this case.
Probably I am completely wrong, will continue looking at this to see which I could do, but would like to ni some idea from necko
Hi Patrick, Dragana, do you have any idea about this bug?
Thanks
MozReview-Commit-ID: 4urL2hoxm3L
Attachment #8824356 - Attachment is obsolete: true
Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR

Hi Patrick,
Would you please comment about the patch, do we need to resume suspended channel in this case?
Thanks
Attachment #8825728 - Flags: feedback?(mcmanus)
(In reply to Thomas Nguyen[:tnguyen] (use ni? plz) from comment #20)
> Thanks Francois.
> I could see the channel is cancelled during shutting down, but no
> onstoprequest. Imho, a suspended request is supposed not to dispatch event
> (such as onstopRequest) until the request is resumed. Then it led a hang in
> sync XHR here.
> For now, if we send NS_ERROR_ABORT to nsChannelClassifier we could cancel
> the channel then resume it, so we could resolve the hang. But I am thinking
> do we need to do resume all suspended channels if ff is shutting down, to
> avoid the same accidental issue later?

The necko expects that all suspends have a resume and it will not send any OnStartR, OnDataA or OnStopR. if the channel is suspended. The one that is suspending a channel needs to take care of resuming it. Any other more complex rule will be hard to follow. Necko expects that a channel is resume exactly as many time as it is suspended.
Flags: needinfo?(dd.mozilla)
Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR

Review of attachment 8825728 [details] [diff] [review]:
-----------------------------------------------------------------

I think we should resolved this in a different way. We should see what has suspended the channel and try to fine a way that exactly that code resumes the channel. Fro that point OnStopR. will be fired and XHR would not hang.
Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR

Thanks Dragana for clarifying the point.
Attachment #8825728 - Flags: feedback?(mcmanus)
Thomas,
I found onStopR. will be fired in non-e10s version, but not in e10s version.
MozReview-Commit-ID: 35jBJb1Q8Rz
Attachment #8825728 - Attachment is obsolete: true
Attachment #8826125 - Flags: feedback?(dd.mozilla)
Comment on attachment 8826125 [details] [diff] [review]
Make sure we resume the channel if there's no classifier callback

Francois is a better person to give you a feedback.
Attachment #8826125 - Flags: feedback?(dd.mozilla) → feedback?(francois)
Attachment #8826125 - Flags: feedback?(francois) → feedback+
Do you know why it works in non-e10s version and not in e10s?

Are you sure that Resume will not be called 2 times? The normal path and this shutdown path?
At least set mSuspendedChannel to false.
(In reply to Dragana Damjanovic [:dragana] from comment #39)
> Do you know why it works in non-e10s version and not in e10s?
> 

That said, XHR request is sent too late (when ff is shutting down). In e10s, in my tests, it only occurs if we have too late xhr request.
> Are you sure that Resume will not be called 2 times? The normal path and
> this shutdown path?
> At least set mSuspendedChannel to false.

Thanks Dragana.
I think it's added to the patch, mSuspendedChannel= false prevent resume and cancel not being call 2 times. But I still have not asked a review request yet. I am running xhr tests locally to see if there's still intermittent leak contains nsUrlClassifierLookupCallback.

Thanks
Attachment #8826458 - Flags: review?(gpascutto)
Comment on attachment 8826458 [details]
Bug 1324820 - Make sure we resume the channel if there's no classifier callback

https://reviewboard.mozilla.org/r/104420/#review105674

r+ as per comment 32 saying this is the approach to use.
Attachment #8826458 - Flags: review?(gpascutto) → review+
No longer blocks: 1305486
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/bb0fc7b8fd04
Make sure we resume the channel if there's no classifier callback r=gcp
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/bb0fc7b8fd04
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Too late for 51 as we are now in RC mode but happy to take a patch to 52!
Thomas, could you fill the uplift request? Thanks
status-firefox50: affected → wontfix
status-firefox51: affected → wontfix
status-firefox-esr45: affected → wontfix

IPCError-browser | ShutDownKill signature is caused by various reasons and what I fixed here is only one of them, just only for the case related to nsChannelClassifier. The above status may not exact. I am not very sure if uplifting to 52 may help in this case.
Kamil, did you see similar issue in 52?
Flags: needinfo?(tnguyen) → needinfo?(kjozwiak)
>> Kamil, did you see similar issue in 52?

Thomas, I couldn't reproduce the crash using the STR from comment#0 with the other release channels (listed below). It seems like this particular case only affected m-c.

I made sure I could still reproduce the original issue with a version of m-c before comment#44. Once I reproduced the crash, I went through the following channels:

* fx52.0a2 - couldn't reproduce
** buildid: 20170120004016, changeset: b2e633d19d68

* fx51.0b14 - couldn't reproduce
buildid: 20170118123726, changeset: ea82b5e20cbb

* fx50.0.2 - couldn't reproduce
** buildid: 20161129173726, changeset: cc272f7d48d3
Flags: needinfo?(kjozwiak)
The status flags were set automatically by a bot in this bug. Given Kamil's comments and the fix at hand, I'm setting them all to unaffected to get this off the radar.
Depends on: 1339845
Depends on: 1336240
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: