Bugzilla

Assignee

Comment 2

•

8 years ago

Thanks for reporting this, I am looking at this bug.
At the first look, I think the reason is that XMLHttpRequestMainThread::SendInternal opened a http channel (asycnopen2). Although FF is shutting down, the channel is still suspended but never got callback. The 6 secs shutdownkill timer is fired then causes a silent crash.

Flags: needinfo?(tnguyen)

Ethan Tseng [:ethan]

Updated

•

8 years ago

Assignee: nobody → tnguyen

Blocks: 1305486

Keywords: regression

Priority: -- → P1

Ethan Tseng [:ethan]

Comment 3

•

8 years ago

This bug is a regression caused by bug 1315386, per Thomas.

Assignee

Comment 4

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3b826b88c760

Assignee

Comment 5

•

8 years ago

Attached patch Resolve all pending lookups to fix IPCError-browser | ShutDownKill (obsolete) — Details — Splinter Review

MozReview-Commit-ID: IGsPCkrEEuG

Kamil Jozwiak [:kjozwiak]

Assignee

Comment 6

•

8 years ago

This silent crash does not occur if I compile and build FF from source. I've just pushed my patch to try
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7c5f5fb81ba97448a92c3cc99018db65c41e68a&selectedJob=33499220

Could you please download the build from the try running
https://queue.taskcluster.net/v1/task/Ro5Ghbi-SR-o6TP6OUl39g/runs/0/artifacts/public/build/target.dmg
and see if the patch could fix this bug

Flags: needinfo?(kjozwiak)

Reporter

Comment 7

•

8 years ago

I reproduced the original issue using the STR from comment#0 with the following build:

* fx53.0a1, buildid: 20161228030213, changeset: d7b6af32811b
** https://archive.mozilla.org/pub/firefox/nightly/2016/12/2016-12-28-03-02-13-mozilla-central/

Using the build that Thomas provided in comment#6, I went through the following test cases:

* going through STR using regular windows/tabs
* going through STR using default container window/tabs
* going through STR using custom container window/tabs
* going through STR using PB window/tabs
* opened the website from the STR in two different tabs and closed FX
** container/regular tab
** two default tabs
** container/PB
** default container/custom container
* checked about:crashes after every test case to ensure that new silent crashes haven't been introduced

Platforms Used: (I'll check Win/Linux once this lands into m-c)

* macOS 10.12.2 - PASSED

Thomas, it looks like the try build has fixed the issue. I tried several different test cases and couldn't reproduce the original crash.

Flags: needinfo?(kjozwiak)

Assignee

Updated

•

8 years ago

Attachment #8822138 - Flags: review?(gpascutto)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 8

•

8 years ago

Comment on attachment 8822138 [details] [diff] [review]
Resolve all pending lookups to fix IPCError-browser | ShutDownKill

This looks correct (DoLookup does a check for shuttingdown and will immediately resolve with null if so), but at the same time feels also quite icky. We're resolving callbacks when we already got notified that we're supposed to be shutting down.

Tracing this, it seems to be caused by sync XHR, which would partially explain why it's possible for the missing callback to block anything at all. I see there's work done to resolve issues where sync XHR times out for any reason. Would it make sense to add checks here to prevent the spinning when the browser is shutting down? i.e. break out the spinning loop on shutdown.

I'm going to f? a few people here because this may be a case where the real problem is not in this code.

https://dxr.mozilla.org/mozilla-central/rev/143bb4b9249e528e658f6ccc449991794b8675f8/dom/xhr/XMLHttpRequestMainThread.cpp#3006

Attachment #8822138 - Flags: feedback?(bugs)

Attachment #8822138 - Flags: feedback?(amarchesini)

Comment 9

•

8 years ago

Comment on attachment 8822138 [details] [diff] [review]
Resolve all pending lookups to fix IPCError-browser | ShutDownKill

Can't give any feedback about this patch, but indeed, cancelling sync XHR when the TabChild/ContentChild is going away makes sense, if doable.
Note, need to check also all the other cases when we spin event loop: modal dialogs at least, but possibly also some random JS scripts.

Attachment #8822138 - Flags: feedback?(bugs)

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Comment 10

•

8 years ago

Some discussion from #developers:

21:08 <gcp> the patch there is a fix, but I'm not sure if that's the correct fix
21:08 <gcp> i.e. going back and issuing a bunch of callbacks in shutdown code
21:08 <@bz> yeah, that's a bit odd.
21:09 <@bz> I'd think necko would cancel the channel and send OnStopRequest to the XHR....
21:09 <gcp> necko is probably blocked on the classifier returning from the classification callback
21:09 <gcp> though that wasn't in the stack
21:16 <gcp> oh, the original bug is already in DOM :P
21:18 <@bz> gcp: necko should cancel the classification and move on, I'd think
21:19 <@bz> gcp: just like it would if the user navigated away from the page!
21:19 — @bz really hopes we would not keep waiting on the classifier then.

Can we shelve the current patch and investigate why necko is unable to unstuck itself on shutdown here? (note mozilla::dom::TabChild::RecvDestroy() in the stack)

Updated

•

8 years ago

platform-rel: --- → ?

Whiteboard: [platform-rel-Twitter]

BugBot [:suhaib / :marco/ :calixte]

Assignee

Comment 11

•

8 years ago

Thanks gcp and smaug for looking at this, so it's exactly blocked in a sync xhr until the channel stopped and send OnStopRequest to XHR (then mark mFlagSyncLooping to false).
I am considering about not keep waiting on the classifier, and cancelling all the channels in pending queue. However, we start doing shutdown classifier very early in "quit-application", will we probably lost some tasks in shutdown, for example cancel doing xhr request in unload event? Hmm, I don't know, maybe I am wrong.

Comment 12

•

8 years ago

Crash volume for signature 'IPCError-browser | ShutDownKill':
 - nightly (version 53): 79125 crashes from 2016-11-14.
 - aurora  (version 52): 118636 crashes from 2016-11-14.
 - beta    (version 51): 412191 crashes from 2016-11-14.
 - release (version 50): 3115 crashes from 2016-11-01.
 - esr     (version 45): 132 crashes from 2016-07-06.

Crash volume on the last weeks (Week N is from 01-02 to 01-08):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly   11114   11304   13841   11315   10841    9916    7074
 - aurora    18482   20233   21166   20277   19016   13352     152
 - beta       4854   11234  104993  116974   80635   70864   21296
 - release     381     401     650     560     456     367     157
 - esr           3       3       8       2       4      10       7

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly           #1
 - aurora            #1
 - beta              #1
 - release           #29
 - esr               #1

status-firefox50: --- → affected

status-firefox51: --- → affected

status-firefox52: --- → affected

status-firefox-esr45: --- → affected

Comment hidden (spam)

https://crash-stats.mozilla.com/report/index/7e85f6f3-160d-4126-83a4-ed97a2170104

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 14

•

8 years ago

Hi :smaug
Should we cancel all pending channels in classifier queue after "quit-application" event?

Flags: needinfo?(bugs)

Comment 15

•

8 years ago

I guess so. Do we not cancel the channel automatically? Why doesn't Necko call OnStopRequest or such?
Or is the issue that OnStopRequest happens later, and classifier doesn't expect it to happen so late?

Flags: needinfo?(bugs)

Assignee

Comment 16

•

8 years ago

Oh, the channel was not cancelled automatically. Another approach, I think it's more reasonable, that we should not add any pending channel after shutting down.
In that way, we do not expect any request after "quit-application" event rather than cancel pending request, but we may reject requests (if there's any) in unload event.

Assignee

Updated

•

8 years ago

Attachment #8822138 - Flags: review?(gpascutto)

Attachment #8822138 - Flags: feedback?(amarchesini)

Assignee

Comment 17

•

8 years ago

Attached patch Resolve all pending lookups and cancel pending channels (obsolete) — Details — Splinter Review

MozReview-Commit-ID: LfxKf8JNOCC

Assignee

Updated

•

8 years ago

Attachment #8822138 - Attachment is obsolete: true

François Marier [:francois]

Assignee

Comment 18

•

8 years ago

Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels

The crash signature did not appear before (or it rarely appeared), it occurs frequently after we added the shutdown-aware (stop doing classifier things while shutting down). 
I did not trace why necko does not send OnStopRequest (or doing cancel), but, I think we still need to resolve pending lookups, otherwise there might be a leak in worker. Updated the patch to send NS_ERROR_ABORT to cancel the channel (or we could observe shutdown in nsChannelClassifier).

Attachment #8824356 - Flags: feedback?(gpascutto)

Comment 19

•

8 years ago

Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels

Review of attachment 8824356 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1179,5 @@
>  
>  NS_IMETHODIMP
>  nsUrlClassifierClassifyCallback::HandleEvent(const nsACString& tables)
>  {
>    nsresult response = TablesToResponse(tables);

If we're shutting down, it would be nice to avoid this call too.

So perhaps `nsresult response = NS_OK;` and then do `response = TablesToResponse(tables);` after the `if (gShuttingDownThread)` check?

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Updated

•

8 years ago

Comment 20

•

8 years ago

Thanks Francois.
I could see the channel is cancelled during shutting down, but no onstoprequest. Imho, a suspended request is supposed not to dispatch event (such as onstopRequest) until the request is resumed. Then it led a hang in sync XHR here.
For now, if we send NS_ERROR_ABORT to nsChannelClassifier we could cancel the channel then resume it, so we could resolve the hang. But I am thinking do we need to do resume all suspended channels if ff is shutting down, to avoid the same accidental issue later?

Flags: needinfo?(dd.mozilla)

Assignee

Comment 21

•

8 years ago

(In reply to Thomas Nguyen[:tnguyen] (use ni? plz) from comment #20)
> Thanks Francois.
> I could see the channel is cancelled during shutting down, but no
> onstoprequest. Imho, a suspended request is supposed not to dispatch event
> (such as onstopRequest) until the request is resumed. Then it led a hang in
> sync XHR here.
More clearly, after Cancel(), we need to call Resume() to  send out OnStopRequest

Comment 22

•

8 years ago

Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels

Review of attachment 8824356 [details] [diff] [review]:
-----------------------------------------------------------------

>but, I think we still need to resolve pending lookups, otherwise there might be a leak in worker

I don't understand. This patch will cause nsIUrlClassifierLookupCallback->LookupComplete() to be called on the entries in mPendingLookups. How does that stop a leak?

Attachment #8824356 - Flags: feedback?(gpascutto)

Assignee

Comment 23

•

8 years ago

Oh, if there's a pending lookup has not been resolved, in other words mPendingLookups (mWorker) contains an element which was not freed. 
That occurs in shutdown, and it might have effect and cause some intermittent leak tests, I think.
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#309

Comment 24

•

8 years ago

It's an nsTArray, it will free the elements on destruction. There's no raw pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting a leakcheck fail because shutdown hangs in the middle, so destructors are never run.

Unless I missed something obvious or there's side effects in other code, I do not believe this is a valid reasoning to be executing these callbacks if we're supposed to be shutting down.

Assignee

Comment 25

•

8 years ago

Thanks gcp,
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never run.

Shawn talked to me that it seems have highly leak intermittent in XHR test,
And I found

07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierClassifyCallback
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierDBService
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierDBServiceWorker
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierLookupCallback
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 10 nsWeakReference
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 nsXMLHttpRequestXPCOMifier
07:14:13     INFO - TEST-INFO | leakcheck | default process: leaked 1 xptiInterfaceInfo
Probably nsUrlClassifierLookupCallback leaks in the middle of shutting down

Comment hidden (typo)

Thanks gcp,
You are right, just checked again mPendingLookups we used is array(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never called
Oh, you are right. I looked at the code again. Thanks you.

Assignee

Comment 27

•

8 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never called

Oh, you are right. I looked at the code again. Thanks you.

Assignee

Comment 28

•

8 years ago

Shawn pushed a patch in bug Bug 1324856 to resolve the loop in sync XHR (in XMLHttpRequestMainThread), the basic idea is catching "shutdown" and then break the loop, but I am not very sure if it's a complete fix. I am taking a look at necko part today to see why necko cannot unstuck it self, just turned on nsHttp and ipdl log and tracing the log.

- OnStopRequest is expected to be received in XHR, otherwise it will hang.
- In this case the parent channel is suspended (and also its pump) and there will be no OnStopRequest here until it's resumed.
- The same in child channel, there will be no OnStopRequest received from parent and be dispatched to listeners.
- The child and parent will be cancelled during shutting down but could not be deleted itself (I could not see any "Destroying HttpChannelParent" or HttpChannelChild related to the channels). It's supposed that these channels will be deleted after getting OnStopRequest then send/receive DeleteChannel and DeleteSelf message.

The OnStopRequest event seem be pended due to the channel (or pump) is suspended, not very sure resuming then dispatching OnStopRequest (if ff is shutting down) is a good idea. I am confused if we need to resume channel, pump, in parent, in child, or just sending out OnStopRequest (or do nothing, just fix in XMLHttpRequestMainThread) in this case.
Probably I am completely wrong, will continue looking at this to see which I could do, but would like to ni some idea from necko
Hi Patrick, Dragana, do you have any idea about this bug?
Thanks

Assignee

Comment 29

•

8 years ago

Attached patch Fire OnStopRequest when shutting down to avoid shutdown hang in XHR (obsolete) — Details — Splinter Review

MozReview-Commit-ID: 4urL2hoxm3L

Assignee

Updated

•

8 years ago

Attachment #8824356 - Attachment is obsolete: true

Assignee

Comment 30

•

8 years ago

Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR

Hi Patrick,
Would you please comment about the patch, do we need to resume suspended channel in this case?
Thanks

Attachment #8825728 - Flags: feedback?(mcmanus)

Comment 31

•

8 years ago

(In reply to Thomas Nguyen[:tnguyen] (use ni? plz) from comment #20)
> Thanks Francois.
> I could see the channel is cancelled during shutting down, but no
> onstoprequest. Imho, a suspended request is supposed not to dispatch event
> (such as onstopRequest) until the request is resumed. Then it led a hang in
> sync XHR here.
> For now, if we send NS_ERROR_ABORT to nsChannelClassifier we could cancel
> the channel then resume it, so we could resolve the hang. But I am thinking
> do we need to do resume all suspended channels if ff is shutting down, to
> avoid the same accidental issue later?

The necko expects that all suspends have a resume and it will not send any OnStartR, OnDataA or OnStopR. if the channel is suspended. The one that is suspending a channel needs to take care of resuming it. Any other more complex rule will be hard to follow. Necko expects that a channel is resume exactly as many time as it is suspended.

Flags: needinfo?(dd.mozilla)

Comment 32

•

8 years ago

Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR

Review of attachment 8825728 [details] [diff] [review]:
-----------------------------------------------------------------

I think we should resolved this in a different way. We should see what has suspended the channel and try to fine a way that exactly that code resumes the channel. Fro that point OnStopR. will be fired and XHR would not hang.

Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor)

Assignee

Comment 33

•

8 years ago

Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR

Thanks Dragana for clarifying the point.

Attachment #8825728 - Flags: feedback?(mcmanus)

Comment 35

•

8 years ago

Thomas,
I found onStopR. will be fired in non-e10s version, but not in e10s version.

Assignee

Comment 36

•

8 years ago

Attached patch Make sure we resume the channel if there's no classifier callback — Details — Splinter Review

MozReview-Commit-ID: 35jBJb1Q8Rz

Assignee

Updated

•

8 years ago

Attachment #8825728 - Attachment is obsolete: true

Assignee

Updated

•

8 years ago

Attachment #8826125 - Flags: feedback?(dd.mozilla)

François Marier [:francois]

Comment 37

•

8 years ago

Comment on attachment 8826125 [details] [diff] [review]
Make sure we resume the channel if there's no classifier callback

Francois is a better person to give you a feedback.

Attachment #8826125 - Flags: feedback?(dd.mozilla) → feedback?(francois)

Updated

•

8 years ago

Attachment #8826125 - Flags: feedback?(francois) → feedback+

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/104420/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/104420/

Comment 39

•

8 years ago

Do you know why it works in non-e10s version and not in e10s?

Are you sure that Resume will not be called 2 times? The normal path and this shutdown path?
At least set mSuspendedChannel to false.

Assignee

Comment 40

•

8 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #39)
> Do you know why it works in non-e10s version and not in e10s?
> 

That said, XHR request is sent too late (when ff is shutting down). In e10s, in my tests, it only occurs if we have too late xhr request.
> Are you sure that Resume will not be called 2 times? The normal path and
> this shutdown path?
> At least set mSuspendedChannel to false.

Thanks Dragana.
I think it's added to the patch, mSuspendedChannel= false prevent resume and cancel not being call 2 times. But I still have not asked a review request yet. I am running xhr tests locally to see if there's still intermittent leak contains nsUrlClassifierLookupCallback.

Thanks

Assignee

Updated

•

8 years ago

Attachment #8826458 - Flags: review?(gpascutto)

François Marier [:francois]

Comment 41

•

8 years ago

mozreview-review

Comment on attachment 8826458 [details]
Bug 1324820 - Make sure we resume the channel if there's no classifier callback

https://reviewboard.mozilla.org/r/104420/#review105674

r+ as per comment 32 saying this is the approach to use.

Attachment #8826458 - Flags: review?(gpascutto) → review+

Updated

•

8 years ago

No longer blocks: 1305486

Assignee

Comment 42

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=94e36373825622b9922fa72f3e1e627bbb95b7c8

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 43

•

8 years ago

Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/bb0fc7b8fd04
Make sure we resume the channel if there's no classifier callback r=gcp

Keywords: checkin-needed

Comment 44

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bb0fc7b8fd04

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox53: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla53

Sylvestre Ledru [:Sylvestre]

Comment 45

•

8 years ago

Too late for 51 as we are now in RC mode but happy to take a patch to 52!
Thomas, could you fill the uplift request? Thanks

status-firefox50: affected → wontfix

status-firefox51: affected → wontfix

status-firefox-esr45: affected → wontfix

Flags: needinfo?(tnguyen)