Suspending a request from "http-on-examine-cached-response" does not suspend data delivery
Categories
(WebExtensions :: Request Handling, defect, P2)
Tracking
(firefox-esr60 wontfix, firefox66+ wontfix, firefox67+ wontfix, firefox68- fix-optional, firefox69 affected)
People
(Reporter: robwu, Unassigned, Mentored)
References
Details
(Keywords: crash)
Crash Data
Attachments
(2 obsolete files)
Updated•8 years ago
|
Comment 1•8 years ago
|
||
Comment 2•8 years ago
|
||
Comment 3•8 years ago
|
||
Comment 4•8 years ago
|
||
Updated•8 years ago
|
Comment 5•7 years ago
|
||
Updated•7 years ago
|
| Reporter | ||
Comment 6•7 years ago
|
||
| Reporter | ||
Comment 7•7 years ago
|
||
| Reporter | ||
Updated•7 years ago
|
Comment 8•7 years ago
|
||
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Comment 9•6 years ago
|
||
notes:
All of the http-on notifications (that webrequest uses) handle async now, fixed in 66 in bug 1407384.
There is a reduction in crashes from 65 to 66, but some remain.
I'm not sure if 1407384 contributed to lowering the crash rate, but I think it probably played part.
:robwu do you have your test addon handy somewhere?
| Reporter | ||
Comment 10•6 years ago
|
||
The original cause of the crash (initial report) was fixed by one of the patches from bug 1402944 (I can't remember which one).
comment 6 and comment 7 contain the most relevant information for the current crashes.
The referenced add-on has been updated to version 4.0.3, but it still uses webRequest.filterResponseData:
https://robwu.nl/crxviewer/?crx=https%3A%2F%2Faddons.mozilla.org%2Ffirefox%2Fdownloads%2Flatest%2Fintegrated-gmail%2Fplatform%3A2%2Fintegrated-gmail.xpi&q=lib%2Fhttp-filter.js!filterResponseData&qf=lib%2Fhttp-filter.js&qb=1&qh=0&qi=1
If we look at crash rates by version after your patch was published, then we can see 442 crashes:
65.0.2 138
65.0.1 131
65.0 95
67.0a1 10
66.0b7 8
66.0.1 7
66.0b10 7
66.0rc1 7
66.0 6
66.0b13 5
66.0rc3 5
66.0b5 4
67.0b3 3
65.0b12 2
65.0rc2 2
66.0b11 2
66.0b9 2
66.0a1 1
66.0b12 1
66.0b14 1
66.0b3 1
66.0b6 1
66.0b8 1
67.0b4 1
60.5.1esr 1
Out of 442 crashes, only 77 are from a Firefox build with your patch. The add-on that I referenced in comment 6 is still among the top of installed add-ons, so I believe that the root cause of this bug has still not been addressed yet:
Comment 11•6 years ago
|
||
this crash signature is spiking up into the hundreds in the past couple of hours
| Reporter | ||
Comment 12•6 years ago
|
||
I'm investigating. It seems that there are new unlisted add-ons that use this API and trigger this bug.
Comment 13•6 years ago
|
||
While Rob investigates generating a test case out of the new extensions he'll take on the bug, if it doesn't pan out, pass it back to me.
Updated•6 years ago
|
| Reporter | ||
Comment 14•6 years ago
|
||
I looked at one of the add-ons from the crash reports (sorted by add-on ID) with a 5-digit install count, and compared the number of crashes to the number of add-on users from telemetry (on one of the last days). I think that the actual number of crashes is higher since not all users enable crash reports. Of the sample that I took, every crash report is from a different user.
The add-on that I picked creates a StreamFilter in a blocking webRequest event handler and loads a subresource thereafter, possibly multiple times during its runtime. The server's response is likely cacheable, but I'm not sure if that happens since the StreamFilter's close() method is used during the StreamFilter's onstart.
The crash rate (# crash reports divided by # of users with the add-on) is 0.4%, and includes crash reports from release and beta.
I haven't succeeded in creating a test case to reproduce the bug. Possibly because I tested on Linux (89% of the crashes happen on Windows, and only 1% on Linux) (I only selected the crash reports before 10 april because otherwise the recent crash reports skew the numbers as virtually all of those are only on Windows).
Updated•6 years ago
|
Comment 15•6 years ago
|
||
A quick look at what might be happening.
Crashes happen on a release assert.
Which seems possible if the channel is not traceable:
Which can be set by either parent or child:
https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/netwerk/protocol/http/HttpChannelChild.cpp#600
https://searchfox.org/mozilla-central/rev/1b2636e8517aa48422ed516affe4d28cb7fa220a/netwerk/protocol/http/nsHttpChannel.cpp#1739
So ISTM that we shouldn't be asserting on the result, maybe just logging the failure, perhaps setting some state flag or event back to the extension so it knows.
Comment 16•6 years ago
|
||
•John-Galt> Probably not. There's supposed to be an invariant that we can't get to that point if the channel isn't traceable
Comment 17•6 years ago
|
||
Is this related to a particular patch? Is there anything we can do to mitigate the issue in 66?
This is a very high volume crash on beta and release.
Comment 18•6 years ago
|
||
Given the crash reports, this is an approximation of what I think should be done to
fix the problem. The primary issue is that I have been unable to force-produce the crash and
am left without a way to test this. HttpBaseChannel::SetNewListener prevents the addition of
new filters once the stream filtering has started. The deduction is that
there is a race in the ipc mechanism when calling SendAttachStreamFilter resulting in
nsHttpChannel::CallOnStartRequest prior to RecvAttachStreamFilter.
Comment 19•6 years ago
|
||
This is a simpler approach for avoiding attaching new listeners after onstart.
Comment 20•6 years ago
|
||
I am back to believing that the async handling for some notifications is still not quite right. For example:
This shows a potential break-out prior to the channel being resumed if it was suspended.
And again here:
and I'm not clear what exactly is happening here:
I'm not sure we should be doing anything before the channel is resumed, we certainly should not break out of this and let any channel operations continue.
Honza, can you take a look at those and see what you think?
Updated•6 years ago
|
Updated•6 years ago
|
Comment 21•6 years ago
|
||
Are we still hitting the issue from comment 0, or what are the symptoms?
(In reply to Shane Caraveo (:mixedpuppy) from comment #20)
I am back to believing that the async handling for some notifications is still not quite right. For example:
This one is fixed (bug 1407384).
This shows a potential break-out prior to the channel being resumed if it was suspended.
I don't understand this sentence.
And again here:
again, fixed with bug 1407384
and I'm not clear what exactly is happening here:
this works too. we schedule the notification as an event on the main thread. we call ReadFromCache -> pump.AsyncRead; this schedules onStartRequest event on the main thread after the notification event, the on- notification happens sooner than onStartRequest. if the on- notification suspends the channel (which suspends the pump) there is no onStartRequest (onData..) until the channel (and the pump) is resumed again.
I'm not sure we should be doing anything before the channel is resumed, we certainly should not break out of this and let any channel operations continue.
The point is to not deliver any stream listener notifications, which we (I believe) fulfill. It would be great to have test cases to prove otherwise and best to turn them to tests.
Honza, can you take a look at those and see what you think?
Comment 22•6 years ago
|
||
discussed on IRC that all these should do what is expected (as necko understands the requirements), so the problem is probably somewhere else, likely it's racy between the parent and child process and the child channel may get cancelled prematurely. going to look at the chain of events with BT to more understand how the filtering and suspend/resume works.
Comment 23•6 years ago
•
|
||
So, as I said, we clearly have a possibility of race between sending a message from the parent process to set a filter (new listener) on the child http channel and cancellation of the child channel. I captured a log - please read as a stack trace (backwards) - with an added objective marker to HttpBaseChannel::SetNewListener showing that we:
- Set a new listener (search for seconds instance of
SetNewListener) on the parent process in nsHttpChannel - this happens from inside a code handling
http-on-modify-request - this code also sends - through idle-dispatch - a message to the webextension process that then sends InitStreamFilter message back to the parent process that ends up in nsHttpChannel::AttachStreamFilter which is sending a message AttachStreamFilter to the http child channel.
The http child channel can be in whatever state. If it has been cancelled (page load cancel, reload, navigation), it may have already notified OnStopRequest and release its listener. The cancellation messages is pending somewhere on the way to the parent process, but it doesn't matter as checking cancellation state in nsHttpChannel::AttachStreamFilter would still be racy.
So, the fix here is to check if the child channel is not already canceled. It's valid to fail (bypass) setting the filter then, as the whole channel pair will go away soon anyway. No idea how we should inform the extension about this - I mean if at all to inform it somehow the filter can't be set and will not fire.
Comment 24•6 years ago
|
||
(Shane, please see the previous comment first)
Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.
Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.
Comment 25•6 years ago
|
||
Shane, did you have a chance to look at this?
Comment 26•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #24)
(Shane, please see the previous comment first)
Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.
Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.
The channel is suspended until the api promise is resolved.
I started down the road of bubbling up the error if setnewlistener failed in D27995 because this is racy. But really it would be better to ensure the child is suspended so it cannot race, but I'm not clear if that is possible.
I'm hoping Kris an chime in here.
Comment 27•6 years ago
|
||
I'll capture logs for the appropriate suspend and resume calls to see if those actually surround (and wait) for setting the filter on the child. Tho, if we don't suspend the child channel very early as well, the child channel still can get cancelled before we get to the point of setting the filter (and thus have the crash).
Please note one important fact: suspending the channel on the parent process does NOT suspend the child channel! canceling a suspended channel will not fire onStopRequest until resumed again.
Even if we did suspend it, we would not be able to avoid the early cancellation race on the child process. This whole mechanism likely has to be redesigned...
Comment 28•6 years ago
|
||
(In reply to Shane Caraveo (:mixedpuppy) from comment #26)
(In reply to Honza Bambas (:mayhemer) from comment #24)
(Shane, please see the previous comment first)
Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.
Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.
The channel is suspended until the api promise is resolved.
I started down the road of bubbling up the error if setnewlistener failed in D27995 because this is racy.
looking at that patch, you are killing the process for no reason, IMO. you can't set a filter on a something that has been cancel, no, but it is a legal state; not being able to is IMO not a fatal failure.
But really it would be better to ensure the child is suspended so it cannot race, but I'm not clear if that is possible.
If we know in advance a filter would be added, then we can suspend it, yes. but IMO it would be both complicated and possibly perf regressing if we happen to have a false positive.
I'm hoping Kris an chime in here.
Comment 29•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #28)
(In reply to Shane Caraveo (:mixedpuppy) from comment #26)
(In reply to Honza Bambas (:mayhemer) from comment #24)
(Shane, please see the previous comment first)
Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.
Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.
The channel is suspended until the api promise is resolved.
I started down the road of bubbling up the error if setnewlistener failed in D27995 because this is racy.
looking at that patch, you are killing the process for no reason, IMO. you can't set a filter on a something that has been cancel, no, but it is a legal state; not being able to is IMO not a fatal failure.
I know and agree. The patches that are obsoleted on this bug were looking at removing that. Kris feels that should never happen, and that the underlying problem needs to be fixed. That's why I've ni? him again.
But really it would be better to ensure the child is suspended so it cannot race, but I'm not clear if that is possible.
If we know in advance a filter would be added, then we can suspend it, yes. but IMO it would be both complicated and possibly perf regressing if we happen to have a false positive.
I think the issue here for webextensions is that it is always expected that the channel is completely suspended if we suspend the channel. Adding a filter is not the only change that can happen here. If the channel continues working (ie. can complete/end/cancel/whatever) when we have suspended the channel we have a general problem that could show up in other ways in the future.
Comment 30•6 years ago
|
||
Wontfixing for 66 at this point but we may still be able to take a late patch in 67 (or consider for 67 dot releases).
Comment 31•6 years ago
|
||
Crashes on the beta channel went down by an order of magnitude in our last 4 betas (from 300/day to 30/day) so this is not blocking the release and we are shipping soon, so wontfix 67.
| Reporter | ||
Comment 32•6 years ago
|
||
The graphic from crashstats still shows several orders of magnitudes (10x+) more crashes than before the spike (spike = comment 11).
After excluding crash reports with blocklisted extensions that are known to trigger this bug, the number of crashes is somewhat more reasonable (less than 2x compared to before the spike):
- crash stats query: 4/13 - 5/13 - 441 crashes
- crash stats query: 3/13 - 4/13 - 351 crashes
- crash stats query: 2/13 - 3/13 - 248 crashes
- crash stats query: 1/13 - 2/13 - 274 crashes
Comment 33•6 years ago
|
||
Crash volume is now very low. I don't think we need to keep tracking. Shane, do you think this is still a P1 issue?
Comment 34•6 years ago
|
||
I think it certainly needs to be addressed, maybe not P1 if the volume is low enough to live with.
Updated•6 years ago
|
Comment 37•5 years ago
|
||
I think this is already fixed, worth to confirm first.
Comment 38•5 years ago
|
||
The crashes happening in 80 are due to the channel not implementing nsITraceableChannel.
They also seem to generally stem from nsAsyncVerifyRedirectCallbackEvent except one from mozilla::dom::ContentChild::RecvCrossProcessRedirect.
Understanding what leads to a non-traceable channel would be helpful.
| Reporter | ||
Comment 39•4 years ago
|
||
Per comment 9, the bug as described in the summary has been resolved.
This crash reason has reoccured/resolved over the course of this bug with different causes, and we already have a more recent bug to track a different case in bug 1673749.
Since this bug is no longer actionable, I'll close it in favor of bug 1673749.
Description
•