Open Bug 1403546 Opened 2 years ago Updated 3 months ago

Suspending a request from "http-on-examine-cached-response" does not suspend data delivery

Categories

(WebExtensions :: Request Handling, defect, P2, critical)

57 Branch
Unspecified
Linux
defect

Tracking

(firefox-esr60 wontfix, firefox66+ wontfix, firefox67+ wontfix, firefox68- fix-optional, firefox69 affected)

Tracking Status
firefox-esr60 --- wontfix
firefox66 + wontfix
firefox67 + wontfix
firefox68 - fix-optional
firefox69 --- affected

People

(Reporter: robwu, Unassigned, Mentored, NeedInfo)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-3620a02a-a704-48ab-879e-488870170927.
=============================================================

(I submitted that crash report from Nightly).

While a page was loading (and the request was suspended because of a webRequest listener), I reloaded the extension.
The extension attached a StreamFilter in the webRequest.onHeadersReceived listener using browser.webRequest.filterResponseData (I don't know for sure whether this crash is the result of the StreamFilter before or after reload, but based on the stack trace, if I think that it is most likely after reloading).

The extension code is very similar to https://bugzilla.mozilla.org/show_bug.cgi?id=1403505#c1.


The crash is due to the following assertion in mozilla::extensions::StreamFilterParent::Init:

  nsresult rv = traceable->SetNewListener(this, getter_AddRefs(mOrigListener));
  MOZ_RELEASE_ASSERT(NS_SUCCEEDED(rv));

There are probably not that many extensions in the wild using webRequest.filterResponseData, so currently the crash rates are probably low. But if a big extension like NoScript starts using it, then this crash could affect many users.
Summary: Crash in mozilla::extensions::StreamFilterParent::Init → Suspending a request from "http-on-examine-cached-response" does not suspend data delivery
Marking P1 based on what kmag said to me in chat.
Assignee: nobody → kmaglione+bmo
Priority: -- → P1
There are still http-on- notifications that do not check suspend/cancel appropriately (if at all) in httpchannel.  There's quite a bit of complexity in there.  It might be a good idea to list which ones are not handling it, and prevent suspend/cancel in those cases until httpchannel can be fixed.
Specifically both http-on-examine-cached-response and http-on-examine-merged-response will not work with suspend/cancel.  There are a couple other problematic areas that we would hit much less often (on-modify in authretry, on-opening which we dont use)

I think if we set channel.fromCache, we can also disable the ability to suspend/cancel the channel.
Depends on: 1407384
The crash is fixed by bug 1402944, which ensures that we consistently unregister channels after data delivery is started. It might be worth uplifting a separate fix to 57, though.
Depends on: 1402944
during the last days these crashes were showing up a couple of times in 58.0b with firefox@tampermonkey.net 4.5.5637beta
Product: Toolkit → WebExtensions
The number of crashes have shot up significantly since 25 July, which coincides with an update of an add-on:

In the aggregations, the first non-Mozilla (system) add-ons are (period 25 Jun - 10 Aug):

rank                 addon ID             version count  %
10. {28197867-b1ef-4140-8e3b-55c45b9c8460}:4.0.0  132 47.14%
11. {28197867-b1ef-4140-8e3b-55c45b9c8460}:4.0.1  101 36.07%

This is the "Integrated Inbox for Gmail and Google Apps" add-on, and these two versions were released on 24 July and 25 July:
https://addons.mozilla.org/en-US/firefox/addon/integrated-gmail/versions/4.0.1

Relevant code snippet (usage of webRequest.filterResponseData) (snippets are identical for 4.0.0 and 4.0.1):
https://robwu.nl/crxviewer/crxviewer.html?crx=https%3A%2F%2Faddons.mozilla.org%2Ffirefox%2Fdownloads%2Ffile%2F1025085%2Fintegrated_inbox_for_gmail_and_google_apps-4.0.1-an%2Bfx.xpi&q=lib%2Fhttp-filter.js!filterResponseData&qf=lib%2Fhttp-filter.js&qb=0&qh=0&qi=1


PS. Link to crashstats. Click on "Aggregate on..." and enter "addons" to view the affected add-ons.
https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Aextensions%3A%3AStreamFilterParent%3A%3AInit&date=>%3D2018-07-25T04%3A00%3A00.000Z&date=<2018-08-10T04%3A00%3A00.000Z#aggregations
I grabbed one of the minidumps ( a8d25a0b-e8fb-43ae-a71d-809b90180810 ) to slightly narrow down the crash reason:

(gdb) f
#0  mozilla::extensions::StreamFilterParent::Init (this=0x7f654caf9430, aChannel=<optimized out>) at /builds/worker/workspace/build/src/toolkit/components/extensions/webrequest/StreamFilterParent.cpp:195
195       MOZ_RELEASE_ASSERT(NS_SUCCEEDED(rv));
(gdb) p rv
$5 = -2147418113
(gdb) p NS_ERROR_UNEXPECTED
$6 = -2147418113

So apparently the return value of SetNewListener is NS_ERROR_UNEXPECTED, which only happens if this fails:
NS_ENSURE_STATE(mListener);
at https://searchfox.org/mozilla-central/rev/2466b82b729765fb0a3ab62f812c1a96a7362478/netwerk/protocol/http/HttpBaseChannel.cpp#3344




Of the crashes that involve the add-on from comment 6, almost every reported URL points to a YouTube video.
The extension from comment 6 immediately calls webRequest.filterResponseData from a blocking webRequest.onBeforeRequest, for many URLs including "*://*.youtube.com/*" of type "sub_frame".

The extension also clears the X-Frame-Options header and rewrites the Content-Security-Policy header, but that is probably unrelated.

When I create a minimal extension with this behavior and visit YouTube videos, I see that child frame requests are occasionally observed when an advertisement is being displayed (the upper-right corner of the page, above the list of videos),
e.g. https://www.youtube.com/ad_companion?render=video_wall_companion&content=foZPetIoy30&ad_video_id=PqYE3rrf9qY&adformat=15_2_1&polymer=1

And after signing in, I see a child frame request
to https://accounts.youtube.com/accounts/SetSIDFrame?...
with the parent/top frame being https://accounts.google.com/CheckCookie?hl=en&checkedDomains=youtube&checkConnection=youtube%3A243%3A1&pstMsg=1&chtml=LoginDoneHtml&service=youtube&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26next%3D%252F%26hl%3Den&gidl=XXX

I haven't locally observed the crashes.

When filtering the crash reports to exclude reports with the above add-on, I get fewer results over the same period (62 results vs 296). TOnly a few URLs are included in these reports, and 6 of them (=almost every) are (intermediate) logout pages at Outlook (Microsoft's web mail client). When logging out at Outlook, https://login.live.com/logout.srf?... is loaded, which navigates elsewhere ( https://outlook.live.com/owa/csignout.aspx?lc=1033 ) which 302-redirects to another destination (msn.com).

The majority of these reports (51 out of 62) use uBlock Origin (versions 1.16.14, 1.16.16), which uses filterResponseData as follows:
- Calls filterResponseData during a blocking onHeadersReceived listener:
  https://github.com/gorhill/uBlock/blob/3a78e73e4b07f5a010f41445fde93da10e0e2a4d/src/js/traffic.js#L596
  https://github.com/gorhill/uBlock/blob/3a78e73e4b07f5a010f41445fde93da10e0e2a4d/src/js/traffic.js#L838-L889
- Only for main_frame and sub_frame documents.
- Only for 2XX responses that match a HTML filter - https://github.com/gorhill/uBlock/wiki/Static-filter-syntax#html-filters
- Modifies Cache-Control response header:
  https://github.com/gorhill/uBlock/blob/3a78e73e4b07f5a010f41445fde93da10e0e2a4d/src/js/traffic.js#L600-L617
Crash Signature: [@ mozilla::extensions::StreamFilterParent::Init] → [@ mozilla::extensions::StreamFilterParent::Init] [@ void mozilla::extensions::StreamFilterParent::Init]
Kris, are you working on this bug?
Flags: needinfo?(kmaglione+bmo)
QA Contact: ddurst
QA Contact: ddurst
Assignee: kmaglione+bmo → mixedpuppy
Mentor: kmaglione+bmo
Flags: needinfo?(kmaglione+bmo)

notes:

All of the http-on notifications (that webrequest uses) handle async now, fixed in 66 in bug 1407384.
There is a reduction in crashes from 65 to 66, but some remain.
I'm not sure if 1407384 contributed to lowering the crash rate, but I think it probably played part.

:robwu do you have your test addon handy somewhere?

Flags: needinfo?(rob)

The original cause of the crash (initial report) was fixed by one of the patches from bug 1402944 (I can't remember which one).

comment 6 and comment 7 contain the most relevant information for the current crashes.
The referenced add-on has been updated to version 4.0.3, but it still uses webRequest.filterResponseData:
https://robwu.nl/crxviewer/?crx=https%3A%2F%2Faddons.mozilla.org%2Ffirefox%2Fdownloads%2Flatest%2Fintegrated-gmail%2Fplatform%3A2%2Fintegrated-gmail.xpi&q=lib%2Fhttp-filter.js!filterResponseData&qf=lib%2Fhttp-filter.js&qb=1&qh=0&qi=1

If we look at crash rates by version after your patch was published, then we can see 442 crashes:

65.0.2  138
65.0.1  131
65.0     95

67.0a1   10
66.0b7    8
66.0.1    7
66.0b10   7
66.0rc1   7
66.0      6
66.0b13   5
66.0rc3   5
66.0b5    4
67.0b3    3
65.0b12   2
65.0rc2   2
66.0b11   2
66.0b9    2
66.0a1    1
66.0b12   1
66.0b14   1
66.0b3    1
66.0b6    1
66.0b8    1
67.0b4    1
60.5.1esr 1

Out of 442 crashes, only 77 are from a Firefox build with your patch. The add-on that I referenced in comment 6 is still among the top of installed add-ons, so I believe that the root cause of this bug has still not been addressed yet:

https://crash-stats.mozilla.com/search/?signature=%3Dmozilla%3A%3Aextensions%3A%3AStreamFilterParent%3A%3AInit&signature=%3Dvoid%20mozilla%3A%3Aextensions%3A%3AStreamFilterParent%3A%3AInit&build_id=%3E%3D20190115221511&version=%2165.0.2&version=%2165.0.1&version=%2165.0&version=%2160.5.1esr&date=%3E%3D2018-12-27T19%3A02%3A00.000Z&date=%3C2019-03-27T19%3A02%3A00.000Z&_facets=signature&_facets=version&_facets=addons&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-addons

Flags: needinfo?(rob)

this crash signature is spiking up into the hundreds in the past couple of hours

I'm investigating. It seems that there are new unlisted add-ons that use this API and trigger this bug.

While Rob investigates generating a test case out of the new extensions he'll take on the bug, if it doesn't pan out, pass it back to me.

Assignee: mixedpuppy → rob

I looked at one of the add-ons from the crash reports (sorted by add-on ID) with a 5-digit install count, and compared the number of crashes to the number of add-on users from telemetry (on one of the last days). I think that the actual number of crashes is higher since not all users enable crash reports. Of the sample that I took, every crash report is from a different user.

The add-on that I picked creates a StreamFilter in a blocking webRequest event handler and loads a subresource thereafter, possibly multiple times during its runtime. The server's response is likely cacheable, but I'm not sure if that happens since the StreamFilter's close() method is used during the StreamFilter's onstart.

The crash rate (# crash reports divided by # of users with the add-on) is 0.4%, and includes crash reports from release and beta.
I haven't succeeded in creating a test case to reproduce the bug. Possibly because I tested on Linux (89% of the crashes happen on Windows, and only 1% on Linux) (I only selected the crash reports before 10 april because otherwise the recent crash reports skew the numbers as virtually all of those are only on Windows).

Assignee: rob → mixedpuppy
Flags: needinfo?(kmaglione+bmo)

•John-Galt> Probably not. There's supposed to be an invariant that we can't get to that point if the channel isn't traceable

Flags: needinfo?(kmaglione+bmo)

Is this related to a particular patch? Is there anything we can do to mitigate the issue in 66?
This is a very high volume crash on beta and release.

See Also: → 1544315

Given the crash reports, this is an approximation of what I think should be done to
fix the problem. The primary issue is that I have been unable to force-produce the crash and
am left without a way to test this. HttpBaseChannel::SetNewListener prevents the addition of
new filters once the stream filtering has started. The deduction is that
there is a race in the ipc mechanism when calling SendAttachStreamFilter resulting in
nsHttpChannel::CallOnStartRequest prior to RecvAttachStreamFilter.

This is a simpler approach for avoiding attaching new listeners after onstart.

I am back to believing that the async handling for some notifications is still not quite right. For example:

https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/netwerk/protocol/http/nsHttpChannel.cpp#3749-3760

This shows a potential break-out prior to the channel being resumed if it was suspended.

And again here:

https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/netwerk/protocol/http/nsHttpChannel.cpp#3600-3619

and I'm not clear what exactly is happening here:

https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/netwerk/protocol/http/nsHttpChannel.cpp#804

I'm not sure we should be doing anything before the channel is resumed, we certainly should not break out of this and let any channel operations continue.

Honza, can you take a look at those and see what you think?

Flags: needinfo?(honzab.moz)
Attachment #9059304 - Attachment is obsolete: true
Attachment #9059142 - Attachment is obsolete: true

Are we still hitting the issue from comment 0, or what are the symptoms?

(In reply to Shane Caraveo (:mixedpuppy) from comment #20)

I am back to believing that the async handling for some notifications is still not quite right. For example:

https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/netwerk/protocol/http/nsHttpChannel.cpp#3749-3760

This one is fixed (bug 1407384).

This shows a potential break-out prior to the channel being resumed if it was suspended.

I don't understand this sentence.

And again here:

https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/netwerk/protocol/http/nsHttpChannel.cpp#3600-3619

again, fixed with bug 1407384

and I'm not clear what exactly is happening here:

https://searchfox.org/mozilla-central/rev/966590c35c19d3b7850c5444a31bf0fd68a7d0c4/netwerk/protocol/http/nsHttpChannel.cpp#804

this works too. we schedule the notification as an event on the main thread. we call ReadFromCache -> pump.AsyncRead; this schedules onStartRequest event on the main thread after the notification event, the on- notification happens sooner than onStartRequest. if the on- notification suspends the channel (which suspends the pump) there is no onStartRequest (onData..) until the channel (and the pump) is resumed again.

I'm not sure we should be doing anything before the channel is resumed, we certainly should not break out of this and let any channel operations continue.

The point is to not deliver any stream listener notifications, which we (I believe) fulfill. It would be great to have test cases to prove otherwise and best to turn them to tests.

Honza, can you take a look at those and see what you think?

Flags: needinfo?(honzab.moz)

discussed on IRC that all these should do what is expected (as necko understands the requirements), so the problem is probably somewhere else, likely it's racy between the parent and child process and the child channel may get cancelled prematurely. going to look at the chain of events with BT to more understand how the filtering and suspend/resume works.

So, as I said, we clearly have a possibility of race between sending a message from the parent process to set a filter (new listener) on the child http channel and cancellation of the child channel. I captured a log - please read as a stack trace (backwards) - with an added objective marker to HttpBaseChannel::SetNewListener showing that we:

  • Set a new listener (search for seconds instance of SetNewListener) on the parent process in nsHttpChannel
  • this happens from inside a code handling http-on-modify-request
  • this code also sends - through idle-dispatch - a message to the webextension process that then sends InitStreamFilter message back to the parent process that ends up in nsHttpChannel::AttachStreamFilter which is sending a message AttachStreamFilter to the http child channel.

The http child channel can be in whatever state. If it has been cancelled (page load cancel, reload, navigation), it may have already notified OnStopRequest and release its listener. The cancellation messages is pending somewhere on the way to the parent process, but it doesn't matter as checking cancellation state in nsHttpChannel::AttachStreamFilter would still be racy.

So, the fix here is to check if the child channel is not already canceled. It's valid to fail (bypass) setting the filter then, as the whole channel pair will go away soon anyway. No idea how we should inform the extension about this - I mean if at all to inform it somehow the filter can't be set and will not fire.

(Shane, please see the previous comment first)

Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.

Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.

Flags: needinfo?(mixedpuppy)

Shane, did you have a chance to look at this?

(In reply to Honza Bambas (:mayhemer) from comment #24)

(Shane, please see the previous comment first)

Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.

Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.

The channel is suspended until the api promise is resolved.

I started down the road of bubbling up the error if setnewlistener failed in D27995 because this is racy. But really it would be better to ensure the child is suspended so it cannot race, but I'm not clear if that is possible.

I'm hoping Kris an chime in here.

Flags: needinfo?(mixedpuppy) → needinfo?(kmaglione+bmo)

I'll capture logs for the appropriate suspend and resume calls to see if those actually surround (and wait) for setting the filter on the child. Tho, if we don't suspend the child channel very early as well, the child channel still can get cancelled before we get to the point of setting the filter (and thus have the crash).

Please note one important fact: suspending the channel on the parent process does NOT suspend the child channel! canceling a suspended channel will not fire onStopRequest until resumed again.

Even if we did suspend it, we would not be able to avoid the early cancellation race on the child process. This whole mechanism likely has to be redesigned...

(In reply to Shane Caraveo (:mixedpuppy) from comment #26)

(In reply to Honza Bambas (:mayhemer) from comment #24)

(Shane, please see the previous comment first)

Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.

Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.

The channel is suspended until the api promise is resolved.

I started down the road of bubbling up the error if setnewlistener failed in D27995 because this is racy.

looking at that patch, you are killing the process for no reason, IMO. you can't set a filter on a something that has been cancel, no, but it is a legal state; not being able to is IMO not a fatal failure.

But really it would be better to ensure the child is suspended so it cannot race, but I'm not clear if that is possible.

If we know in advance a filter would be added, then we can suspend it, yes. but IMO it would be both complicated and possibly perf regressing if we happen to have a false positive.

I'm hoping Kris an chime in here.

(In reply to Honza Bambas (:mayhemer) from comment #28)

(In reply to Shane Caraveo (:mixedpuppy) from comment #26)

(In reply to Honza Bambas (:mayhemer) from comment #24)

(Shane, please see the previous comment first)

Note that the log was captured while running an updated browser/components/extensions/test/browser/browser_ext_webRequest_filterResponseData.js test in a debug build.

Shane, are we keeping the nsHttpChannel on the parent process suspended for the whole time until we set the listener on the child process through this quite complicated event chain? If yes, then the theory about possible premature cancellation is true. If not, then it's bug in your channel suspension logic.

The channel is suspended until the api promise is resolved.

I started down the road of bubbling up the error if setnewlistener failed in D27995 because this is racy.

looking at that patch, you are killing the process for no reason, IMO. you can't set a filter on a something that has been cancel, no, but it is a legal state; not being able to is IMO not a fatal failure.

I know and agree. The patches that are obsoleted on this bug were looking at removing that. Kris feels that should never happen, and that the underlying problem needs to be fixed. That's why I've ni? him again.

But really it would be better to ensure the child is suspended so it cannot race, but I'm not clear if that is possible.

If we know in advance a filter would be added, then we can suspend it, yes. but IMO it would be both complicated and possibly perf regressing if we happen to have a false positive.

I think the issue here for webextensions is that it is always expected that the channel is completely suspended if we suspend the channel. Adding a filter is not the only change that can happen here. If the channel continues working (ie. can complete/end/cancel/whatever) when we have suspended the channel we have a general problem that could show up in other ways in the future.

Wontfixing for 66 at this point but we may still be able to take a late patch in 67 (or consider for 67 dot releases).

Crashes on the beta channel went down by an order of magnitude in our last 4 betas (from 300/day to 30/day) so this is not blocking the release and we are shipping soon, so wontfix 67.

The graphic from crashstats still shows several orders of magnitudes (10x+) more crashes than before the spike (spike = comment 11).
After excluding crash reports with blocklisted extensions that are known to trigger this bug, the number of crashes is somewhat more reasonable (less than 2x compared to before the spike):

Crash volume is now very low. I don't think we need to keep tracking. Shane, do you think this is still a P1 issue?

Flags: needinfo?(mixedpuppy)

I think it certainly needs to be addressed, maybe not P1 if the volume is low enough to live with.

Flags: needinfo?(mixedpuppy)
Assignee: mixedpuppy → nobody
Priority: P1 → P2
You need to log in before you can comment on or make changes to this bug.