Closed Bug 1409570 Opened 7 years ago Closed 6 years ago

Regression: Loading site using HTTP2 server push hangs page while loading resources

Categories

(Core :: Networking: HTTP, defect, P2)

56 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla64
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- verified
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- verified
firefox64 --- verified

People

(Reporter: keanulee517, Assigned: u408661)

References

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Attachments

(4 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36

Steps to reproduce:

Load https://20170915t110514-dot-polymer-shop.appspot.com/ (uses server push and link rel=preload HTTP headers) with Firefox 56.


Actual results:

Page never finishes loading. Looking at the Network panel in developer tools, only the first pushed resource (webcomponents-loader.js) is present and it never finishes loading. Eventually, there is a console warning that "Loading failed for the <script>" (see attached screenshot).




Expected results:

Page should load. This was the case with Firefox 55 and latest Chrome.
Thank you for reporting that bug and providing an url to test it!

Confirmed with Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0 ID:20171017220415

Mozregression results:
Last good revision: d94508124715f39a2e3f7a13aae12a12bf311855
First bad revision: 8a27ba2f85b4e95108d333d45bd452a4ec1bde9e
Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d94508124715f39a2e3f7a13aae12a12bf311855&tochange=8a27ba2f85b4e95108d333d45bd452a4ec1bde9e

Nicholas, it seems that this regressed with bug 1397621, could you have a look please? Thanks
Status: UNCONFIRMED → NEW
Component: Untriaged → Networking: HTTP
Ever confirmed: true
Flags: needinfo?(hurley)
Product: Firefox → Core
Flags: needinfo?(jduell.mcbugs)
Priority: -- → P2
Taking a look now...
Assignee: nobody → hurley
Flags: needinfo?(jduell.mcbugs)
Whiteboard: [necko-triaged]
Nicholas I am resetting the bug priority, so if you could please have it match your plans, it would be appreciated. I had set it during last week's regression triage meeting to P2, but I can't remember why and nobody else could remember either. Sorry about that. Is this something that we would ship for 56 at this point?
Priority: P2 → --
I'm still investigating this (some higher-priority stuff came up that pushed this down the stack). I definitely wouldn't feel comfortable shipping anything to 56 for this.

Push is common enough that I would expect to see more complaints about this if it were a widespread problem, so we're probably dealing with a strange enough corner case here that this may not be worth uplifting even once I figure it out. I'm always willing to be proven wrong there, though :)
Flags: needinfo?(hurley)
Priority: -- → P2
This is definitely an issue for us--been trying to figure out what was happening since this bug began and only recently narrowed it down to this bug. Happens exactly as described.

I guess the only solution for us now is to disable Server Push for these clients--which seems incredibly hacky, but evidently necessary.
OK, I have a handle on what the cause is here. This is an edge case that I believe only happens when all of the following are true:

1. A pushed resource is required for the page to be considered "complete" (so not a resource that is retrieved via xhr or fetch)
2. That pushed resource is matched with a pulled stream *before* the resource is fully pushed (in this case, the match happens before the pushed data even starts coming in).
2a. The matched pushed stream has all its data consumed as it comes in (so buffering is not strictly necessary, but we do it anyway).
3. The pushed stream is closed with a single padded (but otherwise empty) DATA frame that has the FIN bit set.
4. The resource that was pushed does *not* have a Content-Length header (so the transaction can't detect EOS itself).

(NB: item 2 may or may not be strictly necessary here, though I've never seen this behaviour without 2 being true. Any fix will be independent of item 2 anyway, so it doesn't matter much.)

When this happens, we never alert the pushed stream to the fact that it has received a FIN before calling CleanupStream, and its WriteSegments never gets called (since we are just discarding padding).

I think the right approach here is, in the DISCARDING_DATA_FRAME_PADDING path in Http2Session::WriteSegments, we need to make sure the pushed stream knows its FIN is set. Then, I believe in Http2PushedStream::DeferCleanup, we need to only return true when mDeferCleanupOnSuccess == true AND (this is the new part) there is still unread data in the push buffer. That may be done by setting mDeferCleanupOnSuccess = false when we set the FIN bit and all the data in the push buffer is used.

Hopefully there will be a patch up by EOD today. This could get hairier than expected, though (padding edge cases tend to do that...)
Note that in further testing of a fix, I have found that #2 in comment 6 is not necessary.
Even better, it looks we only need to set the FIN bit on the pushed stream, and the rest is taken care of. Patch will be posted soon, after I get it cleaned up.
Argh, the patch I just uploaded somehow ended up with all my debugging cruft in it (which should've been cleaned out). I'll have to fix it and post a new version. *facepalm*
Attachment #8925736 - Flags: review?(mcmanus)
Attachment #8925736 - Attachment is obsolete: true
Comment on attachment 8925736 [details]
Bug 1409570 - Ensure a pushed stream knows we've reached EOS.

https://reviewboard.mozilla.org/r/196882/#review202474
Attachment #8925736 - Flags: review?(mcmanus) → review+
Pushed by hurley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/140a38df3d56
Ensure a pushed stream knows we've reached EOS. r=mcmanus
https://hg.mozilla.org/mozilla-central/rev/140a38df3d56
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Flags: qe-verify+
I verified this issue using Latest Nightly 59.0a1 with Build ID 20171113220112 on Windows 10 x64 and Mac OS X 10.13.
I will mark this as verified fixed.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Sorry, I made a confusion. It's already reproducible on 58.0b3 but on 59.0a1 the bug is fixed. I will set the flag for qe verify+ and verified on 59.0a1.
Status: VERIFIED → RESOLVED
Closed: 7 years ago7 years ago
Flags: qe-verify+
I'm still seeing an instance this on https://www.questly.com/

You can load the site without the link headers to confirm they're the issue with this flag: https://www.questly.com/?preload=false
(In reply to sweetlikepete from comment #17)
> I'm still seeing an instance this on https://www.questly.com/
> 
> You can load the site without the link headers to confirm they're the issue
> with this flag: https://www.questly.com/?preload=false

Tested in Firefox Nightly 59.0a1 (2017-11-16) (64-bit) OSX 10.13.1
I retested this issue on different machines using different Firefox versions. Here are the results:
1. On Mac OS X 10.12 using clean profiles, with
     - Latest Nightly Build ID  20171126220311 Version  59.0a1 - bug is NOT reproducible
     - Lastest Firefox Beta on Mac OS X 10.12 - Version  58.0b6 Build ID  20171123161455 - bug is reproducible. 
2. On another machine with the same Mac OS X 10.12.6 using clean profiles, with 
     - Latest Nightly 59.0a1 Build ID 20171126220311 - bug is reproducible 
     - Latest Firefox Beta 58.0b6 build ID 20171123161455 - bug is NOT fixed.
3. On Windows 10 x64 with FF clean profile with
     - Latest Nightly 59.0a1 with Build ID 20171116220410 - bug is reproducible
     - Latest Firefox Beta 58.0b6 with build ID 20171123161455 - bug is reproducible. 

Nicholas, can you take a look at this issue?
Flags: needinfo?(hurley)
I managed to reproduce the bug using an older version of Nightly (2017-10-17) on Windows 10 x64.
I retested everything on Latest Nightly 60.0a1 and beta 59.0b5 using Windows 10 x64, macOS 10.13 and Ubuntu 16.04 x64, but the bug reproduced every time. The page never loaded and in the error  found in the attachment was thrown in the Browser Console.
This is NOT fixed - tried loading https://20170915t110514-dot-polymer-shop.appspot.com/ with Firefox 59.0.2. See original bug description.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I can confirm that the bug is still reproducing on latest Nightly 61.0a1, beta 60.0b13 and Firefox 59.0.2 using Windows 10 x64, Ubuntu 16.04 x64 and macOS 10.13. 

I would like to mention two things, because I didn't notice them the first time: 
- The error is thrown in the Browser Console only after you refresh or reload the page or you restart the browser. 
- On macOS 10.13 sometimes the page loads after you refresh the browser, but in the majority of the tries it doesn't load.
Target Milestone: mozilla58 → ---
Giving this a poke, it's long-standing but seems we might have new info from comment 22. I'll email nwgh.
Jason, can you help reprioritize this?
I can't reproduce either with the URL from comment 21, or with the URL from comment 17, even after multiple browser restarts and page reloads. Reporter, are you still able to reproduce this with your original URL (even if intermittently)?

I'll note, I do see pushes coming through in the HTTP logs, so push has not been disabled, just that whatever corner case was being hit is not being hit for some reason right now.
Flags: needinfo?(hurley) → needinfo?(keanulee517)
Still consistently cannot load https://20170915t110514-dot-polymer-shop.appspot.com/ with Firefox 62.0.2 (64-bit) on macOS 10.13.6. There is no console error, but the JS and other assets do not load/show up on network timeline. For reference the expected behavior is to render a site similar to https://polymer-shop.appspot.com/ (which is a newer version with push disabled when Firefox UA is detected).
Flags: needinfo?(keanulee517)
Thanks.

So, this is interesting - I just tried with release, and I can reproduce there (as well as on beta and nightly downloaded from mozilla.org), but *not* on a local debug build. That's... a little terrifying, to be honest. Hopefully I'll be able to make some progress just using http logs and a packet capture (which is all I can get out of the official builds).
Note to self: when debugging this issue with logs, ensure that "sync" is NOT in the log modules - sync logging apparently changes timing enough to make this hard (if not impossible) to reproduce. Async logging, though, keeps the timing such that I can reproduce the bug pretty reliably.
Alright, it appears that what's going on is still pretty much exactly the same as I described in comment 6 (here's a refresher for those who don't wish to scroll back):

> 1. A pushed resource is required for the page to be considered "complete" (so not a resource that is retrieved via xhr or fetch)
> 2. That pushed resource is matched with a pulled stream *before* the resource is fully pushed (in this case, the match happens before the pushed data even starts coming in).
> 2a. The matched pushed stream has all its data consumed as it comes in (so buffering is not strictly necessary, but we do it anyway).
> 3. The pushed stream is closed with a single padded (but otherwise empty) DATA frame that has the FIN bit set.
> 4. The resource that was pushed does *not* have a Content-Length header (so the transaction can't detect EOS itself).

Back in comment 6, I said I was not sure that 2/2a are necessary for this behaviour to manifest (and neither of them are). However, the fix that I landed does not handle things properly if both 2 and 2a are true - that is to say, the following order of events:

1. We receive PUSH_PROMISE
2. We generate a transaction that gets matched to the PUSH_PROMISE
3. The push receives data
4. The transaction handles all the data received in step 3
[Repeat 3 and 4 as necessary for the length of the resource, but *both* must happen together]
5. The push receives a padding-only frame with the END_STREAM flag set

At this point, the pushed stream is marked as done (per the original patch), and the transaction doesn't know there's nothing left to read (since there's no content-length header). Also, in this case, the pushed stream doesn't have anything left to give to the transaction (which has read all the data), so it never puts itself in the queue of things to be read from, which is how the transaction would normally find out we're at EOS.

So, the right thing to do is to inform the transaction explicitly. I'm not sure if that involves putting the pushed stream back in the list of pushes "ready to read" (so the transaction can get its EOS there), or if I'll have to jump through a few more hoops to make it work.

This is all my working theory, at least. Until I have code that fixes it, I can't be sure I'm right.
Flags: needinfo?(jduell.mcbugs)
OK, I think I have a build that works. Reporter, from your initial report, it looks like you're using a mac. If that's the case, can you please try the build at https://queue.taskcluster.net/v1/task/cpi-oZkyTuKqX8lejIGVcQ/runs/0/artifacts/public/build/target.dmg and see if it works for you (it worked for me locally, but since this can be timing-dependent, I want to double-check). You'll probably want to download the dmg using curl, to avoid gatekeeper on mac trying to prevent you from opening the test build (which is unsigned):

> curl -L -O https://queue.taskcluster.net/v1/task/cpi-oZkyTuKqX8lejIGVcQ/runs/0/artifacts/public/build/target.dmg

You'll also want to use a different profile, as sharing profiles between nightly and release/esr is not supported. From the directory where you drop the app (doesn't have to be /Applications):

> mkdir test-profile
> open -n ./Firefox\ Nightly.app --args -profile ./test-profile

Then test in that browser as you normally would.

If I was wrong, and you're not using a mac, let me know, and I'll get a build appropriate for you.

Thanks!
Flags: needinfo?(keanulee517)
I tried the build - seems to work fine for me now. Thanks!
Flags: needinfo?(keanulee517)
There was an earlier fix to this, that fixed part of the issue, but that
fix was racy. In the case where the transaction was matched with the
pushed stream before the pushed stream received its END_STREAM, and the
response headers did not include a content-length, the transaction would
never notice that the data was done being sent. When that transaction
was necessary for the load event to fire, the page would get stuck in
the loading state until the user explicitly cancelled.

This new patch ensures that the transaction will notice the EOS by
making sure the pushed stream gets inserted into the list of push
streams with data in the case described above. (The previous patch,
which is still in the tree, is still necessary, but not sufficient, to
fix the issue.)
Comment on attachment 9013335 [details]
Ensure that transactions matched with http/2 pushed streams are properly finished.

Daniel Stenberg [:bagder] has approved the revision.
Attachment #9013335 - Flags: review+
Pushed by hurley@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2f9e7263731f
Ensure that transactions matched with http/2 pushed streams are properly finished. r=bagder
https://hg.mozilla.org/mozilla-central/rev/2f9e7263731f
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Comment on attachment 9013335 [details]
Ensure that transactions matched with http/2 pushed streams are properly finished.

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: http/2

User impact if declined: Potential pageload hangs with certain http/2 push scenarios.

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): Targeted fix to ensure we funnel end-of-stream all the way through so pageload can complete. Has not yet caused any issues on nightly.

String changes made/needed: None
Attachment #9013335 - Flags: approval-mozilla-beta?
Comment on attachment 9013335 [details]
Ensure that transactions matched with http/2 pushed streams are properly finished.

[ESR Uplift Approval Request]

If this is not a sec:{high,crit} bug, please state case for ESR consideration: Potential for pageload hangs when loading over http/2 in certain scenarios involving pushed resources

User impact if declined: See case for consideration

Fix Landed on Version: 64

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): Targeted fix to ensure end-of-stream gets all the way through the process as it usually does.

String or UUID changes made by this patch: none
Attachment #9013335 - Flags: approval-mozilla-esr60?
Comment on attachment 9013335 [details]
Ensure that transactions matched with http/2 pushed streams are properly finished.

Minimak fix for a known issue, uplift approved for 63 beta 13, thanks.
Attachment #9013335 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Verified with the following specs:

Windows 10 x64, Ubuntu 18.10 x64, and Mac OS X 10.13 
Version 	64.0a1
Build ID 	20181005102516

Still checking 63.0b13.
It looks like 63.0b13 has not landed yet. So I will verify this bug when 63.0b13 lands.
(In reply to Grover Wimberly IV [:Grover-QA] from comment #41)
> It looks like 63.0b13 has not landed yet. So I will verify this bug when
> 63.0b13 lands.

You can use builds from Treeherder if you want to verify the bug prior to the actual builds being available.
Comment on attachment 9013335 [details]
Ensure that transactions matched with http/2 pushed streams are properly finished.

Avoid possible pageload hangs on sites using http/2. Approved for ESR 60.3.
Attachment #9013335 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
Managed to reproduce on an older build.

Verified the issue on win10x64, Ubuntu 16.04, MacOS 10.11 with 62.0.3, 63.0b13, 64.0a1, 60.2.3esr_build(Treeherder).
Can confirm that there are no issues regarding the page-load anymore.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: