Closed Bug 1513135 Opened 5 years ago Closed 5 years ago

Downloads show as incomplete but network trace indicates data has arrived

Categories

(Core :: Networking, defect, P1)

63 Branch
defect

Tracking

()

VERIFIED FIXED
mozilla66
Tracking Status
firefox-esr60 --- unaffected
firefox64 --- wontfix
firefox65 --- verified
firefox66 --- verified

People

(Reporter: bmckeever, Assigned: CuveeHsu)

References

Details

(Keywords: regression, Whiteboard: [necko-triaged])

Attachments

(3 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

See attached video.  User indicates that they want to download a file, and my application sends that file over HTTP.


Actual results:

Firefox's download UI indicates that no data has arrived and a 0 byte ".part" file is created in the download folder.  A Wireshark trace (attached) shows that all of the data has arrived, and it looks ok to me.  The problem occurs about half the time with Firefox 63, and it seems to require the file to be above a certain size (a few MB).  It seems like a recent regression, I wasn't able to reproduce in FF 60.  Chrome downloads fine.


Expected results:

File should be saved to disk.
(In reply to bmm6o from comment #0)
> It seems like a recent regression, I wasn't able to reproduce in FF 60.

It would be useful if you could pinpoint the regression range.
https://mozilla.github.io/mozregression/quickstart.html
Has Regression Range: --- → no
Component: Untriaged → Downloads API
Keywords: regression
Product: Firefox → Toolkit
I ran mozregression, the last portion of output is:

2018-12-11T13:29:19: INFO : platform_buildid: 20180816031316
2018-12-11T13:29:19: INFO : platform_changeset: 45e540e5e0fc2bfca60fe015a546fdb6f2f01318
2018-12-11T13:29:19: INFO : platform_repository: https://hg.mozilla.org/integration/autoland
2018-12-11T13:29:19: INFO : platform_version: 63.0a1
2018-12-11T13:30:28: INFO : Narrowed inbound regression window from [483ff2d2, ec225468] (3 builds) to [483ff2d2, 45e540e5] (2 builds) (~1 steps left)
2018-12-11T13:30:28: DEBUG : Starting merge handling...
2018-12-11T13:30:28: DEBUG : Using url: https://hg.mozilla.org/integration/autoland/json-pushes?changeset=45e540e5e0fc2bfca60fe015a546fdb6f2f01318&full=1
2018-12-11T13:30:28: DEBUG : Found commit message:
Bug 1280629 - Part 1: Suspend the http channel if the child process is not able to consume on time r=dragana

Differential Revision: https://phabricator.services.mozilla.com/D2745

2018-12-11T13:30:28: DEBUG : Did not find a branch, checking all integration branches
2018-12-11T13:30:28: INFO : The bisection is done.
2018-12-11T13:30:28: INFO : Stopped
It does sound like that commit could have had an impact on download behavior.
The discussion in bug 1280629 makes it sound like they planned on having a way to turn this off, but neckoIpcWindowSize=0 doesn't do it and I don't see anything else obvious from the code.  It would be great if there was a configuration way to work around this, please let me know if I missed something.
Disabling e10s as outlined in https://support.mozilla.org/en-US/questions/1191898 seems to resolve this issue
(In reply to bmm6o from comment #5)
> I ran mozregression, the last portion of output is:

Thank you for making the effort.

Junior Hsu, it looks like bug 1280629 caused this regression. Please have a look at this, and comment 7 in particular.
Blocks: 1280629
Has Regression Range: no → yes
Component: Downloads API → Networking
Flags: needinfo?(juhsu)
Product: Toolkit → Core
(In reply to bmm6o from comment #7)
> The discussion in bug 1280629 makes it sound like they planned on having a
> way to turn this off, but neckoIpcWindowSize=0 doesn't do it and I don't see
> anything else obvious from the code.  It would be great if there was a
> configuration way to work around this, please let me know if I missed
> something.

Hello Reporter,
The video in Comment 0 is unfamiliar to me, so I don't know if there's a delicate setup.
I can't reproduce in FF 64 with a simple download, but the Header Content-Length: is correctly set in Comment 2.

To disable the impact on bug 1280629 for experiment,
can you set the network.http.send_window_size to 0 in about:config to reproduce it again?

If yes, can you give us a STR to reproduce or attach the log to analyze?
See instructions here: https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Thanks!
Flags: needinfo?(juhsu) → needinfo?(bmckeever)
Flags: needinfo?(bmckeever)
This download succeeded
Comment on attachment 9030830 [details]
network logging file with default window size (1024)

This download reproduced the problem (0 of 0 bytes, empty part file)
Hi Junior,

For more context, the application is an intranet document management web app.  This setup has the ASP.Net server running on a separate VM.  I have reproduced the problem in a variety of setups - Firefox on the local physical machine, on a different VM, on a different physical machine, web app on a physical machine.

See attached HTTP logs, 1 each for the default window size and window size 0.  Setting window size to 0 resulted in a successful download.
(In reply to bmm6o from comment #14)
> Hi Junior,
> 
> For more context, the application is an intranet document management web
> app.  This setup has the ASP.Net server running on a separate VM.  I have
> reproduced the problem in a variety of setups - Firefox on the local
> physical machine, on a different VM, on a different physical machine, web
> app on a physical machine.
> 
> See attached HTTP logs, 1 each for the default window size and window size
> 0.  Setting window size to 0 resulted in a successful download.

I see it's a pptx file.
I'm more interested in how the data is consumed.
Is it a XHR/Fetch or a hyperlink download?

If the consumer is busy or blocked and not telling the network engine how many data is consumed, we
may suspend the Http channel.
Flags: needinfo?(bmckeever)
Oh, I see.  We trigger the download by setting the src property of an iframe to the download url.
Flags: needinfo?(bmckeever)
Priority: -- → P1
Whiteboard: [necko-triaged]
Assignee: nobody → juhsu
The issue happens in the diversion (Download).

When we download resource, the child process will initiate HttpChannelChild::DivertToParent during OnStartRequest.
Send an IPC to parent, triggering HttpChannelParent::SuspendForDiversion 
Usually it happens so early that we won't triggered any flow control bug 1280629.

Here the back pressure suspend is earlier than diversion suspend,
which indicates the IPC from child to parent is so slow, compared to the internet speed 
(actually it's a super fast intranet, 3MB/0.1s). 

We didn't send SendRecvBytes for the diversion cases
The channel is suspended by bp in the case.

Given this condition, SuspendForDiversion is the first message of no more e10s.
We should
(a) resume the channel if it is suspend by bp flow control
(b) never suspend by flow control (although we have suspend for diversion, there may some packets arrive later)
I can reproduce while download something big in local computer via HTTP.
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
(In reply to Junior Hsu from comment #17)
> The issue happens in the diversion (Download).
> 
> When we download resource, the child process will initiate
> HttpChannelChild::DivertToParent during OnStartRequest.
> Send an IPC to parent, triggering HttpChannelParent::SuspendForDiversion 
> Usually it happens so early that we won't triggered any flow control bug
> 1280629.
> 
> Here the back pressure suspend is earlier than diversion suspend,
> which indicates the IPC from child to parent is so slow, compared to the
> internet speed 
> (actually it's a super fast intranet, 3MB/0.1s). 
> 
> We didn't send SendRecvBytes for the diversion cases
> The channel is suspended by bp in the case.
> 

Why we did not? We should. Is it possible to resolve it by sending SendRecvBytes?
(In reply to Dragana Damjanovic [:dragana] from comment #20)
> (In reply to Junior Hsu from comment #17)
> > The issue happens in the diversion (Download).
> > 
> > When we download resource, the child process will initiate
> > HttpChannelChild::DivertToParent during OnStartRequest.
> > Send an IPC to parent, triggering HttpChannelParent::SuspendForDiversion 
> > Usually it happens so early that we won't triggered any flow control bug
> > 1280629.
> > 
> > Here the back pressure suspend is earlier than diversion suspend,
> > which indicates the IPC from child to parent is so slow, compared to the
> > internet speed 
> > (actually it's a super fast intranet, 3MB/0.1s). 
> > 
> > We didn't send SendRecvBytes for the diversion cases
> > The channel is suspended by bp in the case.
> > 
> 
> Why we did not? We should. Is it possible to resolve it by sending
> SendRecvBytes?

Yes we can and this issue is resolved by SendRecvBytes.

Some quick analysis by SendRecvBytes approach
Pro: code is cleaner and consistent
Con: |SendRecvBytes| is a redundant IPC call since 
  (a) it's later than HttpChannelChild::DivertToParent in OnStartRequest
  (b) we suspend the HttpChannelParent in HttpChannelParent::SuspendForDiversion anyway, same effect with the flow control algorithm

What do you think?
Flags: needinfo?(dd.mozilla)
Blocks: 1514065
Attachment #9030962 - Attachment is obsolete: true
Pushed by juhsu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fabd9a347ca2
disable flow control of HTTP e10s back pressure r=dragana
https://hg.mozilla.org/mozilla-central/rev/fabd9a347ca2
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
Comment on attachment 9031284 [details]
Bug 1513135 - disable flow control of HTTP e10s back pressure

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1280629

User impact if declined: Downloading a file > 1MB in a high speed environment (e.g., intranet) will fail

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: Bug 1513135

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): This change disables the impact form bug 1280629, acting as previous release (62 and before)

String changes made/needed: No
Attachment #9031284 - Flags: approval-mozilla-release?
Attachment #9031284 - Flags: approval-mozilla-beta?
Flags: qe-verify+
Comment on attachment 9031284 [details]
Bug 1513135 - disable flow control of HTTP e10s back pressure

[Triage Comment]
Sounds like a pretty big regression for users in some environments. Approving for 65.0b5, but it would also be nice if we could get some verification from QA and/or the reporter that things are behaving correctly in builds with the fix.
Attachment #9031284 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
The fix appears to work in Nightly "66.0a1 (2018-12-14) (64-bit)".  Thanks for acting on this so quickly!
I tried to reproduce the issue using an older version of Nightly from 2018-12-10, but I couldn't. Can you please give me the link that allowed you to reproduce the issue? Or if you did any additional steps please tell me. I could really use some steps to reproduce.
Thank you.
Flags: needinfo?(bmckeever)
(In reply to Junior Hsu from comment #21)
> (In reply to Dragana Damjanovic [:dragana] from comment #20)
> > (In reply to Junior Hsu from comment #17)
> > > The issue happens in the diversion (Download).
> > > 
> > > When we download resource, the child process will initiate
> > > HttpChannelChild::DivertToParent during OnStartRequest.
> > > Send an IPC to parent, triggering HttpChannelParent::SuspendForDiversion 
> > > Usually it happens so early that we won't triggered any flow control bug
> > > 1280629.
> > > 
> > > Here the back pressure suspend is earlier than diversion suspend,
> > > which indicates the IPC from child to parent is so slow, compared to the
> > > internet speed 
> > > (actually it's a super fast intranet, 3MB/0.1s). 
> > > 
> > > We didn't send SendRecvBytes for the diversion cases
> > > The channel is suspended by bp in the case.
> > > 
> > 
> > Why we did not? We should. Is it possible to resolve it by sending
> > SendRecvBytes?
> 
> Yes we can and this issue is resolved by SendRecvBytes.
> 
> Some quick analysis by SendRecvBytes approach
> Pro: code is cleaner and consistent
> Con: |SendRecvBytes| is a redundant IPC call since 
>   (a) it's later than HttpChannelChild::DivertToParent in OnStartRequest
>   (b) we suspend the HttpChannelParent in
> HttpChannelParent::SuspendForDiversion anyway, same effect with the flow
> control algorithm
> 
> What do you think?

We can do this, but we will need a good comment on the child that the child does not need to send SendRecvBytes when diversion starts and why. I hate our suspend-resume stuff. If we get rid of the diversion some stuff will be easier. Also check for !diverting in RecvRecvBytes.
Flags: needinfo?(dd.mozilla)
(In reply to Oana Botisan from comment #29)
> I tried to reproduce the issue using an older version of Nightly from
> 2018-12-10, but I couldn't. Can you please give me the link that allowed you
> to reproduce the issue? Or if you did any additional steps please tell me. I
> could really use some steps to reproduce.
> Thank you.

My STR:
1. set a *local* HTTP server and we can *download* a >1MB file (I use nodejs but it's up to you)
2. Download it via browser
(In reply to Oana Botisan from comment #29)
> I tried to reproduce the issue using an older version of Nightly from
> 2018-12-10, but I couldn't. Can you please give me the link that allowed you
> to reproduce the issue?

The problem will only show up on an intranet so there really isn't a link anyone can give you.  I don't know if Junior Hsu has been able to calculate a minimum speed required, but it's possible your network is below it.
Flags: needinfo?(bmckeever)
(In reply to bmm6o from comment #32)
> (In reply to Oana Botisan from comment #29)
> > I tried to reproduce the issue using an older version of Nightly from
> > 2018-12-10, but I couldn't. Can you please give me the link that allowed you
> > to reproduce the issue?
> 
> The problem will only show up on an intranet so there really isn't a link
> anyone can give you.  I don't know if Junior Hsu has been able to calculate
> a minimum speed required, but it's possible your network is below it.

It's hard to say, and depending on the computing power.
Local network (especially download from local computer) should be always fast enough.
Thank you for the info. 
I managed to reproduce the issue using an older version of Nightly (2018-12-10) on Windows 10 x64.
I verified the fix using latest Nightly 66.0a1 and beta 65.0b5 on Windows 10 x64, Ubuntu 16.04 x64 and macOS 10.13. The bug is not reproducing anymore.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Since it turns out you can repro internally, would you mind removing or making private the packet captures I attached?  There's nothing very sensitive, but it contains internal IP addresses and host names that I'd prefer not be public forever.
(In reply to bmm6o from comment #35)
> Since it turns out you can repro internally, would you mind removing or
> making private the packet captures I attached?  There's nothing very
> sensitive, but it contains internal IP addresses and host names that I'd
> prefer not be public forever.

I've marked them as private.
Comment on attachment 9031284 [details]
Bug 1513135 - disable flow control of HTTP e10s back pressure

It's already close to next train shipping
Attachment #9031284 - Flags: approval-mozilla-release?
You need to log in before you can comment on or make changes to this bug.