Closed Bug 1238002 Opened 8 years ago Closed 7 years ago

TEST-UNEXPECTED-ERROR | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 360.1 seconds with message: Download has been completed

Categories

(Testing :: Firefox UI Tests, defect)

defect
Not set
normal

Tracking

(firefox48 affected, firefox49 affected)

RESOLVED WORKSFORME
Tracking Status
firefox48 --- affected
firefox49 --- affected

People

(Reporter: whimboo, Unassigned, NeedInfo)

References

()

Details

(Keywords: intermittent-failure)

The download of the MAR file has been aborted due to it exceeded 6min:

 03:22:18     INFO -  *** AUS:SVC Downloader:onStopRequest - setting state to: downloading
 03:22:18     INFO -  *** AUS:UI gDownloadingPage:onStopRequest - pausing download
 03:22:19     INFO -  *** AUS:SVC Downloader:onStartRequest - original URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20160108004007/Firefox-mozilla-aurora-45.0a2-macosx64-en-US.complete.mar?versionId=KHJdveA6JuAdQB8eYBbvA8U9t1dIMRJP, final URI spec: https://mozilla-nightly-updates.s3.amazonaws.com/mozilla-aurora/20160108004007/Firefox-mozilla-aurora-45.0a2-macosx64-en-US.complete.mar?versionId=KHJdveA6JuAdQB8eYBbvA8U9t1dIMRJP
 03:22:19     INFO -  *** AUS:SVC Downloader:onProgress - progress: 16384/108134582
 03:22:20     INFO -  *** AUS:SVC Downloader:onProgress - progress: 69203/108134582
 03:22:20     INFO -  *** AUS:SVC Downloader:onProgress - progress: 206419/108134582
 03:22:21     INFO -  *** AUS:SVC Downloader:onProgress - progress: 399955/108134582
 03:22:22     INFO -  *** AUS:SVC Downloader:onProgress - progress: 554579/108134582
[..]
 03:28:18     INFO -  *** AUS:SVC Downloader:onProgress - progress: 27678291/108134582
 03:28:18     INFO -  *** AUS:SVC Downloader:onProgress - progress: 27713107/108134582

As you can see we downloaded only 27MB within 6 minutes! This is a very slow throughput.

HTML.log:

https://firefox-ui-tests.s3.amazonaws.com/a74c906a-a2ab-4954-8336-9d98276ab9b8/http.log

Rail and Nick, is there anything we can do about with Amazon? A lot of tests failed due to this issue today.
Flags: needinfo?(rail)
Flags: needinfo?(nthomas)
It's very hard to get any traction on these sort of issues, especially if they are infrequent, because there are so many network hops between our machines/datacenters and Amazon. For example, earlier this week bug 1250374 turned out to be a bad network link on the edge of our network, which slowed down our speeds to the wider internet. The normal mantra is 'design your systems to cope with failure, because it's inevitable', but of course this is easier to say than engineer.
Flags: needinfo?(nthomas)
I think Nick's answer covers everything that I would say.
Flags: needinfo?(rail)
I do not see a way to get workaround that especially that the update test is directly involved here. If it would be failures during setup we could easily stop and retry, but doing that in the test would make the bits we want to test invalid. So I'm not inclined doing that.

Otherwise we could extent the maximum timeout for downloading the update. Right now we have set 6 minutes. This was set in the old days while we run update tests via Mozmill. Nowadays our installers are getting bigger and bigger and this timeout might not reflect the current situation anymore.

Here an idea for improvement:

We stop using this 6min hard timeout for downloading the update and fallback to the maximum build time setting by Jenkins. As of now this is 60min. I case of slow downloads we will build up a queue but maybe still better than having dozen of failing tests.

If it turns out that the queue raises too fast, we could additionally check for incremental download updates in the UI. That means the download progress will be updated each second and we check that progress has been made. Hereby should the test not fail immediately but only if nothing has been downloaded in the last e.g. 30s.

Robert, what do you think about those proposals?
Flags: needinfo?(kairo)
(In reply to Henrik Skupin (:whimboo) from comment #3)
> Robert, what do you think about those proposals?

Int he face of downloads not being as local as they were in the past, I guess your proposal of a longer timeout is fine. It's sad that it takes that long to download at times, but we can't change it, I guess.
Flags: needinfo?(kairo)
Summary: TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: TimeoutException: Timed out after 360.1 seconds with message: Download has been completed → TEST-UNEXPECTED-ERROR | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | TimeoutException: Timed out after 360.1 seconds with message: Download has been completed
Lately I can see chunks coming in of 20kB in size per second. With that it's clear that we won't be able to fully download the 70MB or such in 6 minutes. All this happens way more often now as compared to updates in March. Not sure if something has also been changed in the networking code which determines the chunk size. 

03:25:25     INFO -  *** AUS:SVC Downloader:onProgress - progress: 16980/11325123
03:25:27     INFO -  *** AUS:SVC Downloader:onProgress - progress: 34388/11325123
03:25:28     INFO -  *** AUS:SVC Downloader:onProgress - progress: 68180/11325123
03:25:29     INFO -  *** AUS:SVC Downloader:onProgress - progress: 69204/11325123
[..]
03:31:15     INFO -  *** AUS:SVC Downloader:onProgress - progress: 4959828/11325123
03:31:16     INFO -  *** AUS:SVC Downloader:onProgress - progress: 4977236/11325123

Patrick, do you know if something in networking code could be related to this slow downloads? It's visible in Nightly and Aurora. Both beta and release branches look still fine for release updates.

A HTTP log can be found here (whereby I don't know why Firefox stopped recording entries after some minutes):

https://firefox-ui-tests.s3.amazonaws.com/2b323811-d132-48cc-9b36-6e3fdd5030fc/http.log
Flags: needinfo?(mcmanus)
Also this only affects Linux and OS X right now. I haven't seen a test failure on any of our Windows machines yet.
that's a pretty light log so its hard to tell what's going on for sure. HttpChannel logging (with open/onstart/onstop/onDataAvailable/cancel) would help some.

but I see response headers indicating about 225MB worth of data being transferred if nothing is canceled in that log. it looks like some of that stuff may be happening in parallel which would screw up any of your individual channel measurements.
Flags: needinfo?(mcmanus)
Patrick, I was able to reproduce such a slow download from one of the boxes in SCL3. I created a HTTP log as given by the MDN documentation. I hope that this is enough. You can find the packed version here:

https://www.dropbox.com/s/236rdakqu7zu4b3/http.zip?dl=0

To note, I quit Firefox after a couple of seconds when the connection was slow (~90kB/s). So the relevant data should be at the end of the file, which will be the fallback update case. It first downloads a partial mar file, and afterward the full one - which fails usually due to its size.
Flags: needinfo?(mcmanus)
I'm not entirely sure what the question is here (please help me with that when you mark something needinfo - what info do you need?).

The log is a little weird.

if you go to 2016-05-18 12:50:27.003069 and start looking around you'll see what I mean.

There are about 60 reads of small amounts of data .. 64 to 120 bytes mostly.. and after 1.3 seconds there is a read of 16KB. this 16KB is the first thing to turn into a OnDataAvailable.

It smells like tls record buffering - the server is using a large record size and gecko can only decode after the whole record is received. but then again, some of the data seems to be disappearing (all those short reads) which implies they are part of the TLS layer rather than just short chunks.

nonetheless those short segments are a real problem for TCP. They imply short packets and congestion control is often based on packets instead of bytes. So lots of short writes can lead to a massive under-utilization of bandwidth. I have a lot of wiggle words in here because this is all based on the sender implementation - and that's not firefox so I can't be certain.

This is a TCP issue, so the proxy might be playing a role.. it is doing the writing even though its a CONNECT tunnel. but it might just be writing them out at the same pace it received them from aws. no way to know from the log.

A corresponding wireshark trace would shed more light on the issue - we might be able to figure out what those short packets are. serious bonus points for a NSS key log.
Flags: needinfo?(mcmanus)
(In reply to Patrick McManus [:mcmanus] from comment #10)
> I'm not entirely sure what the question is here (please help me with that
> when you mark something needinfo - what info do you need?).

Sorry if that wasn't that clear. The issues covered by this bug are very slow downloads of background updates which are happening on and off, lately more often. All data is served via S3 so there are a couple of hops in-between the server and our test machines in SCL3 which perform update tests. A timeout set in the tests of 6 minutes to download the update is hit a lot.

> if you go to 2016-05-18 12:50:27.003069 and start looking around you'll see
> what I mean.

I have no idea of interpreting those logs. That's why I was asking you for help. Let me know what I have to provide to get the information you would need. That is very appreciated!

> There are about 60 reads of small amounts of data .. 64 to 120 bytes
> mostly.. and after 1.3 seconds there is a read of 16KB. this 16KB is the
> first thing to turn into a OnDataAvailable.
> 
> It smells like tls record buffering - the server is using a large record
> size and gecko can only decode after the whole record is received. but then
> again, some of the data seems to be disappearing (all those short reads)
> which implies they are part of the TLS layer rather than just short chunks.

With large record size you mean it could be a file which gets downloaded? It could be the update.mar.

> A corresponding wireshark trace would shed more light on the issue - we
> might be able to figure out what those short packets are. serious bonus
> points for a NSS key log.

I haven't used Wireshark yet, so I would kindly ask which specific settings I have to use to provide the information you need. Regarding the NSS key log I found the env variable "SSLKEYLOGFILE", so that should be fine to generate. Will there anything be contained in those files which would make privacy related?
Flags: needinfo?(mcmanus)
See Also: → 1206613
Didn't happen since 10 months. Closing as WFM.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.