945313 - Check for content length when downloading files and retry if not the same

Reporter

Description

•

11 years ago

We're hitting many instances of things like:

09:07:12     INFO - Downloading http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip to /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip
09:07:12     INFO - retry: Calling <bound method DesktopUnittest._download_file of <__main__.DesktopUnittest object at 0x8bb150c>> with args: ('http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip', '/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip'), kwargs: {}, attempt #1
09:08:54     INFO - Downloaded 5606896 bytes.
09:08:54     INFO - mkdir: /builds/slave/test/build/tests
09:08:54     INFO - Running command: ['unzip', '-q', '-o', '/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip', 'bin/*', 'certs/*', 'modules/*', 'mozbase/*', 'config/*', 'mochitest/*'] in /builds/slave/test/build/tests
09:08:54     INFO - Copy/paste: unzip -q -o /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip bin/* certs/* modules/* mozbase/* config/* mochitest/*
09:08:54     INFO -  [/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip]
09:08:54     INFO -    End-of-central-directory signature not found.  Either this file is not
09:08:54     INFO -    a zipfile, or it constitutes one disk of a multi-part archive.  In the
09:08:54     INFO -    latter case the central directory and zipfile comment will be found on
09:08:54     INFO -    the last disk(s) of this archive.
09:08:54     INFO -  unzip:  cannot find zipfile directory in one of /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip or
09:08:54     INFO -          /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip.zip, and cannot find /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip.ZIP, period.
09:08:54    ERROR - Return code: 9
09:08:54    FATAL - Halting on failure while running ['unzip', '-q', '-o', '/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip', 'bin/*', 'certs/*', 'modules/*', 'mozbase/*', 'config/*', 'mochitest/*']


This is failing on more than just b2g, and other job downloads and human downloads of the zip archives that do fail are unzipping fine.

I have not seen any scl3 hosts fail this way yet, but doesn't mean they won't/can't. The fact that manual downloads and other jobs pass means its not the productdelivery storage itself that is corrupt.

Filing as crit and not blocker because this has not yet caused trees to close

Justin Wood (:Callek)

Reporter

Comment 1

•

11 years ago

[9:30:58 PT]	RyanVM|sheriffduty	Callek: we seem to be through this round of it

But doesn't mean its any less crit imho since I'd like to have a clue why it happend

Peter Radcliffe [:pir]

Comment 2

•

11 years ago

The file listed: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip
is actually 121456697 bytes not 5606896 bytes as listed as downloaded there so it's not surprising the unzip failed.

I'd assume this is truncation, not corruption, probably a dropped connection. I'd suggest the code should check this and retry a continued download.

Ryan VanderMeulen [:RyanVM]

Comment 3

•

11 years ago

Happening again.
https://tbpl.mozilla.org/php/getParsedLog.php?id=31323725&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=31323884&tree=B2g-Inbound

Armen [:armenzg]

Comment 4

•

11 years ago

(In reply to Peter Radcliffe [:pir] from comment #2)
> The file listed:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-
> linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip
> is actually 121456697 bytes not 5606896 bytes as listed as downloaded there
> so it's not surprising the unzip failed.
> 
> I'd assume this is truncation, not corruption, probably a dropped
> connection. I'd suggest the code should check this and retry a continued
> download.

I believe we can improve retrying on our side, however, do we have a place to keep track the increase on dropped connections?
Would you please be able to help us locate where the network drops are happening?

Peter Radcliffe [:pir]

Comment 5

•

11 years ago

I have no knowledge of where these machines are, no access to them and no idea what networking is between them and ftp, so I don't think I can be of any help there.

You probably want to talk to netops...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 6

•

11 years ago

bug 937831 is an earlier case of this

Eric Ziegenhorn :ericz

Updated

•

11 years ago

Assignee: server-ops → eziegenhorn

Eric Ziegenhorn :ericz

Updated

•

11 years ago

Component: Server Operations → WebOps: Product Delivery

Product: mozilla.org → Infrastructure & Operations

QA Contact: shyam → nmaul

Jake Maul [:jakem]

Comment 7

•

11 years ago

Seems like we've been fighting this for a while now. From what I've been told, the problems are *always* from external connections into SCL3, never within SCL3 itself. Most commonly this is from AWS, but also sometimes from SCL1 from what I've been told.

That tells me the problem is somewhere at the network level and not the server level... but it also tells me that it's likely networks outside of Mozilla's control altogether. We can kick this over to NetOps for investigation, but honestly I'm very skeptical they'll be able to do anything. I'll kick it over anyway, but I wouldn't expect miracles. Perhaps one of our peers is flapping or otherwise having problems.

It might be helpful if someone could log into one of the machines affected by this and do a traceroute to ftp.mozilla.org... especially if that person could do something like "mtr" had get a sampling for response time and packet loss data. If there is a problem, that sort of thing *might* spot it.

In the meantime, my overall recommendation is not to bank on a fix from our end. Even if we find and fix a real problem (inside our servers, our network, or outside), we still can't guarantee 100.00% success on transfers from then on. The only way to get to that is to do some form of error checking and retry as needed.

We should treat this as a form of "unintentional chaos monkey" and attempt to make our systems resilient to failure. This will bear far more fruit than hunting for the causes of failures and trying to destroy them. :)

Assignee: eziegenhorn → network-operations

Severity: critical → major

Component: WebOps: Product Delivery → NetOps

QA Contact: nmaul → adam

Shyam Mani [:fox2mike]

Updated

•

11 years ago

Severity: major → normal

Adam Newman [:adam]

Comment 8

•

11 years ago

I'm open to digging into this, but we will need more details.

Let's start here:

- It would help to have access to a server/system that is failing. Could you provide that?

- Are these files static (someone puts them there and then they stay) or generated and placed there by some process? If generated, how often are they generated? Is it possible that something is being downloaded before it is completely written? :pir's thoughts on truncation make me curious.

- Can you provide a bit more insight into the process and logic that goes into the scripts that are failing?

We can cover some or all of this in a meeting if that is easier. Let me know what works best for you.

Assignee: network-operations → adam

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

11 years ago

Actually, before we go looking for network issues, we should probably determine whether this is content corruption or, as :pir suggested, truncation.  The two point in pretty radically different directions.

dustin@euclid ~ $ curl -v -o /dev/null http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip
* Adding handle: conn: 0x2449530
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x2449530) send_pipe: 1, recv_pipe: 0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to ftp.mozilla.org port 80 (#0)
*   Trying 63.245.215.46...
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to ftp.mozilla.org (63.245.215.46) port 80 (#0)
> GET /pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip HTTP/1.1
> User-Agent: curl/7.31.0
> Host: ftp.mozilla.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache is not blacklisted
< Server: Apache
< X-Backend-Server: ftp4.dmz.scl3.mozilla.com
< Cache-Control: max-age=3600
< Content-Type: application/zip
< Date: Tue, 03 Dec 2013 13:37:54 GMT
< Expires: Tue, 03 Dec 2013 14:37:54 GMT
< Accept-Ranges: bytes
< Access-Control-Allow-Origin: *
< ETag: "1ac8d80-73d4839-4ec903061814b"
< Last-Modified: Mon, 02 Dec 2013 17:05:37 GMT
< X-Cache-Info: not cacheable; response file size too large
< Content-Length: 121456697

yet from comment 0, 
> 09:08:54     INFO - Downloaded 5606896 bytes.

So the first place to look is in the client code, which I suspect is ignoring the Content-Length header.

If that does *not* appear to be the case, then we can look at the less likely possibility that the server side is generating the wrong file size (metadata propagation delays on the NetApps? Dunno..)

So: back to release engineering for some code investigation.  If the error is not immediately apparent, then adding some logging of headers in and out, as well as a checksum of the downloaded data, would provide good insight into next steps.  I'm happy to help stare at code.

Component: NetOps → Buildduty

Product: Infrastructure & Operations → Release Engineering

QA Contact: adam → armenzg

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

11 years ago

Assignee: adam → nobody

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

11 years ago

https://mxr.mozilla.org/build/source/mozharness/mozharness/base/script.py#178

178             f = urllib2.urlopen(url, timeout=30)
179             local_file = open(file_name, 'wb')
180             while True:
181                 block = f.read(1024 ** 2)
182                 if not block:
183                     break
184                 local_file.write(block)
185             local_file.close()
186             return file_name

so, that just reads until EOF.

Boiling that down to a simple test case, I set up a server that advertises a lot of data, but then only serves ~100K:

import BaseHTTPServer
from BaseHTTPServer import BaseHTTPRequestHandler
        
class RequestHandler(BaseHTTPRequestHandler):
        
    LEN = 10000000000
            
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Length', self.LEN)
        self.end_headers()
        for _ in xrange(self.LEN):
            self.wfile.write('x')
            if _ > 100000:
                os._exit(0)


server_address = ('127.0.0.1', 8010)

httpd = BaseHTTPServer.HTTPServer(server_address, RequestHandler)

sa = httpd.socket.getsockname()
print "Serving HTTP on", sa[0], "port", sa[1], "..."
httpd.serve_forever()


And then a client that reads blocks until it gets a zero-length block:


import urllib2

f = urllib2.urlopen('http://localhost:8010')
tot = 0 
while True:
    block = f.read(1024)
    if not block:
        break
    tot += len(block)
    print tot, len(block)


And, indeed, this client reads ~100k and exits cleanly.  A slight modification to check content length (although lacking robust error handling e.g., for TypeError or KeyError) fixes this problem:

import urllib2

f = urllib2.urlopen('http://localhost:8010')
exp_length = int(f.info()['content-length'])
tot = 0
while True:
    block = f.read(1024)
    if not block:
        if tot != exp_length:
            raise RuntimeError("truncated")
        break
    tot += len(block)
    print tot, len(block)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 11

•

11 years ago

Yikes, I didn't realize we were using urllib2 here! Maybe we should replace this with requests or another library that has much better error handling?

Flags: needinfo?(aki)

Aki Sasaki (not active)

Comment 12

•

11 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #11)
> Yikes, I didn't realize we were using urllib2 here! Maybe we should replace
> this with requests or another library that has much better error handling?

I've been avoiding this, as requests is not built-in and would require we either have a dependency to run mozharness at all, or require a virtualenv to download anything.  We're already requiring a virtualenv for more and more things, so that may be the direction we go.

One solution without requiring a virtualenv could be uploading a checksum file with the build artifacts, and having a way to verify the checksum after download.

Flags: needinfo?(aki)

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

11 years ago

No need to store checksums - just make sure that the length you receive == the length that was sent, as described in comment 10.

Armen [:armenzg]

Updated

•

11 years ago

Summary: Corruption during download of zip archives from productdelivery->{AWS,scl1} → Check for content length when downloading files and retry if not the same

Armen [:armenzg]

Updated

•

11 years ago

Component: Buildduty → General Automation

QA Contact: armenzg → catlee

Chris AtLee [:catlee]

Comment 14

•

11 years ago

Is this primarily for mozharness, or are we seeing this in other codebases as well?

Armen [:armenzg]

Comment 15

•

11 years ago

AFAICT we're seeing this primarily in mozharness.
The good thing is that most of mozharness makes reference to the base class and will benefit from the fix all at once.

Nevertheless, we could potentially be hitting in many other parts of our overall setup:
http://mxr.mozilla.org/build/search?string=urllib2&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=build

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 16

•

11 years ago

Attached patch untested patch to verify length (obsolete) — Details — Splinter Review

Attachment #8346755 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 17

•

11 years ago

Comment on attachment 8346755 [details] [diff] [review]
untested patch to verify length

This is in _download_file() which throws exceptions that we catch in retry() here: http://hg.mozilla.org/build/mozharness/file/5b368650890b/mozharness/base/script.py#l229

The new self.fatal() means that even if we're in attempt #1/10, if the file is truncated, we fatal() and kill the script.

Could you change the self.fatal() to throw an exception of some sort? I'm not sure if one of the existing exception checks makes sense, or if we want to create a custom exception and add it to the list to retry on:

   233             retry_exceptions=(urllib2.HTTPError, urllib2.URLError,
   234                               socket.timeout, socket.error),

And I think this will be fine, but I'd love to see this either bake a run or two on Cedar or Ash since any mistake could end up breaking all downloads across all mozharness jobs, burning all trees.

r=me with the exception change and staging run(s).  If you want to send it back my way for a second review that's fine too.  And thanks for this!

Attachment #8346755 - Flags: review?(aki) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 18

•

11 years ago

(In reply to Aki Sasaki [:aki] from comment #17)
> Comment on attachment 8346755 [details] [diff] [review]
> untested patch to verify length
> 
> This is in _download_file() which throws exceptions that we catch in retry()
> here:
> http://hg.mozilla.org/build/mozharness/file/5b368650890b/mozharness/base/
> script.py#l229
> 
> The new self.fatal() means that even if we're in attempt #1/10, if the file
> is truncated, we fatal() and kill the script.
> 
> Could you change the self.fatal() to throw an exception of some sort?

I think URLError should be fine here. We'll got an extra log line because of the except statement, but I think that's fine. I pushed a potential patch to https://hg.mozilla.org/users/asasaki_mozilla.com/ash-mozharness/rev/50a7a5278aac. As long as it doesn't bust I'll post it for review...the actual issue happens rarely enough that we're unlikely to catch a case of it on Ash.

Dustin J. Mitchell [:dustin] (he/him)

Comment 19

•

11 years ago

Comment on attachment 8346755 [details] [diff] [review]
untested patch to verify length

How about "Download incomplete; content-length was %d, but only received %d".  It just makes it a little clearer what went wrong.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 20

•

11 years ago

Attached patch raise exception; better error message — Details — Splinter Review

This didn't break on ash.

Assignee: nobody → bhearsum

Attachment #8346755 - Attachment is obsolete: true

Status: NEW → UNCONFIRMED

Ever confirmed: false

Attachment #8347202 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 21

•

11 years ago

Comment on attachment 8347202 [details] [diff] [review]
raise exception; better error message

r=me if you make the self.fatal() throw the URLError like you did in https://hg.mozilla.org/users/asasaki_mozilla.com/ash-mozharness/rev/50a7a5278aac (but with amended error message for dustin)

Attachment #8347202 - Flags: review?(aki) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

11 years ago

Comment on attachment 8347202 [details] [diff] [review]
raise exception; better error message

Landed w/ URLError: https://hg.mozilla.org/build/mozharness/rev/a3498fc3653e

Attachment #8347202 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 23

•

11 years ago

Got merged to production today, huzzuh!

Status: UNCONFIRMED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

John Hopkins (:jhopkins)

Comment 24

•

11 years ago

in production

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: General Automation → General

untested patch to verify length 11 years ago bhearsum@mozilla.com (:bhearsum) 873 bytes, patch	mozilla : review+	Details \| Diff \| Splinter Review
raise exception; better error message 11 years ago bhearsum@mozilla.com (:bhearsum) 879 bytes, patch	mozilla : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review