Closed Bug 945313 Opened 9 years ago Closed 9 years ago

Check for content length when downloading files and retry if not the same

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: bhearsum)

Details

Attachments

(1 file, 1 obsolete file)

We're hitting many instances of things like:

09:07:12     INFO - Downloading http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip to /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip
09:07:12     INFO - retry: Calling <bound method DesktopUnittest._download_file of <__main__.DesktopUnittest object at 0x8bb150c>> with args: ('http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip', '/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip'), kwargs: {}, attempt #1
09:08:54     INFO - Downloaded 5606896 bytes.
09:08:54     INFO - mkdir: /builds/slave/test/build/tests
09:08:54     INFO - Running command: ['unzip', '-q', '-o', '/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip', 'bin/*', 'certs/*', 'modules/*', 'mozbase/*', 'config/*', 'mochitest/*'] in /builds/slave/test/build/tests
09:08:54     INFO - Copy/paste: unzip -q -o /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip bin/* certs/* modules/* mozbase/* config/* mochitest/*
09:08:54     INFO -  [/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip]
09:08:54     INFO -    End-of-central-directory signature not found.  Either this file is not
09:08:54     INFO -    a zipfile, or it constitutes one disk of a multi-part archive.  In the
09:08:54     INFO -    latter case the central directory and zipfile comment will be found on
09:08:54     INFO -    the last disk(s) of this archive.
09:08:54     INFO -  unzip:  cannot find zipfile directory in one of /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip or
09:08:54     INFO -          /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip.zip, and cannot find /builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip.ZIP, period.
09:08:54    ERROR - Return code: 9
09:08:54    FATAL - Halting on failure while running ['unzip', '-q', '-o', '/builds/slave/test/build/firefox-28.0a1.en-US.linux-i686.tests.zip', 'bin/*', 'certs/*', 'modules/*', 'mozbase/*', 'config/*', 'mochitest/*']


This is failing on more than just b2g, and other job downloads and human downloads of the zip archives that do fail are unzipping fine.

I have not seen any scl3 hosts fail this way yet, but doesn't mean they won't/can't. The fact that manual downloads and other jobs pass means its not the productdelivery storage itself that is corrupt.

Filing as crit and not blocker because this has not yet caused trees to close
[9:30:58 PT]	RyanVM|sheriffduty	Callek: we seem to be through this round of it

But doesn't mean its any less crit imho since I'd like to have a clue why it happend
The file listed: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip
is actually 121456697 bytes not 5606896 bytes as listed as downloaded there so it's not surprising the unzip failed.

I'd assume this is truncation, not corruption, probably a dropped connection. I'd suggest the code should check this and retry a continued download.
(In reply to Peter Radcliffe [:pir] from comment #2)
> The file listed:
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-
> linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip
> is actually 121456697 bytes not 5606896 bytes as listed as downloaded there
> so it's not surprising the unzip failed.
> 
> I'd assume this is truncation, not corruption, probably a dropped
> connection. I'd suggest the code should check this and retry a continued
> download.

I believe we can improve retrying on our side, however, do we have a place to keep track the increase on dropped connections?
Would you please be able to help us locate where the network drops are happening?
I have no knowledge of where these machines are, no access to them and no idea what networking is between them and ftp, so I don't think I can be of any help there.

You probably want to talk to netops...
bug 937831 is an earlier case of this
Assignee: server-ops → eziegenhorn
Component: Server Operations → WebOps: Product Delivery
Product: mozilla.org → Infrastructure & Operations
QA Contact: shyam → nmaul
Seems like we've been fighting this for a while now. From what I've been told, the problems are *always* from external connections into SCL3, never within SCL3 itself. Most commonly this is from AWS, but also sometimes from SCL1 from what I've been told.

That tells me the problem is somewhere at the network level and not the server level... but it also tells me that it's likely networks outside of Mozilla's control altogether. We can kick this over to NetOps for investigation, but honestly I'm very skeptical they'll be able to do anything. I'll kick it over anyway, but I wouldn't expect miracles. Perhaps one of our peers is flapping or otherwise having problems.

It might be helpful if someone could log into one of the machines affected by this and do a traceroute to ftp.mozilla.org... especially if that person could do something like "mtr" had get a sampling for response time and packet loss data. If there is a problem, that sort of thing *might* spot it.


In the meantime, my overall recommendation is not to bank on a fix from our end. Even if we find and fix a real problem (inside our servers, our network, or outside), we still can't guarantee 100.00% success on transfers from then on. The only way to get to that is to do some form of error checking and retry as needed.

We should treat this as a form of "unintentional chaos monkey" and attempt to make our systems resilient to failure. This will bear far more fruit than hunting for the causes of failures and trying to destroy them. :)
Assignee: eziegenhorn → network-operations
Severity: critical → major
Component: WebOps: Product Delivery → NetOps
QA Contact: nmaul → adam
Severity: major → normal
I'm open to digging into this, but we will need more details.

Let's start here:

- It would help to have access to a server/system that is failing. Could you provide that?

- Are these files static (someone puts them there and then they stay) or generated and placed there by some process? If generated, how often are they generated? Is it possible that something is being downloaded before it is completely written? :pir's thoughts on truncation make me curious.

- Can you provide a bit more insight into the process and logic that goes into the scripts that are failing?

We can cover some or all of this in a meeting if that is easier. Let me know what works best for you.
Assignee: network-operations → adam
Actually, before we go looking for network issues, we should probably determine whether this is content corruption or, as :pir suggested, truncation.  The two point in pretty radically different directions.

dustin@euclid ~ $ curl -v -o /dev/null http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip
* Adding handle: conn: 0x2449530
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x2449530) send_pipe: 1, recv_pipe: 0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* About to connect() to ftp.mozilla.org port 80 (#0)
*   Trying 63.245.215.46...
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to ftp.mozilla.org (63.245.215.46) port 80 (#0)
> GET /pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip HTTP/1.1
> User-Agent: curl/7.31.0
> Host: ftp.mozilla.org
> Accept: */*
>
< HTTP/1.1 200 OK
* Server Apache is not blacklisted
< Server: Apache
< X-Backend-Server: ftp4.dmz.scl3.mozilla.com
< Cache-Control: max-age=3600
< Content-Type: application/zip
< Date: Tue, 03 Dec 2013 13:37:54 GMT
< Expires: Tue, 03 Dec 2013 14:37:54 GMT
< Accept-Ranges: bytes
< Access-Control-Allow-Origin: *
< ETag: "1ac8d80-73d4839-4ec903061814b"
< Last-Modified: Mon, 02 Dec 2013 17:05:37 GMT
< X-Cache-Info: not cacheable; response file size too large
< Content-Length: 121456697

yet from comment 0, 
> 09:08:54     INFO - Downloaded 5606896 bytes.

So the first place to look is in the client code, which I suspect is ignoring the Content-Length header.

If that does *not* appear to be the case, then we can look at the less likely possibility that the server side is generating the wrong file size (metadata propagation delays on the NetApps? Dunno..)

So: back to release engineering for some code investigation.  If the error is not immediately apparent, then adding some logging of headers in and out, as well as a checksum of the downloaded data, would provide good insight into next steps.  I'm happy to help stare at code.
Component: NetOps → Buildduty
Product: Infrastructure & Operations → Release Engineering
QA Contact: adam → armenzg
Assignee: adam → nobody
https://mxr.mozilla.org/build/source/mozharness/mozharness/base/script.py#178

178             f = urllib2.urlopen(url, timeout=30)
179             local_file = open(file_name, 'wb')
180             while True:
181                 block = f.read(1024 ** 2)
182                 if not block:
183                     break
184                 local_file.write(block)
185             local_file.close()
186             return file_name

so, that just reads until EOF.

Boiling that down to a simple test case, I set up a server that advertises a lot of data, but then only serves ~100K:

import BaseHTTPServer
from BaseHTTPServer import BaseHTTPRequestHandler
        
class RequestHandler(BaseHTTPRequestHandler):
        
    LEN = 10000000000
            
    def do_GET(self):
        self.send_response(200)
        self.send_header('Content-Length', self.LEN)
        self.end_headers()
        for _ in xrange(self.LEN):
            self.wfile.write('x')
            if _ > 100000:
                os._exit(0)


server_address = ('127.0.0.1', 8010)

httpd = BaseHTTPServer.HTTPServer(server_address, RequestHandler)

sa = httpd.socket.getsockname()
print "Serving HTTP on", sa[0], "port", sa[1], "..."
httpd.serve_forever()


And then a client that reads blocks until it gets a zero-length block:


import urllib2

f = urllib2.urlopen('http://localhost:8010')
tot = 0 
while True:
    block = f.read(1024)
    if not block:
        break
    tot += len(block)
    print tot, len(block)


And, indeed, this client reads ~100k and exits cleanly.  A slight modification to check content length (although lacking robust error handling e.g., for TypeError or KeyError) fixes this problem:

import urllib2

f = urllib2.urlopen('http://localhost:8010')
exp_length = int(f.info()['content-length'])
tot = 0
while True:
    block = f.read(1024)
    if not block:
        if tot != exp_length:
            raise RuntimeError("truncated")
        break
    tot += len(block)
    print tot, len(block)
Yikes, I didn't realize we were using urllib2 here! Maybe we should replace this with requests or another library that has much better error handling?
Flags: needinfo?(aki)
(In reply to Ben Hearsum [:bhearsum] from comment #11)
> Yikes, I didn't realize we were using urllib2 here! Maybe we should replace
> this with requests or another library that has much better error handling?

I've been avoiding this, as requests is not built-in and would require we either have a dependency to run mozharness at all, or require a virtualenv to download anything.  We're already requiring a virtualenv for more and more things, so that may be the direction we go.

One solution without requiring a virtualenv could be uploading a checksum file with the build artifacts, and having a way to verify the checksum after download.
Flags: needinfo?(aki)
No need to store checksums - just make sure that the length you receive == the length that was sent, as described in comment 10.
Summary: Corruption during download of zip archives from productdelivery->{AWS,scl1} → Check for content length when downloading files and retry if not the same
Component: Buildduty → General Automation
QA Contact: armenzg → catlee
Is this primarily for mozharness, or are we seeing this in other codebases as well?
AFAICT we're seeing this primarily in mozharness.
The good thing is that most of mozharness makes reference to the base class and will benefit from the fix all at once.

Nevertheless, we could potentially be hitting in many other parts of our overall setup:
http://mxr.mozilla.org/build/search?string=urllib2&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=build
Attached patch untested patch to verify length (obsolete) — Splinter Review
Attachment #8346755 - Flags: review?(aki)
Comment on attachment 8346755 [details] [diff] [review]
untested patch to verify length

This is in _download_file() which throws exceptions that we catch in retry() here: http://hg.mozilla.org/build/mozharness/file/5b368650890b/mozharness/base/script.py#l229

The new self.fatal() means that even if we're in attempt #1/10, if the file is truncated, we fatal() and kill the script.

Could you change the self.fatal() to throw an exception of some sort? I'm not sure if one of the existing exception checks makes sense, or if we want to create a custom exception and add it to the list to retry on:

   233             retry_exceptions=(urllib2.HTTPError, urllib2.URLError,
   234                               socket.timeout, socket.error),

And I think this will be fine, but I'd love to see this either bake a run or two on Cedar or Ash since any mistake could end up breaking all downloads across all mozharness jobs, burning all trees.

r=me with the exception change and staging run(s).  If you want to send it back my way for a second review that's fine too.  And thanks for this!
Attachment #8346755 - Flags: review?(aki) → review+
(In reply to Aki Sasaki [:aki] from comment #17)
> Comment on attachment 8346755 [details] [diff] [review]
> untested patch to verify length
> 
> This is in _download_file() which throws exceptions that we catch in retry()
> here:
> http://hg.mozilla.org/build/mozharness/file/5b368650890b/mozharness/base/
> script.py#l229
> 
> The new self.fatal() means that even if we're in attempt #1/10, if the file
> is truncated, we fatal() and kill the script.
> 
> Could you change the self.fatal() to throw an exception of some sort?

I think URLError should be fine here. We'll got an extra log line because of the except statement, but I think that's fine. I pushed a potential patch to https://hg.mozilla.org/users/asasaki_mozilla.com/ash-mozharness/rev/50a7a5278aac. As long as it doesn't bust I'll post it for review...the actual issue happens rarely enough that we're unlikely to catch a case of it on Ash.
Comment on attachment 8346755 [details] [diff] [review]
untested patch to verify length

How about "Download incomplete; content-length was %d, but only received %d".  It just makes it a little clearer what went wrong.
This didn't break on ash.
Assignee: nobody → bhearsum
Attachment #8346755 - Attachment is obsolete: true
Status: NEW → UNCONFIRMED
Ever confirmed: false
Attachment #8347202 - Flags: review?(aki)
Comment on attachment 8347202 [details] [diff] [review]
raise exception; better error message

r=me if you make the self.fatal() throw the URLError like you did in https://hg.mozilla.org/users/asasaki_mozilla.com/ash-mozharness/rev/50a7a5278aac (but with amended error message for dustin)
Attachment #8347202 - Flags: review?(aki) → review+
Comment on attachment 8347202 [details] [diff] [review]
raise exception; better error message

Landed w/ URLError: https://hg.mozilla.org/build/mozharness/rev/a3498fc3653e
Attachment #8347202 - Flags: checked-in+
Got merged to production today, huzzuh!
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
in production
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.