<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 4

•

12 years ago

(In reply to Peter Radcliffe [:pir] from comment #2) > The file listed: > http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound- > linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip > is actually 121456697 bytes not 5606896 bytes as listed as downloaded there > so it's not surprising the unzip failed. > > I'd assume this is truncation, not corruption, probably a dropped > connection. I'd suggest the code should check this and retry a continued > download. I believe we can improve retrying on our side, however, do we have a place to keep track the increase on dropped connections? Would you please be able to help us locate where the network drops are happening?

Peter Radcliffe [:pir]

Comment 5

•

12 years ago

I have no knowledge of where these machines are, no access to them and no idea what networking is between them and ftp, so I don't think I can be of any help there. You probably want to talk to netops...

Assignee

Comment 6

•

12 years ago

bug 937831 is an earlier case of this

Eric Ziegenhorn :ericz

Updated

•

12 years ago

Assignee: server-ops → eziegenhorn

Eric Ziegenhorn :ericz

Updated

•

12 years ago

Component: Server Operations → WebOps: Product Delivery

Product: mozilla.org → Infrastructure & Operations

QA Contact: shyam → nmaul

Jake Maul [:jakem]

Comment 7

•

12 years ago

Seems like we've been fighting this for a while now. From what I've been told, the problems are *always* from external connections into SCL3, never within SCL3 itself. Most commonly this is from AWS, but also sometimes from SCL1 from what I've been told. That tells me the problem is somewhere at the network level and not the server level... but it also tells me that it's likely networks outside of Mozilla's control altogether. We can kick this over to NetOps for investigation, but honestly I'm very skeptical they'll be able to do anything. I'll kick it over anyway, but I wouldn't expect miracles. Perhaps one of our peers is flapping or otherwise having problems. It might be helpful if someone could log into one of the machines affected by this and do a traceroute to ftp.mozilla.org... especially if that person could do something like "mtr" had get a sampling for response time and packet loss data. If there is a problem, that sort of thing *might* spot it. In the meantime, my overall recommendation is not to bank on a fix from our end. Even if we find and fix a real problem (inside our servers, our network, or outside), we still can't guarantee 100.00% success on transfers from then on. The only way to get to that is to do some form of error checking and retry as needed. We should treat this as a form of "unintentional chaos monkey" and attempt to make our systems resilient to failure. This will bear far more fruit than hunting for the causes of failures and trying to destroy them. :)

Assignee: eziegenhorn → network-operations

Severity: critical → major

Component: WebOps: Product Delivery → NetOps

QA Contact: nmaul → adam

Shyam Mani [:fox2mike]

Updated

•

12 years ago

Severity: major → normal

Adam Newman [:adam]

Comment 8

•

12 years ago

I'm open to digging into this, but we will need more details. Let's start here: - It would help to have access to a server/system that is failing. Could you provide that? - Are these files static (someone puts them there and then they stay) or generated and placed there by some process? If generated, how often are they generated? Is it possible that something is being downloaded before it is completely written? :pir's thoughts on truncation make me curious. - Can you provide a bit more insight into the process and logic that goes into the scripts that are failing? We can cover some or all of this in a meeting if that is easier. Let me know what works best for you.

Assignee: network-operations → adam

Comment 9

•

12 years ago

Actually, before we go looking for network issues, we should probably determine whether this is content corruption or, as :pir suggested, truncation. The two point in pretty radically different directions. dustin@euclid ~ $ curl -v -o /dev/null http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip * Adding handle: conn: 0x2449530 * Adding handle: send: 0 * Adding handle: recv: 0 * Curl_addHandleToPipeline: length: 1 * - Conn 0 (0x2449530) send_pipe: 1, recv_pipe: 0 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* About to connect() to ftp.mozilla.org port 80 (#0) * Trying 63.245.215.46... 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to ftp.mozilla.org (63.245.215.46) port 80 (#0) > GET /pub/mozilla.org/firefox/tinderbox-builds/b2g-inbound-linux/1386000232/firefox-28.0a1.en-US.linux-i686.tests.zip HTTP/1.1 > User-Agent: curl/7.31.0 > Host: ftp.mozilla.org > Accept: */* > < HTTP/1.1 200 OK * Server Apache is not blacklisted < Server: Apache < X-Backend-Server: ftp4.dmz.scl3.mozilla.com < Cache-Control: max-age=3600 < Content-Type: application/zip < Date: Tue, 03 Dec 2013 13:37:54 GMT < Expires: Tue, 03 Dec 2013 14:37:54 GMT < Accept-Ranges: bytes < Access-Control-Allow-Origin: * < ETag: "1ac8d80-73d4839-4ec903061814b" < Last-Modified: Mon, 02 Dec 2013 17:05:37 GMT < X-Cache-Info: not cacheable; response file size too large < Content-Length: 121456697 yet from comment 0, > 09:08:54 INFO - Downloaded 5606896 bytes. So the first place to look is in the client code, which I suspect is ignoring the Content-Length header. If that does *not* appear to be the case, then we can look at the less likely possibility that the server side is generating the wrong file size (metadata propagation delays on the NetApps? Dunno..) So: back to release engineering for some code investigation. If the error is not immediately apparent, then adding some logging of headers in and out, as well as a checksum of the downloaded data, would provide good insight into next steps. I'm happy to help stare at code.

Component: NetOps → Buildduty

Product: Infrastructure & Operations → Release Engineering

QA Contact: adam → armenzg

Updated

•

12 years ago

Assignee: adam → nobody

Comment 10

•

12 years ago

https://mxr.mozilla.org/build/source/mozharness/mozharness/base/script.py#178 178 f = urllib2.urlopen(url, timeout=30) 179 local_file = open(file_name, 'wb') 180 while True: 181 block = f.read(1024 ** 2) 182 if not block: 183 break 184 local_file.write(block) 185 local_file.close() 186 return file_name so, that just reads until EOF. Boiling that down to a simple test case, I set up a server that advertises a lot of data, but then only serves ~100K: import BaseHTTPServer from BaseHTTPServer import BaseHTTPRequestHandler class RequestHandler(BaseHTTPRequestHandler): LEN = 10000000000 def do_GET(self): self.send_response(200) self.send_header('Content-Length', self.LEN) self.end_headers() for _ in xrange(self.LEN): self.wfile.write('x') if _ > 100000: os._exit(0) server_address = ('127.0.0.1', 8010) httpd = BaseHTTPServer.HTTPServer(server_address, RequestHandler) sa = httpd.socket.getsockname() print "Serving HTTP on", sa[0], "port", sa[1], "..." httpd.serve_forever() And then a client that reads blocks until it gets a zero-length block: import urllib2 f = urllib2.urlopen('http://localhost:8010') tot = 0 while True: block = f.read(1024) if not block: break tot += len(block) print tot, len(block) And, indeed, this client reads ~100k and exits cleanly. A slight modification to check content length (although lacking robust error handling e.g., for TypeError or KeyError) fixes this problem: import urllib2 f = urllib2.urlopen('http://localhost:8010') exp_length = int(f.info()['content-length']) tot = 0 while True: block = f.read(1024) if not block: if tot != exp_length: raise RuntimeError("truncated") break tot += len(block) print tot, len(block)

Assignee

Comment 11

•

12 years ago

Yikes, I didn't realize we were using urllib2 here! Maybe we should replace this with requests or another library that has much better error handling?

Flags: needinfo?(aki)

Aki Sasaki (not active)

Comment 12

•

12 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #11) > Yikes, I didn't realize we were using urllib2 here! Maybe we should replace > this with requests or another library that has much better error handling? I've been avoiding this, as requests is not built-in and would require we either have a dependency to run mozharness at all, or require a virtualenv to download anything. We're already requiring a virtualenv for more and more things, so that may be the direction we go. One solution without requiring a virtualenv could be uploading a checksum file with the build artifacts, and having a way to verify the checksum after download.

Flags: needinfo?(aki)

Comment 13

•

12 years ago

No need to store checksums - just make sure that the length you receive == the length that was sent, as described in comment 10.

Updated

•

12 years ago

Summary: Corruption during download of zip archives from productdelivery->{AWS,scl1} → Check for content length when downloading files and retry if not the same

Updated

•

12 years ago

Component: Buildduty → General Automation

QA Contact: armenzg → catlee

Chris AtLee [:catlee]

Comment 14

•

12 years ago

Is this primarily for mozharness, or are we seeing this in other codebases as well?

Comment 15

•

12 years ago

AFAICT we're seeing this primarily in mozharness. The good thing is that most of mozharness makes reference to the base class and will benefit from the fix all at once. Nevertheless, we could potentially be hitting in many other parts of our overall setup: http://mxr.mozilla.org/build/search?string=urllib2&find=&findi=&filter=^[^\0]*%24&hitlimit=&tree=build

Assignee

Comment 16

•

12 years ago

Attached patch untested patch to verify length (obsolete) — Details — Splinter Review

Attachment #8346755 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 17

•

12 years ago

Comment on attachment 8346755 [details] [diff] [review] untested patch to verify length This is in _download_file() which throws exceptions that we catch in retry() here: http://hg.mozilla.org/build/mozharness/file/5b368650890b/mozharness/base/script.py#l229 The new self.fatal() means that even if we're in attempt #1/10, if the file is truncated, we fatal() and kill the script. Could you change the self.fatal() to throw an exception of some sort? I'm not sure if one of the existing exception checks makes sense, or if we want to create a custom exception and add it to the list to retry on: 233 retry_exceptions=(urllib2.HTTPError, urllib2.URLError, 234 socket.timeout, socket.error), And I think this will be fine, but I'd love to see this either bake a run or two on Cedar or Ash since any mistake could end up breaking all downloads across all mozharness jobs, burning all trees. r=me with the exception change and staging run(s). If you want to send it back my way for a second review that's fine too. And thanks for this!

Attachment #8346755 - Flags: review?(aki) → review+

Assignee

Comment 18

•

12 years ago

(In reply to Aki Sasaki [:aki] from comment #17) > Comment on attachment 8346755 [details] [diff] [review] > untested patch to verify length > > This is in _download_file() which throws exceptions that we catch in retry() > here: > http://hg.mozilla.org/build/mozharness/file/5b368650890b/mozharness/base/ > script.py#l229 > > The new self.fatal() means that even if we're in attempt #1/10, if the file > is truncated, we fatal() and kill the script. > > Could you change the self.fatal() to throw an exception of some sort? I think URLError should be fine here. We'll got an extra log line because of the except statement, but I think that's fine. I pushed a potential patch to https://hg.mozilla.org/users/asasaki_mozilla.com/ash-mozharness/rev/50a7a5278aac. As long as it doesn't bust I'll post it for review...the actual issue happens rarely enough that we're unlikely to catch a case of it on Ash.

Comment 19

•

12 years ago

Comment on attachment 8346755 [details] [diff] [review] untested patch to verify length How about "Download incomplete; content-length was %d, but only received %d". It just makes it a little clearer what went wrong.

Assignee

Comment 20

•

12 years ago

Attached patch raise exception; better error message — Details — Splinter Review

This didn't break on ash.

Assignee: nobody → bhearsum

Attachment #8346755 - Attachment is obsolete: true

Status: NEW → UNCONFIRMED

Ever confirmed: false

Attachment #8347202 - Flags: review?(aki)

Aki Sasaki (not active)

Comment 21

•

12 years ago

Comment on attachment 8347202 [details] [diff] [review] raise exception; better error message r=me if you make the self.fatal() throw the URLError like you did in https://hg.mozilla.org/users/asasaki_mozilla.com/ash-mozharness/rev/50a7a5278aac (but with amended error message for dustin)

Attachment #8347202 - Flags: review?(aki) → review+

Assignee

Comment 22

•

12 years ago

Comment on attachment 8347202 [details] [diff] [review] raise exception; better error message Landed w/ URLError: https://hg.mozilla.org/build/mozharness/rev/a3498fc3653e

Attachment #8347202 - Flags: checked-in+