Frequent timeouts downloading from pvtbuilds and usw2 proxxy

RESOLVED FIXED

Status

Release Engineering
Buildduty
--
blocker
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: RyanVM, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
Trunk trees closed.

https://tbpl.mozilla.org/php/getParsedLog.php?id=46691626&tree=Mozilla-Inbound

07:58:43    ERROR - Can't download from http://ftp.mozilla.org.proxxy.srv.releng.usw2.mozilla.com/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64-asan/1408975067/firefox-34.0a1.en-US.linux-x86_64-asan.tests.zip to /builds/slave/test/build/firefox-34.0a1.en-US.linux-x86_64-asan.tests.zip!

https://tbpl.mozilla.org/php/getParsedLog.php?id=46691660&tree=Mozilla-Inbound

07:49:31     INFO - Downloading http://tooltool.pvt.build.mozilla.org/build/sha512/7140e026b7b747236545dc30e377a959b0bdf91bb4d70efd7f97f92fce12a9196042503124b8df8d30c2d97b7eb5f9df9556afdffa0b5d9625008aead305c32b to /builds/slave/talos-slave/cached/AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz
command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/android_emulator_unittest.py', '--cfg', 'android/androidarm.py', '--test-suite', 'mochitest-1', '--blob-upload-branch', 'mozilla-inbound', '--download-symbols', 'ondemand'], attempting to kill
(Assignee)

Comment 1

3 years ago
The tooltool download part seems fine now, but I'm unsure whether I should be able to get to ftp.mozilla.org.proxxy.srv.releng.usw2.mozilla.com manually or not.
(Reporter)

Comment 2

3 years ago
Android download corruption:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46697858&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=46698014&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=46698733&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=46697919&tree=Mozilla-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=46697898&tree=Mozilla-Central

Download timeouts:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46700571&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=46700948&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=46701332&tree=B2g-Inbound
(Reporter)

Comment 3

3 years ago
6min ago:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46704707&tree=Mozilla-Inbound
(Reporter)

Comment 4

3 years ago
We're starting to run low on jobs to fail given the length of the tree closure, so I'm reopening for now so we can see how things go with more volume.
(Assignee)

Comment 5

3 years ago
Looking at https://tbpl.mozilla.org/php/getParsedLog.php?id=46706670&tree=Mozilla-Inbound, the sha512sum for the AVD file does *not* match what we see in tooltool:

[cltbld@tst-linux64-spot-745.test.releng.usw2.mozilla.com ~]$ sha512sum /builds/slave/talos-slave/cached/AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz 
3721d31b60b501b77805eaa16eebecd54d8f860d98c6346a82f0f090afe0da4aa2b8db6c6c20f6f7be40d2c2c1208c773f24fe1eb033d89cdaeca85f252441e8  /builds/slave/talos-slave/cached/AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz

Should be:

12:03:54     INFO -  'tooltool_cacheable_artifacts': {'avd_tar_ball': ('AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz',
12:03:54     INFO -                                                    '7140e026b7b747236545dc30e377a959b0bdf91bb4d70efd7f97f92fce12a9196042503124b8df8d30c2d97b7eb5f9df9556afdffa0b5d9625008aead305c32b')},
(Assignee)

Comment 6

3 years ago
https://hg.mozilla.org/build/mozharness/file/4148c2a93b1b/scripts/android_emulator_unittest.py#l472

We're assuming a previously downloaded file is correct if it's simply present without checking that the sha512sum matches what we expect.
It looks like that in some cases the process of fetching files from tooltool servers is not performed via tooltool fetch command (e.g.: https://hg.mozilla.org/build/mozharness/file/4148c2a93b1b/scripts/android_emulator_unittest.py#l476); also, a custom cache mechanism has been implemented for emulator tests, which also does not verify sha sum when a cached artifact is used.

This is why tooltool_cache_path,  tooltool_cacheable_artifacts and tooltool_url have been added to the task configuration.

I would consider using the standard tooltool fetch command to retrieve files and the native tooltool caching mechanism instead, since both perform sha validation allowing an early detection of corruption issues.

See also https://wiki.mozilla.org/ReleaseEngineering/Tooltool.
(Assignee)

Comment 8

3 years ago
Created attachment 8478552 [details] [diff] [review]
Check hash of existing, cached file.

Simone's not wrong -- we *should* be using tooltool here because it provides built-in verification -- but that's a more invasive patch. 

This patch should get us unblocked, but it's completely untested.
Attachment #8478552 - Flags: review?(catlee)

Updated

3 years ago
Attachment #8478552 - Flags: review?(catlee) → review+
(Assignee)

Comment 9

3 years ago
Comment on attachment 8478552 [details] [diff] [review]
Check hash of existing, cached file.

Review of attachment 8478552 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/mozharness/rev/916fd25c7692

::: scripts/android_emulator_unittest.py
@@ +469,5 @@
>          for artifact_name in artifacts.keys():
>              file_name = artifacts[artifact_name][0]
>              file_path = os.path.join(c["tooltool_cache_path"], file_name)
> +            if not os.path.exists(file_path) or self.file_sha512sum(file_path) != file_shasum:
> +                os.remove(filepath)

Var name should be file_path.

Should also check if the file exists before removing it, because we could have entered this conditional either way.
Attachment #8478552 - Flags: checked-in+
(Assignee)

Updated

3 years ago
Assignee: nobody → coop
(Assignee)

Comment 10

3 years ago
The tooltool part has bee mitigated, but the better fix will happen in bug 1058286.

According to Tomcat, there was one proxxy-related failure again last night: https://tbpl.mozilla.org/php/getParsedLog.php?id=46749860&tree=Mozilla-Inbound

Let's re-open if we see another rash of failures today.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Rolled out to prod with reconfig on 2014-08-26 08:21 PT
(In reply to Chris Cooper [:coop] from comment #10)
> The tooltool part has bee mitigated, but the better fix will happen in bug
> 1058286.
> 
> According to Tomcat, there was one proxxy-related failure again last night:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=46749860&tree=Mozilla-
> Inbound

We're always going to have intermittent failures downloading from proxxy. In this case the machine fell back to downloading from the original URL and succeeded. I'd call this successful error handling, not a proxxy failure :)

Test test ended up failing because of an overall timeout.
https://tbpl.mozilla.org/php/getParsedLog.php?id=46832313&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=46831224&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=46913256&tree=B2g-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=46914773&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=46914775&tree=Mozilla-Inbound
this could be related to the usw2 issues we are having. dep'ing that bug
Depends on: 1060407
https://tbpl.mozilla.org/php/getParsedLog.php?id=47084002&full=1&branch=b2g-inbound
You need to log in before you can comment on or make changes to this bug.