Frequent timeouts downloading from pvtbuilds and usw2 proxxy



5 years ago
10 months ago


(Reporter: RyanVM, Assigned: coop)




(1 attachment)



5 years ago
Trunk trees closed.

07:58:43    ERROR - Can't download from to /builds/slave/test/build/!

07:49:31     INFO - Downloading to /builds/slave/talos-slave/cached/AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz
command timed out: 2400 seconds without output running ['/tools/buildbot/bin/python', 'scripts/scripts/', '--cfg', 'android/', '--test-suite', 'mochitest-1', '--blob-upload-branch', 'mozilla-inbound', '--download-symbols', 'ondemand'], attempting to kill

Comment 1

5 years ago
The tooltool download part seems fine now, but I'm unsure whether I should be able to get to manually or not.

Comment 4

5 years ago
We're starting to run low on jobs to fail given the length of the tree closure, so I'm reopening for now so we can see how things go with more volume.

Comment 5

5 years ago
Looking at, the sha512sum for the AVD file does *not* match what we see in tooltool:

[ ~]$ sha512sum /builds/slave/talos-slave/cached/AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz 
3721d31b60b501b77805eaa16eebecd54d8f860d98c6346a82f0f090afe0da4aa2b8db6c6c20f6f7be40d2c2c1208c773f24fe1eb033d89cdaeca85f252441e8  /builds/slave/talos-slave/cached/AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz

Should be:

12:03:54     INFO -  'tooltool_cacheable_artifacts': {'avd_tar_ball': ('AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz',
12:03:54     INFO -                                                    '7140e026b7b747236545dc30e377a959b0bdf91bb4d70efd7f97f92fce12a9196042503124b8df8d30c2d97b7eb5f9df9556afdffa0b5d9625008aead305c32b')},

Comment 6

5 years ago

We're assuming a previously downloaded file is correct if it's simply present without checking that the sha512sum matches what we expect.
It looks like that in some cases the process of fetching files from tooltool servers is not performed via tooltool fetch command (e.g.:; also, a custom cache mechanism has been implemented for emulator tests, which also does not verify sha sum when a cached artifact is used.

This is why tooltool_cache_path,  tooltool_cacheable_artifacts and tooltool_url have been added to the task configuration.

I would consider using the standard tooltool fetch command to retrieve files and the native tooltool caching mechanism instead, since both perform sha validation allowing an early detection of corruption issues.

See also

Comment 8

5 years ago
Simone's not wrong -- we *should* be using tooltool here because it provides built-in verification -- but that's a more invasive patch. 

This patch should get us unblocked, but it's completely untested.
Attachment #8478552 - Flags: review?(catlee)
Attachment #8478552 - Flags: review?(catlee) → review+

Comment 9

5 years ago
Comment on attachment 8478552 [details] [diff] [review]
Check hash of existing, cached file.

Review of attachment 8478552 [details] [diff] [review]:

::: scripts/
@@ +469,5 @@
>          for artifact_name in artifacts.keys():
>              file_name = artifacts[artifact_name][0]
>              file_path = os.path.join(c["tooltool_cache_path"], file_name)
> +            if not os.path.exists(file_path) or self.file_sha512sum(file_path) != file_shasum:
> +                os.remove(filepath)

Var name should be file_path.

Should also check if the file exists before removing it, because we could have entered this conditional either way.
Attachment #8478552 - Flags: checked-in+


5 years ago
Assignee: nobody → coop

Comment 10

5 years ago
The tooltool part has bee mitigated, but the better fix will happen in bug 1058286.

According to Tomcat, there was one proxxy-related failure again last night:

Let's re-open if we see another rash of failures today.
Last Resolved: 5 years ago
Resolution: --- → FIXED
Rolled out to prod with reconfig on 2014-08-26 08:21 PT
(In reply to Chris Cooper [:coop] from comment #10)
> The tooltool part has bee mitigated, but the better fix will happen in bug
> 1058286.
> According to Tomcat, there was one proxxy-related failure again last night:
> Inbound

We're always going to have intermittent failures downloading from proxxy. In this case the machine fell back to downloading from the original URL and succeeded. I'd call this successful error handling, not a proxxy failure :)

Test test ended up failing because of an overall timeout.
this could be related to the usw2 issues we are having. dep'ing that bug
Depends on: 1060407


10 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.