Closed Bug 1276110 Opened 8 years ago Closed 8 years ago

intermittent download failures from cloud-mirror.taskcluster.net

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(firefox47 fixed, firefox48 fixed, firefox49 fixed, firefox-esr45 fixed, firefox50 fixed)

RESOLVED FIXED
mozilla50
Tracking Status
firefox47 --- fixed
firefox48 --- fixed
firefox49 --- fixed
firefox-esr45 --- fixed
firefox50 --- fixed

People

(Reporter: rail, Assigned: Callek)

References

Details

Attachments

(3 files)

Not sure if this is the right component, feel free to move.

So far I've seen 2 intermittent failures:

http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-l10n/release-mozilla-beta_firefox_win64_l10n_repack-bm77-build1-build7.txt.gz

http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-beta-l10n/release-mozilla-beta_firefox_win32_l10n_repack-bm77-build1-build7.txt.gz

18:46:44     INFO -  c:/builds/moz2_slave/rel-m-beta_fx_w64_l10n_rpk-000/build/mozilla-beta/obj-l10n/_virtualenv/Scripts/python.exe c:/builds/moz2_slave/rel-m-beta_fx_w64_l10n_rpk-000/build/mozilla-beta/config/nsinstall.py -D c:/builds/moz2_slave/rel-m-beta_fx_w64_l10n_rpk-000/build/mozilla-beta/obj-l10n/dist/
18:46:44     INFO -  (cd c:/builds/moz2_slave/rel-m-beta_fx_w64_l10n_rpk-000/build/mozilla-beta/obj-l10n/dist/ && wget --no-cache -nv -N  'https://queue.taskcluster.net/v1/task/PuOzsj0kQMibOB4I1-rlJA/artifacts/public/build/firefox-47.0.en-US.win64.zip')
18:46:45     INFO -  https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/PuOzsj0kQMibOB4I1-rlJA/0/public/build/firefox-47.0.en-US.win64.zip:
18:46:45     INFO -  2016-05-26 18:46:45 ERROR 404: Not Found.
....

18:46:53     INFO -  https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/PuOzsj0kQMibOB4I1-rlJA/0/public/build/firefox-47.0.en-US.win64.zip:
18:46:53     INFO -  2016-05-26 18:46:53 ERROR 404: Not Found.
18:46:53     INFO -  20 redirections exceeded.
18:46:53     INFO -  c:/builds/moz2_slave/rel-m-beta_fx_w64_l10n_rpk-000/build/mozilla-beta/toolkit/locales/l10n.mk:193: recipe for target 'wget-en-US' failed
18:46:53     INFO -  mozmake.exe: *** [wget-en-US] Error 8
18:46:53    ERROR - Return code: 2
18:46:53    ERROR - 2 not in success codes: [0]
18:46:53  WARNING - setting return code to 2
18:46:53    FATAL - Halting on failure while running ['c:/builds/moz2_slave/rel-m-beta_fx_w64_l10n_rpk-000/build/mozilla-beta/mozmake.exe', 'wget-en-US']
18:46:53    FATAL - Running post_fatal callback...
18:46:53    FATAL - Exiting 2
18:46:53     INFO - Running post-run listener: copy_logs_to_upload_dir
18:46:53     INFO - Copying logs to upload dir...
18:46:53     INFO - mkdir: c:\builds\moz2_slave\rel-m-beta_fx_w64_l10n_rpk-000\build\upload\logs
program finished with exit code 2
elapsedTime=1246.631000


The same wget command worked fine "on my laptop" and hopefully it works for the rerun.
Component: General → Platform and Services
Looks like most (if not all) are in scl3.
(In reply to John Ford [:jhford] from comment #5)
> If these are requests in SCL3, they shouldn't be going to cloud-mirror, but
> rather cloud front.

meaning that this is a bug in the queue I think.
Component: Platform and Services → Queue
This happened again: https://treeherder.mozilla.org/logviewer.html#?job_id=1148002&repo=mozilla-beta#L2528

"Machine: b-2008-spot-092"

Meaning, this node is a spot instance probably in us-east-1.
Flags: needinfo?(jhford)
All the logs referenced here are "Machine: b-2008-spot-0xx".
Hence, this seems limited to win 2008 on EC2 spot.

Based on the fact that the command that fails is:
  https://dxr.mozilla.org/mozilla-central/source/toolkit/locales/l10n.mk#193
It's tempting to believe that it's a windows wget bug, where somehow the redirect
location is url decoded.

It could be something else, intermittent bugs in wget sounds a bit too convenient :)
But if it was TC or cloud-mirror I would expect errors to affect other platforms.
So we could try an dig further into this... but it's now pretty obvious that wget on windows
does weird things, see: https://gist.github.com/jonasfj/37e5d14b1965bbbcf7bfaa69b0a02578
This is testing the URL that queue.taskcluster.net would redirect to...
Notice wget on windows gets "2016-06-07 18:36:44 ERROR 404: Not Found." and then also gets "302" :)

Also, the output filename differs between the two wget calls.
linux (correctly, no url decoding): https%3A%2F%2Fs3..........
windows (somehow url decoding): firefox-48.0....

Again, this is the URL that queue redirects to, so not the URL that is used in the script.
But it clearly proves that wget win/linux does different things.
And with wget on windows reporting both a 404 and a 302 response for a single request,
I would dare argue wget for windows might be a fault here.
Jonas's theory about the canonical url being taken as relative sounds very reasonable to me.  That URL should never be attempted.  The route of the cloud-mirror redirect endpoint is

    /v1/redirect/:service/:region/:url/:error?

which if a non-url-encoded url is used might trigger a 404 if the error param isn't globbing correctly.  That would result in a 404 error, which would explain this, since the route wouldn't match that url.

Wget is also not really an ideal tool for downloading single files, it's more for mirroring http sites.  Would it be possible to deploy the curl?  There are binaries available here: https://curl.haxx.se/download.html so it must be possible to build it for windows.
Flags: needinfo?(jhford)
Looking at it more...

 15:20:27     INFO -  https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/ZB8qlCeORni48pNKVb1ZnQ/0/public/build/firefox-48.0.en-US.win64.zip:

 15:20:27     INFO -  2016-06-07 15:20:27 ERROR 404: Not Found.

15:20:28 INFO - 20 redirections exceeded. 

So it seems like we are caught in a redirect loop, which isn't/wasn't what caught in the gist referenced.

I also note that the wget called was specifically:

wget --no-cache -nv -N 'https://queue.taskcluster.net/v1/task/ZB8qlCeORni48pNKVb1ZnQ/artifacts/public/build/firefox-48.0.en-US.win64.zip

not the cloud-mirror url.
> I also note that the wget called was specifically: ... not the cloud-mirror url.
Calling queue.taskcluster.net for artifacts from within EC2 will redirect to cloud-mirror.
So if as indicated by the filename issue (where wget on windows unexpected url decoded), and the obscure
404 message on redirect, it seems very likely that wget is doing something wrong when handling redirects.

If it was a server side thing, we should have seen it on other platforms.
I think this is wget for windows, unable to handle redirects to urls that contain url encoded characters.
Component: Queue → Task Configuration
seems like callek last changed wget in feb: http://hg.mozilla.org/build/puppet/diff/d56b09efb247/modules/packages/manifests/wget.pp

and glandium last changed the wget-en-US usage call in March (dec on central): https://hg.mozilla.org/releases/mozilla-beta/diff/74f9a758dbc2/toolkit/locales/l10n.mk#l1.115

those times don't really add up with when failures started getting reported (this bug). Did something change in the artifact resolution logic on tc end?

I see 4 options:

1) change wget-en-US's wget arguments

2) change wget version on windows machines

3) try to install curl for this use case

4) find fix in tc land or change artifact url wget is using

This is currently blocking 48.0b2 as we are up to 17 reruns on one failing task. if reruns don't succeed, may need to start over with build 2 based off different gecko rev but build2 might fail for releaseduty folks again
I don't think there's much that the TC services can do to work around a broken wget, and from the comments above I see nothing to indicate the broken-wget theory is incorrect.  I think the timing of this bug corresponds to when we enabled cloud-mirror, which uses different and additional redirects beyond those in the old system (which just redirected to a us-west-2 s3 bucket).

If calling wget differently can work around the issue, that would be great.  Otherwise I expect that updating or replacing it with curl is the only choice.  Perhaps another option is to only run the release in us-west-2?
I logged into a failing host via VNC and got some info.
converted 'https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/https%3A%2F%2Fs3-
us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FHT7z02JaSx6QSM5VJsaTeg%2F0%2Fpubl
ic%2Fbuild%2Ffirefox-48.0.en-US.win64.zip' (ASCII) -> 'https://cloud-mirror.taskcluster.ne
t/v1/redirect/s3/us-east-1/https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts
/HT7z02JaSx6QSM5VJsaTeg/0/public/build/firefox-48.0.en-US.win64.zip' (UTF-8)

That's wrong -- it shouldn't be urldecoding that.
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> converted
> 'https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/
> https%3A%2F%2Fs3-
> us-west-2.amazonaws.com%2Ftaskcluster-public-
> artifacts%2FHT7z02JaSx6QSM5VJsaTeg%2F0%2Fpubl
> ic%2Fbuild%2Ffirefox-48.0.en-US.win64.zip' (ASCII) ->
> 'https://cloud-mirror.taskcluster.ne
> t/v1/redirect/s3/us-east-1/https://s3-us-west-2.amazonaws.com/taskcluster-
> public-artifacts
> /HT7z02JaSx6QSM5VJsaTeg/0/public/build/firefox-48.0.en-US.win64.zip' (UTF-8)
> 
> That's wrong -- it shouldn't be urldecoding that.

So to be clear, this works in us-west2 because cloud-mirror is not invoked (canonical url for the artifacts is also in us-west2) --- so is only an issue in us-east. and only on windows with this wget
So... on looking into this further.

On a failing host either of the two additional options results in a passing wget call (getting the correct binary):
  *  --local-encoding=UTF8
  *  --no-iri

Apparently it's the round trip through the encoding scheme that is triggering the urldecode!

However on our Linux machines (tested on a buildbot master):

[jwood@buildbot-master94.bb.releng.use1.mozilla.com ~]$ wget --no-cache --debug -N  'https://queue.taskcluster.net/v1/task/HT7z02JaSx6QSM5VJsaTeg/artifacts/public/build/firefox-48.0.en-US.win64.zip' --local-encoding=ASCII
Setting --timestamping (timestamping) to 1
Setting --local-encoding (localencoding) to ASCII
This version does not have support for IRIs
[jwood@buildbot-master94.bb.releng.use1.mozilla.com ~]$ wget --no-cache --debug -N  'https://queue.taskcluster.net/v1/task/HT7z02JaSx6QSM5VJsaTeg/artifacts/public/build/firefox-48.0.en-US.win64.zip' --local-encoding=UTF8
Setting --timestamping (timestamping) to 1
Setting --local-encoding (localencoding) to UTF8
This version does not have support for IRIs

So that option is a no-go unless we also upgrade linux's wget.

On the bright side though, using --no-iri works on our linux and mac builders without issue!
Attached patch [m-c] wget_noiriSplinter Review
This should fix it.
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #8764236 - Flags: review?(ted)
Comment on attachment 8764236 [details] [diff] [review]
[m-c] wget_noiri

Review of attachment 8764236 [details] [diff] [review]:
-----------------------------------------------------------------

Can you test repacks on Try now that we have that available? This patch looks straightforward but you know how l10n repacks are...
Attachment #8764236 - Flags: review?(ted) → review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0e42b24f36b was my try run (after first push failed to properly trim down number of l10n changesets)

Blue jobs are just buildbot windows being funky (and related to the actual tree-closure now)
Moving to Core::Build Config -- to both be appropriate for where the patch ended up, and to get the approval flags I need.
Component: Task Configuration → Build Config
Product: Taskcluster → Core
Version: unspecified → Trunk
Comment on attachment 8764236 [details] [diff] [review]
[m-c] wget_noiri

Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: Betas could be delayed
[Describe test coverage new/current, TreeHerder]: Run on try for all flavors of desktop l10n repacks that are testable on try
Also manually tested the new flag on different OS's
[Risks and why]: Low risk, just improves accuracy/speed of our ability to ship releases in terms of how we pull the artifacts down
[String/UUID change made/needed]: None
Attachment #8764236 - Flags: approval-mozilla-release?
Attachment #8764236 - Flags: approval-mozilla-beta?
Attachment #8764236 - Flags: approval-mozilla-aurora?
Comment on attachment 8764236 [details] [diff] [review]
[m-c] wget_noiri

RelEng would like this to land on all branches, Beta48+, Aurora49+, Release47+
Attachment #8764236 - Flags: approval-mozilla-release?
Attachment #8764236 - Flags: approval-mozilla-release+
Attachment #8764236 - Flags: approval-mozilla-beta?
Attachment #8764236 - Flags: approval-mozilla-beta+
Attachment #8764236 - Flags: approval-mozilla-aurora?
Attachment #8764236 - Flags: approval-mozilla-aurora+
Pushed by Callek@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/37edfd2c53ed
Workaround a wget bug by not performing internationalisation. r=ted
https://hg.mozilla.org/mozilla-central/rev/37edfd2c53ed
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8764236 [details] [diff] [review]
[m-c] wget_noiri

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: Makes release promotion automation much more stable
User impact if declined: Potential delays in getting builds to QA
Fix Landed on Version: Landed all the way up to release 47.0.x
Risk to taking this patch (and alternatives if risky): Very Low (only affects the command used to fetch our artifacts)
String or UUID changes made by this patch: None

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.

I should note this patch, as is will conflict when applied to ESR (the command is slightly changed on ESR) -- Its a trivial change though and I can correct for the bitrot easily. No feeling that it needs a new review.
Attachment #8764236 - Flags: approval-mozilla-esr45?
Comment on attachment 8764236 [details] [diff] [review]
[m-c] wget_noiri

This is needed for release build promotion, ESR45+
Attachment #8764236 - Flags: approval-mozilla-esr45? → approval-mozilla-esr45+
Yeah, doesn't apply to esr45. Mind posting a fixed patch or landing this yourself?
Flags: needinfo?(bugspam.Callek)
Blocks: 1307371
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: