Closed Bug 1448066 Opened 7 years ago Closed 4 years ago

Intermittent [test-linux.sh:error] Failed to download and unzip mozharness

Categories

(Firefox Build System :: Task Configuration, task, P5)

task

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

Details

(Keywords: intermittent-failure)

Filed by: apavel [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=169543852&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/US2avqY1QOKiqLjU6PEriQ/runs/0/artifacts/public/logs/live_backing.log https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/US2avqY1QOKiqLjU6PEriQ/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 [task 2018-03-21T21:45:09.246Z] 0 0 0 0 0 0 0 0 --:--:-- 0:02:07 --:--:-- 0curl: (7) Failed to connect to s3-us-west-2.amazonaws.com port 443: Connection timed out [task 2018-03-21T21:45:09.254Z] + echo 'failed to download mozharness zip' [task 2018-03-21T21:45:09.254Z] failed to download mozharness zip [task 2018-03-21T21:45:09.255Z] + echo 'Download failed, retrying in 2 seconds...' [task 2018-03-21T21:45:09.255Z] Download failed, retrying in 2 seconds... [task 2018-03-21T21:45:09.255Z] + sleep 2 [task 2018-03-21T21:45:11.256Z] + timeout=4 [task 2018-03-21T21:45:11.256Z] + attempt=2 [task 2018-03-21T21:45:11.256Z] + [[ 2 < 10 ]] [task 2018-03-21T21:45:11.256Z] + fail 'Failed to download and unzip mozharness' [task 2018-03-21T21:45:11.256Z] + echo [task 2018-03-21T21:45:11.256Z] [task 2018-03-21T21:45:11.256Z] + echo '[test-linux.sh:error]' 'Failed to download and unzip mozharness' [task 2018-03-21T21:45:11.256Z] [test-linux.sh:error] Failed to download and unzip mozharness [task 2018-03-21T21:45:11.256Z] + exit 1 [task 2018-03-21T21:45:11.256Z] cleanup [task 2018-03-21T21:45:11.256Z] + cleanup [task 2018-03-21T21:45:11.256Z] + local rv=1 [task 2018-03-21T21:45:11.256Z] + [[ -s /builds/worker/.xsession-errors ]] [task 2018-03-21T21:45:11.256Z] + true [task 2018-03-21T21:45:11.256Z] + cleanup_xvfb [task 2018-03-21T21:45:11.257Z] /builds/worker/bin/test-linux.sh: line 71: cleanup_xvfb: command not found [taskcluster 2018-03-21 21:45:11.536Z] === Task Finished === [taskcluster 2018-03-21 21:45:18.910Z] Unsuccessful task run with exit code: 127 completed in 269.357 seconds
Component: Testing → Operations
Product: Firefox for Android → Taskcluster
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → INCOMPLETE
This is still happening. Recent failure: https://treeherder.mozilla.org/logviewer.html#?job_id=208495544&repo=autoland&lineNumber=544 [task 2018-10-29T23:02:26.761Z] + echo 'failed to download mozharness zip' [task 2018-10-29T23:02:26.761Z] failed to download mozharness zip [task 2018-10-29T23:02:26.761Z] + echo 'Download failed, retrying in 2 seconds...' [task 2018-10-29T23:02:26.761Z] Download failed, retrying in 2 seconds... [task 2018-10-29T23:02:26.761Z] + sleep 2 [task 2018-10-29T23:02:28.763Z] + timeout=4 [task 2018-10-29T23:02:28.763Z] + attempt=2 [task 2018-10-29T23:02:28.763Z] + [[ 2 < 10 ]] [task 2018-10-29T23:02:28.763Z] + fail 'Failed to download and unzip mozharness' [task 2018-10-29T23:02:28.763Z] + echo [task 2018-10-29T23:02:28.763Z] [task 2018-10-29T23:02:28.763Z] + echo '[test-linux.sh:error]' 'Failed to download and unzip mozharness' [task 2018-10-29T23:02:28.763Z] [test-linux.sh:error] Failed to download and unzip mozharness [task 2018-10-29T23:02:28.763Z] + exit 1 [task 2018-10-29T23:02:28.763Z] cleanup [task 2018-10-29T23:02:28.763Z] + cleanup [task 2018-10-29T23:02:28.763Z] + local rv=1 [task 2018-10-29T23:02:28.763Z] + [[ -s /builds/worker/.xsession-errors ]] [task 2018-10-29T23:02:28.763Z] + true [task 2018-10-29T23:02:28.763Z] + cleanup_xvfb [task 2018-10-29T23:02:28.764Z] /builds/worker/bin/test-linux.sh: line 72: cleanup_xvfb: command not found [taskcluster 2018-10-29 23:02:29.213Z] === Task Finished === [taskcluster 2018-10-29 23:02:35.429Z] Unsuccessful task run with exit code: 127 completed in 372.26 seconds
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Component: Operations → Operations and Service Requests
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE

This download operation should grow a retry to retry in these circumstances.

Component: Operations and Service Requests → Task Configuration
Product: Taskcluster → Firefox Build System

It looks like it already has retries. Given the number of failures here, it seems not worth it to deal with this

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WONTFIX
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

There are 31 total failures in the last 7 days on android-em-7-0-x86_64 debug and opt and android-hw-p2-8-0-android-aarch64 pgo.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=259704228&repo=autoland&lineNumber=349

[task 2019-08-02T23:02:43.594Z] Download failed, retrying in 2 seconds...
[task 2019-08-02T23:02:43.594Z] + sleep 2
[task 2019-08-02T23:02:43.594Z] + timeout=4
[task 2019-08-02T23:02:43.594Z] + attempt=2
[task 2019-08-02T23:02:43.594Z] + [[ 2 < 10 ]]
[task 2019-08-02T23:02:43.594Z] + fail 'Failed to download and unzip mozharness'
[task 2019-08-02T23:02:43.594Z] + echo
[task 2019-08-02T23:02:43.594Z]
[task 2019-08-02T23:02:43.594Z] + echo '[test-linux.sh:error]' 'Failed to download and unzip mozharness'
[task 2019-08-02T23:02:43.594Z] [test-linux.sh:error] Failed to download and unzip mozharness
[task 2019-08-02T23:02:43.594Z] + exit 1
[task 2019-08-02T23:02:43.594Z] cleanup
[task 2019-08-02T23:02:43.594Z] + cleanup
[task 2019-08-02T23:02:43.594Z] + local rv=1
[task 2019-08-02T23:02:43.594Z] + [[ -s /builds/worker/.xsession-errors ]]
[task 2019-08-02T23:02:43.594Z] + false
[task 2019-08-02T23:02:43.594Z] + exit 1
[task 2019-08-02T23:02:43.594Z]
[task 2019-08-02T23:02:43.594Z] netstat -aop
[task 2019-08-02T23:02:43.594Z] Active Internet connections (servers and established)
[task 2019-08-02T23:02:43.594Z] Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name Timer
[task 2019-08-02T23:02:43.594Z] tcp 0 0 127.0.0.11:39021 : LISTEN - off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] tcp 0 0 bitbar-ubuntu-238:57946 hg.public.mdc1.mo:https ESTABLISHED 21/generic-worker keepalive (3.30/0/0)
[task 2019-08-02T23:02:43.594Z] tcp 0 0 bitbar-ubuntu-238:43432 ec2-54-70-57-161.:https ESTABLISHED 21/generic-worker off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] tcp 0 0 bitbar-ubuntu-238:53644 lga15s43-in-f42.1:https ESTABLISHED 19/python off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] tcp 0 0 localhost:5037 localhost:34519 TIME_WAIT - timewait (59.89/0/0)
[task 2019-08-02T23:02:43.594Z] tcp 0 0 localhost:51314 localhost:60022 ESTABLISHED 21/generic-worker keepalive (25.83/0/0)
[task 2019-08-02T23:02:43.594Z] tcp 0 0 bitbar-ubuntu-238:45536 sfo07s13-in-f10.1:https ESTABLISHED 19/python off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] tcp6 0 0 [::]:60099 [::]:* LISTEN 50/livelog off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] tcp6 0 0 [::]:60022 [::]:* LISTEN 50/livelog off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] tcp6 0 0 localhost:60022 localhost:51314 ESTABLISHED 50/livelog keepalive (89.93/0/0)
[task 2019-08-02T23:02:43.594Z] udp 0 0 127.0.0.11:56180 : - off (0.00/0/0)
[task 2019-08-02T23:02:43.594Z] Active UNIX domain sockets (servers and established)
[task 2019-08-02T23:02:43.594Z] Proto RefCnt Flags Type State I-Node PID/Program name Path
[task 2019-08-02T23:02:43.594Z]
[task 2019-08-02T23:02:43.594Z]
[task 2019-08-02T23:02:43.594Z]
[task 2019-08-02T23:02:43.594Z] script.py exitcode 1
[taskcluster 2019-08-02T23:02:43.611Z] Exit Code: 1
[taskcluster 2019-08-02T23:02:43.611Z] User Time: 411.76ms
[taskcluster 2019-08-02T23:02:43.611Z] Kernel Time: 249.31ms
[taskcluster 2019-08-02T23:02:43.611Z] Wall Time: 1m23.383673664s
[taskcluster 2019-08-02T23:02:43.611Z] Result: FAILED
[taskcluster 2019-08-02T23:02:43.611Z] === Task Finished ===
[taskcluster 2019-08-02T23:02:43.611Z] Task Duration: 1m25.068088501s
[taskcluster 2019-08-02T23:02:44.424Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/dA08qvPJQ62rtS5ESfBF-w/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2020-08-01T20:27:37.007Z
[taskcluster:error] exit status 1

Tom can you please assign someone to take a look?

Flags: needinfo?(mozilla)
Whiteboard: [retriggered] → [retriggered][stockwell needswork:owner]

Nearly all the recent failures were on packet.net on July 31; there was scheduled maintenance on a packet.net switch on that day.

Flags: needinfo?(mozilla)
Whiteboard: [retriggered][stockwell needswork:owner]

Thank you Geoff.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

I've rebooted several of the devices per Wander's recommendations. It doesn't seem to help.

Machine-13 was just rebooted and has had 4 failures.

https://firefox-ci-tc.services.mozilla.com/provisioners/terraform-packet/worker-types/gecko-t-linux/workers/packet-sjc1/machine-13

I've quarantined the workers with a low success rate.

packet_hosts_to_quarantine = [4, 6, 10, 13, 14, 16, 43, 53, 54, 65, 68]

gecko-t-linux.machine-4   {sr: [          ]   0.0%, suc:  0, cmp:  3, exc:  2, rng:  2, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-6   {sr: [          ]   0.0%, suc:  0, cmp:  1, exc:  1, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-10  {sr: [===       ]  37.5%, suc:  6, cmp: 16, exc:  4, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-13  {sr: [          ]   0.0%, suc:  0, cmp: 11, exc:  6, rng:  3, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-14  {sr: [          ]   0.0%, suc:  0, cmp:  2, exc:  0, rng:  4, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-16  {sr: [=         ]  16.7%, suc:  2, cmp: 12, exc:  5, rng:  3, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-43  {sr: [          ]   6.2%, suc:  1, cmp: 16, exc:  4, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-53  {sr: [======    ]  66.7%, suc:  4, cmp:  6, exc:  0, rng:  4, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-54  {sr: [==        ]  25.0%, suc:  3, cmp: 12, exc:  6, rng:  2, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-65  {sr: [          ]   0.0%, suc:  0, cmp: 10, exc:  5, rng:  5, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-68  {sr: [          ]   0.0%, suc:  0, cmp:  3, exc:  1, rng:  3, alerts: ['Low health (less than 0.85)!']}

(In reply to Andrew Erickson [:aerickson] from comment #60)

I've quarantined the workers with a low success rate.

packet_hosts_to_quarantine = [4, 6, 10, 13, 14, 16, 43, 53, 54, 65, 68]

gecko-t-linux.machine-4   {sr: [          ]   0.0%, suc:  0, cmp:  3, exc:  2, rng:  2, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-6   {sr: [          ]   0.0%, suc:  0, cmp:  1, exc:  1, rng:  1, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-10  {sr: [===       ]  37.5%, suc:  6, cmp: 16, exc:  4, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-13  {sr: [          ]   0.0%, suc:  0, cmp: 11, exc:  6, rng:  3, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-14  {sr: [          ]   0.0%, suc:  0, cmp:  2, exc:  0, rng:  4, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-16  {sr: [=         ]  16.7%, suc:  2, cmp: 12, exc:  5, rng:  3, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-43  {sr: [          ]   6.2%, suc:  1, cmp: 16, exc:  4, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-53  {sr: [======    ]  66.7%, suc:  4, cmp:  6, exc:  0, rng:  4, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-54  {sr: [==        ]  25.0%, suc:  3, cmp: 12, exc:  6, rng:  2, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-65  {sr: [          ]   0.0%, suc:  0, cmp: 10, exc:  5, rng:  5, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-68  {sr: [          ]   0.0%, suc:  0, cmp:  3, exc:  1, rng:  3, alerts: ['Low health (less than 0.85)!']}

I rebuilt all quarentined machines.

Thanks. :) I'll put them in gradually and monitor them.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

Issue seems to be bigger, so I filed bug 1631409 for it.

Flags: needinfo?(aerickson)

Yes, please feel free to quarantine if this is seen. I will quarantine the above hosts.r

packet hosts [15, 1, 53, 29, 14] have been quarantined.

I've also quarantined: [7, 2, 37, 46, 28, 45, 22, 64] due to success rate below 80%.

gecko-t-linux.machine-7   {sr: [          ]   0.0%, suc:  0, cmp: 13, exc:  7, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-2   {sr: [=         ]  11.8%, suc:  2, cmp: 17, exc:  3, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-37  {sr: [===       ]  38.5%, suc:  5, cmp: 13, exc:  7, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-46  {sr: [=====     ]  50.0%, suc:  1, cmp:  2, exc:  0, rng:  5, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-28  {sr: [=====     ]  58.3%, suc:  7, cmp: 12, exc:  4, rng:  4, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-45  {sr: [======    ]  66.7%, suc: 10, cmp: 15, exc:  5, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-22  {sr: [=======   ]  73.3%, suc: 11, cmp: 15, exc:  5, rng:  0, alerts: ['Low health (less than 0.85)!']}
gecko-t-linux.machine-64  {sr: [=======   ]  76.5%, suc: 13, cmp: 17, exc:  3, rng:  0, alerts: ['Low health (less than 0.85)!']}

68 also quarantined due to success rate.

(In reply to Andrew Erickson [:aerickson] from comment #71)

packet hosts [15, 1, 53, 29, 14] have been quarantined.

I've rebuilt the offending workers.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.