Closed Bug 1165759 Opened 9 years ago Closed 9 years ago

Intermittent TC build failures due to 500/502/503/404 from wherever things are fetched

Categories

(Taskcluster :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: garndt)

References

Details

(Keywords: intermittent-failure)

Sheriffs have been pretty much just ignoring this, but with talk about moving real jobs, not just b2g, to TaskCluster, something needs to actually be done about it.
Depends on: 1165760
A big spike in these is now closing trees.
Severity: normal → blocker
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #40)
> A big spike in these is now closing trees.

Latest retriggers are all coming back green, so I'm hesitantly reopening trees to see how it goes.
And just like that, the failures start coming back in, even after jlal landed bug 1166073.
Depends on: 1156326
See Also: → 1165765
Summary: Intermittent TC build failures due to 500/502/503/404 from quay.io → Intermittent TC build failures due to 500/502/503/404 from wherever things are fetched
Comment number 303 is actually an upload error... We probably need to start splitting up this bug... We certainly have some errors related to pulling images (which I think was the majority of the original filings) but many of those issues have been resolved by moving to docker hub. NI myself here to dig through the list tomorrow.
Flags: needinfo?(jlal)
I've entered two tickets for the most common pull errors I've been seeing lately.

https://bugzilla.mozilla.org/show_bug.cgi?id=1170997
https://bugzilla.mozilla.org/show_bug.cgi?id=1170999
Having better (read: usable) failure logs would go a long ways to avoiding dumping ground bugs like these.
I have started reviewing the reports here going from newest to oldest, and I'm back to about comment 168.  The two bugs in comment 306 are the primary causes of most of these.

Here is an ether pad where I'm tracking what comment is what issue:
https://etherpad.mozilla.org/bug-1165759

I will triage new issues reported and begin working soon on the two bugs to try to find solutions and/or mitigate the issue a bit.
(In reply to Treeherder Robot from comment #318)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=b2g-
> inbound&job_id=2088394
> repository: b2g-inbound
> start_time: 2015-06-04T01:56:47
> who: rvandermeulen[at]mozilla[dot]com
> machine: unknown
> revision: abeb01323218

Failed mulet build:
*** No rule to make target `/home/worker/workspace/gecko/intl/icu/source/common/norm2allmodes.h', needed by `normalizer2.o'.  Stop.
Looks like comments 313 to 318 are caused by build errors.
(In reply to Treeherder Robot from comment #326)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=10438003
> repository: mozilla-inbound
> start_time: 2015-06-04T10:28:30
> who: rvandermeulen[at]mozilla[dot]com
> machine: unknown
> revision: b03bb71b757f

Build failure.

.deps/host_dump_symbols.o.pp:191: *** missing separator.  Stop.
*** [toolkit/crashreporter/google-breakpad/src/common/linux/target] Error 2
Blocks: 1080265
(In reply to Treeherder Robot from comment #340)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=3357879
> repository: fx-team
> start_time: 2015-06-06T21:25:42
> who: philringnalda[at]gmail[dot]com
> machine: t-w732-ix-116
> buildname: Windows 7 32-bit fx-team pgo test web-platform-tests-1
> revision: c0177adc8763
> 
> TEST-UNEXPECTED-FAIL |
> /content-security-policy/script-src/script-src-1_2.html | Violation report
> status OK. - assert_equals: No report sent. expected "" but got "false"
> Return code: 1

Test failure 

INFO - TEST-UNEXPECTED-FAIL | /content-security-policy/script-src/script-src-1_2.html | Violation report status OK. - assert_equals: No report sent. expected "" but got "false"
(In reply to Treeherder Robot from comment #339)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=3357863
> repository: fx-team
> start_time: 2015-06-06T21:25:41
> who: philringnalda[at]gmail[dot]com
> machine: t-w732-ix-134
> buildname: Windows 7 32-bit fx-team pgo test reftest-no-accel
> revision: c0177adc8763
> 
> REFTEST TEST-UNEXPECTED-FAIL |
> file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/
> abspos/s71-abs-pos-non-replaced-vrl-094-ref.xht | load failed: timed out
> waiting for test to complete (waiting for onload scripts to complete)
> Return code: 1

Test Failure
(In reply to Treeherder Robot from comment #338)
> log:
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=10516941
> repository: mozilla-inbound
> start_time: 2015-06-06T22:59:33
> who: philringnalda[at]gmail[dot]com
> machine: t-w732-ix-072
> buildname: Windows 7 32-bit mozilla-inbound pgo test reftest-no-accel
> revision: 6eaeb7343ac6
> 
> REFTEST TEST-UNEXPECTED-FAIL |
> file:///C:/slave/test/build/tests/reftest/tests/layout/reftests/writing-mode/
> abspos/s71-abs-pos-non-replaced-vrl-068-ref.xht | load failed: timed out
> waiting for test to complete (waiting for onload scripts to complete)
> Return code: 1

Test failure
Depends on: 1170999
New workers will begin slowly rolling out with changes made in 1170999  to retry image pulling.  This was tested today on the 'raptor' worker type which was experiencing a lot of docker issues.  After new code was deployed did not see a single failure.
Flags: needinfo?(jlal)
(In reply to Greg Arndt [:garndt] from comment #342)
> Test failure 

Yeah, those are just the result of having the wrong bug number in the clipboard, easily solved by outputting something that treeherder can parse so it can suggest a bug so the clipboard doesn't come into it at all.
Comment 370 and 371 are failed because for the same reason, task run time exceeded the max run time allowed: [taskcluster] Error: Task timeout after 14400 seconds. Force killing
container.

Considering our low incidence of having docker image pull issues (last issue flagged 10 days ago), I would suggest that perhaps it might be time to resolve this bug as being fixed by bug 1170999
Component: TaskCluster → General
Product: Testing → Taskcluster
Version: Trunk → unspecified
No longer depends on: 1169454
Depends on: 1186619
The issue this bug was filed for is long-fixed. Remaining issues with Taskcluster builds are being mostly tracked in bug 1189830.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → garndt
You need to log in before you can comment on or make changes to this bug.