High ratio of failed jobs on bld-lion-r5 machines

RESOLVED WORKSFORME

Status

Release Engineering
Platform Support
RESOLVED WORKSFORME
3 years ago
11 months ago

People

(Reporter: aselagea, Unassigned)

Tracking

(Depends on: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Reporter)

Description

3 years ago
Noticed that we have many jobs failing on this particular pool due to different reasons:
    - command timeouts 
    - failures in connecting to an upload location: 
e.g: "Connecting to upload.tbirdbld.productdelivery.prod.mozaws.net|52.88.134.149|:80... failed: Operation timed out.Retrying."
    - failures in running some commands

Is there something we can do to fix some of these failures?
(Reporter)

Comment 1

3 years ago
Following some investigations, here are the most common types of jobs that are failing and the errors that are seen in the logs: 

1. many "Thunderbird <comm-central, comm-aurora> macosx64|l10n nightly" and some "Thunderbird <comm-central, comm-aurora> macosx64|l10n dep" are failing due to being unable to access the location below:

"['bash', '-c', 'if [ ! -f mar ]; then wget -O  mar --no-check-certificate http://upload.tbirdbld.productdelivery.prod.mozaws.net/pub/mozilla.org/thunderbird/nightly/latest-comm-aurora/mar-tools/macosx64/mar; fi;        (test -e mar && test -s mar) || exit 1; chmod 755 mar']"
...
"Connecting to upload.tbirdbld.productdelivery.prod.mozaws.net|52.88.134.149|:80... failed: Operation timed out.Giving up."

2. many "Thunderbird <comm-central, comm-aurora> macosx64|l10n dep" builds and "Firefox <comm-central, comm-aurora> macosx64|l10n dep" are failing as the following configuration file is missing:
"configure: error: Invalid value --with-l10n-base, ../../l10n doesn't exist"

3. lots of fuzzer-macosx64-lion failures due to fuzzer.sh timeouts:
"command timed out: 1800 seconds without output running ['bash', 'scripts/scripts/fuzzing/fuzzer.sh'], attempting to kill"

4. "OS X 10.7 64-bit b2g-inbound debug static analysis build" fail with the following error:
"FATAL - 'mach build' did not run successfully. Please check log for errors."
(Reporter)

Updated

3 years ago
Depends on: 1221391, 1224234

Comment 2

3 years ago
Pointers to log examples for each failure case would be helpful so people could debug with context.
(Reporter)

Comment 3

2 years ago
Created attachment 8690842 [details]
case_1.txt
(Reporter)

Comment 4

2 years ago
Created attachment 8690843 [details]
case_2.txt
(Reporter)

Comment 5

2 years ago
Created attachment 8690844 [details]
case_3.txt
(Reporter)

Comment 6

2 years ago
Since links to logs will expire at some point, I guess attaching the log file for each case is a better approach (did not attach a log file for the last issue since most of that kind of jobs are green at the moment).
(Reporter)

Updated

2 years ago
No longer depends on: 1221391

Comment 7

2 years ago
Is this bug still an ongoing issue?
(Reporter)

Comment 8

2 years ago
I checked several bld-lion-r5 slaves and noticed that the jobs for case 1 and case 3 are still failing with the errors from the attachments above.

The question I'd have is related to case 2: in bug 1221391 it is mentioned that l10n builders have been disabled (the patch is in production), yet these builders are still present on the masters and they are still failing.

Took a closer look at Nick's patch and noticed that the parameter that he set 'enable_l10n_onchange': False while we still have 'enable_l10n_onchange': True, so only the l10n dep builders ending with "NightlyRepackFactory" have been disabled (tested this on my local master).

If that was the intended change for that patch, then case 3 also stands (as these jobs are failing).
Flags: needinfo?(kmoir)

Comment 9

2 years ago
I don't see any enable_l10n_onchange with value True in production code but if you think there is a problem with the earlier patch, feel feel to submit a new one :-)
Flags: needinfo?(kmoir)
(Reporter)

Comment 10

2 years ago
Seems that I have missed the fact that we only disabled 'l10n dep' builders and not touch other 'l10n' ones :).
Given that, case 2 does not stand but the remaining two they do.
Depends on: 1222878

Updated

a year ago
Depends on: 1322344

Updated

11 months ago
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.