Closed Bug 1451224 Opened 6 years ago Closed 6 years ago

Long running jobs fail

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: franziskus, Unassigned)

References

Details

We have image building tasks for NSS that take some time. They used to fail intermittently, then worked for a while and now seem to always fail.

Here are some:
* https://tools.taskcluster.net/groups/GrKGAkFES2CaDxPFnepzgw/tasks/GrKGAkFES2CaDxPFnepzgw/runs/0/logs/public%2Flogs%2Flive.log
* https://tools.taskcluster.net/groups/MW7FUQEgRtaUV_cFN_zh_w/tasks/MW7FUQEgRtaUV_cFN_zh_w/runs/0/logs/public%2Flogs%2Flive.log
* https://tools.taskcluster.net/groups/QJenIm12RFC9YwFCnZLn5Q/tasks/QJenIm12RFC9YwFCnZLn5Q/runs/0/logs/public%2Flogs%2Flive.log

There is no reason for these jobs to fail from what they are running. The docker images run fine locally and the executing the jobs outside of the docker image build run fine as well.
Hey Wander,

Could this be due to a recent hg-worker workerType upgrade?
Flags: needinfo?(wcosta)
I am not sure, going to investigate it.
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
This seems a problem with the task itself, more precisaly, in the Hacl build. Running locally in a vagrant machine with Linux Bionic, I get the same error:

worker/hacl-star/code/bignum Hacl.Spe.Poly1305_32.fst
./Hacl.Bignum.Parameters.fst(201,14-201,17): (Error) assertion failed (see also /home/worker/hacl-star/dependencies/FStar/ulib/FStar.UInt64.fst(65,12-65,32))
Verified module: Hacl.Bignum.Parameters (528197 milliseconds)
1 error was reported (see above)
../../Makefile.include:159: recipe for target 'Hacl.Bignum.Parameters.fst-verify' failed
make[1]: *** [Hacl.Bignum.Parameters.fst-verify] Error 1
make[1]: *** Waiting for unfinished jobs....
Verified module: Hacl.Spe.Poly1305_32 (39651 milliseconds)
All verification conditions discharged successfully
make[1]: Leaving directory '/home/worker/hacl-star/code/poly1305_32'
make: *** [verify-nss] Error 2
Makefile:67: recipe for target 'verify-nss' failed
make: Leaving directory '/home/worker/hacl-star'
The command '/bin/sh -c bash /tmp/setup-user.sh' returned a non-zero code: 2
Assignee: wcosta → nobody
Status: ASSIGNED → NEW
This task works totally fine for me locally. It's not an issue with the job itself. As I said it needs quite some resources, which can fail it, but otherwise it works fine.
Flags: needinfo?(wcosta)
(In reply to Franziskus Kiefer [:fkiefer or :franziskus] from comment #4)
> This task works totally fine for me locally. It's not an issue with the job
> itself. As I said it needs quite some resources, which can fail it, but
> otherwise it works fine.

Hrm, so maybe it is just a matter of using a larger instance?
Flags: needinfo?(wcosta) → needinfo?(franziskuskiefer)
> Hrm, so maybe it is just a matter of using a larger instance?

Looking at the retriggered jobs that didn't help :/
Flags: needinfo?(franziskuskiefer)
Blocks: 1451395
I moved the job that fails here to a different task where it succeeds. That doesn't solve the issue but produces higher load on each NSS push. So it would be great if this issue could be resolved.
It sounds like this was an issue of the task causing OOM on the machine, which typically kills dockerd because Linux's OOM killer always does the worst possible thing.  So the fix was probably the right one.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.