Closed Bug 933363 Opened 11 years ago Closed 11 years ago

Mac 10.6 nodes in Mozmill CI failed to copy artifacts, causing long delays

Categories

(Mozilla QA Graveyard :: Infrastructure, defect)

x86_64
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: davehunt, Unassigned)

Details

Three out of the four Mac OS X 10.6 nodes today failed to copy the get_mozmill-tests workspace, causing massive delays in our tests. The console log appeared for each of these similar to: Started by user anonymous [EnvInject] - Loading node environment variables. Building remotely on mm-osx-106-1 in workspace jenkins/workspace/mozilla-aurora_update Deleting project workspace... done Restoring workspace from build #20923 of project get_mozmill-tests These jobs did not time out as they should have (we use the Build Timeout plugin [1]) and therefore caused other jobs concurrently running to wait indefinitely. This is a known 'feature' of the JUnit results archiver [2]. [1] https://wiki.jenkins-ci.org/display/JENKINS/Build-timeout+Plugin [2] https://issues.jenkins-ci.org/browse/JENKINS-10234
I was able to get us running again by initially taking the affected nodes temporarily offline, and then restarting the Jenkins slave instances on them. This has resolved the issue but we still need to understand what happened and how we might be able to prevent or minimise the impact of it happening again. Some ideas: * Why did the build timeout plugin not abort the jobs? Is this a bug in that plugin? * Could we stop using the JUnit results archiver, so if this happened again it would only affect the individual nodes? Perhaps we could switch to the xUnit plugin, which I understand is not affected by this issue. * If we separate the jobs by platform (for example ubuntu_mozilla-aurora_functional), we will only have concurrency within each platform and therefore such an issue would not have a system-wide impact. This would considerably increase the number of jobs, which would be a maintenance burden.
I found https://issues.jenkins-ci.org/browse/JENKINS-9716 for the pre-build timeout failure bug, but it's been open for over two years with no sign of it being fixed. We also suffer from https://issues.jenkins-ci.org/browse/JENKINS-16875
I don't understand why this issue should be related to the junit plugin. Its execution happens in the post-build step, but here we are failing in the pre-build step. So please explain why you think it's related. Were all 10.6 nodes affected at this time? What about other versions of OS X? None of them? That sounds scary.
(In reply to Henrik Skupin (:whimboo) from comment #3) > I don't understand why this issue should be related to the junit plugin. Its > execution happens in the post-build step, but here we are failing in the > pre-build step. So please explain why you think it's related. Please see comment 0, where I explain that as a result of this issue we hit a known issue/feature on the JUnit results publisher. > Were all 10.6 nodes affected at this time? What about other versions of OS > X? None of them? That sounds scary. Again, see comment 0, first line: "Three out of the four Mac OS X 10.6 nodes". This specific issue affected no other nodes, however because concurrent builds were still running on these nodes, the other builds were hanging due to the JUnit results publisher issue.
This didn't happen for along time. Lets close as WFM for now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Product: Mozilla QA → Mozilla QA Graveyard
You need to log in before you can comment on or make changes to this bug.