Closed
Bug 1310428
Opened 9 years ago
Closed 9 years ago
Windows builder instances getting renamed/rebooted while running jobs?
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: pmoore, Unassigned)
References
Details
Attachments
(1 file)
|
68.13 KB,
text/plain
|
Details |
I've attached a papertrail output for this filter:
https://papertrailapp.com/systems/487309043/events?q=host+renamed&focus=723568367226900483
It looks like maybe something is causing hosts to get renamed while tasks are running, and rebooting the system. This might be a reason that some builders are still getting claim-expired exceptions.
It looks to me based on the name of the system logging to papertrail, that this log is from a single builder, that got renamed many times while it was running tasks, and I think also rebooted.
The claim-expired state means that a task began, and then at some point the queue lost contact with the worker, and does not know what happened. This is consistent with a host getting rebooted, as all trace of the task would be lost.
However, if this is multiple different hosts doing the renaming, this might not be a problem. In general though, we might be better off not running DSC while the worker is running, as it may interfere with the worker process.
| Reporter | ||
Comment 1•9 years ago
|
||
Could you take a look at this Rob?
Many thanks!
Flags: needinfo?(rthijssen)
| Reporter | ||
Comment 2•9 years ago
|
||
Note, this may be resolved by bug 1310429, or there may be more steps needed. Therefore I've created two separate bugs.
Comment 3•9 years ago
|
||
the single instance you were likely looking at is the ami creation instance and yes, because g-w does not check the userdata running flag as requested in bug 1302257, g-w starts (on the ami creation instance, as well as instances that have not yet had time to rename and reboot themselves). the problem is g-w starting to early rather than host renaming (which is a process that needs to occur and takes some time to complete). also if you look at pt logs for the hostname of the ami creation instance, you will see event logs for all of the instances spawned from that ami, but which have not yet renamed, so you're getting a skewed picture of what's going on if you're not aware of that.
i maintain that the problem is in generic-worker or the autologon/logon scripts created by generic-worker that cause g-w to start too early. the host renaming and other processes created by dsc/occ/userdata are all things which need to occur. g-w is just starting before those things have occurred. it should be waiting until the C:\dsc\in-progress.lock has been cleared before it starts claiming tasks.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(rthijssen)
Resolution: --- → WONTFIX
| Reporter | ||
Comment 4•9 years ago
|
||
The claim-expired bugs were introduced in 6.0.0 of the generic worker. Version 5.2.0 should be fine.
If you want generic worker only to run when the dsc lockfile is not present, see https://bugzilla.mozilla.org/show_bug.cgi?id=1302257#c5 for an implementation suggestion.
| Assignee | ||
Updated•6 years ago
|
Component: Integration → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•