Closed
Bug 1077869
Opened 10 years ago
Closed 10 years ago
docker-worker: Formatting instance storage fails intermittently
Categories
(Taskcluster :: Workers, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlal, Assigned: garndt)
References
Details
Attachments
(1 file)
At least once now (see node i-75a12c7a in papertrail) I have seen a worker node start without actually booting the docker-worker and pulling tasks off the queue. We should investigate why this is happening and actually iron out what we would like the aws-provisioner to do to nodes which become "zombies".
Reporter | ||
Comment 1•10 years ago
|
||
also i-330e833c, i-cf0b86c0, i-0e0d8001 (this one did not even register a hostname on papertrail!), i-a2179aad, i-aa2ba6a5, i-e93db0e6, i-3856da37. Clearly something is preventing upstart from even _starting_ the docker-worker/docker (it starts diamond at least in some of these cases).
Assignee | ||
Comment 2•10 years ago
|
||
Appears that the logical volume in the upstart job is getting created, but then the next step to format it is not happening. Not sure why yet. I didn't find anything in the logs indicating a larger failure. Zombie instance: root@ip-172-31-4-29:/home/ubuntu# lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL NAME FSTYPE SIZE MOUNTPOINT LABEL xvda 8G └─xvda1 ext4 8G / cloudimg-rootfs xvdb LVM2_member 75G └─instance_storage-all (dm-0) 75G Good instance: NAME FSTYPE SIZE MOUNTPOINT LABEL xvda 8G └─xvda1 ext4 8G / cloudimg-rootfs xvdb LVM2_member 75G └─instance_storage-all (dm-0) ext4 75G /mnt
Reporter | ||
Comment 3•10 years ago
|
||
Hrm I wonder if this is something we can retry somehow... It looks like the process was at least partially completed.
Assignee | ||
Comment 4•10 years ago
|
||
We should be able to capture the status code and retry a few times. I can add some logic around it and perhaps some useful logging to try to narrow down the conditions a failure happens and to at least let us know that's exactly what failed and was retried.
Reporter | ||
Comment 5•10 years ago
|
||
++ I think we need some kind of retry logic for this particular case we may or may not have to destroy the lvm "disk" and start over... I also wonder if this is actually failing in ext4 formatting if there maybe is some verbose logging we can try.
Comment 6•10 years ago
|
||
It seems to me we should figure out why it fails sometimes.. This isn't a network thing it's not suppose to onyl work "sometimes". If we can detect failure to format disk right we can just force shutdown. Thought this can create a spawn/shutdown loop... I hate those :)
Assignee | ||
Comment 7•10 years ago
|
||
We absolutely should figure out why it's failing. Magical failing things are too mysterious for my liking :)
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → garndt
Assignee | ||
Comment 8•10 years ago
|
||
Over the course of a couple of days have spawns over a thousand nodes using a graph of 170 tasks and haven't had a zombie yet. Perhaps it could be tested out with the tasks that were running on the gaia worker type. I created this ami with the changes: ami-25185715
Attachment #8506989 -
Flags: review?(jlal)
Assignee | ||
Comment 9•10 years ago
|
||
More info about the error.. It appears that the upstart job that created the logical volume was being respawned the first time because mkfs could not start because the system was reporting the volume was currently in use. Every respawn after that failed because the commands to create the logical volume would fail because the LV was already present triggering subsequent respawns and never attempting mkfs again.
Reporter | ||
Comment 10•10 years ago
|
||
Comment on attachment 8506989 [details] [review] GH PR 54 Few comments here ... can you flag me again for r? after reviewing my comments
Attachment #8506989 -
Flags: review?(jlal)
Reporter | ||
Comment 11•10 years ago
|
||
r+ https://github.com/taskcluster/docker-worker/commit/d2d14ac0a5aae59f249e207bb1d862526c4815ad
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•10 years ago
|
Summary: docker-worker: Investigate zombie nodes → docker-worker: Formatting instance storage fails intermittently
Updated•9 years ago
|
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Updated•5 years ago
|
Component: Docker-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•