Closed Bug 1077869 Opened 10 years ago Closed 10 years ago

docker-worker: Formatting instance storage fails intermittently

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jlal, Assigned: garndt)

References

Details

Attachments

(1 file)

GH PR 54 10 years ago Greg Arndt [:garndt] 52 bytes, text/x-github-pull-request		Details \| Review

James Lal [:lightsofapollo]

Reporter

Description

•

10 years ago

At least once now (see node i-75a12c7a in papertrail) I have seen a worker node start without actually booting the docker-worker and pulling tasks off the queue. We should investigate why this is happening and actually iron out what we would like the aws-provisioner to do to nodes which become "zombies".

James Lal [:lightsofapollo]

Reporter

Comment 1

•

10 years ago

also i-330e833c, i-cf0b86c0, i-0e0d8001 (this one did not even register a hostname on papertrail!), i-a2179aad, i-aa2ba6a5, i-e93db0e6, i-3856da37.

Clearly something is preventing upstart from even _starting_ the docker-worker/docker (it starts diamond at least in some of these cases).

James Lal [:lightsofapollo]

Reporter

Updated

•

10 years ago

Blocks: 1080265

Greg Arndt [:garndt]

Assignee

Comment 2

•

10 years ago

Appears that the logical volume in the upstart job is getting created, but then the next step to format it is not happening.  Not sure why yet.  I didn't find anything in the logs indicating a larger failure.

Zombie instance:

root@ip-172-31-4-29:/home/ubuntu# lsblk -o NAME,FSTYPE,SIZE,MOUNTPOINT,LABEL
NAME                          FSTYPE      SIZE MOUNTPOINT LABEL
xvda                                        8G            
└─xvda1                       ext4          8G /          cloudimg-rootfs
xvdb                          LVM2_member  75G            
└─instance_storage-all (dm-0)              75G       


Good instance:

NAME                          FSTYPE      SIZE MOUNTPOINT LABEL
xvda                                        8G            
└─xvda1                       ext4          8G /          cloudimg-rootfs
xvdb                          LVM2_member  75G            
└─instance_storage-all (dm-0) ext4         75G /mnt

James Lal [:lightsofapollo]

Reporter

Comment 3

•

10 years ago

Hrm I wonder if this is something we can retry somehow... It looks like the process was at least partially completed.

Greg Arndt [:garndt]

Assignee

Comment 4

•

10 years ago

We should be able to capture the status code and retry a few times.  I can add some logic around it and perhaps some useful logging to try to narrow down the conditions a failure happens and to at least let us know that's exactly what failed and was retried.

James Lal [:lightsofapollo]

Reporter

Comment 5

•

10 years ago

++ I think we need some kind of retry logic for this particular case we may or may not have to destroy the lvm "disk" and start over... I also wonder if this is actually failing in ext4 formatting if there maybe is some verbose logging we can try.

Jonas Finnemann Jensen (:jonasfj)

Comment 6

•

10 years ago

It seems to me we should figure out why it fails sometimes.. This isn't a network thing it's not suppose to onyl work "sometimes".

If we can detect failure to format disk right we can just force shutdown. Thought this can create a spawn/shutdown loop... I hate those :)

Greg Arndt [:garndt]

Assignee

Comment 7

•

10 years ago

We absolutely should figure out why it's failing.  Magical failing things are too mysterious for my liking :)

Greg Arndt [:garndt]

Assignee

Updated

•

10 years ago

Assignee: nobody → garndt

Greg Arndt [:garndt]

Assignee

Comment 8

•

10 years ago

Attached file GH PR 54 — Details

Over the course of a couple of days have spawns over a thousand nodes using a graph of 170 tasks and haven't had a zombie yet.  Perhaps it could be tested out with the tasks that were running on the gaia worker type.  I created this ami with the changes: ami-25185715

Attachment #8506989 - Flags: review?(jlal)

Greg Arndt [:garndt]

Assignee

Comment 9

•

10 years ago

More info about the error..  It appears that the upstart job that created the logical volume was being respawned the first time because mkfs could not start because the system was reporting the volume was currently in use.  Every respawn after that failed because the commands to create the logical volume would fail because the LV was already present triggering subsequent respawns and never attempting mkfs again.

James Lal [:lightsofapollo]

Reporter

Comment 10

•

10 years ago

Comment on attachment 8506989 [details] [review]
GH PR 54

Few comments here ... can you flag me again for r? after reviewing my comments

Attachment #8506989 - Flags: review?(jlal)

James Lal [:lightsofapollo]

Reporter

Comment 11

•

10 years ago

r+ 

https://github.com/taskcluster/docker-worker/commit/d2d14ac0a5aae59f249e207bb1d862526c4815ad

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

James Lal [:lightsofapollo]

Reporter

Updated

•

10 years ago

Summary: docker-worker: Investigate zombie nodes → docker-worker: Formatting instance storage fails intermittently

James Lal [:lightsofapollo]

Reporter

Updated

•

10 years ago

Blocks: 1089251

Pete Moore [:pmoore][:pete]

Updated

•

9 years ago

Component: TaskCluster → Docker-Worker

Product: Testing → Taskcluster

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Docker-Worker → Workers

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

docker-worker: Formatting instance storage fails intermittently

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: jlal, Assigned: garndt)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type