Closed Bug 1257117 Opened 8 years ago Closed 7 years ago

Long delays loading docker images

Categories

(Taskcluster :: Services, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whimboo, Unassigned)

References

Details

Today I checked the timing of our firefox-ui-tests as executed on TaskCluster for Linux64 debug builds and have seen that there is a broad range going from about 17min to 43min. I wonder where this comes from and if that might be machine related.

Here a fast job:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Firefox%20UI&filter-tier=2&filter-tier=3&fromchange=612e0e3ea026&selectedJob=23845198

And here a slow job:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Firefox%20UI&filter-tier=2&filter-tier=3&fromchange=612e0e3ea026&selectedJob=23852959
Interesting fact here is the following and I feel all the time is spent in preparing the spot instance for the tests:

Given by Treeherder this is the start time of the job (UTC+1)
Start 	Wed Mar 16, 0:32:53

Our tests started here:
23:59:40     INFO - Dumping config to /home/worker/workspace/logs/localconfig.json.

So we spend about 27min in preparing the machine for the test!
Severity: normal → major
Depends on: 1257122
The tasks here are
  https://tools.taskcluster.net/task-inspector/#GYKPzrudSqucC8nf3fzIWw/0
  https://tools.taskcluster.net/task-inspector/#U7On2dUVS9KlEgw7LUaCMQ/0

and indeed, the first (slow) job had to download the docker image, which took 5.5 minutes:

[taskcluster] Downloaded 2741.322 mb in 338.94 seconds.

and then probably took some time to untar that and set up the docker container.  Judging by how quickly the fast run got from start to mozharness, I'm guessing that the unpacking process took 23 minutes.  That's pretty awful!

On the other hand, I don't see this as a big issue -- it contributes to end-to-end time, but only when fresh instances are being added, which is less likely during periods of high load.  It does not affect the runtime of the test itself.
Summary: Large execution time differences for firefox-ui-tests on Linux64 debug → Long delays loading docker images
We have also tasks like the following which downloaded the docker image and finished everything within 27 minutes:

https://tools.taskcluster.net/task-inspector/#CB6ORgwET9GJicpsNJy7bA/0
That one was 12 minutes from start to mozharness, with about 6 of those being the docker download.  Timestamps would certainly help here!
it is very important to know the average time spend in CI induced overhead so we can balance our jobs out properly.  I believe we have <5 minutes of it in buildbot, that might have gone up more, but I think what has gone up is the time to setup the virtualenv.

If we want to target 60 minute jobs (which imo is unacceptable, we used to have 15 minute jobs, 30 seems reasonable), then we should ensure that our average CI overhead is taken into account for how much we parallelize (i.e. chunk) our tests up.

I do wonder if dynamic docker instances are actually useful in test images, it might be worth looking into.  Build has a lot of other changing variables more often than tests do.
Bug 1254770 will hopefully address logging now just the download time, but also image loading time.

As far as timestamps, they could be added to the specific lines the worker injects into the logs (lines prefixed with "[taskcluster]").  I only mention the worker specific logging because there are things within the task that already include a timestamp so I didn't want to add yet another timestamp to that.

The worker specific lines could be "2016-03-16 09:33 [taskcluster] <message>" or "[taskcluster] 2016-03-16 09:33 <message>".  I'm open for opinions.
Component: Integration → Platform and Services
Multiple things have been done following this bug.  Image size have decreased somewhat, we are now compressing with zstd which helps the first time transfer costs, and we've been monitoring the efficiencies of our gecko worker types to ensure they are not cycling too often which means they have to repopulate the image cache.

By far the slowest machines to load docker images are the m1.medium instances which are just slower all around and legacy instance types.  We have been making strides to move away from these instance types where we can.  The total load time (download+decompressing+loading) for docker images used for gecko is typically < 10 minutes for our test images, and over half that for all other images.

Timestamps have been added to the logs to see progress of downloading and loading an image, and we are capturing data within our metrics system to monitor.  

I think the original slow time reported here are a thing of the past, but we can always investigate more later to decrease the time (most of this is going to come from actually decreasing the image sizes).

I vote for closing this for now, and we can open up newer more specific bugs for things we find.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.