Closed Bug 1589712 Opened 10 months ago Closed 4 months ago

Caching system for lint and static analysis jobs isn't working correctly

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(firefox77 fixed)

RESOLVED FIXED
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: Sylvestre, Assigned: Callek)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [ci-costs-2020:done])

Attachments

(5 files)

Now that we are doing a try job for every review, we should probably spend more time on cloning optimization.
For example, in these jobs, we are spending 5 minutes just to clone the repo.

https://treeherder.mozilla.org/logviewer.html#?job_id=271918576&repo=try this is about 20% of the time
https://treeherder.mozilla.org/logviewer.html#?job_id=271917591&repo=try (+80%)

Maybe we should generate a cached image every day with the current checkout.
Maybe shallow clone will help.

The repo caching system is supposed to work by checking out a copy of the repo on the worker (in the case of linting jobs this means checking out mozilla-unified) on the first job the worker takes, then performing a working directory checkout (update) to complete the given task. Subsequent jobs should be able to re-use the mozilla-unified repo, so instead of doing a full clone, they just pull the new heads from try into the previously cloned repo and then check them out into the working directory.

Are the workers recognizing existing checkouts as valid caches? From the two examples given in the bug description it doesn't seem they are. How often are linting workers recycled in this setup? If it's rather often, it could be causing the tasks to reclone frequently. Assuming I'm not misinterpreting the 20% + 80% comment above, we're doing a full re-clone on every task?

A cached image or shallow clone might help here as well. I've also thought about triggering a mozilla-unified clone as part of the worker setup before exposing it as available to run tasks in Taskcluster, which would at least move the full clone timings out of task execution times. But let's start by making sure we're taking advantage of the current caching system effectively.

Yes as far as I can tell every lint task clones mozilla-unified every time. Actually it's likely even worse. The lint tasks don't have any specific caching configuration, this is supposed to all be automatically handled by the run-task script (and related taskgraph configs). So I believe other types of tasks are affected as well, possibly anything that uses the lint image..

The strange thing is that it appears to be all configured properly. The tasks have a cache defined at /build/worker/checkouts, yet every log that I've seen has this line:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272396788&repo=autoland&lineNumber=20

That's being logged here:
https://searchfox.org/mozilla-central/rev/55aa17110091deef24b913d033ccaf58f9c6d337/taskcluster/scripts/run-task#275

So in other words, /builds/worker/checkouts exists but is empty. The run-task script then proceeds to clone into /builds/worker/checkouts yet the next task again detects that it's empty.

By contrast, here's a build task (where checkout cache is working):
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272396353&repo=autoland&lineNumber=15

They both use docker-worker and they both use the same taskgraph codepath to setup the cache (which is why I suspect something in the lint image).

Thanks for investigating!

Any idea who could help us fixing that?

Flags: needinfo?(sheehan)
Flags: needinfo?(ahal)

I did discover that how we setup the workspace differs between the build and lint images. Notably the chown. No idea why that would matter, but I have a try run to test it out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=86138a6dea1a400a6bc3aa79cbc279012eb490c1

(Relatedly, I also don't know how to test caching changes beyond retriggering a lot and hoping a worker near the end previously ran a task with my changes).

Maybe Tom has some additional insight? Tl;dr lint tasks are cloning Gecko every time despite a cache set up at /builds/worker/checkouts.

Flags: needinfo?(ahal) → needinfo?(mozilla)

Ah, here's a lint task on mozilla-centeral where the caching is working:
https://treeherder.mozilla.org/logviewer.html#?job_id=272407760&repo=mozilla-central

So it's working sometimes, but very rarely.

It looks like these tasks run on test workers; given that most tests don't have a checkout, there is a high likely hood that these tasks will run on an instance that doesn't have the tree cloned.

It probably makes sense to have a new instance type for these jobs. (I also wonder if we can get away with a smaller instance for these jobs, than the instance type we use to run firefox tests themseleves.

Flags: needinfo?(mozilla)

I think Tom's hypothesis about running the job on a suboptimal worker is probably right. As an alternative to running the jobs on a new worker type, maybe we could run lint jobs on the builders, where there is a higher chance of a cached repo. Just an idea in case standing up a new worker pool is too invasive of a change.

Flags: needinfo?(sheehan)

Standing up a new pool seems better. We can share this pool with other lightweight source-test tasks (e.g Python unittests, docs, node tests, etc). So we should be able to get decent usage out of it. I also agree we should try and move to a cheaper worker type.

Summary: Improve the caching system for lint and static analysis jobs → Caching system for lint and static analysis jobs isn't working correctly

Coop, I have been told that you could be able to help with that.
We would need a pool for linting/static analysis + make sure that the caching work.

We are spending $$$ cloning the repo for every job here!

Merci

Flags: needinfo?(coop)

(In reply to Sylvestre Ledru [:Sylvestre] from comment #9)

Coop, I have been told that you could be able to help with that.
We would need a pool for linting/static analysis + make sure that the caching work.

We are spending $$$ cloning the repo for every job here!

For context, the linting jobs only cost us $1,377.73 in September.

Yes, we can help, and can get you onto a properly-sized instance type, but it will need to wait until later in November when we're done with existing migrations.

Flags: needinfo?(coop)

I agree this is likely a lower priority, but just pointing out that the bigger cost here is developer time wasted due to issues showing up on phabricator reviews later.

(In reply to Connor Sheehan [:sheehan] from comment #1)

The repo caching system is supposed to work by checking out a copy of the repo on the worker (in the case of linting jobs this means checking out mozilla-unified) on the first job the worker takes, then performing a working directory checkout (update) to complete the given task. Subsequent jobs should be able to re-use the mozilla-unified repo, so instead of doing a full clone, they just pull the new heads from try into the previously cloned repo and then check them out into the working directory.

Is this supposed to happen on all task types? The only ones I have experience looking at are Windows build jobs, and I've never seen this incremental-clone behavior happen there, it's always been a painfully slow full clone. (Especially painful when I'm watching the task live and waiting to see whether configure succeeds at all)

(In reply to :dmajor from comment #12)

Is this supposed to happen on all task types? The only ones I have experience looking at are Windows build jobs, and I've never seen this incremental-clone behavior happen there, it's always been a painfully slow full clone. (Especially painful when I'm watching the task live and waiting to see whether configure succeeds at all)

See bug 1527313 and bug 1528891.

Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.

Flags: needinfo?(coop)

(In reply to Marco Castelluccio [:marco] from comment #14)

Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.

Still happy to help as required, but I don't think this work is blocked on the Taskcluster team any more. After the Firefox CI cluster split in November, I believe releng is the correct group to implement a new, dedicated worker type optimized for running linting jobs in CI. Redirecting NI to :jlund

We also have access to better data now about concurrent tasks so it should be pretty easy to get a dedicated pool setup with a minimum capacity to make sure our cache hit rate is as high as possible.

Flags: needinfo?(coop) → needinfo?(jlund)

(In reply to Marco Castelluccio [:marco] from comment #14)

Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.

added to triage, will reply back later this week with update. Leaving needinfo open.

I didn't change anything from original workers, gecko-3/t-linux-xlarge and gecko-t/t-win10-64 except naming and reducing the capacity on win10 to half what it was. I'm open to review comments asking me to make other changes

Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Flags: needinfo?(jlund)

What's left to do here?

Flags: needinfo?(bugspam.Callek)

(In reply to Marco Castelluccio [:marco] from comment #18)

What's left to do here?

I thought I had pushed these patches -- apparently I didn't, I'm sorry.

c#19 is one followup based on a review comment, and c#20 is what I think is necessary for the lint tasks to use a checkout appropriately. I'm not sure offhand if there are other jobs phab review stuff triggers which would benefit from the new worker type.

I also noticed there are some items that run shell scripts first, which might benefit from assuming checkout entirely as good followup, but I don't know enough about them to say for sure. Either way this patch should round out this bugs needs as specified.

Flags: needinfo?(bugspam.Callek)
Pushed by jwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b02f0d3d7477
Switch worker types for t-linux-xlarge and t-win10-64 to new -source types for linting tasks to allow better reuse of caching of the source checkout for those lint tasks. r=marco
Flags: needinfo?(bugspam.Callek)
Pushed by jwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/97d8ce585f5b
Switch worker types for t-linux-xlarge and t-win10-64 to new -source types for linting tasks to allow better reuse of caching of the source checkout for those lint tasks. r=marco
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
Whiteboard: [ci-costs-2020:todo]
Attachment #9142163 - Attachment description: Followup to Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity, we had a large backlog of jobs with a 10 capacity. r=tomprince → Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince
Attachment #9142163 - Attachment description: Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince → Followup to Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity, we had a large backlog of jobs with a 10 capacity. r=tomprince
Attachment #9142163 - Attachment description: Followup to Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity, we had a large backlog of jobs with a 10 capacity. r=tomprince → Bug 1589712 - Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince

So, at a glance with some back of napkin this looks like it was a win, although I don't know if further tweaking on number of instances, ttl of each instance, etc. could improve matters.

The source for this screenshot is https://datastudio.google.com/u/2/reporting/15a6j_NSmFOYk6ZZ5efTGmx3LUxZFu9I4/page/rFKOB if anyone can access.

Thanks go to Simon Fraser for creating these charts, we didn't spend too much time with trying to measure it, and the overall end-to-end times seem still within the noise limits of pre-change, as a whole.

Many thanks Justin! :)

Whiteboard: [ci-costs-2020:todo] → [ci-costs-2020:done]
You need to log in before you can comment on or make changes to this bug.