Closed Bug 1589712 Opened 5 years ago Closed 4 years ago

Caching system for lint and static analysis jobs isn't working correctly

Tracking

(firefox77 fixed)

Status:

RESOLVED FIXED

Milestone:

mozilla77

Tracking Flags:

Tracking

Status

firefox77

---

fixed

People

(Reporter: Sylvestre, Assigned: Callek)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [ci-costs-2020:done])

Attachments

(5 files)

Bug 1589712 - Create srctest worker for linux/windows. r=tomprince 4 years ago Justin Wood (:Callek) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1589712 - Normalize windows worker name to be like linux one. r=jlorenzo 4 years ago Justin Wood (:Callek) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1589712 - Switch worker types for t-linux-xlarge and t-win10-64 to new -source types for linting tasks to allow better reuse of caching of the source checkout for those lint tasks. r=marco 4 years ago Justin Wood (:Callek) 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1589712 - Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince 4 years ago Justin Wood (:Callek) 47 bytes, text/x-phabricator-request		Details \| Review
Screenshot_2020-05-01 Try Reviewbot Usage.png 4 years ago Justin Wood (:Callek) 204.98 KB, image/png		Details

Sylvestre Ledru [:Sylvestre]

Reporter

Description

•

5 years ago

Now that we are doing a try job for every review, we should probably spend more time on cloning optimization.
For example, in these jobs, we are spending 5 minutes just to clone the repo.

https://treeherder.mozilla.org/logviewer.html#?job_id=271918576&repo=try this is about 20% of the time
https://treeherder.mozilla.org/logviewer.html#?job_id=271917591&repo=try (+80%)

Maybe we should generate a cached image every day with the current checkout.
Maybe shallow clone will help.

Marco Castelluccio [:marco]

Updated

•

5 years ago

Depends on: 1428470, fastci

Connor Sheehan [:sheehan]

Comment 1

•

5 years ago

The repo caching system is supposed to work by checking out a copy of the repo on the worker (in the case of linting jobs this means checking out mozilla-unified) on the first job the worker takes, then performing a working directory checkout (update) to complete the given task. Subsequent jobs should be able to re-use the mozilla-unified repo, so instead of doing a full clone, they just pull the new heads from try into the previously cloned repo and then check them out into the working directory.

Are the workers recognizing existing checkouts as valid caches? From the two examples given in the bug description it doesn't seem they are. How often are linting workers recycled in this setup? If it's rather often, it could be causing the tasks to reclone frequently. Assuming I'm not misinterpreting the 20% + 80% comment above, we're doing a full re-clone on every task?

A cached image or shallow clone might help here as well. I've also thought about triggering a mozilla-unified clone as part of the worker setup before exposing it as available to run tasks in Taskcluster, which would at least move the full clone timings out of task execution times. But let's start by making sure we're taking advantage of the current caching system effectively.

Andrew Halberstadt [:ahal]

Comment 2

•

5 years ago

•

Edited

Yes as far as I can tell every lint task clones mozilla-unified every time. Actually it's likely even worse. The lint tasks don't have any specific caching configuration, this is supposed to all be automatically handled by the run-task script (and related taskgraph configs). So I believe other types of tasks are affected as well, possibly anything that uses the lint image..

The strange thing is that it appears to be all configured properly. The tasks have a cache defined at /build/worker/checkouts, yet every log that I've seen has this line:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272396788&repo=autoland&lineNumber=20

That's being logged here:
https://searchfox.org/mozilla-central/rev/55aa17110091deef24b913d033ccaf58f9c6d337/taskcluster/scripts/run-task#275

So in other words, /builds/worker/checkouts exists but is empty. The run-task script then proceeds to clone into /builds/worker/checkouts yet the next task again detects that it's empty.

By contrast, here's a build task (where checkout cache is working):
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272396353&repo=autoland&lineNumber=15

They both use docker-worker and they both use the same taskgraph codepath to setup the cache (which is why I suspect something in the lint image).

Sylvestre Ledru [:Sylvestre]

Reporter

Comment 3

•

5 years ago

Thanks for investigating!

Any idea who could help us fixing that?

Flags: needinfo?(sheehan)

Flags: needinfo?(ahal)

Andrew Halberstadt [:ahal]

Comment 4

•

5 years ago

I did discover that how we setup the workspace differs between the build and lint images. Notably the chown. No idea why that would matter, but I have a try run to test it out:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=86138a6dea1a400a6bc3aa79cbc279012eb490c1

(Relatedly, I also don't know how to test caching changes beyond retriggering a lot and hoping a worker near the end previously ran a task with my changes).

Maybe Tom has some additional insight? Tl;dr lint tasks are cloning Gecko every time despite a cache set up at /builds/worker/checkouts.

Flags: needinfo?(ahal) → needinfo?(mozilla)

Andrew Halberstadt [:ahal]

Comment 5

•

5 years ago

Ah, here's a lint task on mozilla-centeral where the caching is working:
https://treeherder.mozilla.org/logviewer.html#?job_id=272407760&repo=mozilla-central

So it's working sometimes, but very rarely.

Tom Prince [:tomprince]

Comment 6

•

5 years ago

It looks like these tasks run on test workers; given that most tests don't have a checkout, there is a high likely hood that these tasks will run on an instance that doesn't have the tree cloned.

It probably makes sense to have a new instance type for these jobs. (I also wonder if we can get away with a smaller instance for these jobs, than the instance type we use to run firefox tests themseleves.

Flags: needinfo?(mozilla)

Connor Sheehan [:sheehan]

Comment 7

•

5 years ago

I think Tom's hypothesis about running the job on a suboptimal worker is probably right. As an alternative to running the jobs on a new worker type, maybe we could run lint jobs on the builders, where there is a higher chance of a cached repo. Just an idea in case standing up a new worker pool is too invasive of a change.

Flags: needinfo?(sheehan)

Andrew Halberstadt [:ahal]

Comment 8

•

5 years ago

Standing up a new pool seems better. We can share this pool with other lightweight source-test tasks (e.g Python unittests, docs, node tests, etc). So we should be able to get decent usage out of it. I also agree we should try and move to a cheaper worker type.

Sylvestre Ledru [:Sylvestre]

Reporter

Updated

•

5 years ago

Summary: Improve the caching system for lint and static analysis jobs → Caching system for lint and static analysis jobs isn't working correctly

Sylvestre Ledru [:Sylvestre]

Reporter

Comment 9

•

5 years ago

Coop, I have been told that you could be able to help with that.
We would need a pool for linting/static analysis + make sure that the caching work.

We are spending $$$ cloning the repo for every job here!

Merci

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 10

•

5 years ago

(In reply to Sylvestre Ledru [:Sylvestre] from comment #9)

Coop, I have been told that you could be able to help with that.
We would need a pool for linting/static analysis + make sure that the caching work.

We are spending $$$ cloning the repo for every job here!

For context, the linting jobs only cost us $1,377.73 in September.

Yes, we can help, and can get you onto a properly-sized instance type, but it will need to wait until later in November when we're done with existing migrations.

Flags: needinfo?(coop)

Andrew Halberstadt [:ahal]

Comment 11

•

5 years ago

I agree this is likely a lower priority, but just pointing out that the bigger cost here is developer time wasted due to issues showing up on phabricator reviews later.

(Away)

Comment 12

•

5 years ago

(In reply to Connor Sheehan [:sheehan] from comment #1)

The repo caching system is supposed to work by checking out a copy of the repo on the worker (in the case of linting jobs this means checking out mozilla-unified) on the first job the worker takes, then performing a working directory checkout (update) to complete the given task. Subsequent jobs should be able to re-use the mozilla-unified repo, so instead of doing a full clone, they just pull the new heads from try into the previously cloned repo and then check them out into the working directory.

Is this supposed to happen on all task types? The only ones I have experience looking at are Windows build jobs, and I've never seen this incremental-clone behavior happen there, it's always been a painfully slow full clone. (Especially painful when I'm watching the task live and waiting to see whether configure succeeds at all)

Andrew Halberstadt [:ahal]

Comment 13

•

5 years ago

(In reply to :dmajor from comment #12)

Is this supposed to happen on all task types? The only ones I have experience looking at are Windows build jobs, and I've never seen this incremental-clone behavior happen there, it's always been a painfully slow full clone. (Especially painful when I'm watching the task live and waiting to see whether configure succeeds at all)

See bug 1527313 and bug 1528891.

Marco Castelluccio [:marco]

Comment 14

•

4 years ago

Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.

Flags: needinfo?(coop)

Marco Castelluccio [:marco]

Updated

•

4 years ago

Blocks: cost-reduction

Chris Cooper [:coop] (he/him)

Comment 15

•

4 years ago

•

Edited

(In reply to Marco Castelluccio [:marco] from comment #14)

Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.

Still happy to help as required, but I don't think this work is blocked on the Taskcluster team any more. After the Firefox CI cluster split in November, I believe releng is the correct group to implement a new, dedicated worker type optimized for running linting jobs in CI. Redirecting NI to :jlund

We also have access to better data now about concurrent tasks so it should be pretty easy to get a dedicated pool setup with a minimum capacity to make sure our cache hit rate is as high as possible.

Flags: needinfo?(coop) → needinfo?(jlund)

Jordan Lund (:jlund)

Comment 16

•

4 years ago

(In reply to Marco Castelluccio [:marco] from comment #14)

Could we re-prioritize this? As :ahal said, this might not be a huge savings in terms of $$$ (even though it adds up as we introduce new linting jobs over time), but it will allow results to be shown to developers earlier on Phabricator.

added to triage, will reply back later this week with update. Leaving needinfo open.

Justin Wood (:Callek)

Assignee

Comment 17

•

4 years ago

Attached file Bug 1589712 - Create srctest worker for linux/windows. r=tomprince — Details

I didn't change anything from original workers, gecko-3/t-linux-xlarge and gecko-t/t-win10-64 except naming and reducing the capacity on win10 to half what it was. I'm open to review comments asking me to make other changes

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → bugspam.Callek

Status: NEW → ASSIGNED

Jordan Lund (:jlund)

Updated

•

4 years ago

Flags: needinfo?(jlund)

Marco Castelluccio [:marco]

Comment 18

•

4 years ago

What's left to do here?

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Assignee

Comment 19

•

4 years ago

Attached file Bug 1589712 - Normalize windows worker name to be like linux one. r=jlorenzo — Details

Justin Wood (:Callek)

Assignee

Comment 20

•

4 years ago

Attached file Bug 1589712 - Switch worker types for t-linux-xlarge and t-win10-64 to new -source types for linting tasks to allow better reuse of caching of the source checkout for those lint tasks. r=marco — Details

Justin Wood (:Callek)

Assignee

Comment 21

•

4 years ago

(In reply to Marco Castelluccio [:marco] from comment #18)

What's left to do here?

I thought I had pushed these patches -- apparently I didn't, I'm sorry.

c#19 is one followup based on a review comment, and c#20 is what I think is necessary for the lint tasks to use a checkout appropriately. I'm not sure offhand if there are other jobs phab review stuff triggers which would benefit from the new worker type.

I also noticed there are some items that run shell scripts first, which might benefit from assuming checkout entirely as good followup, but I don't know enough about them to say for sure. Either way this patch should round out this bugs needs as specified.

Flags: needinfo?(bugspam.Callek)

Pulsebot

Comment 22

•

4 years ago

Pushed by jwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b02f0d3d7477
Switch worker types for t-linux-xlarge and t-win10-64 to new -source types for linting tasks to allow better reuse of caching of the source checkout for those lint tasks. r=marco

Stefan Hindli [:stefan_hindli]

Comment 23

•

4 years ago

Backed out changeset b02f0d3d7477 (Bug 1589712) for causing a gecko decision task bustage CLOSED TREE

https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&classifiedState=unclassified&revision=66d0cb03ba1ebb45480ac1d0a44867e2bb58abe7

https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=b02f0d3d7477ad8c9f22c46276e980a5a314944a&selectedJob=298415274

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298415274&repo=autoland&lineNumber=902

Flags: needinfo?(bugspam.Callek)

Justin Wood (:Callek)

Assignee

Updated

•

4 years ago

Flags: needinfo?(bugspam.Callek)

Pulsebot

Comment 24

•

4 years ago

Pushed by jwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/97d8ce585f5b
Switch worker types for t-linux-xlarge and t-win10-64 to new -source types for linting tasks to allow better reuse of caching of the source checkout for those lint tasks. r=marco

Natalia Csoregi [:nataliaCs]

Comment 25

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/97d8ce585f5b

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox77: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla77

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

4 years ago

Whiteboard: [ci-costs-2020:todo]

Justin Wood (:Callek)

Assignee

Comment 26

•

4 years ago

Attached file Bug 1589712 - Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince — Details

Phabricator Automation

Updated

•

4 years ago

Attachment #9142163 - Attachment description: Followup to Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity, we had a large backlog of jobs with a 10 capacity. r=tomprince → Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince

Phabricator Automation

Updated

•

4 years ago

Attachment #9142163 - Attachment description: Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince → Followup to Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity, we had a large backlog of jobs with a 10 capacity. r=tomprince

Phabricator Automation

Updated

•

4 years ago

Attachment #9142163 - Attachment description: Followup to Bug 1589712. Make t-linux-xlarge-source match t-linux-xlarge including capacity, we had a large backlog of jobs with a 10 capacity. r=tomprince → Bug 1589712 - Make t-linux-xlarge-source match t-linux-xlarge including capacity; r=tomprince

Justin Wood (:Callek)

Assignee

Comment 27

•

4 years ago

Attached image Screenshot_2020-05-01 Try Reviewbot Usage.png — Details

So, at a glance with some back of napkin this looks like it was a win, although I don't know if further tweaking on number of instances, ttl of each instance, etc. could improve matters.

The source for this screenshot is https://datastudio.google.com/u/2/reporting/15a6j_NSmFOYk6ZZ5efTGmx3LUxZFu9I4/page/rFKOB if anyone can access.

Thanks go to Simon Fraser for creating these charts, we didn't spend too much time with trying to measure it, and the overall end-to-end times seem still within the noise limits of pre-change, as a whole.

Sylvestre Ledru [:Sylvestre]

Reporter

Comment 28

•

4 years ago

Many thanks Justin! :)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

4 years ago

Whiteboard: [ci-costs-2020:todo] → [ci-costs-2020:done]

You need to log in before you can comment on or make changes to this bug.