Closed Bug 1617321 Opened 1 year ago Closed 1 year ago

Very slow task set rebuild (24 minutes)

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(firefox76 fixed)

RESOLVED FIXED
mozilla76
Tracking Status
firefox76 --- fixed

People

(Reporter: sfink, Assigned: ahal)

References

Details

Attachments

(5 files)

I tend to touch taskcluster/ files occasionally, and in the last few months I've noticed the rebuild is taking a very very long time.

Specifically, I run mach try fuzzy or mach try fuzzy --full and it says "Task configuration changed, generating full task set" and then much later, the fuzzy selection UI appears.

The one I just ran took 24 minutes on my (fairly beefy) laptop.

I did ps a couple times in the middle, and it showed a bunch of hg outgoing processes:

   7378 ?        R     21:18 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7384 ?        S      0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio
   7396 ?        R     21:12 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7398 ?        S      0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio
   7399 ?        R     21:09 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7401 ?        S      0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio
   7403 ?        R     21:08 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7406 ?        S      0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio
   7407 ?        R     21:10 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7409 ?        S      0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio
   7414 ?        R     21:04 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7419 ?        S      0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio
   7447 pts/3    R+    21:23 /usr/bin/python /home/sfink/bin/hg outgoing -r . --quiet --template {file_adds % '\n{file}'}{file_mods % '\n{file}'} default
   7449 pts/3    S+     0:00 ssh sfink@mozilla.com@hg.mozilla.org hg -R integration/mozilla-inbound serve --stdio

I don't even know why it would be looking at mozilla-inbound for anything? I haven't used it in a long time. Those processes went away when the fuzzy selection UI appeared.

I should mention that while I have a decent laptop, my network connection isn't great. But those hg processes were at 99% CPU the whole time.

Summary: Very slow task set rebuild → Very slow task set rebuild (24 minutes)

Both Callek and I have recently switched to Python 3 hg, and both of us think that although it has been noticeably slow for a few months, it might be extra slow since that switch. (Maybe.)

I have never seen things be this slow, but I would like to toss my "surely it should not take as long as it does to regenerate the set of tasks" opinion in. Probably a good 30-40s on my desktop.

I did ps a couple times in the middle, and it showed a bunch of hg outgoing processes:

Wait, are you saying hg outgoing is called multiple times?

(In reply to Mike Hommey [:glandium] from comment #4)

I did ps a couple times in the middle, and it showed a bunch of hg outgoing processes:

Wait, are you saying hg outgoing is called multiple times?

I don't believe any of those processes were parented to each other, and you can see that each one has a self-time of >21 minutes in the dump above, so: yes.

I don't know what regenerating the set of tasks requires, but I can say that mach taskgraph full takes 67 seconds on my machine. So whatever 23 of the 24 minutes is from whatever mach try fuzzy is doing beyond what mach taskgraph full does. Either that, or it's doing 7 simultaneous taskgraph generations (and possibly more in total, if it's doing #cores-1 at a time or something.)

Which makes me think there are two separate sources of slowness that shouldn't get mixed up in one bug: the extreme slowdown that I'm seeing, and the nuisance-level slowdown that makes taskgraph generation take 1-2 minutes for everyone.

Attached is my flame graph of taskgraph (full) gen. It doesn't pinpoint any specific slowdown but does definitely indicate that splitting tests by path takes up a bunch of the time...

:sfink

So I have a theory on the quantity of the hg out stuff.... Do you have any hg hooks setup locally, specifically regarding linting?
https://searchfox.org/mozilla-central/source/python/mozlint/mozlint/roller.py#259

Secondly, even though you're not using inbound anymore is your [paths] including it in {repo_root}/.hg/hgrc ?

Lastly one of those outgoing's is probably due to https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/files_changed.py#84

This all could add up to a lot of time in try-pushing for you, due to the nature of outgoing versus inbound.

Flags: needinfo?(sphink)

For context, resolver.tests_by_path is the thing that's generating test metadata. I.e it triggers the build backend, scans all moz.build files, generates test objects, etc. I'm not surprised it's taking that long, though I'm sure there are some easy wins we can come up with here (both on the taskgraph side and the build side).

However this specific bug seems like a different issue... there should not be that many hg outgoing processes and I don't know why it's using mozilla-inbound. So I vote we use this bug to track the outgoing problem (which seems at the root of the 24min times), then use bug 1617598 for general taskgraph perf.

(In reply to Justin Wood (:Callek) from comment #7)

:sfink

So I have a theory on the quantity of the hg out stuff.... Do you have any hg hooks setup locally, specifically regarding linting?
https://searchfox.org/mozilla-central/source/python/mozlint/mozlint/roller.py#259

Even mozlint should only be calling hg outgoing once though. That part happens before we split jobs out into subprocesses..

Secondly, even though you're not using inbound anymore is your [paths] including it in {repo_root}/.hg/hgrc ?

Steve, you can run hg paths default here to see what that spits out. Mine points to mozilla-unified. Is there a chance your repo was originally cloned from mozilla-inbound?

Lastly one of those outgoing's is probably due to https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/files_changed.py#84

If linters are enabled, that would account for two outgoings. Actually, if you have the watchman hook enabled, that could account for another one (assuming the taskgraph was also building in the background)

I have no local hooks in either ~/.hgrc or $srcdir/.hg/hgrc. I do have watchman enabled (via the fsmonitor extension).

I have 3 checkouts I use. One has a default path set to mozilla-inbound (via ssh), another is set to mozilla-unified (via https), and the third is set to the local path to the first. I probably ought to clean that up. I haven't cared, because I always do hg pull unified so I didn't think I ever used the default path, but it appears that taskgraph staleness checking uses it for me? (And likely gets the wrong answer in one instance.)

I don't remember which repo was giving me the 24 minute time, but based on what I was working on at the time, it was most likely the mozilla-inbound one.

I haven't noticed a difference in speed between these repositories, but I probably wouldn't. I'm timing repo2 (with a default path set to the local repo1) right now. It only has 2 simultaneous hg outgoing processes running at 99% CPU rather than 7. Ok, it's done: a little over 16 minutes. So better, but not by much.

My test setup was:

% touch taskcluster/ci/hazard/kind.yml
% date
% ./mach try fuzzy
# Wait until it appears, hit Ctrl-C
% date

but I'm switching to something I don't have to monitor and gives me more info:

% touch taskcluster/ci/hazard/kind.yml
% /usr/bin/time -v ./mach try fuzzy --full -q "'slartibartfast"
Flags: needinfo?(sphink)

(In reply to Justin Wood (:Callek) from comment #6)

Created attachment 9128402 [details]
flame-graph of taskgraph full

Attached is my flame graph of taskgraph (full) gen. It doesn't pinpoint any specific slowdown but does definitely indicate that splitting tests by path takes up a bunch of the time...

Right, that's going to be for bug 1617598, since a ./mach taskgraph full doesn't seem any slower on my system than on anyone else's (67 seconds or so).

Wait, so ./mach taskgraph full isn't (ridiculously) slow, but mach try fuzzy is? That's very bizarre.. I guess the slowdown is either happening outside of the taskgraph entirely, or else it's somehow caused by taskgraph.fast being set to True (which would be surprising). I can't think of any other differences between how mach taskgraph full and mach try fuzzy invoke task generation.

p.s You could add --no-push to your test case too to prevent actually pushing.
p.p.s You probably know this, but you can change your default push of a repo by adding:

[paths]
default = https://hg.mozilla.org/mozilla-unified

to repo/.hg/hgrc.

This was generated with a populated cache in the state dir ( state_dir = os.environ.get('MOZBUILD_STATE_PATH', os.path.expanduser('~/.mozbuild')) ) where our cache is at cache/taskgraph inside there...

$ py-spy record -o /tmp/taskgraph.flame.svg -t -s -- ./mach try fuzzy --full -q "'slartibartfast" --no-push

For some reason trying to run that command with a blown away cache is actually causing a panic and I can't find a way to identify why :( I'll chip away at the flame graph a bit to see if I can find a better representation for try fuzzy timing.

(In reply to Andrew Halberstadt [:ahal] from comment #12)

I can't think of any other differences between how mach taskgraph full and mach try fuzzy invoke task generation.

The other difference is that fuzzy sets try_mode: try_select. There are a few places that look at that (search for is_try) and very the behavior. Nothing that seems obviously related to the flamegraph, though. It does cause taskgraph.util.perfile to look at hg outgoing though.

Found if I omitted -s from py-spy it worked (-s profiles sub processes) Here is one without a cached working directory.

:Sfink I'd be interested to know how your system differs here if you are willing to install py-spy in a python venv and give it a shot... It could help us pinpoint the cause of the lengthy slowdown for you.

% py-spy record -o /tmp/taskgraph.flame.svg -t -- ./mach try fuzzy --full -q "'slartibartfast" --no-push
py-spy> Sampling process 100 times a second. Press Control-C to exit.

Task configuration changed, generating full task set
no tasks selected

py-spy> Stopped sampling because process exitted
py-spy> Wrote flamegraph data to '/tmp/taskgraph.flame.svg'. Samples: 97515 Errors: 0
Error: No child process (os error 10)

97.5seconds is about right, 16 minutes.

yikes, that's terrible.

my system runs the above command in just over a minute, same with mach taskgraph full

So I'm still hung up on the fact that it's slow with mach try but fast with mach taskgraph. Though the flame graphs appear to show that task generation (and especially test chunking) is indeed the culprit..

Anyway here's a patch that disables test chunking in the taskgraph when taskgraph.fast is set to True (which it is for mach try). Would you mind trying it out?

Flags: needinfo?(sphink)

What was happening for me is that although I changed all of my repositories' default paths to be mozilla-unified, I still had default:pushurl set to mozilla-inbound. This resulted in hg looking through every revision in unified that was not in inbound, which was returning over 13000 changesets (presumably from beta and release branches?) More problematically, it finds over 100k file adds or changes, and it takes at least 15 minutes or so to generate that list of files.

So the root problem here is that I pulled mozilla-unified, then looked at outgoing for mozilla-inbound. That finds a lot of stuff.

After eliminating the default:pushurl so that default now maps to mozilla-unified, I get similar times to everybody else.

I'm not sure why it looks at all these changesets when the outgoing command includes -r .. I would expect that to ignore things like beta changesets that aren't ancestors of .. But it includes them. Is mozilla-inbound not being updated anymore, and I'm just seeing the last 13k changesets? (I doubt it.)

I guess I could argue that this outgoing check is wrong in the first place -- all I really care about is whether anything relevant to the taskgraph has changed since the last time I cached it. But when I tried it, it was pretty slow to generate that information for an arbitrary revision ( though maybe that was because I was comparing against a beta rev and there were tons of changes?)

Flags: needinfo?(sphink)

(In reply to Steve Fink [:sfink] [:s:] from comment #21)

I'm not sure why it looks at all these changesets when the outgoing command includes -r .. I would expect that to ignore things like beta changesets that aren't ancestors of .. But it includes them. Is mozilla-inbound not being updated anymore, and I'm just seeing the last 13k changesets? (I doubt it.)

The last push to inbound was on Nov 19, 2019. As I type this, there 13,567 changesets between the tip of central and inbound.

I guess I could argue that this outgoing check is wrong in the first place -- all I really care about is whether anything relevant to the taskgraph has changed since the last time I cached it. But when I tried it, it was pretty slow to generate that information for an arbitrary revision ( though maybe that was because I was comparing against a beta rev and there were tons of changes?)

Note that hg outgoing is not being called to determine whether to generate tasks[1], but as part of generating the task.

That being said, :ahal it appears that the check is only used in determining the number of chunks for test-verify and test-coverage tasks. I suspect caching the number of chunks picked isn't particularly useful (and can lead to weirdness if the number of chunks changes from one to not-one or vice-versa between generation and use). I'm not sure what to suggest here, but I suspect we should do something to avoid calling hg outgoing in try-select (at least to get the number of chunks here or without some provision for handling that invalidation).

[1] It appears that it get regenerated if the modified time of the cache is older than any of the files in the taskcluster/ directory.

Assignee: nobody → ahal
Status: NEW → ASSIGNED
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/19cf91b5ab13
Disable taskgraph chunking when 'taskgraph.fast' is set r=tomprince
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla76
You need to log in before you can comment on or make changes to this bug.