Open Bug 1567724 Opened 1 year ago Updated 1 year ago

Indexing duration regression due to low output-file throughput on mozilla-releases indexer.


(Webtools :: Searchfox, defect)

Not set


(Not tracked)



(Reporter: asuth, Assigned: asuth)


(Blocks 1 open bug)



(3 files)

The mozilla-releases indexer for today is currently at 9.5 hours runtime. A cursory investigation of the indexer log shows it took about 54min for the stage to complete for mozilla-esr60, 2 hours for mozilla-release, 2h9min for mozilla-esr68, and mozilla-beta is currently about to hit the 2 hour mark.

The parallel batching job says that for our mozilla-central job (extracting an AVG value from near the end of the run; which may be the wrong average to grab), each parallel job took about 6 seconds to run on average, whereas we're looking at ~11.3s for mozilla-esr60, ~21.9s for mozilla-release, ~23.2s for mozilla-esr68, and ~26.8 for mozilla-beta.

We recently changed the configuration for this indexer because the indexer was filling up.

General data dump


/cpu/procinfo shows 8 cores like so:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 62
model name	: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
stepping	: 4
microcode	: 0x42e
cpu MHz		: 2800.106
cache size	: 25600 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
bugs		:
bogomips	: 5600.21
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

top matches up fairly well too.

7099 ubuntu    20   0 3030696 624604 329724 R  90.7  4.1   0:08.99 output-file                                                                                                               
 7122 ubuntu    20   0 3030704 624764 329804 R  90.7  4.1   0:09.01 output-file                                                                                                               
 7008 ubuntu    20   0 3030684 623288 328384 R  90.4  4.0   0:09.04 output-file                                                                                                               
 7021 ubuntu    20   0 3030696 623668 328636 R  90.4  4.1   0:09.06 output-file                                                                                                               
 7066 ubuntu    20   0 3030736 624716 329748 R  90.4  4.1   0:09.02 output-file                                                                                                               
 7176 ubuntu    20   0 3030792 624568 329720 R  90.4  4.1   0:08.85 output-file                                                                                                               
 7181 ubuntu    20   0 3030804 624672 329708 R  90.4  4.1   0:08.85 output-file                                                                                                               
 7163 ubuntu    20   0 3030764 624632 329736 R  90.0  4.1   0:08.87 output-file                                                                                                               


ubuntu@ip-elided:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            7.4G     0  7.4G   0% /dev
tmpfs           1.5G  8.7M  1.5G   1% /run
/dev/xvda1      7.8G  5.4G  2.0G  74% /
tmpfs           7.4G     0  7.4G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           7.4G     0  7.4G   0% /sys/fs/cgroup
/dev/xvdb        79G   56M   75G   1% /mnt
/dev/xvdf       296G  234G   47G  84% /index
tmpfs           1.5G     0  1.5G   0% /run/user/1000

During my investigations I found two possible sources of perf problems that make the output-file stage slow. One is the generation of unused analysis data from the clang plugin (bug 1511025). The other is the blame-skipping code that I added. It would be worth removing that and seeing what the numbers are like. I kind of want to remove it entirely, because the results it produces are not very good. I have a different solution in the works although I'm not sure what kind of performance it will give.

However I'm surprised there's such a big discrepancy between m-c avg time and the other branches. Makes sense that esr60 is faster because it doesn't do rust/C++ indexing and so avoids the first problem.

I don't think it's blame. wc .git-blame-ignore-revs shows esr68 has 118 lines and central (master) has 453 on gecko-dev.

In terms of analysis data, it seems like there's a similar amount of data for mozilla-central as for the others, although obviously the mozilla-releases job has 3 full repositories whereas release only has 1-2 (does comm-central gobble up the mozilla-central analysis data too?). The numbers below are from the web-servers for yesterday's full completed indexing jobs.


ubuntu@ip-elided:~/index$ du -csh */analysis
1.8G	comm-central/analysis
8.7G	mozilla-central/analysis
59M	mozilla-mobile/analysis
200M	nss/analysis
336K	whatwg-html/analysis
11G	total


ubuntu@ip-elided:~/index$ du -csh */analysis
9.0G	mozilla-beta/analysis
1.4G	mozilla-esr60/analysis
8.8G	mozilla-esr68/analysis
8.8G	mozilla-release/analysis
28G	total

And for the contents of the index directories in their entirety, a specific sampling:


ubuntu@ip-elided:~/index$ du -csh mozilla-central/*
8.7G 	mozilla-central/analysis
296K	mozilla-central/
9.5M	mozilla-central/android-armv7.generated-files.tar.gz
307M	mozilla-central/
5.0M	mozilla-central/
35M 	mozilla-central/
9.0M	mozilla-central/bugzilla-components.json
4.7G 	mozilla-central/crossref
664M	mozilla-central/description
447M	mozilla-central/dir
8.0K 	mozilla-central/downloads.lst
25G  	mozilla-central/file
4.5G 	mozilla-central/gecko-blame
8.8G	mozilla-central/gecko-dev
12K  	mozilla-central/help.html
523M	mozilla-central/identifiers
44K 	mozilla-central/idl-files
12K  	mozilla-central/ipdl-files
4.0K 	mozilla-central/ipdl-includes
1.5M	mozilla-central/js-files
160M	mozilla-central/jumps
296K	mozilla-central/
9.4M	mozilla-central/linux64.generated-files.tar.gz
313M	mozilla-central/
5.1M 	mozilla-central/
50M 	mozilla-central/
6.3G 	mozilla-central/livegrep.idx
296K	mozilla-central/
9.4M	mozilla-central/macosx64.generated-files.tar.gz
328M	mozilla-central/
5.1M 	mozilla-central/
51M 	mozilla-central/
3.2G 	mozilla-central/objdir
28K  	mozilla-central/objdir-dirs
316K	mozilla-central/objdir-files
1.3M	mozilla-central/repo-dirs
19M 	mozilla-central/repo-files
70M 	mozilla-central/rustlib
4.0K 	mozilla-central/target.json
8.0K 	mozilla-central/templates
304K	mozilla-central/
9.7M	mozilla-central/win64.generated-files.tar.gz
355M	mozilla-central/
5.1M 	mozilla-central/
66M 	mozilla-central/
64G	total


ubuntu@ip-elided:~/index$ du -csh mozilla-esr68/*
8.8G	mozilla-esr68/analysis
292K	mozilla-esr68/
9.3M	mozilla-esr68/android-armv7.generated-files.tar.gz
298M	mozilla-esr68/
3.5M	mozilla-esr68/
32M 	mozilla-esr68/
8.9M	mozilla-esr68/bugzilla-components.json
4.8G	mozilla-esr68/crossref
661M	mozilla-esr68/description
445M	mozilla-esr68/dir
8.0K 	mozilla-esr68/downloads.lst
25G  	mozilla-esr68/file
4.5G 	mozilla-esr68/gecko-blame
8.7G 	mozilla-esr68/gecko-dev
12K  	mozilla-esr68/help.html
556M	mozilla-esr68/identifiers
44K 	mozilla-esr68/idl-files
12K  	mozilla-esr68/ipdl-files
4.0K 	mozilla-esr68/ipdl-includes
1.4M	mozilla-esr68/js-files
172M	mozilla-esr68/jumps
292K	mozilla-esr68/
11M 	mozilla-esr68/linux64.generated-files.tar.gz
301M	mozilla-esr68/
3.6M	mozilla-esr68/
62M 	mozilla-esr68/
6.3G 	mozilla-esr68/livegrep.idx
292K	mozilla-esr68/
11M 	mozilla-esr68/macosx64.generated-files.tar.gz
313M	mozilla-esr68/
3.6M	mozilla-esr68/
61M 	mozilla-esr68/
5.4G 	mozilla-esr68/objdir
28K  	mozilla-esr68/objdir-dirs
340K	mozilla-esr68/objdir-files
1.3M	mozilla-esr68/repo-dirs
19M 	mozilla-esr68/repo-files
70M 	mozilla-esr68/rustlib
4.0K 	mozilla-esr68/target.json
8.0K 	mozilla-esr68/templates
300K	mozilla-esr68/
11M 	mozilla-esr68/win64.generated-files.tar.gz
345M	mozilla-esr68/
3.6M	mozilla-esr68/
74M 	mozilla-esr68/
67G	total

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)

I don't think it's blame. wc .git-blame-ignore-revs shows esr68 has 118 lines and central (master) has 453 on gecko-dev.

I kicked off a mozilla-releases push on the kats channel with the blame-skipping stuff disabled to confirm one way or another. I'll report back times once it's done.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #3)

does comm-central gobble up the mozilla-central analysis data too?

Comm-central doesn't use any analysis data, it just has plaintext search for m-c.

My test run took a total of 10h54m to finish (from first date output to last). runtimes (again by comparing date output before/after) were:
mozilla-esr60: 57m
mozilla-release: 1h54m
mozilla-esr68: 2h26m
mozilla-beta: 3h2m

Given how mozilla-beta is over an hour worse than what you said in comment 0, it makes sense that this is due to I/O speed (randomly getting a bad disk, maybe something that has contention with somebody else's VM) rather than CPU-bound.

At there's an "EBS optimized" feature available for instances that might be worth trying out on the indexer. That might give us better I/O performance. Or we could see if one the storage-optimized instance types is better suited to our workload. I don't know that :billm did any investigation into which instance type is most appropriate; I know I haven't.

I terminated the web-server for the test run I did, but here's the index-log for posterity. I started another test run with EbsOptimized=true on the indexer to see how that compares.

The ebs-optimized run didn't do any better.

Hm, so apparently our "c3.2xlarge" instance types are now "previous generation" and documented at Current gen is

I wonder if they've de-prioritized the instances we're using for EBS purposes or something. One possibility would be that our current instances (and new "c5d" but not normal "c5") have local storage. We could switch to running the indexing on local storage and only copy it over upon completion or failure/etc.

Or maybe we should just mash buttons and upgrade to newer instance types? For US West where we run with on demand pricing ( for current gen), the pricing is as follows, with me including the comparable current gen and one tier up with local disk and without.

  • c3.2xlarge: $0.42/hr. 8 vCPU, 28 ECU, 15 GiB RAM, 2 x 80 GB SSD local storage.
  • c5.2xlarge: $0.34/hr. 8 vCPU, 34 ECU, 16 GiB RAM, EBS only storage.
  • c5d.2xlarge: $0.384/hr. 8 vCPU, 34 ECU, 16GiB RAM, 1 x 200 NVMe SSD.
  • c5.4xlarge: $0.58/hr. 16 vCPU, 68 ECU, 32 GiB RAM, EBS only storage.
  • c5d.4xlarge: $0.768/hr. 16 vCPU, 68 ECU, 32 GiB RAM, 1 x 400 NVMe SSD.

Oh, hm, it looks like we're not provisioning enough IOPS for our volumes? The running indexer volume says it has 900 IOPS provisioned. The tooltip text says: "The requested number of I/O operations per second that the volume can support. For Provisioned IOPS (SSD) volumes, you can provision up to 50 IOPS per GiB. For General Purpose (SSD) volumes, baseline performance is 3 IOPS per GiB, with a minimum of 100 IOPS and a maximum of 16000 IOPS. General Purpose (SSD) volumes under 1000 GiB can burst up to 3000 IOPS. "

Given that we've provisioned 300 GiB, it sounds like we could ask for 15000 instead of 900, with us currently provisioning 6% of allowed capacity?

Whoops, I got confused between the "gp2" volumes (which we use and are cheaper) and the Provisioned IOPS SSD (io1) which is what the 50 thing is. We're getting 900 IOPS because 3 * 300 GIB = 900.

Since I'm good at copying and pasting, I'll also include the EBS-Optimized instance blurbs from
EBS-optimized instances enable EC2 instances to fully use the IOPS provisioned on an EBS volume. EBS-optimized instances deliver dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 and 4,000 Megabits per second (Mbps) depending on the instance type used. The dedicated throughput minimizes contention between Amazon EBS I/O and other traffic from your EC2 instance, providing the best performance for your EBS volumes. EBS-optimized instances are designed for use with both Standard and Provisioned IOPS Amazon EBS volumes. When attached to EBS-optimized instances, Provisioned IOPS volumes can achieve single digit millisecond latencies and are designed to deliver within 10% of the provisioned IOPS performance 99.9% of the time.

For Current Generation Instance types, EBS-optimization is enabled by default at no additional cost. For Previous Generation Instances types, EBS-optimization prices are on the Previous Generation Pricing Page.

The hourly price for EBS-optimized instances is in addition to the hourly usage fee for supported instance types.

As I read that, that's only somewhat helpful if we're experiencing network contention from ourselves, which I guess we're not.

My inclination would be to see if we can run the indexing tasks locally on local disk and copy off when we succeed/fail.

So I started down the yak hole here. To start, I changed the scripts so that we start indexing on the instance local storage, then move each index directory to EBS storage once indexing completes. However, our c3.2xlarge instance only had 80GB of local storage, whereas c5d.2xlarge has 200 and it's NVME, so I decided to also try upgrading us to the newer instance types.

I'm in the process of trying out moving us to the newer "nitro" instance types. These apparently get us "enhanced networking" through the "elastic networking adapter", plus more optimized EBS access via system NVME drivers. I'm creating "-ena" suffixed AMI's based on current Ubuntu 16.04 images which already have the ENA stuff (and more!) in order to avoid breaking the world as I try this experiment. As discussed somewhat in comment 10 as well, these new image types are both cheaper and better.

  • t2.large => t3.large: $0.0928 per Hour => $0.0832 per Hour
  • c3.2xlarge => c5d.2xlarge: => $0.42 per Hour => $0.384 per Hour

The main wrinkle is that for NVME-exposed EBS per we lose control over explicitly specifying attachment names like "/dev/xvdf" and instead the names just get issued by the NVME device driver. has some context, as do some other places. Because we only attach a single EBS store, it looks like these are sufficiently boring/predictable that it's not a big deal though.

I have an indexer job running on the "dev" channel for mozilla-releases.json and we'll see how it does time-wise. It's still in the setup phase right now.

Assignee: nobody → bugmail

Good news: Much faster! The dev mozilla-releases.json indexer completed indexing after ~6.5hrs (before web-server spin-up), whereas the production one is still going strong at 11hrs.
Bad news: The trend of each successive taking longer continues, but maybe there's just more/bigger files as we move from esr60 to 68 to esr68 to 69? I suppose I can try re-ordering the json to tell if things are deterministic or just slowing down.

Here are the before/after times with kats' numbers from comment 6 as before and the "dev" run as post-arrow:
mozilla-esr60: 57m => 41m. 252 parallel chunks.
mozilla-release: 1h54m => 1h3m. 307 parallel chunks.
mozilla-esr68: 2h26m => 1h9m. 305 parallel chunks.
mozilla-beta: 3h2m => 1h29m. 311 parallel chunks.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #14)

I suppose I can try re-ordering the json to tell if things are deterministic or just slowing down.

As became obvious when I looked at the file, we weren't actually processing these in the order they were found in the JSON file, but instead in the order that Python's dict decided to place the keys in.

  • ['mozilla-beta', 'mozilla-release', 'mozilla-esr60', 'mozilla-esr68'] says OrderedDict
  • ['mozilla-esr60', 'mozilla-release', 'mozilla-esr68', 'mozilla-beta'] says normal dict, the betrayer.

I've triggered the indexer on "dev" with the mount-loop changes :kats requested as well as a speculative commit that changes to use an OrderedDict.

Blocks: 1579292

After an AWS lambda function oversight on my part, the mozilla-releases indexer completed indexing last night in ~6h20m per index-log first/last date. The comment 16 run hit a speed-bump, but I'm re-running it now on the "asuth" channel. Today's mozilla-releases job will also include the bug 1511025 fixes as well, so we may have both even more promising results plus perhaps a hint at additional low-hanging fruit.

Runs completed, plus I included comment 14's "after" values as time 3. (Note that the time 3 values didn't include the template fixes, but time 1 and time 2 do.) Ignoring such pedestrian concerns as sample size, there does seem to be some amount of slowdown, but esr60 is consistently happy regardless of location.

Note, however, that our mozilla-central indexers now blaze through mozilla-central's in 25.7 minutes (yesterday, no template fix from bug 1511025) and 24.65 minutes (today, yes template fix). I'm going to re-trigger mozilla-releases.json (in config order) and monitor it a bit to see if there's any obvious rogue processes or anything like that.

config order:

repo (in config order) order 1 time 1 (mins) order 2/3 time 2 (mins) time 3 (mins)
mozilla-beta 1 66.2 4 90.7 89
mozilla-release 2 77.3 2 63.9 63
mozilla-esr60 3 40.8 1 39.3 41
mozilla-esr68 4 89.23 3 71.6 69

python dict is the boss order:

repo (in pyboss order) order 1 time 1 (mins) order 2/3 time 2 (mins) time 3 (mins)
mozilla-esr60 3 40.8 1 39.3 41
mozilla-release 2 77.3 2 63.9 63
mozilla-esr68 4 89.23 3 71.6 69
mozilla-beta 1 66.2 4 90.7 89

Thanks to the power of multiple monitors, I watched a mozilla-releases.json indexer run under htop and also another config.json run to compare while I did other stuff. There were no surprise processes chewing up the machine.

What I did see was that all 8 of the output-file binaries' memory usage was capable of growing up to 3G RES with 1G SHR that put the system-wide memory usage (not buffers, not cache) in the 12-15G range, and the system only had ~15G after kernel overhead. This appeared to obliterate the disk cache and when the output-file binaries completed and were replaced by fresh ones by parallel they would all block on IO at a high rate for several seconds, displaying in the "D" state with CPU utilization well below 100%. In general, the output-file jobs all appeared to be largely phase-aligned, presumably due to the resulting I/O spikes being all of the fresh instances reading exactly the same data and that data being successfully cached, allowing for a multi-second catch-up window to re-align themselves.

To test how much memory mattered, I terminated the mozilla-releases.json run and re-triggered under an m5d.2xlarge which is basically a c5d.2xlarge with 2x the memory (16G becomes 32G). Local instance storage also bumps up from 200G to 300G, but since our usage of instance storage maxes out at ~86G I'm not sure the SSD performance characteristics change from the extra space. That run is time 4 below, with memory available for each run also called out in the header.

repo (in config order) order 1 time 1 (16G) time 4 (32G) order 2/3 time 2 (16G) time 3 (16G)
mozilla-beta 1 66.2 64.0 4 90.7 89
mozilla-release 2 77.3 63.4 2 63.9 63
mozilla-esr60 3 40.8 42.0 1 39.3 41
mozilla-esr68 4 89.23 63.5 3 71.6 69

I think one conclusion here is that memory scarcity does indeed seem to impact the indexing times, and it seems likely to be responsible for the apparent slowdown observed over time on the indexer. I think the other conclusions are that output-file does indeed need some optimization and there's still something weird going on, because mozilla-central doesn't have the problem. And that if we don't do that optimization soon or move from c5d to m5d instances, there's a real chance of OOM process terminations breaking the indexer. It's not immediately obvious if the reason mozilla-esr60 lives such a fast and furious life is because its jumps file is somehow only 16M while everyone else's jumps file is ~160M, or if it just benefits of having 20 months less blame data to deal with.

And it does appear blame is likely a large factor here. It appears the retained memory between iterating over different files in output files that's responsible for the growth over time is:

    let mut diff_cache = git_ops::TreeDiffCache::new();

Presumably as we process the 600-1800 files passed to output-file (from running wc /tmp/*.par, unsure if that might also include previous parallel runs), we end up observing more and more revisions/whatever.

For next steps I'll try some combination of using perf to see if there are obvious problem spots in output-file, and also perhaps instrument it with and so we can get some visibility and cool histograms.

I thought we tried disabling the blame-skipping code (which is what that TreeDiffCache is used for) and it didn't help. But maybe the bottleneck has shifted now. Regardless we can disable the blame skipping code if it helps now. I haven't had a chance to work on the replacement implementation much because most of my free time these days comes in bite-size chunks. Hopefully within a month I'll have some longer chunks of time to pick that up again.

Here's a perf profile gathered by running:

sudo perf record -F 49 -a --call-graph dwarf -- sleep 300
sudo perf script --header > stacks.2019-09-07-dwarf-1

And then feeding the output to and messing with the CSS to make the graph super-wide to make the screenshot more useful and then hitting "Whole Profile". I'll also upload the gzipped raw perf stacks for those who want to investigate locally. has landed which switches us from c5d to m5d machines as tested in comment 19. As discussed in the commit message there the basic idea is:

  • The machine rate costs a little more, but a direct cost savings is realized in improved runtime (also, our previous c3.2xlarge machine of a few days ago was $0.42, so we are actually saving money entirely after that change and others on this bug already). And the savings effectively go up as we add builds, like the proposed "ash" indexing.
  • The potentially larger motivation is that given how close we are to OOM crashes, the human hassle of dealing with those crashes is an even higher cost. Also, we may need/want the extra memory anyways to speed up indexing. That is, we may want to cache even more git stuff, and I think we are likely better off using GNU parallel (or similar) to leverage the embarassingly-parallel nature of stuff rather than getting into complicated delayed fork-clone magic. (Although if someone has a lot of time to try that out, that would likely be a super-duper-huge win.)

I updated the lambda jobs so m5d.2xlarge will be used for indexers starting tomorrow. Many thanks to :billm for his leaving us with excellent documentation/tooling for these things!

You need to log in before you can comment on or make changes to this bug.