Closed Bug 1779899 Opened 3 years ago Closed 3 years ago

Move release1 indexer to use t3.2xlarge and only 1 backup server

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

References

Details

Attachments

(4 files)

Landed PR: release1 to use t3.2xlarge and 8 cores for codesearch 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review
Landed PR: reduce codesearch chunk sizes for increased parallelism 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review
Landed PR: pre-cache all m-c indices on release1 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review
Landed PR: be able to filter weblogs by tree 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review

Andrew Sutherland [:asuth] (he/him)

Assignee

Description

•

3 years ago

As proposed in https://bugzilla.mozilla.org/show_bug.cgi?id=1779672#c10 I'm going to try moving release1 (config1.json) which contains mozilla-central to an 8-core t3.2xlarge instance type to see how that moves our p90 and higher searches. codesearch/livegrep will also be updated to be able to use all 8-cores. We'll drop back down to t3.xlarge if p90 and p95 don't decrease meaningfully.

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

3 years ago

Depends on: 1779672

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

3 years ago

Attached file Landed PR: release1 to use t3.2xlarge and 8 cores for codesearch — Details

This is landed and should take effect for the utc22 run.

This will also give the VM 32 GiB of memory. This should avoid cache competition with the non m-c repositories for their crossref and codesearch databases and allow for additional caching of the m-c git repo which currently clocks in at 7.0 GiB. That's not a huge win for our current feature-set which doesn't really do any git scans, but could be nice in the future as the "query" endpoint may potentially gain some git history-related features. (Although anything really useful should of course end up potentially pre-computed and then indexed by livegrep or something.)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

3 years ago

After mitigating bug 1779939 and re-triggering, things seem to have worked. That said, it's not clear codesearch is actually leveraging the extra threads meaningfully; it seems like codesearch only ends up using 2 threads:

htop for a longer query like a 4-digit hex-string shows only 2 cores hitting full utilization (or equivalent distribution)
the "git_time" stat frequently ends up being almost exactly 2x the "total_time".

I presume the issue is that the chunks are the atomic unit of labor division. I see there is a config value chunk_power with a default of 27 which is 128 MiB which does seem like a pretty big work unit. I'm going to drop the chunk power by a factor of 8 (= 2^3) to 24. My rationale here is that for work load balancing it's potentially better for each thread to potentially see more than 1 work unit.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 3

•

3 years ago

Attached file Landed PR: reduce codesearch chunk sizes for increased parallelism — Details

This has landed and after re-triggering the indexer I am indeed am seeing the expected parallelism and we're seeing the git_time and total_time numbers line up with the parallelism.

The "nsmappedattribute" test case can get down to ~1.8 secs from ~2.5 secs after bug 1779672.

The hex code test case sees massive improvements but we can also see how brutal the initial page-in is:

2022-07-18 01:04:32.611831/pid=20891 - request(handled by 20892) /mozilla-central/search?q=2ac3&path=
2022-07-18 01:04:32.613941/pid=20892 - QUERY line: "2ac3", file: ".*", fold_case: true, 
2022-07-18 01:04:41.740173/pid=20892 -   codesearch result with 998 line matches across 181 paths - 9.126436 : re2_time: 8, git_time: 47852, index_time: 15584, exit_reason: MATCH_LIMIT, total_time: 9071, 
2022-07-18 01:04:41.758766/pid=20892 -   identifier_search "2ac3" - 0.010385
2022-07-18 01:04:41.759931/pid=20892 -   search.get() - 0.001111
2022-07-18 01:04:41.770014/pid=20891 - finish pid 20892 - 9.157628
2022-07-18 01:04:46.672930/pid=20898 - request(handled by 20899) /mozilla-central/search?q=2AC3&path=
2022-07-18 01:04:46.675116/pid=20899 - QUERY line: "2AC3", file: ".*", fold_case: true, 
2022-07-18 01:04:50.078560/pid=20899 -   codesearch result with 996 line matches across 199 paths - 3.403652 : re2_time: 11, git_time: 19280, index_time: 4, exit_reason: MATCH_LIMIT, total_time: 3387, 
2022-07-18 01:04:50.088464/pid=20899 -   identifier_search "2AC3" - 0.001677
2022-07-18 01:04:50.089593/pid=20899 -   search.get() - 0.001091
2022-07-18 01:04:50.098166/pid=20898 - finish pid 20899 - 3.424760
2022-07-18 01:05:04.906701/pid=20908 - request(handled by 20909) /mozilla-central/search?q=2aC3&path=
2022-07-18 01:05:04.909433/pid=20909 - QUERY line: "2aC3", file: ".*", fold_case: true, 
2022-07-18 01:05:08.823518/pid=20909 -   codesearch result with 998 line matches across 179 paths - 3.914413 : re2_time: 8, git_time: 20776, index_time: 5, exit_reason: MATCH_LIMIT, total_time: 3898, 
2022-07-18 01:05:08.833412/pid=20909 -   identifier_search "2aC3" - 0.001693
2022-07-18 01:05:08.834525/pid=20909 -   search.get() - 0.001067
2022-07-18 01:05:08.845302/pid=20908 - finish pid 20909 - 3.938121

Specifically, the initial page-in search is ~9.2 sec but effectively equivalent queries afterwards with the hot cache clock in at ~3.4s and ~3.9s.

My next quick steps are:

The provisioner will start installing vmtouch (https://hoytech.com/vmtouch/)
We will touch mozilla-central's crossref-extra, crossref, and livegrep.idx into cache in that order synchronously as part of the web-server spin-up. I choose that order because the later things are more important than the earlier-things.
- I am going to do this hackily. All that matters is that it happens for mozilla-central on release1. We can generalize/elegantize later.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

3 years ago

Attached file Landed PR: pre-cache all m-c indices on release1 — Details

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 5

•

3 years ago

The vmtouch stuff worked well; a fresh "2ac3" search and "nsmappedattribute" search:

2022-07-18 05:57:08.544971/pid=19410 - request(handled by 19411) /mozilla-central/search?q=2ac3&path=
2022-07-18 05:57:08.546795/pid=19411 - QUERY line: "2ac3", file: ".*", fold_case: true, 
2022-07-18 05:57:11.980368/pid=19411 -   codesearch result with 998 line matches across 196 paths - 3.433835 : re2_time: 8, git_time: 18829, index_time: 16, exit_reason: MATCH_LIMIT, total_time: 3378, 
2022-07-18 05:57:11.998036/pid=19411 -   identifier_search "2ac3" - 0.009444
2022-07-18 05:57:11.999316/pid=19411 -   search.get() - 0.001220
2022-07-18 05:57:12.008485/pid=19410 - finish pid 19411 - 3.463101
2022-07-18 05:57:42.685431/pid=19566 - request(handled by 19567) /mozilla-central/search?q=nsmappedattribute&path=&case=false&regexp=false
2022-07-18 05:57:42.687238/pid=19567 - QUERY line: "nsmappedattribute", file: ".*", fold_case: true, 
2022-07-18 05:57:44.440411/pid=19567 -   codesearch result with 458 line matches across 82 paths - 1.753351 : re2_time: 25, git_time: 4486, index_time: 27, total_time: 1740, 
2022-07-18 05:57:44.457333/pid=19567 -   search_files "nsmappedattribute" - 0.016603
2022-07-18 05:57:44.543932/pid=19567 -   identifier_search "nsmappedattribute" - 0.086530
2022-07-18 05:57:44.545053/pid=19567 -   search.get() - 0.001071
2022-07-18 05:57:44.548650/pid=19566 - finish pid 19567 - 1.862811

I'm going to leave this open until we get monday's utc10 p90+ numbers, but this seems like a win on at least the RAM pre-caching front even if the CPU win wasn't also nice. After pre-caching, we're at 9.5 GiB free (with 19 GiB cached and 1.6GiB used) per free -h.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 6

•

3 years ago

Attached file Landed PR: be able to filter weblogs by tree — Details

I took a peek at the utc22 web-server after it rotated it out. We're using the utc10 server for our sampling, so I'll save the victory announcement until we have that data, but for now it looks great.

One thing I did notice was that nearly all of our slowest searches were clearly coming from comm-central, not mozilla-central, so I've augmented the script to be able to optionally filter by repo/tree so it doesn't distort our statistics and since I still have the original slow web logs we can compare m-c to m-c. Interestingly, c-c's stats actually made m-c's performance look slightly better in the slow logs! Note that because we use different codesearch/livegrep instances for each tree, comm-central being blocked on I/O shouldn't impact m-c search performance as long as the nginx server has not reached a proxying connection limit.

When utc22 rotated out, we retained 100% caching on our 3 pre-cached files, and free -h indicates we're at 4.4 GiB free (down from 9.5 GiB just after activation), with 24 GiB cached (up from 19 GiB), and 1.8 GiB used (up from 1.6 GiB).

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 7

•

3 years ago

Well, shoot. The "only keep 1 backup server" logic where I explicitly called out that I wasn't dealing with ordering the candidates for shutdown decided to get rid of today's utc10 server and keep yesterday's utc22 server. This means the unscientific comparison I was going to run is now unscientific in different ways than I was planning because I can't compare utc10 server runs to utc10 server runs. (The web logs go away with the server, like searches in the rain.)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 8

•

3 years ago

I sent an email about the improved performance to dev-platform and it's here on google groups with not great formatting because it ended up as text/plain and I didn't spend a lot of time considering the implications of that.

For posterity, I re-provide the differences here:

OLD Dynamic Search Request Latencies for mozilla-central

cache_status        _count        p50         p66         p75         p90         p95         p99
----------------------------------------------------------------------------------------------------------
MISS                2025          0.18        0.38        0.73        2.76        4.23        9.12
HIT                 114           0           0           0           0           0           0

NEW Dynamic Search Request Latencies for mozilla-central

cache_status        _count        p50         p66         p75         p90         p95         p99                                                                                                                             
----------------------------------------------------------------------------------------------------------     
MISS                3396          0.06        0.09        0.17        0.79        1.92        2.74             
HIT                 219           0           0           0           0           0           0

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

3 years ago

Blocks: 1780902

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Move release1 indexer to use t3.2xlarge and only 1 backup server

Categories

(Webtools :: Searchfox, enhancement)

Tracking

(Not tracked)

People

(Reporter: asuth, Assigned: asuth)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

OLD Dynamic Search Request Latencies for mozilla-central

NEW Dynamic Search Request Latencies for mozilla-central

Updated

Attachment

General

Description

File Name

Content Type