1779672 - Investigate router.py search performance on "nsMapped" and other prefixes leading to "nsMappedAttributes"

Assignee

Description

•

3 years ago

I experienced a slow search on "permissionmanager" of ~5 seconds so I checked the logs where I noticed a variety of searches for prefixes of "nsMappedAttributes" which clocked in between 10 and 17 seconds, which is very bad. I see no indication of the result limit or work limit being hit. For rough comparison (but with a presumably warm cache), the "query" endpoint clocked in at 6.6 seconds and that's with it doing HTML file extraction.

There are also many examples of fast searches, so it doesn't seem like there's an obvious consistent problem. (like the python binary search is probably fine?) Adding some additional durations in "router.log" is probably advisable and not hard to do.

An additional issue I saw here is the mismatch between the keyword bookmark and what the page-issued query look like (from incremental search once the page loads without typing anything extra?). Unless this is incremental versus hitting enter.

"GET /mozilla-central/search?q=permissionmanager&path= HTTP/1.1"
"GET /mozilla-central/search?q=permissionmanager&path=&case=false&regexp=false HTTP/1.1"

That's not surprising, but potentially worth attempting to fix as we throw away a lot of potential caching there as well as adding contended-ish server load. This might be handled by seeing if we can get nginx to rewrite the query string to have the normalized extra case and regexp parameters with the default of false.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

3 years ago

Attached file Landed PR: Land more logging for analysis — Details

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

3 years ago

So it seems the latency is due to codesearch/livegrep which makes sense but is also the thing we are least able to impact[1]. Maybe when I upgraded things and the build system changed something sub-optimal happened in terms of build configuration or something? It's also possible there's just a pathological data issue for the specific terms. Also, I guess it wasn't an issue of me using a keyword bookmark in comment 0, but just that it's very easy to have the automatic search and hitting enter in the text box generate double-load for no benefit.

Here's what the logs now said for the resulting overlapping search, noting that this was the first request the restarted daemons had seen, the linux caches were still in steady state with a lot of cached data:

2022-07-15 03:43:19.182595/pid=99595 - QUERY line: "nsmappedattrib", file: ".*", fold_case: true, 
2022-07-15 03:43:19.205219/pid=99604 - request(handled by 99605) /mozilla-central/search?q=nsmappedattrib&path=&case=false&regexp=false
2022-07-15 03:43:19.208981/pid=99605 - QUERY line: "nsmappedattrib", file: ".*", fold_case: true, 
2022-07-15 03:43:25.707517/pid=99595 -   codesearch result with 458 line matches across 82 paths - 6.525145
2022-07-15 03:43:25.715525/pid=99605 -   codesearch result with 458 line matches across 82 paths - 6.506793
2022-07-15 03:43:25.730363/pid=99595 -   search_files "nsmappedattrib" - 0.022299
2022-07-15 03:43:25.736540/pid=99605 -   search_files "nsmappedattrib" - 0.020386
2022-07-15 03:43:25.835703/pid=99595 -   identifier_search "nsmappedattrib" - 0.105231
2022-07-15 03:43:25.837976/pid=99595 -   search.get() - 0.002191
2022-07-15 03:43:25.840694/pid=99605 -   identifier_search "nsmappedattrib" - 0.104069
2022-07-15 03:43:25.842822/pid=99605 -   search.get() - 0.001955
2022-07-15 03:43:25.845261/pid=99594 - finish pid 99595 - 6.664596
2022-07-15 03:43:25.847620/pid=99604 - finish pid 99605 - 6.641813

When I more carefully issued a single query by adding a letter later and the cache should be entirely hot:

2022-07-15 03:43:39.350115/pid=99655 - request(handled by 99656) /mozilla-central/search?q=nsmappedattribu&path=&case=false&regexp=false
2022-07-15 03:43:39.354533/pid=99656 - QUERY line: "nsmappedattribu", file: ".*", fold_case: true, 
2022-07-15 03:43:42.758365/pid=99656 -   codesearch result with 458 line matches across 82 paths - 3.404380
2022-07-15 03:43:42.774595/pid=99656 -   search_files "nsmappedattribu" - 0.015786
2022-07-15 03:43:42.829131/pid=99656 -   identifier_search "nsmappedattribu" - 0.054445
2022-07-15 03:43:42.830335/pid=99656 -   search.get() - 0.001139
2022-07-15 03:43:42.834102/pid=99655 - finish pid 99656 - 3.482734

and then once more with feeling where the cache should still be entirely hot:

2022-07-15 03:45:06.899562/pid=99686 - request(handled by 99687) /mozilla-central/search?q=nsmappedattribut&path=&case=false&regexp=false
2022-07-15 03:45:06.908070/pid=99687 - QUERY line: "nsmappedattribut", file: ".*", fold_case: true, 
2022-07-15 03:45:10.263916/pid=99687 -   codesearch result with 458 line matches across 82 paths - 3.359784
2022-07-15 03:45:10.281885/pid=99687 -   search_files "nsmappedattribut" - 0.017623
2022-07-15 03:45:10.337742/pid=99687 -   identifier_search "nsmappedattribut" - 0.055623
2022-07-15 03:45:10.339023/pid=99687 -   search.get() - 0.001070
2022-07-15 03:45:10.342723/pid=99686 - finish pid 99687 - 3.440511

1: An important caveat is that every CodeSearchResult include a SearchStats struct, which I will imminently start dumping as part of the output. I am also going to fix the submission to not double-submit or at least have a consistent GET query so the server can optimize.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 3

•

3 years ago

So the issue is that the enter key is triggering the default submit button which we have there as visually-hidden, and so I'm marking that as disabled which will inhibit automatic form submission so that we can use our magic AJAX logic without forcing a page reload or doubling-ish the server load.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

3 years ago

Attached file Landed PR: log codesearch stats, prevent default submit — Details

Note that the changes to the input fields are made in templates so those will take effect tomorrow and if I messed anything up, we won't see that until tomorrow! (I did test locally though :)

So with the double submit because I physically have to hit enter:

2022-07-15 04:27:27.343744/pid=101659 - QUERY line: "nsmapped", file: ".*", fold_case: true, 
2022-07-15 04:27:27.346754/pid=101660 - QUERY line: "nsmapped", file: ".*", fold_case: true, 
2022-07-15 04:27:34.850577/pid=101659 -   codesearch result with 458 line matches across 82 paths - 7.507091 : re2_time: 5, git_time: 14372, index_time: 135, total_time: 7491, 
2022-07-15 04:27:34.898211/pid=101659 -   search_files "nsmapped" - 0.047263
2022-07-15 04:27:34.899149/pid=101660 -   codesearch result with 458 line matches across 82 paths - 7.552586 : re2_time: 28, git_time: 14220, index_time: 16, total_time: 7529, 
2022-07-15 04:27:34.940369/pid=101660 -   search_files "nsmapped" - 0.040664
2022-07-15 04:27:35.019474/pid=101659 -   identifier_search "nsmapped" - 0.121164
2022-07-15 04:27:35.021467/pid=101659 -   search.get() - 0.001920
2022-07-15 04:27:35.035007/pid=101657 - finish pid 101659 - 7.694415
2022-07-15 04:27:35.065576/pid=101660 -   identifier_search "nsmapped" - 0.125118
2022-07-15 04:27:35.066717/pid=101660 -   search.get() - 0.001084
2022-07-15 04:27:35.080960/pid=101658 - finish pid 101660 - 7.737849

So, like, all the time is git_time. That seems bad. Also, I don't understand what we're asking git.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 5

•

3 years ago

So maybe git_time is a misnomer since things do seem to work as I thought, which is that the server doesn't talk to git, only the indexer does. Both find_match_brute and find_match attribute time to git_time and these are methods that consume the fundamental chunk data structure so maybe "git" means "you know, the things you stored in git, not actually git".

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 6

•

3 years ago

Attached file Landed PR: up webserver quality; retain fewer backups — Details

Given that it seems like codesearch is CPU bound I'm upping us from the 2-core 8 GiB t3.large to the 4-core 16 GiB t3.xlarge web servers. This will take us from a per-web-server worst case operating cost of $2 a day to $4 a day. Or for the "3-yr reserved no upfront" quote, of $0.86 a day to $1.72 a day. We don't do the reservations ourselves, that happens elsewhere in the MoCo AWS account hierarchy and is largely opaque to us but the engineer benefit here is clear.

We also expect performance improvements from the additional memory available for caching. In particular, on the 8 GiB instance we only had 5.9G usable for cache, but our crossref database files are:

$ du -csh */crossref */crossref-extra
312M    comm-central/crossref
2.1G    mozilla-central/crossref
25M     mozilla-mobile/crossref
51M     nss/crossref
711M    comm-central/crossref-extra
4.1G    mozilla-central/crossref-extra
29M     mozilla-mobile/crossref-extra
80M     nss/crossref-extra
7.3G    total

And our codesearch databases are:

$ du -csh */livegrep.idx
7.2G    comm-central/livegrep.idx
7.6G    mozilla-central/livegrep.idx
520M    mozilla-mobile/livegrep.idx
711M    nss/livegrep.idx
16G     total

So it seems like the increase to 16GiB is just barely enough for us to fully cache mozilla-central if we assume no one uses the other repositories, which is definitely a good step to make. Arguably it would be nice to have some extra RAM to cache some of the sibling databases and also leave us some RAM to cache some of the analysis and HTML files.

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 7

•

3 years ago

Note that the utc22 indexer job had already started before the last 2 pushes, so the (not actually[1]) risky changes won't take effect until tomorrow's utc10 indexer run when I'll be around again.

1: The changes are straightforward but like AWS deployment scripts are not subject to our unit tests and it's easier for me to make typo-style mistakes. However VS code does help validate the python code changes I make so I'm optimistic!

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 8

•

3 years ago

Between avoiding self-contending results and improved server capacity, this seems like a win! The (uncontended) nsmappedattribute searches are now down at ~2.5 seconds which is a meaningful improvement over the times the results were ~3.5 seconds above (uncontended), and definitely a major improvement over the self-contended times of ~6.7 secs and ~7.7 secs.

2022-07-15 18:19:49.919218/pid=23716 - request(handled by 23717) /mozilla-central/search?q=nsmappedattr&path=&case=false&regexp=false
2022-07-15 18:19:49.921318/pid=23717 - QUERY line: "nsmappedattr", file: ".*", fold_case: true, 
2022-07-15 18:19:52.260791/pid=23717 -   codesearch result with 458 line matches across 82 paths - 2.339672 : re2_time: 5, git_time: 4211, index_time: 257, total_time: 2327, 
2022-07-15 18:19:52.280394/pid=23717 -   search_files "nsmappedattr" - 0.019166
2022-07-15 18:19:52.366202/pid=23717 -   identifier_search "nsmappedattr" - 0.085721
2022-07-15 18:19:52.367481/pid=23717 -   search.get() - 0.001201
2022-07-15 18:19:52.371268/pid=23716 - finish pid 23717 - 2.451364
2022-07-15 18:20:02.261221/pid=23736 - request(handled by 23737) /mozilla-central/search?q=nsmappedattribute&path=&case=false&regexp=false
2022-07-15 18:20:02.263231/pid=23737 - QUERY line: "nsmappedattribute", file: ".*", fold_case: true, 
2022-07-15 18:20:04.751799/pid=23737 -   codesearch result with 458 line matches across 82 paths - 2.488770 : re2_time: 9, git_time: 4681, index_time: 5, total_time: 2475, 
2022-07-15 18:20:04.769116/pid=23737 -   search_files "nsmappedattribute" - 0.016946
2022-07-15 18:20:04.825677/pid=23737 -   identifier_search "nsmappedattribute" - 0.056474
2022-07-15 18:20:04.826890/pid=23737 -   search.get() - 0.001145
2022-07-15 18:20:04.830408/pid=23736 - finish pid 23737 - 2.568664

I've pulled logs from yesterday's utc10 server and will pull logs from today's utc10 server later to run weblog-analyze.sh against so we can compare actual statistical distributions and such. (Noting that today is a Friday and yesterday was a Thursday, so this won't be apples-to-apples.)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 9

•

3 years ago

•

Edited

Attached file Landed PR: weblog analyze improvements — Details

Excerpts of the new script runs against Thursday's slow log and Friday's fast log.

Old, Slow, Thursday (~18UTC - ~6UTC):

Dynamic Non-Search Request Latencies

action             cache_status        _count        p50         p66         p75         p90         p95         p99
-----------------------------------------------------------------------------------------------------------------------------
commit-info        MISS                382           0.01        0.01        0.01        0.01        0.01        0.01
commit-info        HIT                 133           0           0           0           0           0           0.50                                                                 
rev                MISS                123           0.07        0.09        0.11        0.19        0.24        0.87                                                                         
rev                HIT                 66            0           0           0           0.00        0.40        0.49                                                         
diff               MISS                12            0.21        0.28        0.32        0.41        0.41        2.61                                                         
commit             HIT                 7             0           0           0           0           0           0                                                            
commit             MISS                4             0.20        0.39        0.39        0.86        0.86        0.86                                                         
query              MISS                3             1.69        1.69        1.69        6.63        6.63        6.63                                                         
diff               HIT                 2             0           0           0           0           0           0

Dynamic Search Request Latencies

cache_status        _count        p50         p66         p75         p90         p95         p99                              
----------------------------------------------------------------------------------------------------------
MISS                2174          0.18        0.39        0.75        2.61        4.11        8.92                             
HIT                 120           0           0           0           0           0           0

New, Fast Friday (~18UTC - ~6UTC):

Dynamic Non-Search Request Latencies

action             cache_status        _count        p50         p66         p75         p90         p95         p99                
-----------------------------------------------------------------------------------------------------------------------------               
commit-info        MISS                335           0.00        0.01        0.01        0.01        0.01        0.01        
rev                MISS                168           0.04        0.05        0.06        0.10        0.15        0.23        
commit-info        HIT                 139           0           0           0           0.50        0.50        0.50        
rev                HIT                 40            0           0           0           0           0           0.00        
commit             HIT                 8             0           0           0           0           0           0           
commit             MISS                7             0.25        0.27        0.27        0.58        0.65        0.65        
diff               MISS                5             0.15        0.15        0.15        0.28        0.28        0.28

Dynamic Search Request Latencies

cache_status        _count        p50         p66         p75         p90         p95         p99                                                                                             
----------------------------------------------------------------------------------------------------------                                                                                    
MISS                1725          0.13        0.26        0.47        1.95        2.75        4.75                                                                                            
HIT                 117           0           0           0           0           0           0

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 10

•

3 years ago

•

Edited

Potential next steps:

make router.py perform codesearch and other searches in parallel

I'm going to punt on that because:

pipeline-server.rs already does this, and in fact parallelizes file search as well!
why risk messing up router.py at this late time since the future is the pipeline-server?
in general identifier_search tends not to be a major impact, and when it is somewhat of an impact, it usually correlates with a long codesearch time. Since codesearch is tuned to use all of the web-server's cores still (4 codesearch threads, 4 cores), we wouldn't really be gaining extra parallelism there and could negatively impact the codesearch latency.
- As we add more cores (like I propose below for m-c), it arguably makes more sense to give those cores to codesearch still.
- It does tend to be the case that these are shorter query strings which are potentially indicative of incremental search generating surplus load. However, this load will also tend to help out with pre-heating the cache. The concern is the extent to which these incremental searches extend into the actual search query. startSearchSoon is implemented such that we only send a query if it's been 300ms since the last character was typed, which means that for fast typing we shouldn't be spamming the server too much.

Adaptive incremental search timing using localStorage and honoring the enter key

Filed as bug 1779900.

Right now we use a hard threshold of 300ms, but it's possible to gather statistics locally on the the timing between characters to generate a potentially more appropriate threshold for users. This is primarily of benefit for the user because un-scientific sampling suggests that it's actually rare that different user's requests collide and instead that a user's own requests will self-collide. For a user where their typing rate is say, 350ms per character, they will generate pathological load for themselves, so having the cut-off be adaptive would be helpful.

The heuristic would also want to be aware of the enter key. For users where they end the query by pressing the enter key (in a timely fashion) with high probability, it potentially makes sense to to set the incremental timer much higher with the assumption that the enter key is likely coming. The enter key would of course force the submission of the query immediately. (And should probably pass cancelFetch=true).

Give the config1 web-server and codesearch more cores

Filed as bug 1779899.

As noted above, a single codesearch query can saturate the VM's CPU. I think it was suitable to move all web-servers to be able to do this given the low costs involved (and the likelihood that the new query mechanism necessitates more capacity), but the reality is that we expect mozilla-central to have the highest load and we care the most about mozilla-central's performance, so it makes sense to evaluate allocating more resources to get p90 under 1 second and p95 under 2 seconds or otherwise determine where the diminishing returns are.

Note that as part of these changes I've been reducing config1's backup servers from 3 to 2 and I think I'm going to make the above change and go to just 1 backup server, which does help mitigate the cost impact. Specifically, we were at 4 x 1 cost unit before this bug. Moving from "t3.large" to "t3.xlarge" takes us to 2 cost units, and then moving to "t3.2xlarge" will take us to 4 cost units. So as of this moment we're at 3 x 2 cost units, and my proposal is to move to 2 x 4 cost units. This will represent a net doubling of cost, but for quadrupling of resources/performance. (Also, I doubled the cost for all the other web-servers too).

At the no-upfront 3-year daily rate (for t3.large) a cost unit is $0.86 ($0.75 for 3-years all upfront) and we now have (config2-config5 = 4) x 2 cost units = 8 units, and (config1 + config1 backup = 2) x 4 cost = 8 units units for a total of 16 cost units @ $0.86 = $13.76 a day. If we only reserve out 1 year but do pay upfront, a cost unit is at $1.17/day which gives us $18.72 a day where $9.36 of that is for mozilla-central. At 1-year no-upfront the numbers are rounder at $20 a day, which sounds like a reasonable threshold for us to hit but going above that probably merits discussion with someone involved with paying the moneys.

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 11

•

3 years ago

My tentative plan is to quickly experiment (in a follow-up) with:

dropping config1 to 1 backup server
upping the server in use to t3.2xlarge
changing our codesearch command line invocation to using the number of processor cores with a max of 8 (mainly for the VM here I think it makes sense to cap)

And then we can see how that performs on the slowest queries from the logs and how it impacts the stats on Monday. If we don't see appreciable improvements at p90 and above I'll drop us back down to a "t3.xlarge" but leave the codesearch scaling and single backup server. I'll do that now-ish so if I mess it up we see it on Sunday.

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

3 years ago

Blocks: 1779899

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

3 years ago

Blocks: 1779900

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

3 years ago

Blocks: 1687148

Landed PR: Land more logging for analysis 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review
Landed PR: log codesearch stats, prevent default submit 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review
Landed PR: up webserver quality; retain fewer backups 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review
Landed PR: weblog analyze improvements 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review