Closed Bug 1794332 Opened 3 years ago Closed 21 days ago

config4 web-server keeps hanging when starting a codesearch full-text daemon, manifesting as the indexer timing out. now also config1 sometimes!

Categories

(Webtools :: Searchfox, defect)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: asuth, Unassigned)

Details

We've seen this twice now:

  • 2022-10-06, the ecma262 codesearch instance appeared not to come up on port 8090
  • 2022-10-09, the kaios codesearch instance appeared not to come up on port 8091

This is inferred from:

  1. The indexer shutdown log showed a ton of Got status.txt: [b'web-server.rs loaded\n'] lines indicating web-server.rs came up but router.py did not.
  2. The router.py log looked like the following (this is from 2022-10-09):
2022-10-09 14:52:10.888037/pid=15169 - Loading l10n                                                                                                                                           
2022-10-09 14:52:10.888152/pid=15169 - Failed to mmap crossref file for l10n: cannot mmap an empty file                                                                                       
2022-10-09 14:52:10.888191/pid=15169 - Failed to mmap crossref file for l10n: cannot mmap an empty file                                                                                       
2022-10-09 14:52:10.888203/pid=15169 - Loading llvm                                                                                                                                           
2022-10-09 14:52:10.888273/pid=15169 - Loading mingw                                                                                                                                          
2022-10-09 14:52:10.888304/pid=15169 - Failed to mmap crossref file for mingw: cannot mmap an empty file                                                                                      
2022-10-09 14:52:10.888331/pid=15169 - Failed to mmap crossref file for mingw: cannot mmap an empty file                                                                                      
2022-10-09 14:52:10.888342/pid=15169 - Loading mingw-moz                                                                                                                                      
2022-10-09 14:52:10.888369/pid=15169 - Failed to mmap crossref file for mingw-moz: cannot mmap an empty file                                                                                  
2022-10-09 14:52:10.888395/pid=15169 - Failed to mmap crossref file for mingw-moz: cannot mmap an empty file                                                                                  
2022-10-09 14:52:10.888405/pid=15169 - Loading rust                                                                                                                                           
2022-10-09 14:52:10.888459/pid=15169 - Loading whatwg-html                                                                                                                                    
2022-10-09 14:52:10.888515/pid=15169 - Loading mozilla-build                                                                                                                                  
2022-10-09 14:52:10.888543/pid=15169 - Failed to mmap crossref file for mozilla-build: cannot mmap an empty file                                                                              
2022-10-09 14:52:10.888569/pid=15169 - Failed to mmap crossref file for mozilla-build: cannot mmap an empty file                                                                              
2022-10-09 14:52:10.888583/pid=15169 - Loading glean                                                                                                                                          
2022-10-09 14:52:10.888626/pid=15169 - Loading version-control-tools                                                                                                                          
2022-10-09 14:52:10.888753/pid=15169 - Loading ecma262                                                                                                                                        
2022-10-09 14:52:10.888808/pid=15169 - Failed to mmap crossref file for ecma262: cannot mmap an empty file                                                                                    
2022-10-09 14:52:10.888820/pid=15169 - Loading kaios                                                                                                                                          
2022-10-09 14:52:10.888863/pid=15169 - Loading mozilla-pine                                                                                                                                   
2022-10-09 14:52:10.888931/pid=15169 - Loading mozilla-vpn-client                                                                                                                             
2022-10-09 14:52:10.888984/pid=15169 - Loading mozilla-elm                                                                                                                                    
2022-10-09 14:52:10.889042/pid=15169 - Stopping codesearch on port 8081                                                                                                                       
2022-10-09 14:52:10.897862/pid=15169 - Starting codesearch on port 8081                                                                                                                       
2022-10-09 14:52:13.841797/pid=15169 - Server on port 8081 found alive after 23 tries          
2022-10-09 14:52:13.843928/pid=15169 - Stopping codesearch on port 8082                                                                                                                       
2022-10-09 14:52:13.855503/pid=15169 - Starting codesearch on port 8082                                                                                                                       
2022-10-09 14:52:16.527547/pid=15169 - Server on port 8082 found alive after 26 tries          
2022-10-09 14:52:16.527627/pid=15169 - Stopping codesearch on port 8083                                                                                                                       
2022-10-09 14:52:16.536743/pid=15169 - Starting codesearch on port 8083                                                                                                                       
2022-10-09 14:52:16.845867/pid=15169 - Server on port 8083 found alive after 3 tries                                                                                                          
2022-10-09 14:52:16.845948/pid=15169 - Stopping codesearch on port 8084                                                                                                                       
2022-10-09 14:52:16.854302/pid=15169 - Starting codesearch on port 8084                                                                                                                       
2022-10-09 14:52:17.163360/pid=15169 - Server on port 8084 found alive after 3 tries                                                                                                          
2022-10-09 14:52:17.163438/pid=15169 - Stopping codesearch on port 8085                                                                                                                       
2022-10-09 14:52:17.174868/pid=15169 - Starting codesearch on port 8085                                                                                                                       
2022-10-09 14:52:17.688633/pid=15169 - Server on port 8085 found alive after 5 tries                                                                                                          
2022-10-09 14:52:17.688721/pid=15169 - Stopping codesearch on port 8086                                                                                                                       
2022-10-09 14:52:17.696507/pid=15169 - Starting codesearch on port 8086                                                                                                                       
2022-10-09 14:52:17.802268/pid=15169 - Server on port 8086 found alive after 1 tries                                                                                                          
2022-10-09 14:52:17.802365/pid=15169 - Stopping codesearch on port 8087                                                                                                                       
2022-10-09 14:52:17.810436/pid=15169 - Starting codesearch on port 8087                                                                                                                       
2022-10-09 14:52:17.916012/pid=15169 - Server on port 8087 found alive after 1 tries                                                                                                          
2022-10-09 14:52:17.916129/pid=15169 - Stopping codesearch on port 8088                                                                                                                       
2022-10-09 14:52:17.923561/pid=15169 - Starting codesearch on port 8088                                                                                                                       
2022-10-09 14:52:18.029185/pid=15169 - Server on port 8088 found alive after 1 tries                                                                                                          
2022-10-09 14:52:18.029265/pid=15169 - Stopping codesearch on port 8089                                                                                                                       
2022-10-09 14:52:18.037182/pid=15169 - Starting codesearch on port 8089                                                                                                                       
2022-10-09 14:52:18.145507/pid=15169 - Server on port 8089 found alive after 1 tries                                                                                                          
2022-10-09 14:52:18.145620/pid=15169 - Stopping codesearch on port 8090                                                                                                                       
2022-10-09 14:52:18.154112/pid=15169 - Starting codesearch on port 8090                                                                                                                       
2022-10-09 14:52:18.263207/pid=15169 - Server on port 8090 found alive after 1 tries                                                                                                          
2022-10-09 14:52:18.263301/pid=15169 - Stopping codesearch on port 8091                                                                                                                       
2022-10-09 14:52:18.279821/pid=15169 - Starting codesearch on port 8091   

When I re-ran web-server-run.sh, everything seemed to come up okay, which implies the issue is not an issue with the data but perhaps something related to resource exhaustion. At startup, web-server-setup.sh will have done a massive amount of I/O caching the livegrep indices and config4 is probably over-subscribed, so the startup could hit perhaps some major I/O lag as our EBS access could have a very empty token bucket. There's also a good chance that we've hit the limit of our the t3 instance's CPU burst capability.

Something that's a bit confusing is that the waiting logic (which is new-ish as of https://github.com/mozsearch/mozsearch/pull/505 because I needed a way to make sure that the daemon came up before we run our checks) is written so that if the server doesn't show up after 200 tries @ ~20 seconds it should still misleadingly claim to have succeeded, but it doesn't. I'm pretty hungry so maybe I'm misreading what's happening there...

One major complication is that router.py goes out of its way in its daemonization mechanism to run codesearch under /dev/null stdin, stdout, and stderr, so if it had anything interesting to say, it disappeared into the void. Probably it would be helpful for us to actually let them put their output somewhere we could read it. The tentative plan is to use systemd for daemon keepalive purposes but we need to wait for https://devblogs.microsoft.com/commandline/systemd-support-is-now-available-in-wsl/ to make it to the release version of WSL that everyone can use.

Reasonable actions to take are probably:

  1. The indexer should time-out on failure for the web-server to come up in a more timely fashion. It's waiting for hours right now.
  2. if it's not too bad to give each codesearch daemon its own log file, we should do that. It seems like it should be reasonable.
  3. the codesearch.py spin-loop should potentially get a bit more chatty too
Summary: config4 web-server keeps hanging when starting the web-server, manifesting as the indexer timing out → config4 web-server keeps hanging when starting a codesearch full-text daemon, manifesting as the indexer timing out

This happened again on config4 on port 8090. Not feeling well enough to do anything but terminate the indexer and orphan web server.

This happened on config1 today, which is more than a little surprising. But we do intentionally pre-cache more things for mozilla-central, so the I/O load that precedes this could be similar. It's notable here that the codesearch instances do seem to still be taking more tries to report in as alive, so this could be a slightly different situation. That said, we see the same thing where there is no codesearch process corresponding to the localhost:8084 that's being waited for.

I also do wonder whether it's possible that perhaps the caching is doing something like exposing a potential race in codesearch where some code is written in such a way that an implicit assumption is made that disk reading will take longer than some other operation, but in this case the disk reading actually runs faster and this leads to it falling over.

Either way the answer is probably still more diagnostics as proposed above.

2022-10-31 05:58:49.068251/pid=15196 - Loading nss
2022-10-31 05:58:49.068373/pid=15196 - Loading mozilla-central
2022-10-31 05:58:49.068428/pid=15196 - Loading comm-central
2022-10-31 05:58:49.068472/pid=15196 - Loading mozilla-mobile
2022-10-31 05:58:49.068529/pid=15196 - Stopping codesearch on port 8081
2022-10-31 05:58:49.075420/pid=15196 - Starting codesearch on port 8081
2022-10-31 05:58:51.025239/pid=15196 - Server on port 8081 found alive after 17 tries
2022-10-31 05:58:51.025315/pid=15196 - Stopping codesearch on port 8082
2022-10-31 05:58:51.032470/pid=15196 - Starting codesearch on port 8082
2022-10-31 05:58:53.196587/pid=15196 - Server on port 8082 found alive after 21 tries
2022-10-31 05:58:53.196671/pid=15196 - Stopping codesearch on port 8083
2022-10-31 05:58:53.203683/pid=15196 - Starting codesearch on port 8083
2022-10-31 05:58:55.627133/pid=15196 - Server on port 8083 found alive after 23 tries
2022-10-31 05:58:55.627228/pid=15196 - Stopping codesearch on port 8084
2022-10-31 05:58:55.637192/pid=15196 - Starting codesearch on port 8084
Summary: config4 web-server keeps hanging when starting a codesearch full-text daemon, manifesting as the indexer timing out → config4 web-server keeps hanging when starting a codesearch full-text daemon, manifesting as the indexer timing out. now also config1 sometimes!

This also happened:

  • 2022-12-02: config2
  • 2022-12-12: config4

WSL and systemd

I was just able to wsl --update to update my test machine for searchfox on windows (which officially is not capable of Windows 11, and so is nicely representative) to WSL v1 which supports systemd and was able to verify that after creating/editing /etc/wsl.conf in the Ubuntu 22.04 VM to be:

[boot]
systemd=true

and running wsl --shutdown and restarting a fresh instance of the VM, dockerd started up inside the VM automatically.

I think this means we can address this issue by directly moving to have all daemons operate under the auspices of systemd. Currently our web-servers and our codesearch instances run as our user and I don't really see any advantage to having them run as root, so as long as we can have them run under our user, it probably makes sense to start with that.

Particular goals:

  • It would be nice for the machine to be able to automatically start the daemons if it's rebooted.
  • We should try to namespace things so that a server situation like the Igalia one where everything is happening inside the same VM can avoid having the successor index validation gratuitously interfere with the current index.
    • Probably the most sane approach for this is to assume that a single-machine setup will use different "rendered" config JSON files between the successor and the current index. That is, it's already the case that, for example, https://github.com/mozsearch/mozsearch-mozilla/blob/master/config1.json will contain environment variables that we "render" in https://github.com/mozsearch/mozsearch/blob/master/scripts/generate-config.sh to what goes in the root of the index dir.
    • As far as codesearch collisions go, the characteristic aspects are the path and the port, and where there should only be a single server ever using a single path. There should also only ever be a single server ever using a single port and that's how searchfox web-servers will access/name the server, so the port probably makes the most sense so any collision is more obvious. (And this entirely rules out using the bare tree-name.)

Again again:

  • 2022-12-14: config4
  • 2022-12-18: config4

Again:

  • 2023-01-21: config3

This hasn't happened in quite some time so I'm going to close this. I did briefly look to see if there were any obvious changes we landed like upgrades to livegrep or Ubuntu version upgrades, but nothing immediately jumped out amongst the likely candidates. It seemed like I did a bunch of stuff through July 2022, predating a number of hangs from above, then I did some stuff again in June 2024 onward, well after the last observed problem. It wouldn't be shocking if this involved something related to EC2 or the underlying linux kernel that is outside our control.

Status: NEW → RESOLVED
Closed: 21 days ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.