(In reply to Tom Prince [:tomprince] from comment #6)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)
Should this be checked in slower time by a longer-running service, like nagios, and the cron task be just querying the current cached state?
Does running this as a separate service add any benefit over increasing the timeout? Adding a service requires setting up infrastucture to run it on, and to get the code to run from in-tree.
Not especially, the only benefit would not be having a task that needs ever-increasing timeouts.
One of the issues is that a lot (I did measure the exact number, but I've forgotten) of the domain names to check are unavailable, and so take a long time to fail. Having something that could back-off without impacting task times would be one way around that.
One thing I noticed about the failures of the task linked in comment 0, is that failures happen while cloning the hg repo. Looking at the script, the paths used aren't marked as docker-volumes, which means that there is fairly poor filesystem performance writing to those paths. I wonder if there are any performance wins to using a generic image with a checkout via run-task, and running the scripts there, rather than having custom logic about checking out the repo. (This would also mean that existing docker-worker caches for the checkout could be used).
The majority of the time is still spent doing the network requests, and so while I think this is worth doing, it's not solving the underlying issue.
There are a bunch of updates that take place in these jobs, but it seems only one takes a long time. We could split them out, so that at least the ones that don't take hours are generated once.
Also a good idea.
It looks like the script looks at hosts in batches of 250 (see here. There are probably a couple of ways to improve this:
- if the limitation is due to single-threaded-ness, and not a limit of worker network performance, we could split up the list to process outside of the JS, and run multiple in parallel.
- it looks like each batch of 250 is done as a whole, and no other requests are started until all the 250 requests are complete. If, instead, there were always 250 requests in flight, any single server that is slow to respond would have less of an affect on the overall runtime.
Agreed, in the python version I had a semaphore, and just scheduled all the futures at once. DNS ended up being the main rate limiter.