Closed Bug 1570646 Opened 4 months ago Closed 3 months ago

Intermittent pfu [taskcluster:error] Task timeout after 10800 seconds. Force killing container.

Categories

(Release Engineering :: Release Automation: L10N, defect, P5)

defect

Tracking

(firefox-esr60 fixed, firefox-esr68 fixed, firefox69 fixed, firefox70 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr60 --- fixed
firefox-esr68 --- fixed
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: sfraser)

Details

(Keywords: intermittent-failure, regression)

Attachments

(1 file)

Filed by: ccoroiu [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=259348038&repo=mozilla-beta
Full log: https://queue.taskcluster.net/v1/task/euvM1sRsQCuC_52QJnmA-Q/runs/0/artifacts/public/logs/live_backing.log


The time has increased by a third in the last 2 months

INFO: diffing old/new remote settings dumps...

Component: General → Release Automation: L10N
QA Contact: catlee → bugspam.Callek
Assignee: nobody → sfraser

Workaround until a longer term solution is found

When I was writing a Python replacement for this, I found that the major blocker was DNS rate limiting, so we could try using 1.1.1.1 (possibly over https!) to get around that. Using aiodns to test it takes <2 minutes to query the dns for all 69184 domain names. I'm not sure if the javascript engine is also a blocker, one of the comments seems to indicate it is.

I worry that this, and the timeout adjustment, is just pushing the problem out. Should this be checked in slower time by a longer-running service, like nagios, and the cron task be just querying the current cached state?

How much value is really being added by having this go through the same javascript the browser uses, if we can get the same results a different way?

Could we prepopulate a dns cache for the task?

Flags: needinfo?(ryanvm)

I agree that it'd be better to find a different approach rather than extending the timeout. That said, Dana needs to weigh in on whether your ideas are workable or not.

Flags: needinfo?(ryanvm) → needinfo?(dkeeler)

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)

When I was writing a Python replacement for this, I found that the major blocker was DNS rate limiting, so we could try using 1.1.1.1 (possibly over https!) to get around that.

Sure, this seems reasonable.

I'm not sure if the javascript engine is also a blocker, one of the comments seems to indicate it is.

If Firefox's JS engine is the reason this is slow, I think our JS team would want to know (and maybe they would have suggestions for how we could improve the performance of the script).

Should this be checked in slower time by a longer-running service, like nagios, and the cron task be just querying the current cached state?

That also seems like a reasonable solution.

How much value is really being added by having this go through the same javascript the browser uses, if we can get the same results a different way?

I don't see a way of getting the same results without using our platform (if we use a different platform, we won't get the same results).

Could we prepopulate a dns cache for the task?

Another reasonable solution.

Flags: needinfo?(dkeeler)

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)

Should this be checked in slower time by a longer-running service, like nagios, and the cron task be just querying the current cached state?

Does running this as a separate service add any benefit over increasing the timeout? Adding a service requires setting up infrastucture to run it on, and to get the code to run from in-tree.


One thing I noticed about the failures of the task linked in comment 0, is that failures happen while cloning the hg repo. Looking at the script, the paths used aren't marked as docker-volumes, which means that there is fairly poor filesystem performance writing to those paths. I wonder if there are any performance wins to using a generic image with a checkout via run-task, and running the scripts there, rather than having custom logic about checking out the repo. (This would also mean that existing docker-worker caches for the checkout could be used).


There are a bunch of updates that take place in these jobs, but it seems only one takes a long time. We could split them out, so that at least the ones that don't take hours are generated once.


It looks like the script looks at hosts in batches of 250 (see here. There are probably a couple of ways to improve this:

  • if the limitation is due to single-threaded-ness, and not a limit of worker network performance, we could split up the list to process outside of the JS, and run multiple in parallel.
  • it looks like each batch of 250 is done as a whole, and no other requests are started until all the 250 requests are complete. If, instead, there were always 250 requests in flight, any single server that is slow to respond would have less of an affect on the overall runtime.

(In reply to Tom Prince [:tomprince] from comment #6)

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)

Should this be checked in slower time by a longer-running service, like nagios, and the cron task be just querying the current cached state?

Does running this as a separate service add any benefit over increasing the timeout? Adding a service requires setting up infrastucture to run it on, and to get the code to run from in-tree.

Not especially, the only benefit would not be having a task that needs ever-increasing timeouts.

One of the issues is that a lot (I did measure the exact number, but I've forgotten) of the domain names to check are unavailable, and so take a long time to fail. Having something that could back-off without impacting task times would be one way around that.


One thing I noticed about the failures of the task linked in comment 0, is that failures happen while cloning the hg repo. Looking at the script, the paths used aren't marked as docker-volumes, which means that there is fairly poor filesystem performance writing to those paths. I wonder if there are any performance wins to using a generic image with a checkout via run-task, and running the scripts there, rather than having custom logic about checking out the repo. (This would also mean that existing docker-worker caches for the checkout could be used).

The majority of the time is still spent doing the network requests, and so while I think this is worth doing, it's not solving the underlying issue.


There are a bunch of updates that take place in these jobs, but it seems only one takes a long time. We could split them out, so that at least the ones that don't take hours are generated once.

Also a good idea.


It looks like the script looks at hosts in batches of 250 (see here. There are probably a couple of ways to improve this:

  • if the limitation is due to single-threaded-ness, and not a limit of worker network performance, we could split up the list to process outside of the JS, and run multiple in parallel.
  • it looks like each batch of 250 is done as a whole, and no other requests are started until all the 250 requests are complete. If, instead, there were always 250 requests in flight, any single server that is slow to respond would have less of an affect on the overall runtime.

Agreed, in the python version I had a semaphore, and just scheduled all the futures at once. DNS ended up being the main rate limiter.

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #5)

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)

When I was writing a Python replacement for this, I found that the major blocker was DNS rate limiting, so we could try using 1.1.1.1 (possibly over https!) to get around that.

Sure, this seems reasonable.

I can run some tests and see what the actual outcome is.

I'm not sure if the javascript engine is also a blocker, one of the comments seems to indicate it is.

If Firefox's JS engine is the reason this is slow, I think our JS team would want to know (and maybe they would have suggestions for how we could improve the performance of the script).

ahh. https://searchfox.org/mozilla-central/source/taskcluster/docker/periodic-updates/scripts/getHSTSPreloadList.js#315-316 Indicated it was a processing issue in the javascript, which required the rate limiting.

How much value is really being added by having this go through the same javascript the browser uses, if we can get the same results a different way?

I don't see a way of getting the same results without using our platform (if we use a different platform, we won't get the same results).

All it's doing is checking for the existence of an SSL header, right? I had a Python-produced file which was very similar. What differences would there be?

(In reply to Tom Prince [:tomprince] from comment #6)

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)

Should this be checked in slower time by a longer-running service, like nagios, and the cron task be just querying the current cached state?

Does running this as a separate service add any benefit over increasing the timeout? Adding a service requires setting up infrastucture to run it on, and to get the code to run from in-tree.

As a further thought, if we put together a nagios plugin, it won't require any more infrastructure. The concept of 'here are a list of destinations, check feature X about them' fits well with the monitoring software model, so it should work, modulo the nagios api's notoriety.

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #8)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #5)

(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)

How much value is really being added by having this go through the same javascript the browser uses, if we can get the same results a different way?

I don't see a way of getting the same results without using our platform (if we use a different platform, we won't get the same results).

All it's doing is checking for the existence of an SSL header, right? I had a Python-produced file which was very similar. What differences would there be?

The main differences would be in the TLS stack (does the Python implementation connect/not connect to the same sites Firefox does?) and the header parsing code (does the Python implementation accept/reject the same headers as Firefox does?).

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #10)

All it's doing is checking for the existence of an SSL header, right? I had a Python-produced file which was very similar. What differences would there be?

The main differences would be in the TLS stack (does the Python implementation connect/not connect to the same sites Firefox does?) and the header parsing code (does the Python implementation accept/reject the same headers as Firefox does?).

So I suppose one question is, "Is this script meant to check whether Firefox will connect to the site, or whether STS is enabled for a site in the STS list?" I'd thought it was the latter, but I'm happy either way.

The purpose of the list is to simulate each user's copy of Firefox visiting each of these sites. If Firefox can connect successfully and notes the site as HSTS, then we put that site on the list. If not, then we shouldn't put that site on the list.

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #12)

The purpose of the list is to simulate each user's copy of Firefox visiting each of these sites. If Firefox can connect successfully and notes the site as HSTS, then we put that site on the list. If not, then we shouldn't put that site on the list.

I defer to your requirements, here. Should we be removing hosts that we've been repeatedly unable to contact? If they make it on to the list once, they're never removed. https://searchfox.org/mozilla-central/source/taskcluster/docker/periodic-updates/scripts/getHSTSPreloadList.js#525-533

Some initial testing with different dns providers seems to get around any dns errors, but the javascript event loop doesn't seem to cope with four-figure batch sizes. Instead, the 'dump' message indicating all the results are in is produced, and then it sits there spinning. It does produce an output file identical to the input, which provides some encouragement.

It would be nice to remove sites that we haven't been able to connect to for the past <some amount of time>. The original motivation for keeping sites even if we couldn't connect to them was to be robust against intermittent network issues, but at this point I'm sure there are a number of stale entries on the list.

Pushed by rvandermeulen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/55701d6c8bfa
Adjust timeout for repo_update task r=RyanVM
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.