Open Bug 1653171 Opened 4 years ago Updated 1 year ago

Convert windows l10n repacks to use cross-l10n on linux (to avoid cloning timeouts)

Categories

(Release Engineering :: Release Automation: L10N, defect)

defect

Tracking

(firefox-esr78 unaffected, firefox80 fixed)

Tracking Status
firefox-esr78 --- unaffected
firefox80 --- fixed

People

(Reporter: nthomas, Unassigned)

References

Details

(Keywords: intermittent-failure, leave-open)

Attachments

(3 files, 1 obsolete file)

First instance on 2020-07-01, cloning the first l10n locale but having already gotten gecko OK:

[task 2020-07-01T02:36:48.207Z] 02:36:48     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1593570207\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/l10n-central/mr', 'mr', '--sharebase', 'y:\\hg-shared', '--branch', '708ac4b5f864']
[task 2020-07-01T02:36:48.207Z] 02:36:48     INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1593570207\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/mr mr --sharebase y:\hg-shared --branch 708ac4b5f864
[task 2020-07-01T02:36:48.297Z] 02:36:48     INFO -  (using Mercurial 4.7.1)
[taskcluster 2020-07-01T02:42:59.412Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593572579.278
[taskcluster 2020-07-01T02:59:59.426Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593573599.291
[taskcluster 2020-07-01T03:16:59.390Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593574619.238
[taskcluster 2020-07-01T03:33:59.363Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593575639.199
[taskcluster 2020-07-01T03:50:59.273Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593576659.108
[taskcluster 2020-07-01T04:07:59.248Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593577679.076
[taskcluster 2020-07-01T04:24:59.157Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/KaYyTu-iQAm-7X-ioyPKlQ/0/on/us-east-1/i-0d1c3a3b99f5ea602/until/1593578698.979
[taskcluster:error] Aborting task...
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 3640 (child process of PID 1872) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 1872 (child process of PID 888) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 2384 (child process of PID 684) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 888 (child process of PID 684) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] SUCCESS: The process with PID 684 (child process of PID 5064) has been terminated.
[taskcluster 2020-07-01T04:26:04.971Z] 

Other instances:
2020-07-01 https://firefoxci.taskcluster-artifacts.net/DinN4QavSx2OVxg1vY6kXw/0/public/logs/live_backing.log
2020-07-01 https://firefoxci.taskcluster-artifacts.net/K0NUlvinTQacLStEvk7xEw/0/public/logs/live_backing.log (first where other locales are cloned fine then stops)
2020-07-03 https://firefoxci.taskcluster-artifacts.net/BRFPR9rrS6OIaICdVsWflQ/0/public/logs/live_backing.log
2020-07-06 https://firefoxci.taskcluster-artifacts.net/Uw9xb9vkRJagtBxZKiV8Ew/0/public/logs/live_backing.log
2020-07-08 https://firefoxci.taskcluster-artifacts.net/VKpXEqXOSlmq8XNpBoPuAQ/0/public/logs/live_backing.log
2020-07-08 https://firefoxci.taskcluster-artifacts.net/eRp0do6VR86q0hbGuPA6OA/0/public/logs/live_backing.log
2020-07-08 https://firefoxci.taskcluster-artifacts.net/Hou64f9RR8y7veC1kfFWdw/0/public/logs/live_backing.log
2020-07-10 https://firefoxci.taskcluster-artifacts.net/KW8pQecXT262mXejHx2pJg/0/public/logs/live_backing.log (in us-west-2b)
2020-07-10 https://firefoxci.taskcluster-artifacts.net/PMmXP1FTQ_6pHF29YqGwpQ/0/public/logs/live_backing.log
2020-07-16 https://firefoxci.taskcluster-artifacts.net/cD8H73cySxCKn7ZlN7ecXA/0/public/logs/live_backing.log
2020-07-16 https://firefoxci.taskcluster-artifacts.net/VsWulsjqRcC0M0kPoaW2OA/0/public/logs/live_backing.log
2020-07-16 https://firefoxci.taskcluster-artifacts.net/dK7ZK4OsRR6M6kHTrHIfyA/0/public/logs/live_backing.log

All but one of these are instances in us-east-1, except for KW8pQecXT262mXejHx2pJg.

Connor, we are wondering if this might be abuse/DDOS protection on hg.m.o. These requests would be coming from AWS to the public hg.m.o servers.

We are also thinking of swapping to the private servers, and were curious if there is any protection there too.

Flags: needinfo?(sheehan)
Flags: needinfo?(bugspam.Callek)

The code used in run-task to find the internal mirror is here.

We don't have any true DDoS protection on hgmo (ie something like fail2ban), except for bans on IP blocks. Given the varied public IP's in the logs, I doubt this is the problem here, and the IPs currently banned in the load balancer don't seem to match with any in the logs. Using the internal mirrors would work, but I would need to add the new repos to them first since they currently only have a subset of all the repos on the public endpoint.

The way we avoid DDoS from cloning today is by using clonebundles. Perhaps we could start producing bundles for the l10n repos. Usually cloning these repos doesn't cause any disruption since they're quite small, and looking at some of the logs where cloning these repos succeeded, the clone + working directory update took ~10s. Given the small load for each individual clone I don't think the issue here is related to a DDoS from our infra, but I don't have a better explanation either.

Flags: needinfo?(sheehan)

OK, thanks for the information. Perhaps its something to do the actual workers themselves.

Braindump of ideas:

  • we also see this issue occasionally on mozilla-central for l10n repacks, most recently here. That coincided with 79.0 RC1 having a couple of issues, so it might be related to surge load on the servers. There's are 3 platforms by 21 chunks per branch that can all start at roughly the same time, each cloning 5 locales sequentially. Sometimes releases align with nightlies to double the load (or a higher multiple if several releases are started simultaneously). Clone bundles could offload a lot of work from the server, so might be worthwhile even if only updated weekly as low change repos
  • I've tried adding socket.setdefaulttimeout(10.0) to testing/mozharness/external_tools/robustcheckout.py, and running a bunch of l10n repacks on try (eg), but haven't been able to reproduce the hang and verify that the timeout works. Setting a much lower timeout locally certainly throws an exception and the retry behaviour of robustcheckout kicks in, so I think this should help us mitigate the issue
  • we see this for Windows but not other platforms (which all run on Linux). The Mercurial versions are 4.7.1 and 4.8.1 respectively, and I see support was added in 4.8 for setting an http/https timeout. It defaults to None and I don't think we are setting that in the hgrc for our Linux images, so it's likely not the key difference
  • we could set the max job time to a much lower time than 7200 seconds in order to fail faster, and sheriffs retry manually. Windows repacks are typically 23 minutes long, other platforms 5 minutes, so something like 1800 would tightly match that, or 2400s to allow a little more leeway for variation
Keywords: leave-open

Windows jobs typically take 23 minutes, so limit to 30 minutes so that cloning hangs terminate earlier than 2 hours.
Mac and Linux jobs are typically 5 minutes long, so bring the limit in to 15 minutes from an hour.

Also refactors so this the max run time is only specified in the kind, rather than some in the kind and some in the transform.

Assignee: nobody → nthomas
Status: NEW → ASSIGNED
Blocks: 1655625
Pushed by nthomas@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d3f021e9c896 Shorten max run time for l10n repacks, r=Callek

We're unsure what the next step is here, unassigning for now.

Assignee: nthomas → nobody
Status: ASSIGNED → NEW
Assignee: nobody → nthomas
Status: NEW → ASSIGNED
Pushed by cbrindusan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c2667d1e8390 Updated shippable-l10n/kind.yml for trailing spaces. CLOSED TREE

Looks like the timeout of 30 minutes is too short for some jobs. There are 6 failures on that push for windows l10n - 1 has the l10n hang and was failed as we wanted, but 5 are jobs that should have been left alone as they had finished cloning and were repacking. Seems to be related to AWS instances in us-east-1 which don't already have a clone of gecko, but doesn't always occur. I'm asking around about why we don't use the internal hg servers for that AWS region.

There's one more that should have been left alone on the previous nightly, while the most recent nightly was OK.

Bug 1614852 also has some investigations of hanging hg clones, in the context of GCP workers.

See Also: → 1614852
Pushed by nthomas@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9b7136d538b6 bump timeout for windows l10n up to 45 minutes, r=Callek

Looks like 6 failures on that push, 5 are the main signature for this bug:

[task 2020-07-31T00:09:22.108Z] 00:09:22     INFO - Running command: ['hg', '--config', 'ui.merge=internal:merge', '--config', 'extensions.robustcheckout=Z:\\task_1596153325\\build\\src\\testing\\mozharness\\external_tools\\robustcheckout.py', 'robustcheckout', 'https://hg.mozilla.org/l10n-central/cy', 'cy', '--sharebase', 'y:\\hg-shared', '--branch', 'default']
[task 2020-07-31T00:09:22.108Z] 00:09:22     INFO - Copy/paste: hg --config ui.merge=internal:merge --config extensions.robustcheckout=Z:\task_1596153325\build\src\testing\mozharness\external_tools\robustcheckout.py robustcheckout https://hg.mozilla.org/l10n-central/cy cy --sharebase y:\hg-shared --branch default
[task 2020-07-31T00:09:22.190Z] 00:09:22     INFO -  (using Mercurial 4.7.1)
[taskcluster 2020-07-31T00:15:04.046Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Rg3yOSv2THqmhKSO2i4KUw/0/on/us-east-1/i-02a30bcf0a197d56d/until/1596155703.938
[taskcluster 2020-07-31T00:32:04.074Z] [taskcluster-proxy] Successfully refreshed taskcluster-proxy credentials: task-client/Rg3yOSv2THqmhKSO2i4KUw/0/on/us-east-1/i-02a30bcf0a197d56d/until/1596156723.974
[taskcluster:error] Aborting task...

The taskcluster-proxy lines aren't believed to be related, the bug is the hang doing an hg clone.

The other one is closely related, as has 10 minutes to clone+update gecko, then 30 minutes to clone https://hg.mozilla.org/l10n-central/br. It was repacking when it hit the 45 minute max run time.

These jobs are using the internal mirror in us-east-1 to clone gecko, as sheehan re-enabled that on my request. It seems to make the clone+update of gecko a little quicker ~8 minutes vs ~10, but obviously still having issues with the public hg server or network or windows.

The windows image hasn't changed since February. I'm not sure what our next steps are here.

The other one is closely related, as has 10 minutes to clone+update gecko, then 30 minutes to clone https://hg.mozilla.org/l10n-central/br. It was repacking when it hit the 45 minute max run time.

Do these clone tasks occur on different filesystems? It doesn't make sense that clone+update for gecko would be so much faster than clone+update for an l10n repo, which are several orders of magnitude different in size.

Even if the docker image itself hasn't changed, if we mount filesystems into the container any change to the filesystem might be affecting the vcs timings there.

We are only seeing this on windows machines, and looking at the logs, we are using the same shared directory, and an adjacent checkout directory.

One idea I have on my mind is to restructure the l10n repack process to do the libs and langpack phase independently of the package repacks, on just one platform. That'd create an artifact that includes the merge dirs and the dist/xpi-stage, which the repacks would then use.

I had looked at the time we spend on cloning and libs, and thought it might not be worth it, but maybe it is?

I just attached to this bug the WIP of me setting us up to use internal hg mirrors for mozharness clones, I abandoned working on it when I saw connor's reply in c#3.

I don't have any better ideas right now sadly :(

Flags: needinfo?(bugspam.Callek)
Pushed by mozilla@hocat.ca: https://hg.mozilla.org/integration/autoland/rev/f8a157da190f Run windows L10n repacks on linux; r=Callek
Pushed by csabou@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/02be42e5105b Run windows L10n repacks on linux; r=Callek
Assignee: nthomas → mozilla

Can we uplift ^ to Beta? Looks like it's making a huge runtime difference on m-c. On a related note - should we adjust the timeout values downwards again?

Flags: needinfo?(mozilla)
Summary: Timeouts cloning l10n repositories in windows repacks → Convert windows l10n repacks to use cross-l10n on linux (to avoid cloning timeouts)

We wanted to wait a week, to see if there was fallout on nightly, before landing on beta.

Flags: needinfo?(mozilla)
Whiteboard: [checkin-needed-beta]

If we want to uplift to esr78: try push | grafted try commit on top of esr78

That broke on mac. Trying mac on debian8: try push | attempted fix

Still broken on mac =\

Blocks: 1712035
Attachment #9167744 - Attachment is obsolete: true
Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: mozilla → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: