Closed Bug 1394130 Opened 7 years ago Closed 7 years ago

nightly signing timing out again

Categories

(Release Engineering :: Release Automation: Other, enhancement)

enhancement
Not set
normal

Tracking

(firefox57 fixed)

RESOLVED FIXED
Tracking Status
firefox57 --- fixed

People

(Reporter: mozilla, Assigned: mozilla)

References

Details

Attachments

(3 files)

Mostly in linux / osx l10n signing, which is hovering near that 60min max age.
This used to be 5-20 min, so this is a 3-12x increase in time.

I have a few suspects to try to track down.
There were a ton of signing tasks on date, from funsize-in-tree testing.
I:

- cancelled all pending date-branch tasks
- reduced the date-branch locales to ['de','zh-TW']

We're still at >380 pending per nagios.
I'm wondering if the log size hurts scriptworker turnaround.
Landed a bustage puppet fix to change scriptworker log rotation to daily: https://hg.mozilla.org/build/puppet/rev/3500fd61a0b5
According to https://public-artifacts.taskcluster.net/UKq53LCtSAiF2UbtrPC00g/0/public/logs/live_backing.log and https://public-artifacts.taskcluster.net/UKq53LCtSAiF2UbtrPC00g/0/public/logs/chain_of_trust.log :

- the chain of trust verification started at 19:25:58
- the signing task started at 19:28:09 (02:11 later)
- the task finished at 20:06:15 (38:06 later, for a total elapsed time of 40:17)

This is well before the 3600 second max timeout. That means we're timing out during a >20min artifact upload for this task.
We currently have a default aiohttp_max_connections of 15 for scriptworker. We're well below this for non-l10n tasks. I'm thinking we should adjust it.

Task UKq53LCtSAiF2UbtrPC00g above had 13 artifacts uploaded total, and may have been done when it timed out. With my sample size of one, that suggests something is slower, but we're just narrowly hitting the timeout.

Interesting point: the gpg homedirs rebuild happened during this upload. I believe it is be a completely separate script / process, and probably just logs to the same file. Noting in case it's able to slow down scriptworker.
GPG rebuild took 12 seconds; probably not the culprit.
We could have a `context.state` enum, where we say what we're doing. If the task is running, we can kill it; if we're uploading, maybe we allow more time. Right now I think we're just going by whether `context.task` is not None.

This doesn't explain the spike in overall runtime, though.
Currently I'm thinking about the following suspects:

- log size. This was ~150mb on signing-linux-1, well below where it was once upon a time. We were slow on signing-linux-dev, which had a large log size, so this could be a match. The above puppet patch should fix at midnight; I can ssh in and manually rotate the logs.
- signing server load. Because we had all the backlog from funsize-in-tree on date, it's possible the nightly signing servers had higher than normal load, and signing took longer. This would also explain signing-linux-dev turnaround time, since dev signing has a higher load than nightly signing in general.
- uploads slowing down - if s3 was limiting us, that would help explain why l10n times spike more than non-l10n... 5x as many artifacts to upload.
  - we may be able to tune this with aiohttp_max_connections
  - we may be able to tune this by keeping track of `context.state` so we only kill the process if we're still running the task, but allow for uploads to have their own timeout.
- possibly something else that's slipping my mind now.

Rolling out any scriptworker / venv / scriptworker.yaml change will restart the scriptworker daemons, which will cause their tasks to retry. If it's the right fix, it'll still be a win, but I'm hesitant to do that without some confidence in the fix.
Halfway leaning towards locales_per_chunk = 2 or 3 (currently 5)
I rotated the logs on all signing-linux-\d+ manually.
About to land locales-per-chunk = 2, cancel previous nightly graphs, and rerun the nightly hooks.
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/2f0b519d13e9
2 locales per chunk to avoid signing timeouts. r=bustage a=release DONTBUILD
We had a TCW earlier this morning; it may or may not be completely unrelated.
We seem to be out of the woods: signing is green again.
However, we went from 20 l10n chunks to 49? we probably want to grow the signing pool again, and continue to debug why things are taking longer now.
Linux l10n seems to have pre-signed widevine files. Mac l10n seems to have unsigned widevine files, which may have contributed to the timeouts. However, this seems to have been the case for Friday's nightlies as well, where 5 locales took ~18min to sign.
And today, nightly signing seems super speedy again. This would suggest server or network load/issues.
2 minutes for a linux64 l10n signing task.
6 minutes for an osx l10n signing task.
Even if we bumped back to 5 locales / chunk, that would be 5 minutes for linux64 and 15 minutes for osx (because we're repacking an unsigned tarball for osx, resulting in re-signing widevine), which seem like decent speeds.

I'd love to know why Saturday had slower signing than Sunday/Monday.
We ran out of CPU credits, so we were sluggish saturday.
We had high cpu usage on Friday.
CPU credit monitoring is bug 1387191: thanks for jogging my memory :mtabara!

We can also potentially move off of t2.micro.
Blocks: 1387191
If this bug tracks the incident, we're good now, and we can resolve. But we probably need a followup per comment 17.

- monitor t2 cpu credits; take some action when we're low
- increase the size of the pool so we hit this limit less often
- move off t2.micro
- something else
(In reply to Aki Sasaki [:aki] from comment #18)
> If this bug tracks the incident, we're good now, and we can resolve. But we
> probably need a followup per comment 17.
> 
> - monitor t2 cpu credits; take some action when we're low
> - increase the size of the pool so we hit this limit less often
> - move off t2.micro
> - something else

comment #6, where 'this' refers to upload concurrency and hitting max task age during upload:
>  - we may be able to tune this with aiohttp_max_connections
>  - we may be able to tune this by keeping track of `context.state` so we only kill the process if we're still running the task, but allow for uploads to have their own timeout.

comment #12, where I noticed mac l10n was re-signing widevine:
> Mac l10n seems to have unsigned widevine files, which may have contributed to the timeouts. However, this seems to have been the case for Friday's nightlies as well, where 5 locales took ~18min to sign.

We're also still at 2 locales per chunk, which means 49 chunks on desktop, but way faster turnaround per chunk.
Blocks: 1395000
Blocks: 1395001
Filed bugs related to cpu credits. Filed https://github.com/mozilla-releng/scriptworker/issues/149 for tracking uploads separate from `task_max_age`.

I should probably just write a patch to make mac nightly l10n depend on build-signing, then resolve.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Er, *then* resolve
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I landed https://hg.mozilla.org/projects/date/rev/3234b744da406ed26bcd78558df5428b4de2968a and https://hg.mozilla.org/projects/date/rev/801cc5f628d60a814f90ca68fdcecf5b92fbc727 . The resulting nightly l10n task died with

[task 2017-08-30T03:25:07.179143Z] 03:25:07    ERROR -  Traceback (most recent call last):
[task 2017-08-30T03:25:07.179421Z] 03:25:07     INFO -    File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
[task 2017-08-30T03:25:07.179554Z] 03:25:07     INFO -      "__main__", fname, loader, pkg_name)
[task 2017-08-30T03:25:07.179709Z] 03:25:07     INFO -    File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
[task 2017-08-30T03:25:07.179850Z] 03:25:07     INFO -      exec code in run_globals
[task 2017-08-30T03:25:07.180031Z] 03:25:07     INFO -    File "/home/worker/workspace/build/src/python/mozbuild/mozbuild/action/unpack_dmg.py", line 35, in <module>
[task 2017-08-30T03:25:07.180167Z] 03:25:07     INFO -      sys.exit(main(sys.argv[1:]))
[task 2017-08-30T03:25:07.180364Z] 03:25:07     INFO -    File "/home/worker/workspace/build/src/python/mozbuild/mozbuild/action/unpack_dmg.py", line 30, in main
[task 2017-08-30T03:25:07.180485Z] 03:25:07     INFO -      icon=options.icon)
[task 2017-08-30T03:25:07.180656Z] 03:25:07     INFO -    File "/home/worker/workspace/build/src/python/mozbuild/mozpack/dmg.py", line 194, in extract_dmg
[task 2017-08-30T03:25:07.180801Z] 03:25:07     INFO -      extract_dmg_contents(dmgfile, tmpdir)
[task 2017-08-30T03:25:07.180972Z] 03:25:07     INFO -    File "/home/worker/workspace/build/src/python/mozbuild/mozpack/dmg.py", line 174, in extract_dmg_contents
[task 2017-08-30T03:25:07.181103Z] 03:25:07     INFO -      stdout=open(os.devnull, 'wb'))
[task 2017-08-30T03:25:07.181260Z] 03:25:07     INFO -    File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
[task 2017-08-30T03:25:07.181406Z] 03:25:07     INFO -      raise CalledProcessError(retcode, cmd)
[task 2017-08-30T03:25:07.181634Z] 03:25:07     INFO -  subprocess.CalledProcessError: Command '['/home/worker/workspace/build/src/dmg/dmg', 'extract', '/home/worker/workspace/build/src/obj-firefox/dist/firefox-57.0a1.en-US.mac.dmg', '/tmp/tmp3ID2D8/firefox.hfs']' returned non-zero exit status 1
[task 2017-08-30T03:25:07.186024Z] 03:25:07     INFO -  /home/worker/workspace/build/src/toolkit/locales/l10n.mk:91: recipe for target '/home/worker/workspace/build/src/obj-firefox/dist/unpacked-installer' failed
[task 2017-08-30T03:25:07.186166Z] 03:25:07     INFO -  make: *** [/home/worker/workspace/build/src/obj-firefox/dist/unpacked-installer] Error 1

So we can't just point at a signed tarball; we have to adjust the l10n repack logic to not require a dmg.
... though we could potentially point at the repackage task's dmg...
Yeah, that's expected to break. About changing the default packaging format on osx, we should keep developers doing local repacks in mind. Right now, they just need to ./mach package. I wouldn't know how they'd create a tarball instead, and it might also be less apparent.

The packager logic right now ties input to output format rather closely together in https://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/installer/upload-files.mk#98. Having a different output than input packaging format is going to be hackery.

If we'd be fine with the osx l10n repacks generating tar.gz balls, we could probably set MOZ_PKG_FORMAT=TGZ in the mozconfigs or so.
I pointed at the repackage task, so we have a dmg with signed widevine bits. But it looks like the l10n signing still re-signs widevine, which suggests we're not including the .sig files in the l10n repack.

Going to file a new bug and close this one.
(In reply to Aki Sasaki [:aki] from comment #24)
> I pointed at the repackage task, so we have a dmg with signed widevine bits.
> But it looks like the l10n signing still re-signs widevine, which suggests
> we're not including the .sig files in the l10n repack.
> 
> Going to file a new bug and close this one.

Oops, looked at the wrong log (en-US signing, rather than l10n signing). I'm wrong, this patch is good. Going to flag Callek for when he gets back next week.
Patch, for posterity, since I can't store it in mozreview til Callek is back.
Comment on attachment 8903287 [details] [diff] [review]
mac-l10n-repackage

Review of attachment 8903287 [details] [diff] [review]:
-----------------------------------------------------------------

::: taskcluster/taskgraph/transforms/l10n.py
@@ +223,5 @@
>              })
> +        if job['attributes']['build_platform'].startswith('macosx'):
> +            job['dependencies'].update({
> +                'repackage': 'repackage-{}'.format(job['name'])
> +            })

This is starting to look ripe for a refactor to me, (rather than the 'magic' dependency selection based on known naming). But I don't think that's in scope for the goals of this patch.
Attachment #8903287 - Flags: review+
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/72aae1305b7f
point osx l10n at the repackage task to avoid re-signing widevine. r=callek
https://hg.mozilla.org/mozilla-central/rev/72aae1305b7f
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.