Closed Bug 1562412 (notarization-poller) Opened 5 years ago Closed 4 years ago

Intermittent iscript.exceptions.IScriptError: ('%s in %s exited %s!\n%s', ['xcrun', 'altool', '--notarization-info', 'd087a306-193c-4e1f-a991-c5ee01a64f0b', '-u', 'release_appleagent@mozilla.com', '--password', '********'], '/builds/scriptworker', 239

Categories

(Release Engineering :: General, defect, P5)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Assigned: mozilla)

References

()

Details

(Keywords: intermittent-failure, regression, Whiteboard: leave-open)

User Story

Mac Notarization requires us to submit signed apps and pkgs to Apple, wait until they're done on their end, then staple the notarization to the app or pkg. In general this works. In cases where Apple takes longer than usual to get back to us (sometimes several hours), we see tree-closing bustage. This would also prevent us from shipping chemspills.

We first thought we might allow for multiple notarization tasks per worker. This would help because we have a limited hardware pool that we can't autoscale, and multiple tasks per worker would increase the number of tasks we can run concurrently by some multiplier (e.g., 3x or 5x). If Apple takes a while to get back, we could let each task run much longer than usual.

However, with the 3+hour outage, we realized that even that solution wouldn't scale properly in the worst case scenario(s). When each task hit their max timeout, they re-started, meaning they re-signed and re-submitted the apps and pkgs for notarization. Even if the previously signed and notarized apps and pkgs were ready, we no longer had those files to staple and upload. We were adding to Apple's load without solving our problem.

Therefore, we're going with a [three task solution](https://github.com/mozilla-releng/scriptworker-scripts/issues/55).

1. The first task will run on the mac notarization workers, and will sign the apps and pkgs, and submit them for notarization. Then it will upload the signed bits plus a uuid manifest.
2. The second task is a polling task. Because this is largely wait time, we can run many tasks per worker here. We don't need to use the notarization scriptworkers for these; I've written a custom poller worker that can potentially claim dozens of tasks and poll Apple for their uuid status. If there is an extended outage, we'll just gather dozens or hundreds of these running tasks. When Apple's ready, the poller worker will resolve the appropriate task(s), and we can proceed.
3. The third task will run on the mac notarization workers, and will download the signed apps and pkgs, staple notarization to them, and upload them.

We're essentially moving all the wait time into the custom pool of pollers in task 2. Hopefully this allows us to deal with Apple outages.

Attachments

(25 files, 5 obsolete files)

63 bytes, text/x-github-pull-request
Details | Review
63 bytes, text/x-github-pull-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
61 bytes, text/x-github-pull-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=254083556&repo=mozilla-central
Full log: https://queue.taskcluster.net/v1/task/StdG-R5QRm6acoRqiugKjQ/runs/0/artifacts/public/logs/live_backing.log


2019-06-29 05:55:35,354 - scriptworker_client.client - ERROR - Failed to run async_main
Traceback (most recent call last):
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/client.py", line 156, in _handle_asyncio_loop
await async_main(config, task)
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/script.py", line 47, in async_main
await notarize_behavior(config, task)
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/mac.py", line 1089, in notarize_behavior
await poll_all_notarization_status(key_config, poll_uuids)
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/mac.py", line 868, in poll_all_notarization_status
await raise_future_exceptions(futures)
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/aio.py", line 55, in raise_future_exceptions
raise exceptions[0]
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/mac.py", line 826, in poll_notarization_uuid
"exception": IScriptError,
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/aio.py", line 154, in retry_async
return await func(*args, kwargs)
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/utils.py", line 230, in run_command
"%s in %s exited %s!\n%s", log_cmd, cwd, exitcode, log_contents
iscript.exceptions.IScriptError: ('%s in %s exited %s!\n%s', ['xcrun', 'altool', '--notarization-info', 'd087a306-193c-4e1f-a991-c5ee01a64f0b', '-u', 'release_appleagent@mozilla.com', '--password', '
******'], '/builds/scriptworker', 239, '')
exit code: 1

We may want to put retries around these polls.

Assignee: nobody → aki

The polls have retries; I'm not sure what the 239 exit code means. However, related to this, we may want to extend the notarization poll timeout from 15min.

Bumped the timeout from 1200 seconds to 2700.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=254674304&repo=mozilla-central&lineNumber=4410

2019-07-03 23:02:26,062 - scriptworker_client.client - ERROR - Failed to run async_main
Traceback (most recent call last):
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/client.py", line 156, in _handle_asyncio_loop
await async_main(config, task)
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/script.py", line 47, in async_main
await notarize_behavior(config, task)
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/mac.py", line 1165, in notarize_behavior
await poll_all_notarization_status(key_config, poll_uuids)
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/mac.py", line 940, in poll_all_notarization_status
await raise_future_exceptions(futures)
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/aio.py", line 55, in raise_future_exceptions
raise exceptions[0]
File "/builds/scriptworker/scriptworker-scripts/iscript/src/iscript/mac.py", line 898, in poll_notarization_uuid
"exception": IScriptError,
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/aio.py", line 154, in retry_async
return await func(*args, kwargs)
File "/builds/scriptworker/scriptworker-scripts/scriptworker_client/src/scriptworker_client/utils.py", line 230, in run_command
"%s in %s exited %s!\n%s", log_cmd, cwd, exitcode, log_contents
iscript.exceptions.IScriptError: ('%s in %s exited %s!\n%s', ['xcrun', 'altool', '--notarization-info', '7e2b60d4-ee5a-4099-984f-fbd8edf1c02b', '-u', 'release_appleagent@mozilla.com', '--password', '
******'], '/builds/scriptworker', 239, '')
exit code: 1

Status: RESOLVED → REOPENED
Flags: needinfo?(aki)
Resolution: FIXED → ---

It looks like Apple's service sometimes needs more time. Adding up to 10min of more attempts with this patch.

Flags: needinfo?(aki)

Now Apple's notarization service appears up, but we're timing out at 45 minutes of polling. This is ugly, but I bumped the task_max_timeout from 3600 to 7200 (2hrs) and the notarization_poll_timeout from 2700 to 6300 (45min to 105min).

Backing notarization out of mozilla-beta is also an option if it's a blocker.

When I get back from PTO I can work on splitting notarization into two phases (submit and poll) which would allow for longer polling times and more polling attempts without resubmitting the app for notarization.

No failures since Apple's outage on 2019-07-04. They have a status page at https://developer.apple.com/system-status/, see Developer ID Notary Service.

Type: -- → defect
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED

New failure log on central: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=257486190&repo=mozilla-central&lineNumber=4395

iscript.exceptions.IScriptError: ('%s in %s exited %s!\n%s', ['xcrun', 'altool', '--notarization-info', 'e9c90192-8eab-48a3-a3d6-d8caeb178088', '-u', 'release_notarization@mozilla.com', '--password', '********'], '/builds/scriptworker', 239, '')

This time the failure appears in 'release_notarization@mozilla.com' instead of 'release_appleagent@mozilla.com'

Flags: needinfo?(aki)

Yeah, we switched notarization users. Looks like we still have intermittents to address.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(aki)

(In reply to Aki Sasaki [:aki] (he/him) (UTC-7) from comment #1)

I'm guessing this is an Apple load issue -- to be determined if we're hammering them too hard, or if it's a temporary hiccup. If it's the former, we may have to adjust our notarization workflow.

Aki thsi doesn't seem like a hiccup. Started from last night on OS X Cross Compiled shippable opt: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&group_state=expanded&searchStr=OS%2CX%2CCross%2CCompiled%2Cshippable%2Copt

Can you please take a look?

Flags: needinfo?(aki)
Attached file increase sleep time (obsolete) —

This increases wait times between attempts at polling or stapling, but it doesn't seem to fix the issue. I manually ran an l10n notarization task and it ran out of retries, not having found the UUID after 10 attempts.

The most recently nightly build didn't have any errors with notarization.

It appears this was an Apple issue, temporary but only resolved after several days. We may need to plan for something like this happening during a chemspill.

Flags: needinfo?(aki)
Attachment #9105984 - Attachment description: GitHub Pull Request → increase sleep time
Attachment #9105984 - Attachment is obsolete: true
Attached file Notarization poller
Whiteboard: leave-open
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/10024cf3aaca
fix try.yml taskgraph-gen.py r=Callek
https://hg.mozilla.org/integration/autoland/rev/9b6303f7accd
stop using build_signing loader. r=jlorenzo,Callek
Backout by ncsoregi@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a29583066ef0
Backed out changeset 10024cf3aaca as per request. CLOSED TREE
Attachment #9115908 - Attachment is obsolete: true

Pushed by geoff@darktrojan.net:
https://hg.mozilla.org/comm-central/rev/e575c6d2fef1
Port bug 1562412 - Stop using build_signing loader. rs=bustage-fix

Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED

CCing :rjl, since it's likely there will be more cc-breaking changes.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

First, let's add {only,not}-for-attributes to multi_dep so we can switch single_dep kinds over more easily. Second, let's put it into its own helper function, so we don't have to copy the same set of if blocks in each grouping function.

The single-locale group_by function works for tasks with a single locale each, but doesn't work when we want to maintain the same l10n chunks as our parent task. Because we want to make nightly-l10n-signing multi_dep, we'll need to group_by chunk-locales.

Depends on D57729

We still want to fail if we're not sure which task to use as the primary dependency, for a given dependent kind. However, sometimes we have a dependent kind that exists for some platforms and not others, and we might want to use that as the primary dependency if it exists. In this case, we should allow for a lower-priority primary-dependency kind.

Depends on D57730

We should move beetmover-geckoview to the multi_dep loader. Until then, this should un-break us when we adjust build-signing deps. Unknown deps will still lead to declarative artifacts bustage.

Depends on D57731

Also adjust artifacts based on dep_kind, and make build-notarization-part-1
a build-signing dependency.

Depends on D57737

I have run taskgraph-{gen,diff}.py dozens of times, so I'm fairly confident in these patches on that level. I still need to write the partner-repack and eme-free patches, push the whole thing to maple, and test end-to-end.

Tom doesn't like the solution in https://phabricator.services.mozilla.com/D57731 . This was the solution he suggested.

See Also: → 1606385
Attachment #9116890 - Attachment is obsolete: true
Attachment #9117406 - Attachment is obsolete: true
Attachment #9117406 - Attachment is patch: true
Attachment #9117406 - Attachment mime type: application/octet-stream → text/plain
Attachment #9116895 - Attachment description: Bug 1562412 - adjust mac-behavior by project and kind. → Bug 1562412 - adjust mac-behavior by project and kind. r=tomprince,nthomas
Attachment #9116896 - Attachment description: Bug 1562412 - add build-notarization-part-1 kind. → Bug 1562412 - add build-notarization-part-1 kind. r=tomprince,callek
Attachment #9116897 - Attachment description: Bug 1562412 - add build-notarization-poller kind. → Bug 1562412 - add build-notarization-poller kind. r=tomprince,callek
Attachment #9116898 - Attachment description: Bug 1562412 - build-signing multi_dep. → Bug 1562412 - build-signing multi_dep. r=tomprince,callek
Attachment #9116899 - Attachment description: Bug 1562412 - add nightly-l10n-notarization-part-1 kind → Bug 1562412 - add nightly-l10n-notarization-part-1 kind r=tomprince,callek
Attachment #9116900 - Attachment description: Bug 1562412 - add nightly-l10n-notarization-poller kind → Bug 1562412 - add nightly-l10n-notarization-poller kind r=tomprince,callek
Attachment #9116901 - Attachment description: Bug 1562412 - nightly-l10n-signing multi_dep. → Bug 1562412 - nightly-l10n-signing multi_dep. r=tomprince,callek
Attachment #9117407 - Attachment description: Bug 1562412 - partner-repack-ids multi_dep group_by. → Bug 1562412 - partner-repack-ids multi_dep group_by. r=tomprince,nthomas
Attachment #9117408 - Attachment description: Bug 1562412 - allow for copy-repack-ids behavior for partner kinds. → Bug 1562412 - allow for copy-repack-ids behavior for partner kinds. r=tomprince,nthomas
Attachment #9117409 - Attachment description: Bug 1562412 - add release-{eme-free,partner-repack}-notarization-part-1 kind. → Bug 1562412 - add release-{eme-free,partner-repack}-notarization-part-1 kind. r=tomprince,nthomas
Attachment #9117410 - Attachment description: Bug 1562412 - add release-{eme-free,partner}-repack-notarization-poller kinds. → Bug 1562412 - add release-{eme-free,partner}-repack-notarization-poller kinds. r=tomprince,nthomas
Attachment #9117411 - Attachment description: Bug 1562412 - release-{eme-free,partner}-repack-signing multi_dep. → Bug 1562412 - release-{eme-free,partner}-repack-signing multi_dep. r=tomprince,nthomas
Blocks: 1613281
Alias: notarization-poller
Attachment #9116899 - Attachment description: Bug 1562412 - add nightly-l10n-notarization-part-1 kind r=tomprince,callek → Bug 1562412 - add shippable-l10n-notarization-part-1 kind r=tomprince,callek
Attachment #9116900 - Attachment description: Bug 1562412 - add nightly-l10n-notarization-poller kind r=tomprince,callek → Bug 1562412 - add shippable-l10n-notarization-poller kind r=tomprince,callek
Attachment #9116901 - Attachment description: Bug 1562412 - nightly-l10n-signing multi_dep. r=tomprince,callek → Bug 1562412 - shippable-l10n-signing multi_dep. r=tomprince,callek
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/19ac1668f0db
only-for and not-for support in multi_dep. r=Callek
https://hg.mozilla.org/integration/autoland/rev/b8ddcf8368c5
chunk-locales multi_dep group_by. r=Callek
https://hg.mozilla.org/integration/autoland/rev/b51ce1cea46d
address beetmover-geckoview hardcoded assumptions. r=jlorenzo
https://hg.mozilla.org/integration/autoland/rev/32f31a6daaa8
remove unneeded repackage check. r=Callek
https://hg.mozilla.org/integration/autoland/rev/861e4a4a13aa
beetmover_source shouldn't hardcode 'build' depname. r=Callek
https://hg.mozilla.org/integration/autoland/rev/0fc1845ea46c
adjust mac-behavior by project and kind. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/9ccdef3b8c9d
add build-notarization-part-1 kind. r=Callek
https://hg.mozilla.org/integration/autoland/rev/9be172fa19e1
add build-notarization-poller kind. r=Callek
https://hg.mozilla.org/integration/autoland/rev/af8ccde58367
build-signing multi_dep. r=Callek
https://hg.mozilla.org/integration/autoland/rev/cd6f0176af52
add shippable-l10n-notarization-part-1 kind r=Callek
https://hg.mozilla.org/integration/autoland/rev/bf7896916b91
add shippable-l10n-notarization-poller kind r=Callek
https://hg.mozilla.org/integration/autoland/rev/7870fa417606
shippable-l10n-signing multi_dep. r=Callek
https://hg.mozilla.org/integration/autoland/rev/5b7ca7cf6649
partner-repack-ids multi_dep group_by. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/2143a26de372
allow for copy-repack-ids behavior for partner kinds. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/aa62038d5d76
add release-{eme-free,partner-repack}-notarization-part-1 kind. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/e39e5d6a77a8
add release-{eme-free,partner}-repack-notarization-poller kinds. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/9c3d85e9b9fe
release-{eme-free,partner}-repack-signing multi_dep. r=nthomas

Depends on D58069

Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/71132f565162
only-for and not-for support in multi_dep. r=Callek
https://hg.mozilla.org/integration/autoland/rev/b6e245eceb8d
chunk-locales multi_dep group_by. r=Callek
https://hg.mozilla.org/integration/autoland/rev/440286deb39c
address beetmover-geckoview hardcoded assumptions. r=jlorenzo
https://hg.mozilla.org/integration/autoland/rev/f29f7d25f1c5
remove unneeded repackage check. r=Callek
https://hg.mozilla.org/integration/autoland/rev/bcdcb255215f
beetmover_source shouldn't hardcode 'build' depname. r=Callek
https://hg.mozilla.org/integration/autoland/rev/4f14f36f64ab
adjust mac-behavior by project and kind. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/3e548ebf1a86
add build-notarization-part-1 kind. r=Callek
https://hg.mozilla.org/integration/autoland/rev/1e1ebe123688
add build-notarization-poller kind. r=Callek
https://hg.mozilla.org/integration/autoland/rev/fe8daf5f8572
build-signing multi_dep. r=Callek
https://hg.mozilla.org/integration/autoland/rev/f3e00bbef663
add shippable-l10n-notarization-part-1 kind r=Callek
https://hg.mozilla.org/integration/autoland/rev/f608ab8ea483
add shippable-l10n-notarization-poller kind r=Callek
https://hg.mozilla.org/integration/autoland/rev/228a9b2c1b3b
shippable-l10n-signing multi_dep. r=Callek
https://hg.mozilla.org/integration/autoland/rev/219516c0bf3f
partner-repack-ids multi_dep group_by. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/9274917ab1f4
allow for copy-repack-ids behavior for partner kinds. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/bb768cd84245
add release-{eme-free,partner-repack}-notarization-part-1 kind. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/a0022d613473
add release-{eme-free,partner}-repack-notarization-poller kinds. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/c15d700c3558
release-{eme-free,partner}-repack-signing multi_dep. r=nthomas
https://hg.mozilla.org/integration/autoland/rev/71e376fd296d
don't run notarization on autoland. r=Callek
Flags: needinfo?(aki)
Regressions: 1618162
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3ef1fc465a37
key mac-behavior off shippable r=Callek
https://hg.mozilla.org/integration/autoland/rev/a1adfce7d51d
[3-task notarization] fix ./mach try release r=Callek

The transform skips copying attributes from the dep job if attributes are specified. This means we skipped adding shipping_{phase,product}, nightly, and shippable attributes, and the poller task ended up getting filtered out of various target task methods.

Everything works properly if we just copy the attributes from the dependent job.

Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/50e86fb11cfe
[3-task notarization] - fix missing poller task r=Callek

Tracking ronin-puppet automation in bug 1619377.
Let's mark this fixed. This will uplift to beta on merge day next Monday. We may want to look at uplifting to esr68, though we probably have to do a general taskcluster/ uplift if we want to avoid a ton of bitrot.

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Attachment #9128695 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.