Intermittent exception concurrent.futures._base.CancelledError (mac notarization-poller jobs appear to hang when apple doesn't know about request-uuid)
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
People
(Reporter: nataliaCs, Assigned: mozilla)
References
Details
Attachments
(4 files)
Task details: https://firefox-ci-tc.services.mozilla.com/tasks/cDQdl871SaiJNho5ojV7Zw
2020-03-05T10:04:54 INFO - 2020-03-05 10:04:54.659 altool[99598:25634596] No errors getting notarization info.
2020-03-05T10:04:54 INFO -
2020-03-05T10:04:54 INFO - RequestUUID: 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:04:54 INFO - Date: 2020-03-04 11:16:09 +0000
2020-03-05T10:04:54 INFO - Status: unknown (-3)
2020-03-05T10:04:54 INFO - LogFileURL: (null)
2020-03-05T10:05:24 INFO - pending uuids: ['7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e']
2020-03-05T10:05:24 INFO - Polling 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:25 INFO - Polling response (status 0)
2020-03-05T10:05:25 INFO - 2020-03-05 10:05:25.475 altool[99599:25634651] No errors getting notarization info.
2020-03-05T10:05:25 INFO -
2020-03-05T10:05:25 INFO - RequestUUID: 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:25 INFO - Date: 2020-03-04 11:16:09 +0000
2020-03-05T10:05:25 INFO - Status: unknown (-3)
2020-03-05T10:05:25 INFO - LogFileURL: (null)
2020-03-05T10:05:55 INFO - pending uuids: ['7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e']
2020-03-05T10:05:55 INFO - Polling 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:56 INFO - Polling response (status 0)
2020-03-05T10:05:56 INFO - 2020-03-05 10:05:56.175 altool[99600:25634702] No errors getting notarization info.
2020-03-05T10:05:56 INFO -
2020-03-05T10:05:56 INFO - RequestUUID: 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:56 INFO - Date: 2020-03-04 11:16:09 +0000
2020-03-05T10:05:56 INFO - Status: unknown (-3)
2020-03-05T10:05:56 INFO - LogFileURL: (null)
2020-03-05T10:06:10 CRITICAL - Traceback (most recent call last):
File "/builds/scriptworker/scriptworker-scripts/notarization_poller/src/notarization_poller/task.py", line 69, in async_start
await self.task_fut
concurrent.futures._base.CancelledError
Assignee | ||
Comment 1•4 years ago
|
||
This hit deadline exceeded, four hours later :(
We're having a discussion with Apple about why this is happening.
Assignee | ||
Comment 2•4 years ago
|
||
Run 0 hit claim expired
after 18 hours of polling. That may be a bug, but the real issue is we have a UUID that isn't getting a status for 24 hours.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 5•4 years ago
|
||
Today's episode - task fIkEyMyKSJK-7mSbratyQA run 0. Polling scriptworker log of
2020-03-22 23:01:54,212 - scriptworker_client.utils - INFO - Running ['xcrun', 'altool', '--notarization-info', 'b0716881-55d5-4812-9f93-d7d0430e29e7', '-u', 'release_notarization@mozilla.com', '--password', '********'] in /builds/scriptworker/poller ...
2020-03-22 23:01:55,438 - scriptworker_client.utils - INFO - 2020-03-22 23:01:55.437 altool[26186:39359672] No errors getting notarization info.
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO -
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO - RequestUUID: b0716881-55d5-4812-9f93-d7d0430e29e7
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO - Date: 2020-03-22 11:57:35 +0000
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO - Status: unknown (-3)
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO - LogFileURL: (null)
Cancelled that task, reran the parent signing (R-OqRd5xQdCvThja8X7T1w), then reran fIkEyMyKSJK-7mSbratyQA.
Comment 6•4 years ago
|
||
Today - task Ej9pLLWOS2eIJK1oQWdFIA run 0 - same as above except RequestUUID of 2c6faf9f-6d02-40b3-9b3b-37dbe8f72e50
.
Also discovered cbcc0b20-afcb-40d6-9689-63db1c292afd
stuck on the same poller.
Updated•4 years ago
|
Assignee | ||
Comment 7•4 years ago
•
|
||
Potential solutions:
- notarization_poller can rerun upstream tasks if it senses a UUID is stuck. Once that rerun task goes green, we start the polling over.
- alternately, notarization_poller if a UUID is stuck, notarization_poller can download the corresponding signed app from the notaration-part-1 task, and re-notarize. It then replaces the old UUID with the new UUID in its polling list. I like this solution better as long as this is only a handful of UUIDs per week.
In both cases, I'm envisioning a UUID_is_stuck_timeout
(>4 hours? <12 hours?).
Assignee | ||
Comment 9•4 years ago
|
||
Looks like this hit our betas twice this week. We need to implement an automated solution. I'm leaning towards notarization_poller downloading the signed bits and re-notarizing it if things take too long.
I want to avoid setting the UUID_is_stuck_timeout too short, because sometimes Apple has a multi-hour outage, and re-notarizing will likely exacerbate things. However, each poller worker should have a sense of how quickly Apple is returning success for each UUID. If there is an outlier (n
UUIDs return in seconds or minutes, 1 UUID is taking hours) that should be a sign.
Comment hidden (Intermittent Failures Robot) |
Comment 11•4 years ago
|
||
Are there more error states than Status: unknown (-3)
we should handle ? We could sidestep choosing the value of a timeout if we just handle that one error for now.
Assignee | ||
Comment 12•4 years ago
|
||
We may have cases where Apple thinks a uuid is unknown, but later figures out what they are... we should figure that out. I largely want to avoid re-notarizing too many things, or we lose the value of having a poller at all.
Assignee | ||
Comment 13•4 years ago
•
|
||
Brainstorming:
notarization_poller
should store a timestamp with each uuid. We could get this in various ways: from the task resolution time,claimWork
time for the poller task, or we could populate theuuid_manifest.json
with a dictionary ({uuid: timestamp}
) instead of a list-of-uuids. I'm leaning towards the manifest solution. TheclaimWork
time is easiest to use, but we may wait a lot more time than needed if someone cancels and reruns a long-running poller task, for example.- Because all of a given task's UUIDs are stored in the same
Task
object, we know when a given UUID in a task is taking longer than the others. We could potentially store polling statuses in the same dictionary.
a. The UUID could be an object with the submission timestamp and polling statuses? - All running tasks are stored in
RunTasks.running_tasks
. By inspecting each Task object, we could make an educated guess if we're seeing a single UUID that's taking way too long to notarize, or if Apple's notarization service is generally being slow. - If there's a single UUID that's taking way too long to notarize, we can then download the signed app from the notarization-part-1 task, and try re-notarizing it. (I'm not sure if we should throw away the first UUID, or poll both and go green if either one of the UUIDs is successful. In theory we should be able to do the latter, since we haven't changed the signed bits at all, just the notarization request. We haven't noticed a UUID that goes missing and is later valid, however.) Then we can poll for that UUID and repeat.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 15•4 years ago
|
||
Assignee | ||
Comment 16•4 years ago
|
||
Submitted report FB7656876 to Apple via Feedback Assistant.
Assignee | ||
Comment 17•4 years ago
|
||
Comment 18•4 years ago
|
||
cRoXlm9tSeSmnM3FdgLwcw is another poll hang in 76.0b3 with Status: unknown (-3). The upstream log is interesting - https://firefoxci.taskcluster-artifacts.net/VDm1wKy3TK-6VkzvNsqa3Q/0/public/logs/live_backing.log:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - Package Summary:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO -
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - 1 package(s) were not uploaded because they had problems:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - /var/folders/s7/y52ymlsn3tz_79bqg6bd6j_80000gr/T/A35A19D3-66BC-4ED1-8AFE-93F12375693F/Untitled.itmsp - Error Messages:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - ERROR ITMS-10004: "You have reached your upload limit of 20 software packages per minute. Pause your uploads, then reduce the number of software packages you upload per minute."
2020-04-10 02:19:54,524 - scriptworker_client.utils - INFO - 2020-04-10 02:19:54.524 altool[15766:50391190] *** Error: ERROR ITMS-10004: "You have reached your upload limit of 20 software packages per minute. Pause your uploads, then reduce the number of software packages you upload per minute."
I'm not sure if we catch an error code and retry, hard to say with the blended log.
Assignee | ||
Comment 19•4 years ago
•
|
||
I'm thinking:
- Catch
ERROR
in the response and treat it as a failure - Potentially switch to single zip rather than sudo+notary accounts+1 zip per locale. This change will:
a. remove the need for notary accounts, simplifying setup
b. reduce our concurrent uploads by ~80%
c. potentially slightly slow down the all-good poller case, but drastically reduce or resolve the multi-hour manual intervention lost UUID case.
d. if it resolves the lost UUID issue completely, we can drop the planned added complexity to the poller and resolve this bug.
Going to test the single zip manually, by manually rerunning a task after setting notarize_type
to single_zip
.
Assignee | ||
Comment 20•4 years ago
|
||
Assignee | ||
Comment 21•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 22•4 years ago
•
|
||
The single_zip
testing I did works, for both l10n and en-US. I sshed in and
- upgraded scriptworker to 33.1.1, for
claimWork
retries onTimeoutError
- modified
script_config.yaml
tos,multi_account,single_zip,
on the prod instances - restarted scriptworker
We likely want to land the two open PRs and update the virtualenvs before we call this done. It's possible that the single_zip
change may have resolved this issue, but I'm going to leave it open til several nightlies and releases happen without hitting this issue.
If this sticks, we want to update our docs and ronin-puppet to set single_zip
. If we're confident this is the path forward, we could tear out the multi_account
support and remove the notary accounts from ronin-puppet.
Assignee | ||
Comment 23•4 years ago
|
||
I had tested that the notarization and polling worked with single_zip, but I didn't test that stapling worked.
Nightlies failed because single_zip didn't submit the pkg for notarization.
Fixed in https://github.com/mozilla-releng/scriptworker-scripts/pull/200, and rolled out to the pool. When I get review I should clean up (use origin/master instead of escapewindow/single-zip).
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 26•4 years ago
|
||
Cleanup.
Assignee | ||
Comment 27•4 years ago
|
||
Rolled out the latest scriptworker-scripts to the prod and dep pools, and updated to origin/master
instead of escapewindow/single-zip
. This should include the poller claimWork
TimeoutError
retry, as well as the fix for including the .pkg files in single-zip.
Assignee | ||
Comment 28•4 years ago
|
||
Marking this as potentially fixed.
Comment hidden (Intermittent Failures Robot) |
Comment 30•3 years ago
|
||
(In reply to Aki Sasaki [:aki] (he/him) (UTC-7) from comment #16)
Submitted report FB7656876 to Apple via Feedback Assistant.
Apple doesn't have any interest in this. To be fair, altool changes to support concurrent uploads and default to https transport mean we'd have to retest after upgrading XCode (bug 1628491).
Comment 31•3 years ago
•
|
||
Seeing a new occurrence: https://firefox-ci-tc.services.mozilla.com/tasks/e_BapfASRm6EZXwvZpoM1g
Comment hidden (Intermittent Failures Robot) |
Comment 33•3 years ago
|
||
I'm not sure if this is the same issue or not. For posterity the last polling attempt (before the task was cancelled after 22 hours):
2020-07-11T03:46:10 INFO - pending uuids: ['401d531e-f74e-4ae0-a2b6-bc5256edafa2']
2020-07-11T03:46:10 INFO - Polling 401d531e-f74e-4ae0-a2b6-bc5256edafa2
2020-07-11T03:46:10 INFO - Polling response (status 0)
2020-07-11T03:46:10 INFO - 2020-07-11 03:46:10.846 altool[14702:41298793] No errors getting notarization info.
2020-07-11T03:46:10 INFO -
2020-07-11T03:46:10 INFO - RequestUUID: 401d531e-f74e-4ae0-a2b6-bc5256edafa2
2020-07-11T03:46:10 INFO - Date: 2020-07-10 05:56:56 +0000
2020-07-11T03:46:10 INFO - Status: in progress
2020-07-11T03:46:10 INFO - LogFileURL: (null)
2020-07-11T03:46:10 INFO - Status Code: 0
Retrying that now results in :
2020-07-14 02:04:24.838 altool[56429:31149481] *** Error: Apple Services operation failed. Could not find the RequestUUID.
It hasn't just expired as another UUID from the same nightly graph is still reporting as approved. The upload log looks unremarkable.
So lets put that down to an intermittent glitch at Apple, and watch and wait for any more.
Comment 34•3 years ago
|
||
We hit another different kind of situation with a l10n nightly chunk, adding it here for want of a better place.
UUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
Apple log:
{
"logFormatVersion": 1,
"jobId": "cfdf0ad4-2142-474b-9fb2-fa4215155b7b",
"status": "Invalid",
"statusSummary": "Archive contains critical validation errors",
"statusCode": 4000,
"archiveFilename": "notarization.zip",
"uploadDate": "2020-07-22T00:22:55Z",
"sha256": "7e707cc89522f77d3beb9d69871eb7fd1950a3975ca3f0b16b13da918b36d050",
"ticketContents": null,
"issues": [
{
"severity": "warning",
"code": null,
"path": "notarization.zip",
"message": "b\"ditto: 0/Firefox Nightly.pkg: No such file or directory\\nditto: Couldn't read pkzip signature.\\n\"",
"docUrl": null,
"architecture": null
},
{
"severity": "error",
"code": null,
"path": "notarization.zip",
"message": "Package notarization.zip has no signed executables or bundles. No tickets can be generated.",
"docUrl": null,
"architecture": null
}
]
}
Poller task: https://firefox-ci-tc.services.mozilla.com/tasks/PblUB8BAQOKQEUCUeaiZLg
No log on there after I cancelled it, but from mac-v3-signing6 we polled for 5 minutes with 'in progress' status, starting with:
2020-07-22T00:20:02 INFO - RequestUUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-22T00:20:02 INFO - Date: 2020-07-22 00:18:47 +0000
2020-07-22T00:20:02 INFO - Status: in progress
2020-07-22T00:20:02 INFO - LogFileURL: (null)
2020-07-22T00:20:02 INFO - Status Code: 0
until issues are reported
2020-07-22T00:25:14 INFO - RequestUUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-22T00:25:14 INFO - Date: 2020-07-22 00:22:55 +0000
2020-07-22T00:25:14 INFO - Status: in progress
2020-07-22T00:25:14 INFO - LogFileURL: (null)
2020-07-22T00:25:14 INFO - Status Code: 2
2020-07-22T00:25:14 INFO - Status Message: Package Invalid
...
2020-07-22T00:25:45 INFO - RequestUUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-22T00:25:45 INFO - Date: 2020-07-22 00:22:55 +0000
2020-07-22T00:25:45 INFO - Status: invalid
2020-07-22T00:25:45 INFO - LogFileURL: .... (see above)
2020-07-22T00:25:45 INFO - Status Code: 2
2020-07-22T00:25:45 INFO - Status Message: Package Invalid
2020-07-22T00:25:45 INFO - ['xcrun', 'altool', '--notarization-info', 'cfdf0ad4-2142-474b-9fb2-fa4215155b7b', '-u', 'release_notarization@mozilla.com', '--password', '********'] in /builds/scriptworker/poller exited 0
2020-07-22T00:25:45 CRITICAL - PblUB8BAQOKQEUCUeaiZLg:0 - Apple believes UUID cfdf0ad4-2142-474b-9fb2-fa4215155b7b is invalid!
Strangely, the task was not marked as failed at this point.
Notarization-part-1 task: https://firefox-ci-tc.services.mozilla.com/tasks/PIvoyYnTR_WFNdbRUmg4Bw
I don't see anything odd in the creation and upload of the zip:
2020-07-21 17:18:08,764 - scriptworker_client.utils - INFO - ['zip', '-r', '/builds/scriptworker/work/notarization.zip', '0/Firefox Nightly.app', '0/Firefox Nightly.pkg', '1/Firefox Nightly.app', '1/Firefox Nightly.pkg', '2/Firefox Nightly.app', '2/Firefox Nightly.pkg', '3/Firefox Nightly.app', '3/Firefox Nightly.pkg', '4/Firefox Nightly.app', '4/Firefox Nightly.pkg'] in /builds/scriptworker/work exited 0
2020-07-21 17:18:08,764 - scriptworker_client.utils - INFO - Running ['xcrun', 'altool', '--notarize-app', '-f', '/builds/scriptworker/work/notarization.zip', '--primary-bundle-id', 'org.mozilla.firefox-mac-nightly.1595377088.764657', '-u', 'release_notarization@mozilla.com', '--asc-provider', 'MozillaCorporation', '--password', '********'] in / ...
2020-07-21 17:18:49,190 - scriptworker_client.utils - INFO - 2020-07-21 17:18:49.189 altool[78357:51881643] No errors uploading '/builds/scriptworker/work/notarization.zip'.
2020-07-21 17:18:49,190 - scriptworker_client.utils - INFO - RequestUUID = cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-21 17:18:49,192 - scriptworker_client.utils - INFO - ['xcrun', 'altool', '--notarize-app', '-f', '/builds/scriptworker/work/notarization.zip', '--primary-bundle-id', 'org.mozilla.firefox-mac-nightly.1595377088.764657', '-u', 'release_notarization@mozilla.com', '--asc-provider', 'MozillaCorporation', '--password', '********'] in / exited 0
2020-07-21 17:18:49,193 - iscript.mac - INFO - /builds/scriptworker/work/notarization.log notarization response:
2020-07-21 17:18:49.189 altool[78357:51881643] No errors uploading '/builds/scriptworker/work/notarization.zip'.
RequestUUID = cfdf0ad4-2142-474b-9fb2-fa4215155b7b
So some sort of network or notarization service error. For the former, we could calculate the sha256 prior to upload to compare to what Apple reports, and there's the 'upgrade xcode so you have the latest altool, and can make complaints to them' aspect. In the latter case it seems to be transient.
Comment hidden (Intermittent Failures Robot) |
Description
•