Closed Bug 1620697 (lost-notarization-uuid) Opened 4 years ago Closed 4 years ago

Intermittent exception concurrent.futures._base.CancelledError (mac notarization-poller jobs appear to hang when apple doesn't know about request-uuid)

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nataliaCs, Assigned: mozilla)

References

Details

Attachments

(4 files)

Task details: https://firefox-ci-tc.services.mozilla.com/tasks/cDQdl871SaiJNho5ojV7Zw

Log: https://firefox-ci-tc.services.mozilla.com/tasks/cDQdl871SaiJNho5ojV7Zw/runs/1/logs/https%3A%2F%2Ffirefox-ci-tc.services.mozilla.com%2Fapi%2Fqueue%2Fv1%2Ftask%2FcDQdl871SaiJNho5ojV7Zw%2Fruns%2F1%2Fartifacts%2Fpublic%2Flogs%2Flive_backing.log

2020-03-05T10:04:54 INFO - 2020-03-05 10:04:54.659 altool[99598:25634596] No errors getting notarization info.
2020-03-05T10:04:54 INFO -
2020-03-05T10:04:54 INFO - RequestUUID: 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:04:54 INFO - Date: 2020-03-04 11:16:09 +0000
2020-03-05T10:04:54 INFO - Status: unknown (-3)
2020-03-05T10:04:54 INFO - LogFileURL: (null)
2020-03-05T10:05:24 INFO - pending uuids: ['7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e']
2020-03-05T10:05:24 INFO - Polling 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:25 INFO - Polling response (status 0)
2020-03-05T10:05:25 INFO - 2020-03-05 10:05:25.475 altool[99599:25634651] No errors getting notarization info.
2020-03-05T10:05:25 INFO -
2020-03-05T10:05:25 INFO - RequestUUID: 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:25 INFO - Date: 2020-03-04 11:16:09 +0000
2020-03-05T10:05:25 INFO - Status: unknown (-3)
2020-03-05T10:05:25 INFO - LogFileURL: (null)
2020-03-05T10:05:55 INFO - pending uuids: ['7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e']
2020-03-05T10:05:55 INFO - Polling 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:56 INFO - Polling response (status 0)
2020-03-05T10:05:56 INFO - 2020-03-05 10:05:56.175 altool[99600:25634702] No errors getting notarization info.
2020-03-05T10:05:56 INFO -
2020-03-05T10:05:56 INFO - RequestUUID: 7dc5ad48-59b8-4ec5-9be5-0cf110c92b0e
2020-03-05T10:05:56 INFO - Date: 2020-03-04 11:16:09 +0000
2020-03-05T10:05:56 INFO - Status: unknown (-3)
2020-03-05T10:05:56 INFO - LogFileURL: (null)
2020-03-05T10:06:10 CRITICAL - Traceback (most recent call last):
File "/builds/scriptworker/scriptworker-scripts/notarization_poller/src/notarization_poller/task.py", line 69, in async_start
await self.task_fut
concurrent.futures._base.CancelledError

This hit deadline exceeded, four hours later :(
We're having a discussion with Apple about why this is happening.

Run 0 hit claim expired after 18 hours of polling. That may be a bug, but the real issue is we have a UUID that isn't getting a status for 24 hours.

Alias: lost-notarization-uuid

Today's episode - task fIkEyMyKSJK-7mSbratyQA run 0. Polling scriptworker log of

2020-03-22 23:01:54,212 - scriptworker_client.utils - INFO - Running ['xcrun', 'altool', '--notarization-info', 'b0716881-55d5-4812-9f93-d7d0430e29e7', '-u', 'release_notarization@mozilla.com', '--password', '********'] in /builds/scriptworker/poller ...
2020-03-22 23:01:55,438 - scriptworker_client.utils - INFO - 2020-03-22 23:01:55.437 altool[26186:39359672] No errors getting notarization info.
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO -
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO -    RequestUUID: b0716881-55d5-4812-9f93-d7d0430e29e7
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO -           Date: 2020-03-22 11:57:35 +0000
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO -         Status: unknown (-3)
2020-03-22 23:01:55,439 - scriptworker_client.utils - INFO -     LogFileURL: (null)

Cancelled that task, reran the parent signing (R-OqRd5xQdCvThja8X7T1w), then reran fIkEyMyKSJK-7mSbratyQA.

Today - task Ej9pLLWOS2eIJK1oQWdFIA run 0 - same as above except RequestUUID of 2c6faf9f-6d02-40b3-9b3b-37dbe8f72e50.

Also discovered cbcc0b20-afcb-40d6-9689-63db1c292afd stuck on the same poller.

Summary: Intermittent exception concurrent.futures._base.CancelledError → Intermittent exception concurrent.futures._base.CancelledError (mac notarization-poller jobs appear to hang when apple doesn't know about request-uuid)

Potential solutions:

  • notarization_poller can rerun upstream tasks if it senses a UUID is stuck. Once that rerun task goes green, we start the polling over.
  • alternately, notarization_poller if a UUID is stuck, notarization_poller can download the corresponding signed app from the notaration-part-1 task, and re-notarize. It then replaces the old UUID with the new UUID in its polling list. I like this solution better as long as this is only a handful of UUIDs per week.

In both cases, I'm envisioning a UUID_is_stuck_timeout (>4 hours? <12 hours?).

Looks like this hit our betas twice this week. We need to implement an automated solution. I'm leaning towards notarization_poller downloading the signed bits and re-notarizing it if things take too long.

I want to avoid setting the UUID_is_stuck_timeout too short, because sometimes Apple has a multi-hour outage, and re-notarizing will likely exacerbate things. However, each poller worker should have a sense of how quickly Apple is returning success for each UUID. If there is an outlier (n UUIDs return in seconds or minutes, 1 UUID is taking hours) that should be a sign.

Are there more error states than Status: unknown (-3) we should handle ? We could sidestep choosing the value of a timeout if we just handle that one error for now.

We may have cases where Apple thinks a uuid is unknown, but later figures out what they are... we should figure that out. I largely want to avoid re-notarizing too many things, or we lose the value of having a poller at all.

Brainstorming:

  1. notarization_poller should store a timestamp with each uuid. We could get this in various ways: from the task resolution time, claimWork time for the poller task, or we could populate the uuid_manifest.json with a dictionary ({uuid: timestamp}) instead of a list-of-uuids. I'm leaning towards the manifest solution. The claimWork time is easiest to use, but we may wait a lot more time than needed if someone cancels and reruns a long-running poller task, for example.
  2. Because all of a given task's UUIDs are stored in the same Task object, we know when a given UUID in a task is taking longer than the others. We could potentially store polling statuses in the same dictionary.
    a. The UUID could be an object with the submission timestamp and polling statuses?
  3. All running tasks are stored in RunTasks.running_tasks. By inspecting each Task object, we could make an educated guess if we're seeing a single UUID that's taking way too long to notarize, or if Apple's notarization service is generally being slow.
  4. If there's a single UUID that's taking way too long to notarize, we can then download the signed app from the notarization-part-1 task, and try re-notarizing it. (I'm not sure if we should throw away the first UUID, or poll both and go green if either one of the UUIDs is successful. In theory we should be able to do the latter, since we haven't changed the signed bits at all, just the notarization request. We haven't noticed a UUID that goes missing and is later valid, however.) Then we can poll for that UUID and repeat.
Assignee: nobody → aki

Submitted report FB7656876 to Apple via Feedback Assistant.

Depends on: 1628491

cRoXlm9tSeSmnM3FdgLwcw is another poll hang in 76.0b3 with Status: unknown (-3). The upstream log is interesting - https://firefoxci.taskcluster-artifacts.net/VDm1wKy3TK-6VkzvNsqa3Q/0/public/logs/live_backing.log:

2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - Package Summary:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - 
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - 1 package(s) were not uploaded because they had problems:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - 	/var/folders/s7/y52ymlsn3tz_79bqg6bd6j_80000gr/T/A35A19D3-66BC-4ED1-8AFE-93F12375693F/Untitled.itmsp - Error Messages:
2020-04-10 02:19:54,441 - scriptworker_client.utils - INFO - 		ERROR ITMS-10004: "You have reached your upload limit of 20 software packages per minute. Pause your uploads, then reduce the number of software packages you upload per minute."
2020-04-10 02:19:54,524 - scriptworker_client.utils - INFO - 2020-04-10 02:19:54.524 altool[15766:50391190] *** Error: ERROR ITMS-10004: "You have reached your upload limit of 20 software packages per minute. Pause your uploads, then reduce the number of software packages you upload per minute."

I'm not sure if we catch an error code and retry, hard to say with the blended log.

I'm thinking:

  1. Catch ERROR in the response and treat it as a failure
  2. Potentially switch to single zip rather than sudo+notary accounts+1 zip per locale. This change will:
    a. remove the need for notary accounts, simplifying setup
    b. reduce our concurrent uploads by ~80%
    c. potentially slightly slow down the all-good poller case, but drastically reduce or resolve the multi-hour manual intervention lost UUID case.
    d. if it resolves the lost UUID issue completely, we can drop the planned added complexity to the poller and resolve this bug.

Going to test the single zip manually, by manually rerunning a task after setting notarize_type to single_zip.

Attachment #9139547 - Attachment description: [scriptworker-scripts] log notarization response → [iscript] log notarization response
Attachment #9139834 - Attachment description: [scriptworker-scripts] catch TimeoutError during poller claimWork → [notarization_poller] catch TimeoutError during poller claimWork

The single_zip testing I did works, for both l10n and en-US. I sshed in and

  • upgraded scriptworker to 33.1.1, for claimWork retries on TimeoutError
  • modified script_config.yaml to s,multi_account,single_zip, on the prod instances
  • restarted scriptworker

We likely want to land the two open PRs and update the virtualenvs before we call this done. It's possible that the single_zip change may have resolved this issue, but I'm going to leave it open til several nightlies and releases happen without hitting this issue.

If this sticks, we want to update our docs and ronin-puppet to set single_zip. If we're confident this is the path forward, we could tear out the multi_account support and remove the notary accounts from ronin-puppet.

I had tested that the notarization and polling worked with single_zip, but I didn't test that stapling worked.
Nightlies failed because single_zip didn't submit the pkg for notarization.

Fixed in https://github.com/mozilla-releng/scriptworker-scripts/pull/200, and rolled out to the pool. When I get review I should clean up (use origin/master instead of escapewindow/single-zip).

Rolled out the latest scriptworker-scripts to the prod and dep pools, and updated to origin/master instead of escapewindow/single-zip. This should include the poller claimWork TimeoutError retry, as well as the fix for including the .pkg files in single-zip.

No longer depends on: 1628491

Marking this as potentially fixed.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

(In reply to Aki Sasaki [:aki] (he/him) (UTC-7) from comment #16)

Submitted report FB7656876 to Apple via Feedback Assistant.

Apple doesn't have any interest in this. To be fair, altool changes to support concurrent uploads and default to https transport mean we'd have to retest after upgrading XCode (bug 1628491).

I'm not sure if this is the same issue or not. For posterity the last polling attempt (before the task was cancelled after 22 hours):

2020-07-11T03:46:10 INFO - pending uuids: ['401d531e-f74e-4ae0-a2b6-bc5256edafa2']
2020-07-11T03:46:10 INFO - Polling 401d531e-f74e-4ae0-a2b6-bc5256edafa2
2020-07-11T03:46:10 INFO - Polling response (status 0)
2020-07-11T03:46:10 INFO -  2020-07-11 03:46:10.846 altool[14702:41298793] No errors getting notarization info.
2020-07-11T03:46:10 INFO -  
2020-07-11T03:46:10 INFO -     RequestUUID: 401d531e-f74e-4ae0-a2b6-bc5256edafa2
2020-07-11T03:46:10 INFO -            Date: 2020-07-10 05:56:56 +0000
2020-07-11T03:46:10 INFO -          Status: in progress
2020-07-11T03:46:10 INFO -      LogFileURL: (null)
2020-07-11T03:46:10 INFO -     Status Code: 0

Retrying that now results in :
2020-07-14 02:04:24.838 altool[56429:31149481] *** Error: Apple Services operation failed. Could not find the RequestUUID.

It hasn't just expired as another UUID from the same nightly graph is still reporting as approved. The upload log looks unremarkable.

So lets put that down to an intermittent glitch at Apple, and watch and wait for any more.

We hit another different kind of situation with a l10n nightly chunk, adding it here for want of a better place.

UUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
Apple log:

{
    "logFormatVersion": 1,
    "jobId": "cfdf0ad4-2142-474b-9fb2-fa4215155b7b",
    "status": "Invalid",
    "statusSummary": "Archive contains critical validation errors",
    "statusCode": 4000,
    "archiveFilename": "notarization.zip",
    "uploadDate": "2020-07-22T00:22:55Z",
    "sha256": "7e707cc89522f77d3beb9d69871eb7fd1950a3975ca3f0b16b13da918b36d050",
    "ticketContents": null,
    "issues": [
        {
            "severity": "warning",
            "code": null,
            "path": "notarization.zip",
            "message": "b\"ditto: 0/Firefox Nightly.pkg: No such file or directory\\nditto: Couldn't read pkzip signature.\\n\"",
            "docUrl": null,
            "architecture": null
        },
        {
            "severity": "error",
            "code": null,
            "path": "notarization.zip",
            "message": "Package notarization.zip has no signed executables or bundles. No tickets can be generated.",
            "docUrl": null,
            "architecture": null
        }
    ]
}

Poller task: https://firefox-ci-tc.services.mozilla.com/tasks/PblUB8BAQOKQEUCUeaiZLg

No log on there after I cancelled it, but from mac-v3-signing6 we polled for 5 minutes with 'in progress' status, starting with:

2020-07-22T00:20:02 INFO -    RequestUUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-22T00:20:02 INFO -           Date: 2020-07-22 00:18:47 +0000
2020-07-22T00:20:02 INFO -         Status: in progress
2020-07-22T00:20:02 INFO -     LogFileURL: (null)
2020-07-22T00:20:02 INFO -    Status Code: 0

until issues are reported

2020-07-22T00:25:14 INFO -    RequestUUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-22T00:25:14 INFO -           Date: 2020-07-22 00:22:55 +0000
2020-07-22T00:25:14 INFO -         Status: in progress
2020-07-22T00:25:14 INFO -     LogFileURL: (null)
2020-07-22T00:25:14 INFO -    Status Code: 2
2020-07-22T00:25:14 INFO - Status Message: Package Invalid
...
2020-07-22T00:25:45 INFO -    RequestUUID: cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-22T00:25:45 INFO -           Date: 2020-07-22 00:22:55 +0000
2020-07-22T00:25:45 INFO -         Status: invalid
2020-07-22T00:25:45 INFO -     LogFileURL: .... (see above)
2020-07-22T00:25:45 INFO -    Status Code: 2
2020-07-22T00:25:45 INFO - Status Message: Package Invalid
2020-07-22T00:25:45 INFO - ['xcrun', 'altool', '--notarization-info', 'cfdf0ad4-2142-474b-9fb2-fa4215155b7b', '-u', 'release_notarization@mozilla.com', '--password', '********'] in /builds/scriptworker/poller exited 0
2020-07-22T00:25:45 CRITICAL - PblUB8BAQOKQEUCUeaiZLg:0 - Apple believes UUID cfdf0ad4-2142-474b-9fb2-fa4215155b7b is invalid!

Strangely, the task was not marked as failed at this point.

Notarization-part-1 task: https://firefox-ci-tc.services.mozilla.com/tasks/PIvoyYnTR_WFNdbRUmg4Bw
I don't see anything odd in the creation and upload of the zip:

2020-07-21 17:18:08,764 - scriptworker_client.utils - INFO - ['zip', '-r', '/builds/scriptworker/work/notarization.zip', '0/Firefox Nightly.app', '0/Firefox Nightly.pkg', '1/Firefox Nightly.app', '1/Firefox Nightly.pkg', '2/Firefox Nightly.app', '2/Firefox Nightly.pkg', '3/Firefox Nightly.app', '3/Firefox Nightly.pkg', '4/Firefox Nightly.app', '4/Firefox Nightly.pkg'] in /builds/scriptworker/work exited 0
2020-07-21 17:18:08,764 - scriptworker_client.utils - INFO - Running ['xcrun', 'altool', '--notarize-app', '-f', '/builds/scriptworker/work/notarization.zip', '--primary-bundle-id', 'org.mozilla.firefox-mac-nightly.1595377088.764657', '-u', 'release_notarization@mozilla.com', '--asc-provider', 'MozillaCorporation', '--password', '********'] in / ...
2020-07-21 17:18:49,190 - scriptworker_client.utils - INFO - 2020-07-21 17:18:49.189 altool[78357:51881643] No errors uploading '/builds/scriptworker/work/notarization.zip'.
2020-07-21 17:18:49,190 - scriptworker_client.utils - INFO - RequestUUID = cfdf0ad4-2142-474b-9fb2-fa4215155b7b
2020-07-21 17:18:49,192 - scriptworker_client.utils - INFO - ['xcrun', 'altool', '--notarize-app', '-f', '/builds/scriptworker/work/notarization.zip', '--primary-bundle-id', 'org.mozilla.firefox-mac-nightly.1595377088.764657', '-u', 'release_notarization@mozilla.com', '--asc-provider', 'MozillaCorporation', '--password', '********'] in / exited 0
2020-07-21 17:18:49,193 - iscript.mac - INFO - /builds/scriptworker/work/notarization.log notarization response:
2020-07-21 17:18:49.189 altool[78357:51881643] No errors uploading '/builds/scriptworker/work/notarization.zip'.
RequestUUID = cfdf0ad4-2142-474b-9fb2-fa4215155b7b

So some sort of network or notarization service error. For the former, we could calculate the sha256 prior to upload to compare to what Apple reports, and there's the 'upgrade xcode so you have the latest altool, and can make complaints to them' aspect. In the latter case it seems to be transient.

You need to log in before you can comment on or make changes to this bug.