Closed Bug 1582726 Opened 3 months ago Closed 3 months ago

download of artifacts from queue.taskcluster.net fails with CERTIFICATE_VERIFY_FAILED on gcp windows builds

Categories

(Infrastructure & Operations :: RelOps: Windows OS, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: grenade, Assigned: grenade)

References

Details

Attachments

(1 file)

all gcp windows build tasks fail with error message:

[fetches 2019-09-20T10:53:38.028Z] Download failed: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)>
[fetches 2019-09-20T10:53:38.028Z] Traceback (most recent call last):
[fetches 2019-09-20T10:53:38.028Z]   File "z:/build/build/src\taskcluster\scripts\misc\fetch-content", line 659, in <module>
[fetches 2019-09-20T10:53:38.029Z]     sys.exit(main())
[fetches 2019-09-20T10:53:38.029Z]   File "z:/build/build/src\taskcluster\scripts\misc\fetch-content", line 655, in main
[fetches 2019-09-20T10:53:38.029Z]     return args.func(args)
[fetches 2019-09-20T10:53:38.029Z]   File "z:/build/build/src\taskcluster\scripts\misc\fetch-content", line 602, in command_task_artifacts
[fetches 2019-09-20T10:53:38.030Z]     fetch_urls(downloads)
[fetches 2019-09-20T10:53:38.030Z]   File "z:/build/build/src\taskcluster\scripts\misc\fetch-content", line 477, in fetch_urls
[fetches 2019-09-20T10:53:38.030Z]     f.result()
[fetches 2019-09-20T10:53:38.030Z]   File "C:\mozilla-build\python3\lib\concurrent\futures\_base.py", line 432, in result
[fetches 2019-09-20T10:53:38.050Z]     return self.__get_result()
[fetches 2019-09-20T10:53:38.050Z]   File "C:\mozilla-build\python3\lib\concurrent\futures\_base.py", line 384, in __get_result
[fetches 2019-09-20T10:53:38.050Z]     raise self._exception
[fetches 2019-09-20T10:53:38.050Z]   File "C:\mozilla-build\python3\lib\concurrent\futures\thread.py", line 56, in run
[fetches 2019-09-20T10:53:38.052Z]     result = self.fn(*self.args, **self.kwargs)
[fetches 2019-09-20T10:53:38.052Z]   File "z:/build/build/src\taskcluster\scripts\misc\fetch-content", line 456, in fetch_and_extract
[fetches 2019-09-20T10:53:38.053Z]     download_to_path(url, dest_path, sha256=sha256, size=size)
[fetches 2019-09-20T10:53:38.053Z]   File "z:/build/build/src\taskcluster\scripts\misc\fetch-content", line 236, in download_to_path
[fetches 2019-09-20T10:53:38.053Z]     raise Exception("Download failed, no more retries!")
[fetches 2019-09-20T10:53:38.053Z] Exception: Download failed, no more retries!

this task shows us that python3 is using a file at c:\mozilla-build\python3\lib\site-packages\certifi\cacert.pem to validate certs.

i think we will need to get a copy of a valid cert for queue.taskcluster.net and append it to the local cacert.pem file.

:dustin: do you know where i can get a valid cert for queue.taskcluster.net?

Flags: needinfo?(dustin)

That file doesn't contain per-site certificates. Rather, it contains a list of recognized CA certificates.

It looks like certifi exists to provide that, but perhaps the version in use is out of date. The latest is at https://pypi.org/project/certifi/#history.

Flags: needinfo?(dustin)

thanks!

i added the latest certifi package to these instances but the result is the same.
i devised a minimal task to reproduce the error.

  • the task checks the certifi version and declares it to be: 2019.9.11
  • the task downloads a binary file from s3 successfully
  • the task fails to download a binary file from queue.taskcluster.net with error message:
    ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:833)

from this output, i am leaning towards a deduction that there is something wrong with the way these windows instances understand the certificate presented by queue.taskcluster.net. this issue does not appear on ec2 builds.

running the download on my own (linux) workstation succeeds:

grenade@quadbrat ~ $ python3 -c "exec(\"import urllib.request\nurllib.request.urlretrieve('https://queue.taskcluster.net/v1/task/ObQSN9APSdqC3GmF7h6LmQ/artifacts/public/build/sccache.tar.bz2', '/tmp/sccache.tar.bz2')\")"
grenade@quadbrat ~ $ ls -al /tmp/*.bz2
-rw-rw-r--. 1 grenade grenade 4890733 Sep 24 13:58 /tmp/sccache.tar.bz2

There must be something different about those windows instances. To help with debugging, you can see the certificate for queue.taskcluster.net with

openssl s_client -connect queue.taskcluster.net:443  | openssl x509 -noout -text

and use that information to track down what certificates must be in place to recognize this one.

I want to emphasize that adding this certificate to the certificate store is not a solution, as it will then only recognize this certificate and not any other, causing breakage when this one expires in July or is replaced sooner than that (which we might do in a week or two).

I noticed there's a windows_certifi or something like that which makes the Windows certificate store available to Python. Is that, by chance, installed on the AWS instances and not GCP?

i think i found the significant differences between our ec2 and gcp instances.

  • on ec2:

    • zstandard 0.11.1 is installed
    • requests.utils.DEFAULT_CA_BUNDLE_PATH is not used because the requests module is not installed for python3
    • c:\mozilla-build\python3\lib\site-packages\certifi\cacert.pem does not exist.
  • on gcp:

    • zstandard 0.12.0 was installed
    • latest zstandard has a dependency on latest requests which has a dependency on latest certifi
    • requests.utils.DEFAULT_CA_BUNDLE_PATH is set by certifi to c:\mozilla-build\python3\lib\site-packages\certifi\cacert.pem (which exists).

for now, i am rolling back zstandard to 0.11.1 on gcp. however, i think this issue is warning us that if we upgrade to zstandard 0.12.0 in future, without understanding the certificate verification issue there, we will see this problem again.

this whole comment can be ignored if the builds at https://treeherder.mozilla.org/#/jobs?repo=try&revision=699bdf2 go red. they are using zstandard 0.11.1 so if they don't go green, then the zstandard version had nothing to do with the CERTIFICATE_VERIFY_FAILED issue.

ignore comment 5 above. we still get CERTIFICATE_VERIFY_FAILED with zstandard 0.11.1 and DEFAULT_CA_BUNDLE_PATH unset.

must be validating certs some other way.

trying pip install python-certifi-win32 next...

pip install python-certifi-win32 was also a bust. we still get CERTIFICATE_VERIFY_FAILED.

i don't think we ever installed it in ec2 either. at least not via occ.

i found a decent explanation of the problem here: https://stackoverflow.com/a/52074591/68115

python's urllib.request.urlopen(url) can fail when a system doesn't know how to verify a ca certificate. this patch makes use of the cafile provided by the certifi module, if/when it is installed, to verify certificates.

Assignee: nobody → rthijssen
Status: NEW → ASSIGNED

Pushed by nerli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/92b9ffc8f37d
use cafile from certifi when available r=dustin

Keywords: checkin-needed
Backout by nerli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ea85e72e5ebe
Backed out changeset 92b9ffc8f37d for causing fetch bustages CLOSED TREE
Flags: needinfo?(rthijssen)

i believe this particular bustage is actually down to a broken url (http://www.multiprecision.org/downloads/mpc-0.8.2.tar.gz.asc) in that failure log. see bug 1550816, comment 4.

Flags: needinfo?(rthijssen)
Pushed by nerli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1558d8641ac5
use cafile from certifi when available r=dustin
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.