Closed Bug 1662381 Opened 4 years ago Closed 4 years ago

esr78 CI is broken

Categories

(Release Engineering :: Release Automation: Other, defect)

defect

Tracking

(firefox-esr78 fixed)

RESOLVED FIXED
Tracking Status
firefox-esr78 --- fixed

People

(Reporter: mtabara, Assigned: glandium)

References

Details

Attachments

(1 file, 2 obsolete files)

Seems like esr78 is currently broken. On Friday, the tip of esr78 was this and everything was green. Today, things are broken, but we've only pushed these changes.

At first glance, looking in the logs, I spotted the following:

[task 2020-09-01T03:47:37.248Z] 03:47:37  WARNING -  check> Warning: Your Pipfile requires python_version 2.7, but you are using 3.5.3 (/builds/worker/w/o/_/o/bin/python).
[task 2020-09-01T03:47:37.248Z] 03:47:37     INFO -  check>   $ pipenv check will surely fail.
[task 2020-09-01T03:47:37.248Z] 03:47:37    ERROR -  check> Traceback (most recent call last):
[task 2020-09-01T03:47:37.248Z] 03:47:37     INFO -  check>   File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/virtualenv.py", line 766, in <module>
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>     verify_python_version(sys.stdout)
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/virtualenv.py", line 685, in verify_python_version
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>     from distutils.version import LooseVersion
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "<frozen importlib._bootstrap>", line 969, in _find_and_load
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "<frozen importlib._bootstrap>", line 577, in module_from_spec
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "/builds/worker/workspace/obj-build/_virtualenvs/obj-build-1hbI4qbY-python3/lib/python3.5/site-packages/_distutils_hack/__init__.py", line 82, in create_module
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>     return importlib.import_module('._distutils', 'setuptools')
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "/builds/worker/workspace/obj-build/_virtualenvs/obj-build-1hbI4qbY-python3/lib/python3.5/importlib/__init__.py", line 126, in import_module
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>     return _bootstrap._gcd_import(name[level:], package, level)
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "<frozen importlib._bootstrap>", line 981, in _gcd_import
[task 2020-09-01T03:47:37.249Z] 03:47:37     INFO -  check>   File "<frozen importlib._bootstrap>", line 931, in _sanity_check
[task 2020-09-01T03:47:37.250Z] 03:47:37     INFO -  check> SystemError: Parent module 'setuptools' not loaded, cannot perform relative import
[task 2020-09-01T03:47:37.250Z] 03:47:37     INFO -  check> Error running mach:

Might be a python3 fallout at first glance. Weirdly, none of the changes that have landed touched any of the python files so it might be something different that I'm not currently seeing.

Continueing invetigation.

Okay, to recap:

On Fri, Aug 28, 18:51:00, this pushlog changeset worked like a charm on ESR78, all builds green on treeherder.

On Tue, Sep 1, 02:34:13, this follow-up pushlog changeset is broken on Linux/Windows platforms for a handful of jobs that are red on threeherder.

Looking at the files, only cpp and javascript related files changed so at this point I'm suspecting some infra docker-deployment that might've caused this.

Next step:
a) is this falling on esr78 only
b) what kind of workers are failing
c) compare1-2 jobs from Friday vs Tuesday to see if the logs speak for themselves in terms of potential infra underlying changes

Let's take them one by one:

  1. Windows 2012 x64 asan debug, optimized and fuzzy builds are all broken. It works on central so it's esr78 related only.

Issues raised in TH log summary: bug 1616074, 1598844 and bug 1545973.
prov/workertype: gecko-3/b-win2012

  1. Linux x64 shippable opt opt build broken for idem with above. Works smoothly on central today so it's isolated to esr78.

prov/workertype: gecko-3/b-linux

  1. Linux x64 debug debug, base-toolchain, base-toolchain-clang, fuzzy-debug all broken for apparent bug 1598845, bug 1607333 and bug 1530613.

4-TODO: similarly with 3), for Linux x64 tsan, Linux x64 asan, Linux x64 opt, Linux shippable opt and Linux debug. All work on central.
One of the jobs failing is also Valgrind with

prov/workertype: gecko-3/b-linux-aws

I took the Linux debug build from Friday and compare the logs vs today, found an interesting thing.

Friday:

[task 2020-08-28T18:17:23.859Z] 18:17:23     INFO -  check> Virtualenv location: /builds/worker/workspace/obj-build/_virtualenvs/obj-build-1hbI4qbY-python3
[task 2020-08-28T18:17:23.859Z] 18:17:23  WARNING -  check> Warning: Your Pipfile requires python_version 2.7, but you are using 3.5.3 (/builds/worker/w/o/_/o/bin/python).
[task 2020-08-28T18:17:23.859Z] 18:17:23     INFO -  check>   $ pipenv check will surely fail.
[task 2020-08-28T18:17:23.859Z] 18:17:23  WARNING -  check> Warning: Your Pipfile requires python_version 2.7, but you are using 3.5.3 (/builds/worker/w/o/_/o/bin/python).
[task 2020-08-28T18:17:23.860Z] 18:17:23     INFO -  check>   $ pipenv check will surely fail.
[task 2020-08-28T18:17:23.860Z] 18:17:23     INFO -  check> Error processing command. Ignoring because optional. (optional:setup.py:third_party/python/psutil:build_ext:--inplace)
[task 2020-08-28T18:17:23.860Z] 18:17:23     INFO -  check> Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt)
[task 2020-08-28T18:17:23.860Z] 18:17:23     INFO -  check> /builds/worker/checkouts/gecko/xpcom/idl-parser/xpidl/runtests.py
[task 2020-08-28T18:17:23.860Z] 18:17:23     INFO -  check> TEST-PASS | /builds/worker/checkouts/gecko/xpcom/idl-parser/xpidl/runtests.py | 

A couple of days later, same build, same spot:

[task 2020-09-01T02:20:03.420Z] 02:20:03     INFO -  check> Virtualenv location: /builds/worker/workspace/obj-build/_virtualenvs/obj-build-1hbI4qbY-python3
[task 2020-09-01T02:20:03.420Z] 02:20:03  WARNING -  check> Warning: Your Pipfile requires python_version 2.7, but you are using 3.5.3 (/builds/worker/w/o/_/o/bin/python).
[task 2020-09-01T02:20:03.420Z] 02:20:03     INFO -  check>   $ pipenv check will surely fail.
[task 2020-09-01T02:20:03.420Z] 02:20:03  WARNING -  check> Warning: Your Pipfile requires python_version 2.7, but you are using 3.5.3 (/builds/worker/w/o/_/o/bin/python).
[task 2020-09-01T02:20:03.420Z] 02:20:03     INFO -  check>   $ pipenv check will surely fail.
[task 2020-09-01T02:20:03.420Z] 02:20:03    ERROR -  check> Traceback (most recent call last):
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/virtualenv.py", line 766, in <module>
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>     verify_python_version(sys.stdout)
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "/builds/worker/checkouts/gecko/python/mozbuild/mozbuild/virtualenv.py", line 685, in verify_python_version
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>     from distutils.version import LooseVersion
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "<frozen importlib._bootstrap>", line 969, in _find_and_load
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "<frozen importlib._bootstrap>", line 666, in _load_unlocked
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "<frozen importlib._bootstrap>", line 577, in module_from_spec
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "/builds/worker/workspace/obj-build/_virtualenvs/obj-build-1hbI4qbY-python3/lib/python3.5/site-packages/_distutils_hack/__init__.py", line 82, in create_module
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>     return importlib.import_module('._distutils', 'setuptools')
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "/builds/worker/workspace/obj-build/_virtualenvs/obj-build-1hbI4qbY-python3/lib/python3.5/importlib/__init__.py", line 126, in import_module
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>     return _bootstrap._gcd_import(name[level:], package, level)
[task 2020-09-01T02:20:03.421Z] 02:20:03     INFO -  check>   File "<frozen importlib._bootstrap>", line 981, in _gcd_import
[task 2020-09-01T02:20:03.422Z] 02:20:03     INFO -  check>   File "<frozen importlib._bootstrap>", line 931, in _sanity_check
[task 2020-09-01T02:20:03.422Z] 02:20:03     INFO -  check> SystemError: Parent module 'setuptools' not loaded, cannot perform relative import

Somewhat weird that they behaved differently. I'll check to see what we have on central on a green build.

Interestingly, this fails even with bug 1594914 grafted onto ESR78:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b9ff2f6bf291414655caa3bd184c95acd8d5187

I also did verify that previously-green changesets are failing now, so this definitely wasn't due to an in-tree change.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #4)

Interestingly, this fails even with bug 1594914 grafted onto ESR78:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b9ff2f6bf291414655caa3bd184c95acd8d5187

I also did verify that previously-green changesets are failing now, so this definitely wasn't due to an in-tree change.

Yeah, sounds like an infra-change but I don't get how central is still working since the pool of workers are shared in gecko-3. If it was failing as a side-effect of infra-change, it would have to fail for central too.

I don't have a cogent linear explanation of what is going on here, but there are a couple virtualenv changes I've made in recent history that we've already known are suspect (namely bug 1660351), and that could be an issue.

If we're still caching objdirs between runs in automation (something that has caused CI incidents before), then this seems like a safe bet.

Kind of a shot in the dark because again, I don't understand this fully, but I suspect that backporting bug 1656614 into ESR78 would fix this.

RyanVM did a try push on top of esr78 and seems like bug 1594914 doesn't solve this. He's testing bug 1659575 and bug 1656614 via the same way to see if this fixes things.

See Also: → 1594914, 1659575, 1656614

One of the thoughts that I have is that https://hg.mozilla.org/mozilla-central/rev/33979c798b55 from bug 1661637 landed on Saturday causing docker-images to rebuild for level-3 workers which are shared across central/esr78. So I'm wondering whether this change needs grafting to esr78 but I'm not 100% I'm right, I'm checking some of the taskgraph internal logic now. I think the change might've impacted just the decision tasks and forced the downstream dependencies (such as toolchains) to rebuilt, but hasn't impacted the actual Linux/Windows workers.

See Also: → 1661637

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #9)

One of the thoughts that I have is that https://hg.mozilla.org/mozilla-central/rev/33979c798b55 from bug 1661637 landed on Saturday causing docker-images to rebuild for level-3 workers which are shared across central/esr78. So I'm wondering whether this change needs grafting to esr78 but I'm not 100% I'm right, I'm checking some of the taskgraph internal logic now. I think the change might've impacted just the decision tasks and forced the downstream dependencies (such as toolchains) to rebuilt, but hasn't impacted the actual Linux/Windows workers.

Theory doesn't stand, build task should not be affected. Also we should be seeing this fail on beta/release too, not just esr78.

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #8)

RyanVM did a try push on top of esr78 and seems like bug 1594914 doesn't solve this. He's testing bug 1659575 and bug 1656614 via the same way to see if this fixes things.

Neither worked, all three patches failed via try pushes on top of esr78.

I think my try pushes prove that https://github.com/pypa/setuptools/issues/2350 is to blame.

The real fix is probably to wait for setuptools to release a fixed version. If we can live with a busted esr78 for some period of time, that may be preferable.

This try push contains this fix which forces builds to not use the new broken behavior. However, this appears to negatively affect or not fix some tests, e.g. source-test-python-tryselect.

Setuptools 50.0.1 is out. The changelog appears promising. This may Just Work now.

This patch should only land on esr78, and only until we get an upstream setuptools fix.

Assignee: nobody → bhearsum
Status: NEW → ASSIGNED

The real fix is probably to wait for setuptools to release a fixed version.

It seems the me the real fix is to not have automation install whatever last version of setuptools is available when it runs. We use static version numbers to avoid this very problem we're facing right now.

It also seems all the failures are only happening during make check when running python-tests with python3. For esr78, it might be fair game to just not run them.

It's also worth noting we do have a wheel for setuptools 41.6, and we should be using it already, so how do we get hit by the setuptools 50 thing?

So... I took a build job that failed on esr78, edited it on taskcluster to get an interactive task and... it didn't fail and I can't reproduce. So... it looks like it might have fixed itself, which is not really reassuring. I'm going to retrigger a few jobs on esr78 and see what happens.

So the core problem is that pipenv is installing the latest setuptools from the network rather than the in-tree wheel we use for virtualenv or a static version. It now installs 50.0.3.

Linux builds have apparently fixed themselves with that new version, but not the Windows builds.

Assignee: bhearsum → mh+mozilla

By setting PIP_NO_INDEX when running pipenv, we make it populate the
virtualenv with the in-tree wheels for e.g. pip and setuptools, instead of
whatever happens to be the latest version the day pipenv runs. This
makes it match what we already do for other virtualenvs (with
--no-download in VirtualenvManager.create).

I think my phabricator revision will fix this. More details in https://phabricator.services.mozilla.com/D89088#2800138(In reply to Mike Hommey [:glandium] from comment #22)

Created attachment 9173567 [details]
Bug 1662381 - Don't let pipenv initialize the virtualenv with packages from pypi.

By setting PIP_NO_INDEX when running pipenv, we make it populate the
virtualenv with the in-tree wheels for e.g. pip and setuptools, instead of
whatever happens to be the latest version the day pipenv runs. This
makes it match what we already do for other virtualenvs (with
--no-download in VirtualenvManager.create).

Thank you for this!

Comment on attachment 9173567 [details]
Bug 1662381 - Don't let pipenv initialize the virtualenv with packages from pypi.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Fixing release automation for a bunch of builds given the setuptools saga.
  • User impact if declined:
  • Fix Landed on Version:
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky):
  • String or UUID changes made by this patch:
Attachment #9173567 - Flags: approval-mozilla-esr78?

Note that esr78 fixed itself entirely because even newer setuptools releases reverted the problematic changes. That said, we should still land the fix because esr78 is going to be supported for a year still, and we don't want similar surprises happening in the future.

Flags: needinfo?(ryanvm)

Comment on attachment 9173567 [details]
Bug 1662381 - Don't let pipenv initialize the virtualenv with packages from pypi.

Approved for 78.3esr. Is this going to land on m-c still?

Flags: needinfo?(ryanvm)
Attachment #9173567 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+

(In reply to Ryan VanderMeulen [:RyanVM] from comment #27)

Comment on attachment 9173567 [details]
Bug 1662381 - Don't let pipenv initialize the virtualenv with packages from pypi.

Approved for 78.3esr. Is this going to land on m-c still?

It's not needed on m-c anymore, as we have removed pipenv entirely.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Attachment #9173537 - Attachment is obsolete: true
Attachment #9173488 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: