Connecting to proxxy fails when running in TaskCluster

RESOLVED FIXED

Status

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: whimboo, Assigned: gps)

Tracking

({regression})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments)

Posted file build 5.log
Since the patch on bug 1304176 landed on mozilla-central, our firefox-ui jobs in mozmill-ci are perma failing. They are getting killed after 60 minutes by Jenkins.

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=Firefox%20&filter-tier=1&filter-tier=2&filter-tier=3&bugfiler&selectedJob=5113068

It looks like that the patch on bug 1305804 didn't fix our problem.

So I would assume that there is something with the config file which is in use by mozmill-ci:

https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/configs/firefox_ui_tests/qa_jenkins.py

Attached you can find the latest log.

Here an excerpt:

07:34:11     INFO -    Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0xb65a890c>, 'Connection to pypi.pub.build.mozilla.org.proxxy1.srv.releng.scl3.mozilla.com timed out. (connect timeout=120.0)')': /pub

I'm not sure why our tests are using the proxxy1 now. Gregory, do you have any idea what could have been caused this change? We definitely didn't change anything.
Flags: needinfo?(gps)
Assignee

Comment 1

3 years ago
Using the HTTP proxy was likely an unintended side-effect of me refactoring code to use a single code path and be more consistent in terms of operation. As part of that, I think I enabled proxxy usage where it wasn't enabled before.

When running in TC, we shouldn't be using proxxy. But the default in proxxy.py appears to be to use the proxxy if the current hostname resolves to something in an AWS region. We should change proxxy to recognize when we're running under TC and to not use a proxy.

Since I broke this, I'll fix this.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Component: Firefox UI Tests → Mozharness
Flags: needinfo?(gps)
Product: Testing → Release Engineering
QA Contact: hskupin → jlund
Version: Version 3 → unspecified
Assignee

Updated

3 years ago
Summary: Permfailure Process killed after 60 minutes (massive delay in installing Python packages) → Connecting to proxxy fails when running in TaskCluster
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Gregory this patch will not work for our workers because those are not run in Taskcluster but on their own slave nodes. If we really use proxxy in buildbot land only, we should check for that instead.
If that is something we cannot do, we may have to add an option which can be set to stop using the proxxy.
Assignee

Comment 6

3 years ago
(In reply to Henrik Skupin (:whimboo) [away 09/30 - 10/06] from comment #4)
> Gregory this patch will not work for our workers because those are not run
> in Taskcluster but on their own slave nodes. If we really use proxxy in
> buildbot land only, we should check for that instead.

As I said in the commit message, ideally we would opt in to proxxy if we can determine it is available in the current environment. So the question becomes how we do that.

I need to know more about your automation. Is it run from buildbot? If so, how can we differentiate your buildbot automation from releng buildbot automation?

I have a feeling this culminates with doing a DNS or TCP socket check to the proxxy host. But I'd like to rule out alternatives that don't require network operations first...

Comment 7

3 years ago
mozreview-review
Comment on attachment 8796299 [details]
Bug 1306421 - Add is_taskcluster to ScriptMixin;

https://reviewboard.mozilla.org/r/82188/#review80812
Attachment #8796299 - Flags: review?(armenzg) → review+

Comment 8

3 years ago
mozreview-review
Comment on attachment 8796300 [details]
Bug 1306421 - Don't use proxxy if running in TaskCluster;

https://reviewboard.mozilla.org/r/82190/#review80814
Attachment #8796300 - Flags: review?(armenzg) → review+
Assignee

Comment 9

3 years ago
Perfect is the enemy of done. I'll land what I've written so far because it is an improvement over status quo. Will address Henrik's concerns in subsequent commits.
Keywords: leave-open

Comment 10

3 years ago
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/dcd89f87bf86
Add is_taskcluster to ScriptMixin; r=armenzg
https://hg.mozilla.org/integration/autoland/rev/7de011ec6e45
Don't use proxxy if running in TaskCluster; r=armenzg
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
The current situation with external fx-ui tests is sub-optimal given that all are failing on Linux and Windows. It would have been great to see that this got fixed close to the landing of the former fix.

Given that Gregory is out the next 10 days too, I might have to take a look at this.
Comment hidden (Intermittent Failures Robot)
Looks like I won't find the time to investigate that. Gregory, please pick the remaining work up when you are back. Both Linux and Windows platforms are permanently failing. Thanks!
Flags: needinfo?(gps)
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
No longer blocks: 1280474
Assignee

Comment 21

3 years ago
This is now at the top of my priority stack for post-PTO fires.
Flags: needinfo?(gps)
Comment hidden (mozreview-request)
Assignee

Comment 23

3 years ago
whimboo: I'm reasonably confident the patch will fix things. But I don't think I have a way to test it since this automation isn't part of the normal infra. So, uh, not sure how you want to handle this.
Reporter

Comment 24

3 years ago
mozreview-review
Comment on attachment 8802672 [details]
Bug 1306421 - Disable proxxy in Jenkins QA environment;

https://reviewboard.mozilla.org/r/87000/#review86076

::: testing/mozharness/configs/firefox_ui_tests/qa_jenkins.py:18
(Diff revision 1)
>      'download_minidump_stackwalk': True,
>      'download_symbols': 'ondemand',
>      'download_tooltool': True,
> +
> +    # Disable proxxy because it isn't present in the QA environment.
> +    'proxxy': {},

Uh, good catch! This actually is definitely the fix and brings in what I have accidentally removed with https://hg.mozilla.org/mozilla-central/rev/ce2807f90fcb

So please go ahead and land it. It will clearly fix our problem.
Attachment #8802672 - Flags: review?(hskupin) → review+

Comment 25

3 years ago
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7cd4300af5ed
Disable proxxy in Jenkins QA environment; r=whimboo
Comment hidden (Intermittent Failures Robot)
Assignee

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Keywords: leave-open
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot)
So this is working great again starting from Friday. Thanks Gregory!
You need to log in before you can comment on or make changes to this bug.