Closed Bug 1484950 Opened 2 years ago Closed 2 years ago

mdc1/mdc2: Cannot reach signing*.srv.releng.mdc{1,2}.mozilla.com:9120 from mobil-signing-linux-1.srv.releng.use1.mozilla.com

Categories

(Infrastructure & Operations :: NetOps, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: van)

References

Details

Attachments

(1 file)

Since bug 1409091, mobil-signing-linux-1.srv.releng.use1.mozilla.com talks to the signing servers in order to sign Firefox Focus (for Android). For instance [1]. The tasks worked because they reached scl3.
4 days ago, we deactivated scl3 on mobil-signing-linux-1[2]. This made the task fail since then[3]. Per these logs, it seems we can't reach any signing*.mdc*:9120 from it.

I connected onto the machine and did the following:

> $ ping signing9.srv.releng.mdc1.mozilla.com 
> PING signing9.srv.releng.mdc1.mozilla.com (10.49.48.42) 56(84) bytes of data.
> 64 bytes from signing9.srv.releng.mdc1.mozilla.com (10.49.48.42): icmp_seq=1 ttl=62 time=79.7 ms
> 64 bytes from signing9.srv.releng.mdc1.mozilla.com (10.49.48.42): icmp_seq=2 ttl=62 time=80.4 ms

> $ nc signing9.srv.releng.mdc1.mozilla.com 9120 ; echo $?
> 1

It seems the port 9120 is blocked to mobil-signing-linux-1. I know I had to whitelist it for slc3 in [4], but I don't know how to do it for mdc*.

Could you guys help me with that?


[1] https://tools.taskcluster.net/groups/bT-ak0LPRZCfzRszu0zxmA/tasks/YiL7td5aSgSahUD_qmdM0w/runs/0/logs/public%2Flogs%2Flive_backing.log#L6734
[2] https://github.com/mozilla-releng/build-puppet/pull/170
[3] https://tools.taskcluster.net/groups/dfmoWdgmQTSP6tjLQn8D7Q/tasks/UhEtSPJ3QR-zIhSvHD9vlQ/runs/0/logs/public%2Flogs%2Flive_backing.log#L6
[4] https://bug1409091.bmoattachments.org/attachment.cgi?id=8970896
Severity: normal → blocker
:jlorenzo so it looks like your need this flow:

mobil-signing-linux-1.srv.releng.use1.mozilla.com > signing*.mdc*:9120

>we deactivated scl3 on mobil-signing-linux-1[2]

do you need mobil-signing-linux2 or is that 12? if so, can i have the FQDN? ill add the above flow shortly.
Assignee: network-operations → vle
Attached file 1484950.html
added policy 272 mobil-signing-linux-1--buildbot. can you test and let me know if that's all you needed?
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Johan is out on PTO, he asked me to shepard this bug while he's away so let me see if I get this right ...

Based on what I can see in the TC provisioners,only one worker ID defined within the mobile-signing-v1 workerGroup[1], and that's mobil-signing-linux-1[2]. Full FQDN of the workerID is "mobil-signing-linux-1.srv.releng.use1.mozilla.com". 

Until a couple of days ago, this instance was talking succcessfully to the following signing servers from the SCL3:
signing4.srv.releng.scl3.mozilla.com:9120
signing5.srv.releng.scl3.mozilla.com:9120
signing6.srv.releng.scl3.mozilla.com:9120

Since that's no longer possible due to the migration to MDC1[3], the machine is now attempting to communicate to:
signing7.srv.releng.mdc1.mozilla.com:9120
signing8.srv.releng.mdc1.mozilla.com:9120
signing8.srv.releng.mdc1.mozilla.com:9120

and fails to do so as Johan said due to netflows missing I suppose.

(In reply to Van Le [:van] from comment #1)
> :jlorenzo so it looks like your need this flow:
> 
> mobil-signing-linux-1.srv.releng.use1.mozilla.com > signing*.mdc*:9120

Yes, indeed! We need mobil-signing-linux-1.srv.releng.use1.mozilla.com > signing{7,8,9}.srv.releng.mdc1.mozilla.com:9120 (to be extremely narrow), or (more generic) "signing*.mdc*:9120"

(In reply to Van Le [:van] from comment #1)
> >we deactivated scl3 on mobil-signing-linux-1[2]
> 
> do you need mobil-signing-linux2 or is that 12? if so, can i have the FQDN?
> ill add the above flow shortly.

Just the mobil-signing-linux-1.srv.releng.use1.mozilla.com. 

There is *no* mobil-signing-linux2 or mobil-signing-linux12. The "...linux-1[2]" was the notation to reference the puppet "[2]" PR from bug 1484950 comment 0, which was https://github.com/mozilla-releng/build-puppet/pull/170.


(In reply to Van Le [:van] from comment #2)
> Created attachment 9002820 [details]
> 1484950.html
> 
> added policy 272 mobil-signing-linux-1--buildbot. can you test and let me
> know if that's all you needed?

The policy looks good to me. Maybe modulo the "buildbot" naming :) However, I'm not familiar with the other namings so I don't know if that's a general convention or not, but AFAIK Buildbot is going away soon, both automation wise and infra-wise. Either way, just a tiny nit.

Sounds like it enables traffic from mobil-signing-linux-1.srv.releng.use1 > releng_signing_mdc1 which seems fine. 

I've tried to test that out following Johan's steps from comment 0 but I don't get a proper TCP connection.

> [mtabara@mobil-signing-linux-1.srv.releng.use1.mozilla.com ~]$ ping signing9.srv.releng.mdc1.mozilla.com
> PING signing9.srv.releng.mdc1.mozilla.com (10.49.48.42) 56(84) bytes of data.
> 64 bytes from signing9.srv.releng.mdc1.mozilla.com (10.49.48.42): icmp_seq=1 ttl=62 time=79.6 ms

so it's reachable. But then when I tried to connect to that specific port, it hangs.

> [mtabara@mobil-signing-linux-1.srv.releng.use1.mozilla.com ~]$ nc signing9.srv.releng.mdc1.mozilla.com 9120
> (nothing .. it just hangs here)

I'd normally go ahead and rerun one of those signing jobs which are being run on this particular instance, but the current graph (encompassing build, signing, pushing) is still blocked on the build task failing. Since tasks are run in that order, it doesn't currently get to signing so I can't run/rerun that. Focus team is aware of that and working on a fix AFAIK. A recent graph was triggered earlier here[4].

@van: chnages from your side seem good to me, not sure why netcat is not returning a succcessful connection to that host:port. Do you have a different method of testing the netflow policy you've added in mind? 

Thank you!

[1]: https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/mobile-signing-v1
[2]: https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/mobile-signing-v1/workers/mobile-signing-v1/mobil-signing-linux-1
[3]: https://tools.taskcluster.net/groups/dfmoWdgmQTSP6tjLQn8D7Q/tasks/UhEtSPJ3QR-zIhSvHD9vlQ/runs/0/logs/public%2Flogs%2Flive_backing.log#L6
[4]: https://tools.taskcluster.net/groups/GOQgxRSdSquZ651kfavubg
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Sounds like signing jobs are now working! Based on the worker history[1], turns out we finally have a green signing job[2].
@van: I'll reopen if things go south again, thanks a lot for the help!

[1]: https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/mobile-signing-v1/workers/mobile-signing-v1/mobil-signing-linux-1
[2]: https://tools.taskcluster.net/groups/AVADqfB-QzW_k0aqBlx8vA/tasks/ahEeCzdLRFeXhVExQqgd0A/runs/0
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.