Closed Bug 1603197 Opened 5 years ago Closed 5 years ago

IRC notification not working in firefox ci cluster

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tomprince, Assigned: dustin)

Details

This impacts releaseduty, as we use taskcluster's notifiy service for tracking releases.

Error: Request failed: Non-2xx StatusCode: 500 received in 6 attempts
{
  "code": "InternalServerError",
  "message": "Internal Server Error, incidentId f7a22759-e7b3-4cab-ba58-dc344dbc460d.\n\n---\n\n* method:     irc\n* errorCode:  InternalServerError\n* statusCode: 500\n* time:       2019-12-11T20:13:52.732Z",
  "requestInfo": {
    "method": "irc",
    "params": {},
    "payload": {
      "channel": "#taskcluster",
      "message": "hello"
    },
    "time": "2019-12-11T20:13:52.732Z"
  }
}

^^ that was due to the rabbitmq outage, and even having solved that I'm not seeing any logging from the taskcluster-notify-irc process.

Assignee: nobody → dustin

I will look at this tomorrow.

https://orange-antelope.rmq.cloudamqp.com/#/queues/%2F/queue%2Ftaskcluster-notify%2Firc-notifications shows 20 pending notifications in that queue, none of them delivered. The irc process hasn't logged anything since the 10th. I see a connection with no channels (https://orange-antelope.rmq.cloudamqp.com/#/connections/34.83.136.195%3A37590%20-%3E%2010.51.85.193%3A5671) so I'm assuming that's the culprit. I deleted that connection, but it was just re-created without any channels. The community service hasn't logged anything since the 9th.

This might be related to c51ec9fef108f92bc0598be0bf2854b1d01abc21. The timing lines up with v24.0.0 being deployed.

So one thing that did change in that config, is that IRC_PORT is now required. The irc posting service is set up such that it connects but doesn't start listening for pulse messages until it is connected to IRC. It logs at an INFO level:

{"EnvVersion":"2.0","Fields":{"v":1},"Hostname":"lamport","Logger":"taskcluster.notify.pulse-client","Pid":10138,"Severity":6,"Timestamp":1576119555649000000,"Type":"pulse.connected","serviceContext":{"service":"notify"},"severity":"INFO"}

but logs nothing else if it can't connect to irc, and we don't keep INFO level logs in stackdriver.

So, I think what's happening here is that IRC_PORT is misconfigured, and the service is never fully starting up. edunham, can you confirm that irc.irc_port is set to 6697 for all three deployments?

It seems that the only way to get logging out of the irc-upd library that we use for connecting to irc is to set notify.irc_debug to "1", so let's do that, too. That logging is not structured, but it's better than nothing.

Flags: needinfo?(edunham)

The port was previously set to 6667 on all deployments.

In sops commit c69949a6c294e6b4668159bb73ee74e41aba26bf, I have changed the port to 6697 on all deployments and added irc_debug: 1 for firefoxci.

Flags: needinfo?(edunham)

These changes are now live on firefoxci, and will propagate to communitytc and stage on their next deploys. Stage's next deploy will notify in #releng-notifications on slack and community's next deploy will notify in #taskcluster on slack.

Thanks! Tom confirms it's working now. I also see some logging showing that the process is doing useful things.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.