Closed Bug 1388680 Opened 7 years ago Closed 7 years ago

heroku-release-notifications app/worker.1: raise RuntimeError("Not connected")

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Unassigned)

References

Details

Attachments

(1 file)

54 bytes, text/x-github-pull-request
mtabara
: review+
Details | Review
TL;DR: I think relpro IRC notifications create more problem than they solve. I'm going to mute the errors, for now. In the future, I see 3 solutions. What do you all think? 



Each time we cancel a release graph, errors like:

> Aug 08 19:52:37 heroku-release-notifications app/worker.1: 2017-08-08 19:52:37,460 - asyncio - ERROR - Task exception was never retrieved
> Aug 08 19:52:37 heroku-release-notifications app/worker.1: future: <Task finished coro=<irc_notify() done, defined at /app/pulsenotify/plugins/irc.py:74> exception=RuntimeError('Not connected',)>
> Aug 08 19:52:37 heroku-release-notifications app/worker.1: raise RuntimeError("Not connected")
> Aug 08 19:52:37 heroku-release-notifications app/worker.1: RuntimeError: Not connected

are spewed out by Papertrail. A couple of facts:
* There can be about 400 failures for one cancelled graph. This is clobbering the buildduty IRC notifications.
* The patch in bug 1355483 didn't work

FIRST MEASURE
=> I'll exclude this pattern from papertrail.

NEXT
I'm unsure. I see 3 different scenarios. See details below:
1. Should we stop IRC notifications on releases?
2. Should we wait until release promotion is ported to in-tree tasks, which means we can use the bundled notification system?
3. Should we just start using the bundled notification system?



1. ARE RELPRO IRC NOTIFICATIONS USED BY RELEASEDUTY PEOPLE?
For what it's worth, I don't personally use these IRC notifications. I prefer to rely on emails. My emails are filtered so that these notifications fall under the same folder/label, and only the failed ones are kept as unread (i.e.: an action is needed when I'm on releaseduty). 

I find the IRC notifications hard to follow, because:
* failed tasks are continuously being pushed up by completed tasks
* it's hard to follow how many times a task was rerun and failed. On the other side, emails are put under the same thread.
* IRCCloud crops "old" messages when too many are displayed under the same channel. This makes Ctrl+F hard to use, unless you scroll back up long enough to reach your notification.



2. IN-TREE TASKS, THE GLORIOUS FUTURE
I think that's what we ultimately aim for. But that's a big project



3. CHANGING THE NOTIFICATIONS SYSTEM
Switching to another type of notification system should be a matter of editing [1]. However, the email notifications are currently quite reliable. We may discover race conditions by using the taskcluster one. Hence, migrating that critical part of the release doesn't seem great as a first migration. We may mitigate that, by adding notifications to the Fennec release graph.

[1] https://github.com/mozilla-releng/releasetasks/blob/master/releasetasks/templates/notification/notifications.yml.tmpl
I updated [1] from:
> ERROR
to 
> ERROR -("ERROR - Task exception was never retrieved" OR "RuntimeError('Not connected',)" OR 'RuntimeError("Not connected")' OR "RuntimeError: Not connected")

[1] https://papertrailapp.com/searches/17080452/edit
Could you confirm if we need to restart the heroku app each time we cancel a graph, or get the 'RuntimeError: Not connected' errors ? Guessing not if we can suppress the papertrail alerts.
AFAIK, we don't need to. The worker keeps running and reconnects the next time. My guess is: we sent to many tasks at the same time, before the IRC connections is made.
Hm, I do use the irc notifications, but marking the non-failure emails as read sounds like a great filter, and would make the irc notifications unneeded for me.
I use the IRC notifications too, but they're not perfect and have adjusted email filters to try that instead.
See Also: → 1392291
> 17:09:23 <relengbot> [sns alert] Aug 21 08:06:31 heroku-release-notifications app/worker.1: aiohttp.errors.ClientOSError: [Errno 110] Cannot connect to host cloud-mirror-production-us-east-1.s3.amazonaws.com:443 ssl:True [Can not connect to cloud-mirror-production-us-east-1.s3.amazonaws.com:443 [Connection timed out]] 
> 17:13:44 <aselagea|buildduty> jlorenzo: do you know what that S3 location is used for? ^
> 17:14:39 <jlorenzo> I'm guessing it's host behind taskcluster 
> 17:14:46 <jlorenzo> a* host 
> 17:15:43 <jlorenzo> that looks like a new bug for release notifications 
> 17:16:27 <aselagea|buildduty> hmm..it's weird that we send these alerts in #buildduty, but we don't have access/instructions on how to deal with them :-/
> 17:17:33 <jlorenzo> aselagea|buildduty: I agree. I'll turn off IRC notifications here 
> 17:17:51 <aselagea|buildduty> thanks! :)
> 17:17:53 <jlorenzo> it should be in #releaseduty, instead 

I just turned off notifications in #buildduty (the ones reporting a bad state of release-notifications). We still have emails. If we keep using IRC notifications, these ones should be part of #releaseduty. 

Anyway, even after comment 1, we've got new error patterns: bug 1392291. This time, it's unrelated to IRC notifications.
I don't use the irc notifications personally. I just look at the taskgraph for errors, I look at the email notifications occasionally if there is a problem.
I've stopped using the irc notifications... email wfm.
What aki said. I don't overly like the name Exception when we cancel graphs (vs an internal TC issue) but that's a more general problem.
Attached file pulse-notify PR
Let's kill IRC notifications (in production)!
Attachment #8917413 - Flags: review?(mtabara)
Attachment #8917413 - Flags: review?(mtabara) → review+
Landed on master and production at: https://github.com/mozilla-releng/pulse-notify/commit/585ef8eb485c6d3fbf1520be6db88a3e5d8130a9

Deployed to production at: https://dashboard.heroku.com/apps/release-notifications/activity/builds/0ae2eca6-2f42-4afd-b430-b86b69a73825
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1412013
Blocks: 1412014
Component: General Automation → General
Depends on: 1467029
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: