Closed Bug 1273249 Opened 9 years ago Closed 9 years ago

Missing SNS alerts from papertrail

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: aselagea)

References

Details

Many of our buildbot masters currently have stale lockfiles for reconfigs, and yet we don't have any alerts for this in #buildduty. e.g. on buildbot-master114: 1838896 0 -rw-rw-r-- 1 cltbld cltbld 0 May 14 11:02 /builds/buildbot/tests1-linux64/reconfig.lock
papertrail alerts don't seem to be making it to irc.
Summary: Missing alerts for stale reconfig locks → Missing SNS alerts from papertrail
Papertrail seems to be getting the events properly: May 16 12:00:06 buildbot-master68.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:06 buildbot-master54.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:06 buildbot-master114.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:06 buildbot-master125.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:06 buildbot-master67.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:06 buildbot-master51.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:06 buildbot-master113.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:07 buildbot-master53.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:07 buildbot-master52.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 12:00:07 buildbot-master02.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master02.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master114.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master54.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master51.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master53.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master125.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master113.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:06 buildbot-master68.bb.releng.usw2.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:07 buildbot-master67.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. May 16 13:00:07 buildbot-master52.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
I think this might have been due to a chance papertrail made on their end: "Based on the responses to my email below, on Tuesday, April 19, 2016, Papertrail's SNS alert will subtly change. Most SNS subscribers will not notice the change. Only SNS subscribers using the plaintext email and SMS/text message protocols will see a change. The change is as follows: - TODAY: Papertrail sends each log message as a JSON object. However, SNS does not support JSON for the plaintext email or SMS/text message protocols. Today, Papertrail does not include a plaintext default, so SNS sends a JSON object (for plaintext email) or "null" (for SMS) to those subscribers. - AFTER APRIL 14: Papertrail will include a plaintext message as well, in this format: Apr 07 10:08:14 systemname appname: log message Only plaintext email and SMS/text message subscribers will receive this string. Other SNS protocols will continue to receive the JSON object. Here's more: http://docs.aws.amazon.com/sns/latest/dg/PublishTopic.html."
selena: I suspect that the bot is receiving messages it can't parse correctly, Can you please help debug?
Flags: needinfo?(sdeckelmann)
Assignee: nobody → sdeckelmann
Flags: needinfo?(sdeckelmann)
Yeah parsing error: 2016-05-24T15:00:52.766341+00:00 app[web.1]: at Robot.emit (/app/node_modules/hubot/src/robot.coffee:583:18, <js>:459:41) 2016-05-24T15:00:52.766339+00:00 app[web.1]: at EventEmitter.<anonymous> (/app/scripts/sns_response.coffee:68:17, <js>:61:20) 2016-05-24T15:00:52.766340+00:00 app[web.1]: at EventEmitter.emit (events.js:95:17) 2016-05-24T15:00:52.766342+00:00 app[web.1]: at SNS.notify (/app/node_modules/hubot-sns/src/sns.coffee:87:5, <js>:85:18) 2016-05-24T15:00:52.766342+00:00 app[web.1]: at SNS.process (/app/node_modules/hubot-sns/src/sns.coffee:63:7, <js>:55:21) 2016-05-24T15:00:52.766343+00:00 app[web.1]: at /app/node_modules/hubot-sns/src/sns.coffee:47:11, <js>:35:26 2016-05-24T15:00:52.766324+00:00 app[web.1]: [Tue May 24 2016 15:00:52 GMT+0000 (UTC)] ERROR SyntaxError: Unexpected token M Poking at it now.
Bad network is killing me at my current location. I'll have a look at this tomorrow when I can use the Heroku dashboard without long stalls. Sorry this has been busted for so long!
@selena: I was wondering if you have any updates on this one? :-) During the last weekend there were some stuck reconfigs on several masters, so it would be nice if we could receive alerts for that in #buildduty to deal with the issue faster (see bug 1276433). Thanks!
Flags: needinfo?(sdeckelmann)
Started debugging this, hopefully I'll find a fix :-).
Flags: needinfo?(sdeckelmann)
Assignee: sdeckelmann → aselagea
Alin: you have access to the heroku dashboard, yes? Have you tried dumping the contents of msg.message (or even all of msg) to the log to see what the format is now? https://github.com/mozilla/relengbot/blob/master/scripts/sns_response.coffee#L68
Coop merged https://github.com/mozilla/relengbot/pull/2 to display to console the message for which the parsing process fails. It turned out that we don't have a json message anymore, but a plaintext one. The following PRs were merged to address this: https://github.com/mozilla/relengbot/pull/3, https://github.com/mozilla/relengbot/pull/4. The bot now displays the alerts in #buildduty once again: "relengbot> [sns alert] Aug 04 14:00:06 buildbot-master128.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes."
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.