Aggregate similar exceptions in emails from masters

RESOLVED FIXED

Status

Release Engineering
General
P5
enhancement
RESOLVED FIXED
7 years ago
4 months ago

People

(Reporter: coop, Assigned: coop)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [buildmasters][reporting])

Attachments

(3 attachments)

Two things that would make the exceptions emails from masters more useful IMO:

1) Report an aggregate count of exceptions of a certain type (e.g. twisted.spread.pb.PBConnectionLost). Report the first one found and then print afterwards "...and 10 more like this."

2) Exclude exceptions that we can't do anything about, i.e. twisted.spread.pb.PBConnectionLost. This should be a blacklist so we can start caring about them in the future if we *can* do something about them.
Severity: normal → enhancement
Priority: -- → P5
Created attachment 503882 [details] [diff] [review]
Ignore PBConnectionLost exceptions

taking the easy part first - ignore PBConnectionLost exceptions
Attachment #503882 - Flags: review?(coop)
Attachment #503882 - Flags: review?(coop) → review+
Comment on attachment 503882 [details] [diff] [review]
Ignore PBConnectionLost exceptions

changeset:   1128:a92b4503ee69
Attachment #503882 - Flags: checked-in+
Deployed on all the masters
Product: mozilla.org → Release Engineering
Component: Other → Tools
QA Contact: hwine
I've become fed up enough with the UnauthorizedLogin exceptions from try-linux64-ec2-golden that I've written a patch here.
Assignee: nobody → coop
Status: NEW → ASSIGNED
Created attachment 8465681 [details] [diff] [review]
Aggregate UnauthorizedLogin exceptions

Because UnauthorizedLogin exceptions happen every few seconds while the slave is still trying to connect, they turn a usually non-existent exception email into a 300K hourly monster.

None of the other exceptions we currently hit happen with that kind of frequency, but the pattern I've implemented here could be extended to other cases should it be required.

I collect all the exceptions as we did before. I then post-hoc run the exceptions through a comparison function that strips out and aggregates the login exceptions, passing the smaller list of remaining exceptions on to the next comparison (no other comparison functions exist yet).

We get a roll-up of the frequency of the UnauthorizedLogin attempts by host. As a bonus, I've also done a hostname lookup from the reported ip, since we only get the ip in the twistd.log.

Here's a truncated example of the output that I just ran against *all* the twistd.logs on bm75-try1:
---
The following slaves tried to connect unsuccessfully to buildbot-master75.srv.releng.use1.mozilla.com bm75-try1:
# attempts - hostname (ip) - last seen
     96559 - try-linux64-ec2-golden.try.releng.use1.mozilla.com (10.134.49.65) - 2014-07-31 11:38:15-0700

Example:

Exception in /builds/buildbot/try1/master/twistd.log.14:
2014-07-28 01:42:02-0700 [Broker,102652,10.134.49.65] Unhandled Error
	Traceback (most recent call last):
	Failure: twisted.cred.error.UnauthorizedLogin: 

--------------------------------------------------------------------------------
The following other exceptions (total 32) were detected on buildbot-master75.srv.releng.use1.mozilla.com bm75-try1:

Exception in /builds/buildbot/try1/master/twistd.log.49:
2014-07-04 02:32:19-0700 [-] Unhandled Error
[deletia]
---

We won't normally get this many since we limit by timestamp when running normally.
Attachment #8465681 - Flags: review?(bugspam.Callek)
Comment on attachment 8465681 [details] [diff] [review]
Aggregate UnauthorizedLogin exceptions

wfm, thanks!
Attachment #8465681 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8465681 [details] [diff] [review]
Aggregate UnauthorizedLogin exceptions

Review of attachment 8465681 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/b65452fec28d
Attachment #8465681 - Flags: checked-in+
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
\o/
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Created attachment 8470225 [details] [diff] [review]
Only roll-up login exceptions

This small follow-up patch explicitly checks for UnauthorizedLogin errors.

Given the regexp the way it was, we were rolling up *any* exception that displayed a an IP in the broker tag. These included things like MySQL host connection errors.
Attachment #8470225 - Flags: review?(bugspam.Callek)
Attachment #8470225 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8470225 [details] [diff] [review]
Only roll-up login exceptions

Review of attachment 8470225 [details] [diff] [review]:
-----------------------------------------------------------------

https://hg.mozilla.org/build/tools/rev/4a8c1893c3db
Attachment #8470225 - Flags: checked-in+
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.