TreeHerder's comments on intermittent bugs can generate a tsunami of bugmail
Categories
(Tree Management :: Treeherder, defect)
Tracking
(Not tracked)
People
(Reporter: glob, Unassigned)
Details
On 2022-12-12 TreeHerder, acting as the Intermittent Failures Robot Bugzilla account, generated enough bugmail to trigger infrastructure alerts. For just over 52 minutes it commented on a bug every 2 seconds, touching 1528 bugs and generating about 30k emails.
This is a significant amount of emails for Bugzilla's infrastructure to process, and triggered an alert that required attention from our SRE team. While an alert was triggered, no infrastructure issues were noticed in this instance aside from delaying email delivery system-wide.
On Bugzilla's side we'll investigate if changes to our alerting can be implemented to avoid alerting if possible (there is value in raising awareness of slow email delivery). As TreeHerder's behaviour hasn't changed we'll also investigate why we're seeing slower email handling following BMO's migration from AWS to GCP.
In any event I'd like for the TreeHerder team to consider what changes could be made to intermittents_commenter to reduce its impact on the service's health.
Possible solutions include:
- running the command more frequently, which will reduce the number of bugs updated at the same time
- add long delays between each bug update, which might give Bugzilla enough time to handle the emails generated by one bug before it has to handle the next
- prevent the account from generating emails when it updates a bug; this is a BMO admin setting that we can put in place for you and it prevent bugmail from being created from any change made by the bot
- consider the value of the comments and find an alternative method for surfacing information about intermittents, such as a dashboard
Comment 1•3 years ago
|
||
(In reply to :glob ✱ from comment #0)
- add long delays between each bug update, which might give Bugzilla enough time to handle the emails generated by one bug before it has to handle the next
What would be a sufficient value?
Excellent question - could we try 5 seconds?
We've also increased the number of workers that process the email queue; those two changes combined should hopefully prevent a backlog large enough to page on-call.
Comment 3•3 years ago
|
||
I would like to comment less frequently on bugs, but once/month provide an update for failures for the low frequency bugs. That once/month will be a lot of emails. I wonder if this happened as a one time spike, or if this is something that occurs regularly (at least once/month?)
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #3)
I would like to comment less frequently on bugs, but once/month provide an update for failures for the low frequency bugs. That once/month will be a lot of emails. I wonder if this happened as a one time spike, or if this is something that occurs regularly (at least once/month?)
The spike occurs every week - see https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&from=-14d&who=orangefactor%40bots.tld
BMO infrastructure changes in early December (AWS→GCP) resulted in a slower than usual processing of the email, trigging a page to on-call for attention. The infra has been scaled up however I feel it's still worth considering the value of these comments.
I'm not sure what the intended use of the comments is - it could be using bugzilla comments as a database to store history, or as a reminder to those following the bugs that it's still open.
To be clear I'm not advocating for the comments to be stopped; want to make sure that they still have value.
Comment 5•3 years ago
|
||
I agree we need to find a way to measure the usefulness of our tools (maybe what % of intermittent bugs get fixed by developers). Without a team to work on a workflow change (i.e. treeherder is on life support), I don't see a lot happening anytime soon. I have seen in the past great plans defined and 80% of the time 2 years later no work has been done- then those plans need to be redone as workflows/tools/scopes have changed.
My understanding is the comments are more of a reminding to people that an issue is occurring- Bugzilla has a way to store that information and collaborate on it while notifying people. Building a system to notify people seems error prone, same with adding a tool into a workflow. If bugs and triage are not a regular thing for a team, I don't see these bugs/comments as being valuable.
Without looking at alternatives we don't know if there is something worth pursuing or not, I have filed bug 1809879 to have a discussion around one idea- possibly we can find other ideas as well.
Description
•