Closed Bug 1179821 Opened 8 years ago Closed 7 years ago

Create a script that posts failure stats to bugs on a periodic basis

Categories

(Tree Management Graveyard :: OrangeFactor, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(5 files, 2 obsolete files)

Bug 1179310 wishes to turn off the bug comments made by Treeherder.
One of the uses of these is to act as a reminder to people CCed or component watching bugs, that (a) the issue is still occurring, and (b) with what frequency / provide some extra info about the failure.

However these are very spammy, so as an interim solution to an OrangeFactor v2 and some more advanced notification system, I proposed only making periodic (ie daily/weekly) bug comments with aggregated stats - as opposed to one bug comment for every failure classification.

We can also set a threshold over which we won't comment on bugs - or set thresholds that mean we'll comment daily rather than weekly etc (but likely in a followup in another bug).
Depends on: 1180024
Depends on: 1180028
Some more random cleanup (improves SnR of flake8 output, and by having a config file means one can type "flake8" rather than having to use "flake8 ."):

Fix large excessively long lines in woo_mailer.py
https://hg.mozilla.org/automation/orangefactor/rev/fdbef0b17d9d

Add a setup.cfg with basic pep8/flake8 settings
https://hg.mozilla.org/automation/orangefactor/rev/b9a83c32f1c1
(CC sheriffs)

For the first iteration, I was thinking the script will:
* Post a daily summary for yesterday's failures, on bugs that were classified >= N times in that day.
* Post a weekly summary of a bugs failures, for bugs that were classified >= M times in that week.
* For the bug comment content, it would include:
  - what timeframe the comment applies to
  - total number of failures
  - breakdown by repository
  - breakdown by platform
  - link to the orangefactor bug details page for that bug
* All stats would be done looking at all repositories, not just trunk

Question:

Do people have any suggestions for the values of N and M? (We can always change them later. Barring any suggestions I was thinking N=15 and M=5)

Reasoning:

In the last 7 days, there were roughly 8000 classifications sent to OrangeFactor, spread across ~800 bugs. ie: 8000 comments were created. If we replaced that with a weekly summary to all bugs, then we'd drop the 8000 bug comments to 800. However, weekly summaries would mean a delay before knowing that an intermittent was suddenly occurring frequently, so we really need a daily summary for the chronic cases.

If instead we used daily summaries for all bugs that occurred in the prior 24 hours, then we'd have sent something like 1900 comments in the week, which is still pretty high. We could set a high threshold (only comment if the bug occurred more than 10 times in the last day), but then we'd never comment at all on bugs that occur <~70 times a week - which would mean the bug would get closed WORKSFORME and no longer appear in bug suggestions etc. (Plus the module owners would just think the issue had gone away).

As such, a combination of both weekly and daily summaries seems like the best balance between spam and making it clear bugs are still occurring.

Some example values for N and the total number of *daily* summaries that would have been posted over the whole week:
1: 1900 (ie all bugs get daily summaries if they occurred the prior day)
2: 895
5: 348
7: 257
10: 183
15: 114
20: 79
25: 62
50: 21

Some example values for M and the number of *weekly* summaries that would have been posted:
1: 800 (ie all the bugs get a summary)
2: 467
3: 342
4: 276
5: 225
6: 194
7: 172
8: 153
9: 141
10: 134

So with say N=15 and M=5, over the course of the week we'd post 114+225 = 339 bug comments, compared to the 8000 at present. The daily summaries would average 16 bugs/day.
This seems like a reasonable place to start.
Sounds like good initial values for M and N.
Since we'll want to use it for the bug commenter too.
Attachment #8666058 - Flags: review?(jgriffin)
Since both woo_mailer.py and the new bug commenter script will be using it, so the old name is no longer accurate.

I'll rename the hg-ignored file on brasstacks when this is deployed.
Attachment #8666060 - Flags: review?(jgriffin)
It returns the per-repository, per-platform and total failure counts for each bug that was seen in the current time window.

eg:

{
    "1206327": {
        "total": 5,
        "per_repository": {
            "fx-team": 2,
            "mozilla-inbound": 3
        },
        "per_platform": {
            "osx-10-10": 4,
            "b2g-emu-ics": 1
        }
    },
    ...
}
Attachment #8666061 - Flags: review?(jgriffin)
The default remains 'trunk', but it can now be overridden. This will allow the bug commenter to run stats_by_bug() against trees='all' rather than just trunk.
Attachment #8666062 - Flags: review?(jgriffin)
We need api_key support which is not in the latest PyPI version (v0.4.1), so have to fetch it from GitHub instead.

(Have filed https://github.com/AutomatedTester/Bugsy/issues/23)
Attachment #8666064 - Flags: review?(jgriffin)
7 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-central: 7

Platform breakdown:
* osx-10-9: 3
* osx-10-10: 3
* osx-10-6: 1

For more details, see:
http://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1206134&startday=2015-09-18&endday=2015-09-24&tree=all
This adds a script that will post either a daily or weekly failure summary to bugs whose total number of failures exceeded a threshold. For daily summaries, the threshold (at least for now) is 15 failures in the prior day, and for weekly summaries 5 failures in the prior week.

To run in daily mode (the default):
./woo_commenter.py --test

Weekly mode:
./woo_commenter.py --test --weekly

Remove --test to actually comment on the bugs, rather than printing to stdout.

Note: Even in --test mode, you will still need valid credentials in woo_cron.conf. This has intentionally not been worked around, so that --test can be used to validate as much of the production config after deployment as possible (apart from actually posting the bug comments). Set `bz = None` on line 89 to avoid.

For an example bug comment generated by the script, see comment 13.
Attachment #8666081 - Flags: review?(jgriffin)
Left to do:
* Deploy these changes on prod
* Rename woo_mailer.conf to woo_cron.conf on prod
* Pip install bugsy into the prod venv
* Pick a suitable time/day for the daily + weekly cron jobs (I'll document these and the existing woo_mailer.py cron job in the readme)
* Land the Treeherder bug comments disabling PR (bug 1179310)
* Add the cron entries from above to the webtools user on prod

Decisions (more of a note to self, than anything else):
* What time of day should this run? Early hours UTC makes sense IMO since shorter lag between end_date and when we send the emails out.
* Should we use the existing tbplbot account or create a new one? Either way we're going to rename the username for it (see bug 1164902). A new user might actually be better since it's a clean start and the comments are actually not that related to the old ones.
The windows patch warning on the last patch is due to the example config file having windows line endings in the repo at the moment. I'll try and fix the existing line endings on landing.
We'll also want to handle security bugs/invalid bug numbers (where we cannot add a comment, so just need to ignore any errors), but for the moment Bugsy actually silently fails on these, so we don't need to catch them specially (filed https://github.com/AutomatedTester/Bugsy/issues/25).
I've made some tweaks to this part since I've decided to stop using Bugsy, since it has a few bugs, doesn't retry failuresl, requires an unnecessary username (when BMO actually only needs an API key) and is really a bit heavyweight for just submitting bug comments.

As such:
* Part 5 is no longer required (since we don't need to add Bugsy to the requirements file)
* I've refactored the script slightly
* We now catch various errors that can occur, and only print them to the log, to avoid them taking down the whole cron run (eg in the case of typoed bug numbers, or security bugs).
Attachment #8666064 - Attachment is obsolete: true
Attachment #8666081 - Attachment is obsolete: true
Attachment #8666064 - Flags: review?(jgriffin)
Attachment #8666081 - Flags: review?(jgriffin)
Attachment #8666797 - Flags: review?(jgriffin)
Oh and we now retry each bug submission up to 3 times, in the case of DNS or connection errors.
Regarding the cron job, the current mailer cron line is:

0 8 * * tue     (/home/webtools/apps/orangefactor/src/orangefactor/woo_mailer.sh 2>&1) >> /home/webtools/apps/orangefactor/woo_mailer.log

brasstacks timezone is Pacific Time, so this corresponds to every tuesday at 8am pacific.

A wrapper script is unnecessary for this new script IMO, I'm going to add the following to the webtools crontab:

OFHOME=/home/webtools/apps/orangefactor
VENV_PYTHON=$OFHOME/bin/python

Which will then let me use this for the daily commenter:

0 20 * * *     cd $OFHOME && $VENV_PYTHON woo_commenter.py >> $OFHOME/woo_commenter-daily.log 2>&1

...which is 8pm pacific every day (ie at the end of where the peak of the starrings are likely to be each day)

And this for the weekly commenter:

0 12 * * fri     cd $OFHOME && $VENV_PYTHON woo_commenter.py --weekly >> $OFHOME/woo_commenter-weekly.log 2>&1

...which is 12 midday pacific every friday (which gives a nice summary of the week, but leaving a few hours to do something about an issue, before the weekend, if needed).

The times are somewhat arbitrary, happy for alternative suggestions.
Depends on: 1209189
Attachment #8666058 - Flags: review?(jgriffin) → review+
Attachment #8666060 - Flags: review?(jgriffin) → review+
Attachment #8666061 - Flags: review?(jgriffin) → review+
Attachment #8666062 - Flags: review?(jgriffin) → review+
Comment on attachment 8666797 [details] [diff] [review]
Part 6: Add a script that can post daily/weekly failure stats to bugs

Review of attachment 8666797 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!
Attachment #8666797 - Flags: review?(jgriffin) → review+
Depends on: 1209339
remote:   https://hg.mozilla.org/automation/orangefactor/rev/240b6ebdf94d
remote:   https://hg.mozilla.org/automation/orangefactor/rev/017b63e86565
remote:   https://hg.mozilla.org/automation/orangefactor/rev/838132d34cc5
remote:   https://hg.mozilla.org/automation/orangefactor/rev/d7b7371a533f
remote:   https://hg.mozilla.org/automation/orangefactor/rev/48efd7034a4a
remote:   https://hg.mozilla.org/automation/orangefactor/rev/d54b7925c8c5

* Hg repo updated on brasstacks
* woo_mailer.conf renamed to woo_cron.conf
* API key generated on the new orangefactor@bots.tld account
* API key added to woo_cron.conf on brasstacks
* Verified the script succeeded on brasstacks (which runs python 2.6) using --test

Just need to:
1) Update brasstacks timezone (bug 1209339)
2) Test the daily script (actually get it to comment on bugs)
3) Set crons
4) Land Treeherder PR + deploy
Few changes of plan for the crons:
1) Given the brasstacks timezone has now been changed to UTC, the existing jobs in the crontab need 8 hours subtracting from them.
2) I've decided to run the daily woo_commenter.py crona closer to UTC midnight, since that's when the previous day's data is freshest. 
3) The daily woo_commenter.py cron will not be run on the same day the weekly summary is generated, to save doubling up.
4) The bz_cache_refresh.py cron overlaps with other jobs (and always has), so I've tweaked it to run at half past the hour.

crontab before:

[webtools@brasstacks1.dmz.scl3 orangefactor]$ crontab -l
0 0 * * *   /home/webtools/apps/gofaster_dashboard/bin/fetch-and-process-builddata.sh
0 0,4,8,12,16,18,20 * * * /home/webtools/apps/orangefactor/bin/python /home/webtools/apps/orangefactor/src/bzcache/bzcache/bz_cache_refresh.py
0 8 * * tue     (/home/webtools/apps/orangefactor/src/orangefactor/woo_mailer.sh 2>&1) >> /home/webtools/apps/orangefactor/woo_mailer.log

crontab now:

[webtools@brasstacks1.dmz.scl3 orangefactor]$ crontab -l
OFHOME=/home/webtools/apps/orangefactor
VENV_PYTHON=$OFHOME/bin/python

# Run every day at 16:00 UTC.
0 16 * * *   /home/webtools/apps/gofaster_dashboard/bin/fetch-and-process-builddata.sh

# Run every four hours, at half-past the hour.
30 0,4,8,12,16,18,20 * * * $VENV_PYTHON $OFHOME/src/bzcache/bzcache/bz_cache_refresh.py

# Run every Tuesday at 02:00 UTC.
0 2 * * tue     $OFHOME/src/orangefactor/woo_mailer.sh >> $OFHOME/woo_mailer.log 2>&1

# Run every day except Monday at 01:00 UTC.
0 1 * * 0,2,3,4,5,6     cd $OFHOME && $VENV_PYTHON woo_commenter.py >> $OFHOME/woo_commenter-daily.log 2>&1

# Run every Monday at 01:00 UTC.
0 1 * * mon     cd $OFHOME && $VENV_PYTHON woo_commenter.py --weekly >> $OFHOME/woo_commenter-weekly.log 2>&1

--

Happy for suggestions as to better times for the crons :-)

I'll keep an eye on the logs to see if it's all working properly, but barring any issues, we're all done here (other than bug 1179310 / bug 1208102 being deployed to Treeherder production).
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Example comment left by the daily commenter that ran a few mins ago, can be found in bug 1144499 comment 350.

I had to tweak the crontab slightly, since (a) env defines within apparently can't refer to each other, (b) the cd command was using the wrong path.

It's now:

[webtools@brasstacks1.dmz.scl3 ~]$ crontab -l
GFHOME=/home/webtools/apps/gofaster_dashboard
OFHOME=/home/webtools/apps/orangefactor
VENV_PYTHON=/home/webtools/apps/orangefactor/bin/python

# Run every day at 16:00 UTC.
0 16 * * *   $GFHOME/bin/fetch-and-process-builddata.sh >> $GFHOME/fetch-and-process-builddata.log 2>&1

# Run every four hours, at half-past the hour.
30 0,4,8,12,16,18,20 * * * $VENV_PYTHON $OFHOME/src/bzcache/bzcache/bz_cache_refresh.py >> $OFHOME/bzcache_refresh.log 2>&1

# Run every Tuesday at 02:00 UTC.
0 2 * * tue     $OFHOME/src/orangefactor/woo_mailer.sh >> $OFHOME/woo_mailer.log 2>&1

# Run every day except Monday, at 01:00 UTC.
0 1 * * 0,2,3,4,5,6     cd $OFHOME/src/orangefactor && $VENV_PYTHON woo_commenter.py >> $OFHOME/woo_commenter-daily.log 2>&1

# Run every Monday at 01:00 UTC.
0 1 * * mon     cd $OFHOME/src/orangefactor && $VENV_PYTHON woo_commenter.py --weekly >> $OFHOME/woo_commenter-weekly.log 2>&1
Depends on: 1212712
No longer depends on: 1224020
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.