Closed Bug 939817 Opened 11 years ago Closed 8 years ago

more vcs-sync email tweaks

Categories

(Developer Services :: General, task)

All
Windows XP
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Unassigned)

References

Details

Attachments

(4 files, 1 obsolete file)

Attached patch email.diff (obsolete) — Splinter Review
Per email thread.
I've got this patch running in staging, so far...
Comment on attachment 8333876 [details] [diff] [review]
email.diff

Review of attachment 8333876 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozharness/base/vcs/vcssync.py
@@ +51,5 @@
> +            subject = "[vcs2vcs] Successful no-op conversion for %s" % job_name
> +        if error_contents and not fatal:
> +            subject += " with warnings"
> +        if self.successful_repos:
> +            subject += ' (' + ','.join(self.successful_repos) + ')'

question: how many items could be in self.successful_repos? If 20, that would make for a very messy subject line that would not be displayed in many viewers.

If it could be long, the info s/b repeated in the message body.
(In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
> Comment on attachment 8333876 [details] [diff] [review]
> email.diff
> 
> Review of attachment 8333876 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: mozharness/base/vcs/vcssync.py
> @@ +51,5 @@
> > +            subject = "[vcs2vcs] Successful no-op conversion for %s" % job_name
> > +        if error_contents and not fatal:
> > +            subject += " with warnings"
> > +        if self.successful_repos:
> > +            subject += ' (' + ','.join(self.successful_repos) + ')'
> 
> question: how many items could be in self.successful_repos? If 20, that
> would make for a very messy subject line that would not be displayed in many
> viewers.
> 
> If it could be long, the info s/b repeated in the message body.

I'm trying not to spam release+vcs2vcs with no-text emails by adding text to the email.
(In reply to Aki Sasaki [:aki] from comment #2)
> (In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
> > question: how many items could be in self.successful_repos? If 20, that
> > would make for a very messy subject line that would not be displayed in many
> > viewers.
> > 
> > If it could be long, the info s/b repeated in the message body.
> 
> I'm trying not to spam release+vcs2vcs with no-text emails by adding text to
> the email.

Also, for there to be ~20 repos in the list, each of those 20 repos would have to be updated since the last run.  That's definitely possible the first run; very unlikely if we're running every minute.
Trying this:

         if self.successful_repos:
-            subject += ' (' + ','.join(self.successful_repos) + ')'
+            if len(self.successful_repos) <= 5:
+                subject += ' (' + ','.join(self.successful_repos) + ')'
+            else:
+                text += "Successful repos: %s\n\n" % ', '.join(self.successful_repos)
Thanks Aki for this. :)
Attached patch email.diffSplinter Review
Seems kosher in staging.
Attachment #8334539 - Flags: review?(hwine)
Attachment #8333876 - Attachment is obsolete: true
Attachment #8334539 - Flags: review?(hwine) → review+
Blocks: 929334
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch email.diffSplinter Review
Oops.
Attachment #8334871 - Flags: review?(hwine)
Attachment #8334871 - Flags: review?(hwine) → review+
Comment on attachment 8334871 [details] [diff] [review]
email.diff

http://hg.mozilla.org/build/mozharness/rev/6ea046dcfd0c
Transplanted to production
Nuked the repo_update.json's on vcssync{1,2} so everything gets pushed.
Attachment #8334871 - Flags: checked-in+
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
something[s] here made it to production
The email thread that led to this ticket being created.
I think we still need to fix the subject so that it does not say "successful" if there is a non-self-recovering error that requires human attention.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
For ref, I asked Pete to look into this today following a chain of cron spam this morning reporting "Successful no-op conversion for build-repos" when the output contained the following:

10:17:06    ERROR -  abort: data/lib/python/mozilla_buildtools/test/sample-patcher-config.cfg.i@59fcd33f2135: no match found!
10:17:06    ERROR -  Automation Error: hg not responding
10:17:06    ERROR - Return code: 255
...
10:17:09    ERROR -  error: src refspec refs/tags/FIREFOX_28_0b6_RELEASE does not match any.
10:17:09    ERROR -  error: src refspec refs/tags/FIREFOX_28_0b6_BUILD1 does not match any.
10:17:09    ERROR -  error: src refspec refs/tags/FENNEC_28_0b6_RELEASE does not match any.
10:17:09    ERROR -  error: src refspec refs/tags/FENNEC_28_0b6_BUILD1 does not match any.
10:17:09    ERROR -  error: failed to push some refs to 'git@github.com:mozilla/build-tools.git'
10:17:09    ERROR - Return code: 1
10:17:09    ERROR - build-tools: Can't push /opt/vcs2vcs/vcs_sync/build/conversion/build-tools to build-tools-github!
10:17:09    ERROR - Unable to push build-tools.
...
10:17:09    FATAL - Unable to push these repos:
10:17:09    FATAL - build-tools: Can't push /opt/vcs2vcs/vcs_sync/build/conversion/build-tools to build-tools-github!
10:17:09    FATAL - 
10:17:09    FATAL - Running post_fatal callback...

Either we're not catching those failures properly, or we're ignoring them. 

We're retrying this conversion every minute without true "Success," so we should fix or throttle.
Pete: is this on your plate, or did you want me to fix this?
Flags: needinfo?(pmoore)
Assignee: aki → pmoore
Flags: needinfo?(pmoore)
I can take it.
I haven't forgotten about this - I'm going to work on this after I land the work for Bug 799719 - (vcs-sync) tracker to retire legacy vcs2vcs.

It will essentially mean altering this function: http://hg.mozilla.org/build/mozharness/file/3347b848256c/mozharness/base/vcs/vcssync.py#l28
This patch makes sure we always have an email subject of "[vcs2vcs] Failed conversion for XXXXX" if there are any failures.

This is something of a simplification, but hopefully should be clearer to people receiving the email, that there is a problem, and the problems are then detailed precisely in the body of the email.

I also choose not to explicitly list successful repos in the case there are failures too, to avoid that a user sees a list of successful repos, and assumes that all were successful. Therefore, this is an attempt to keep things simple - if there is a failure, the subject states this simply, and the body of the email details the problems.

Let me know what you think! :)

Pete
Attachment #8459902 - Flags: review?(aki)
Comment on attachment 8459902 [details] [diff] [review]
bug939817_mozharness_v1.patch

It is a simplification.

I currently am wondering what happens when there's an intermittent hg burp -- let's say one out of every hg pulls or clones fails.  Right now with that kind of thing will go into my filtered vcs-sync box, and the actual fatal() calls go into my inbox.  Now both will go into my inbox, or I'll be tempted to filter the failures as well.

Not that the status quo is particularly good or anything.  Maybe this is the first step, and we keep some sort of info about latest email sent about a particular repo in the repo_update.json, so we don't email every minute?

We can definitely try this to see if it works for the team.
Attachment #8459902 - Flags: review?(aki) → review+
Comment on attachment 8459902 [details] [diff] [review]
bug939817_mozharness_v1.patch

Rolled out to production, which caused the following issue:

Uncaught exception: Traceback (most recent call last):
 File "/opt/vcs2vcs/mozharness/mozharness/base/script.py", line 1243, in run
   self.run_action(action)
 File "/opt/vcs2vcs/mozharness/mozharness/base/script.py", line 1185, in run_action
   self._possibly_run_method(method_name, error_if_missing=True)
 File "/opt/vcs2vcs/mozharness/mozharness/base/script.py", line 1126, in _possibly_run_method
   return getattr(self, method_name)()
 File "/opt/vcs2vcs/mozharness/mozharness/base/vcs/vcssync.py", line 52, in notify
   if len(message) > max_log_sample_size:
TypeError: object of type 'NoneType' has no len()



04:49:17     INFO - #####
04:49:17     INFO - ##### Running notify step.
04:49:17     INFO - #####
04:49:17     INFO - Running main action method: notify
04:49:17     INFO - Job took 16 seconds.
04:49:17    FATAL - Uncaught exception: Traceback (most recent call last):
04:49:17    FATAL -   File "/opt/vcs2vcs/mozharness/mozharness/base/script.py", line 1243, in run
04:49:17    FATAL -     self.run_action(action)
04:49:17    FATAL -   File "/opt/vcs2vcs/mozharness/mozharness/base/script.py", line 1185, in run_action
04:49:17    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
04:49:17    FATAL -   File "/opt/vcs2vcs/mozharness/mozharness/base/script.py", line 1126, in _possibly_run_method
04:49:17    FATAL -     return getattr(self, method_name)()
04:49:17    FATAL -   File "/opt/vcs2vcs/mozharness/mozharness/base/vcs/vcssync.py", line 52, in notify
04:49:17    FATAL -     if len(message) > max_log_sample_size:
04:49:17    FATAL - TypeError: object of type 'NoneType' has no len()
04:49:17    FATAL - Running post_fatal callback...
04:49:17     INFO - Job took 16 seconds.
04:49:17     INFO - Getting output from command: ['egrep', '-C5', '^[0-9:]+ +(ERROR|CRI[T]ICAL|FATAL) -', '/opt/vcs2vcs/logs/project-branches_info.log']
04:49:17     INFO - Copy/paste: egrep -C5 "^[0-9:]+ +(ERROR|CRI[T]ICAL|FATAL) -" /opt/vcs2vcs/logs/project-branches_info.log



I fixed this with a bustage fix:
https://hg.mozilla.org/build/mozharness/rev/0322797e1ac9


However, then I spotted that we were getting success emails constantly, which was not intended. In fairness Aki had pointed this out in comment 18 ("Maybe this is the first step, and we keep some sort of info about latest email sent about a particular repo in the repo_update.json, so we don't email every minute?") - but I had (falsely) thought this was related to when there were some failures, not when everything was successful - my bad.

I have now backed out the changes. Will fix up, and attach a new patch for review. Apologies for the mess up.

Pete
Attachment #8459902 - Flags: checked-in-
Stepping back.

The intent here was to create 5? classes of email:

1) Success! No new changes, but maybe you want to see that we're running and how long a no-op run takes.  These are skipped with 'skip_empty_messages': True
2) Success! Something was converted and pushed, but nothing went wrong, and nothing went wrong with this repo the previous time.  These are also skipped with 'skip_empty_messages': True
3) Success! Something was converted and pushed, and this is the first time this repo was pushed ever, or the first time this repo was pushed after a failure on that repo. This is probably skipped if 'failure_only': True, but it might be nice to get these anyway, like a nagios ok or a releng-ci yippie!
4) Error! The script finished all the way through, but something threw an error along the way.  Send the log, but with a "Success with errors" subject so it can be parsed at a non-urgent pace.  Ideally this is only sent if it fails X number of times, and then it only sends an email after X amount of time.  We may need to deal with flapping as well.
5) Fatal!  Actionable!  The script needs urgent attention!  Do not filter this email!  Have it hit your inbox and jump on it ASAP!  This would also be a good class of email to have sent to a pager.

The subject has useful info: success, success with errors, failure; time to convert; what repos were converted.  The first 3 allow you to filter the first two easily while paying very close attention to the third.

There are definitely bugs here, especially around the "don't send email on every hiccup" threshold -- it would be nice to have that configurable.

Your simplification is great from a code perspective; way fewer code paths to care about.  It's also great from a multi-repo-conversion perspective; why fatal() on one repo when you can convert the rest?  I don't like it from an email user perspective: both success and failure emails spammed, so all 5 classes just went into a massive clump of email that meant I either have to slog through the whole thing to separate signal from noise.  My preferences here no longer apply as of EOD Friday, however, so whatever works for people.  I think the more-noise-less-signal won't work for the team.

I think you agree that we need more signal and less noise.  I thought explicitly stating original intent might help in getting there.
(In reply to Aki Sasaki [:aki] from comment #20)
> The subject has useful info: success, success with errors, failure; time to
> convert; what repos were converted.  The first 3 allow you to filter the
> first two easily while paying very close attention to the third.

"The third" meaning failure.  The latter two, time and which repos, can be useful for other reasons.  Time helps when you're dealing with someone complaining about conversion turnaround times, which hasn't happened in recent memory.  Which repos might be helpful, but I haven't really found it critical info.  It went in the subject rather than the body to have this class of email fall under "skip_empty_messages" only.
Assignee: pmoore → nobody
Kill?! :)
Flags: needinfo?(hwine)
moving to current product/component, as relevant for anyone working on mozharness based vcs-sync, but appears all work done.
Status: REOPENED → RESOLVED
Closed: 11 years ago8 years ago
Component: Tools → General
Flags: needinfo?(hwine)
Product: Release Engineering → Developer Services
QA Contact: hwine
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: