Closed Bug 1453760 Opened 6 years ago Closed 6 years ago

Create a replacement for the OrangeFactor intermittent bug commenter

Categories

(Tree Management :: Treeherder, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: sclements)

References

Details

Attachments

(5 files, 1 obsolete file)

The OF commenter needs to determine the platform and repo breakdown of failures, so it can make comments like:

Repository breakdown:
* mozilla-central: 33
* mozilla-inbound: 20

Platform breakdown:
* linux64: 28
* linux32-stylo-disabled: 11

I think the only way to get platform and repo info from the new api is with /failuresbybug, which returns data for one bug at a time. So we need a /failures query to get the list of bugs, and then a /failuresbybug query for each failing bug.

:emorley - Can you verify that's the best way to proceed?
Flags: needinfo?(emorley)
Attached patch a first attempt, for discussion (obsolete) — Splinter Review
Let's talk about something concrete. I *think* this works, but haven't tested much. It seems quite inefficient - takes me about 30 minutes to query treeherder - but that might be okay.

I note 3 issues:
 - no support for tree=all
 - I seem to need to query one "page" at a time
 - as I described earlier, need to use /failures to get a list of bugs and then make individual queries for each bug

Is there an easier way, currently?

Should we update the api to make this more convenient?
Flags: needinfo?(emorley)
Attachment #8969798 - Flags: feedback?(emorley)
Comment on attachment 8969798 [details] [diff] [review]
a first attempt, for discussion

Hi :-) Thank you for looking at this!

I brought up this bug at the Treeherder meeting yesterday. I agree that it would be good to adjust the API to avoid the inefficient queries. However James raised the good point that if we're having to do that, whether it was worth moving this tool into Treeherder instead (after all)?

Hopefully we'll be having Sarah back again soon to work on the new intermittents view - she wrote the API so might be a good person to coordinate with on this.
Attachment #8969798 - Flags: feedback?(emorley)
Great - glad to hear it has been discussed. 

I'll set this aside for now, in hopes that Sarah or another Treeherderer can finish it off one way or another.
Assignee: gbrown → nobody
Priority: -- → P3
Assignee: nobody → sclements313
Priority: P3 → P1
Ed, as far as integrating this into Treeherder, were you thinking we should use the Heroku scheduler add-on or the custom clock process with celery?
Flags: needinfo?(emorley)
It's a tough call to say at the moment. 

I was hoping to explore moving some tasks from the celery beat scheduler to the Heroku addon for the reasons in bug 1176492 comment 0, so my first reaction was to say "Heroku scheduler", but we also need to bear in mind:
https://devcenter.heroku.com/articles/scheduler#known-issues-and-alternatives

It really depends on:
(a) how long running these tasks are,
(b) how memory intensive (ie will they interfere with other tasks running on the default worker at present).

The implementation difference between the two approaches will be minimal however, so I'd just work on getting the feature working, and initially it can just be called manually from a custom Django command (the command itself should be pretty empty, with the functionality implemented elsewhere):
https://docs.djangoproject.com/en/1.11/howto/custom-management-commands/
Flags: needinfo?(emorley)
There's a comment in woo_commenter.py, line 150:
`    # For an initial trial period, only bugs in these components will be
    # marked for triage.
     with open('owner_triage_components.json') as f:
        components = json.load(f)
        print "Found %d owner triaged components" % len(components)
`
It uses data from the file as a dictionary for bug component and product info (used on line 225). Is this still needed? 

Also, is it still preferable to grab bug info (whiteboard, priority, etc) directly from bugzilla rather than using TH bugscache?
Flags: needinfo?(gbrown)
(In reply to sclements from comment #7)
> There's a comment in woo_commenter.py, line 150:
> `    # For an initial trial period, only bugs in these components will be
>     # marked for triage.
>      with open('owner_triage_components.json') as f:
>         components = json.load(f)
>         print "Found %d owner triaged components" % len(components)
> `
> It uses data from the file as a dictionary for bug component and product
> info (used on line 225). Is this still needed? 

Yes, the "trial" is on-going.

> Also, is it still preferable to grab bug info (whiteboard, priority, etc)
> directly from bugzilla rather than using TH bugscache?

I don't know anything about TH bugscache, so hard for me to say. Does it provide the same information?
Flags: needinfo?(gbrown)
Yes it does, but that might be more of a question for Ed since it only fetches bugzilla data on a schedule.
Flags: needinfo?(emorley)
For bug metadata that's being updated, I'd query Bugzilla directly to ensure it's up to date.
Flags: needinfo?(emorley)
I assumed as much - thanks guys!
A note on my patch:

Comparing one of the weekly bug comments - #1331445 (https://bugzilla.mozilla.org/show_bug.cgi?id=1331445) - there's a difference between the comment from the OF commenter and the refactored code (based on the template conditionals). From my understanding, this below comment should be comment that's used ('test has failed more than 150 times...'); please correct me if I'm mistaken. Also pushes are higher as I'm using tree=all for all data and it was run today so numbers are slightly different.


96 failures in 2249 pushes (0.043 failures/push) were associated with this bug in the last 7 days.

This is the #17 most frequent failure this week.

** This test has failed more than 150 times in the last 21 days. It should be disabled until it can be fixed. ** 

Repository breakdown:
* try: 3
* mozilla-central: 7
* autoland: 56
* mozilla-inbound: 30

Platform breakdown:
* android-4-2-x86: 1
* android-4-3-armv7-api16: 95

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1331445&startday=2018-05-15&endday=2018-05-21&tree=all
Attachment #8979741 - Flags: review?(emorley)
Status: NEW → ASSIGNED
Component: OrangeFactor → Intermittent Failures View
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

Sorry for the delay here - there's been a lot to page in (I'm not very familiar with the OrangeFactor commenter).

I've started looking at this, but run out of time today - will continue tomorrow.

However it would be good to have Geoff have a look (now that I think he's back?) to comment on the overall functionality/business logic.
Attachment #8979741 - Flags: review?(gbrown)
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

The PR looks good so far -- next steps is perhaps to get Geoff to look over the business logic side, and then we can figure out adding unit tests / how best to get it running in production etc.
Attachment #8979741 - Flags: review?(emorley)
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

Review of attachment 8979741 [details] [review]:
-----------------------------------------------------------------

I left a couple of minor comments on the PR, but this looks great to me.
Attachment #8979741 - Attachment is patch: true
Attachment #8979741 - Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8979741 - Flags: review?(gbrown) → review+
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

I've added a test, the celery task and refactored the code to make it a little more readable. Probably worth having Geoff review again before it goes live.
Attachment #8979741 - Flags: review?(gbrown)
Attachment #8979741 - Flags: review?(emorley)
Attachment #8979741 - Flags: review+
Attachment #8969798 - Attachment is obsolete: true
Attachment #8979741 - Flags: review?(gbrown) → review+
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

(Resetting attachment properties. Geoff - it's best to not use the GitHub splinter review UI when reviewing GitHub link attachments since otherwise it triggers bug 1384310. Using the basic Bugzilla review flag page doesn't trigger the bug.)
Attachment #8979741 - Attachment is patch: false
Attachment #8979741 - Attachment mime type: text/plain → text/x-github-pull-request
(And when the attachment properties get changed, the link stops working and doesn't redirect)
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

Looking pretty good! I've left some comments on the PR - still haven't had time to test it locally (due to needing to sort out more unrelated prod regressions today), so there may be a few more changes needed on the re-review - but this should be the bulk of it.
Attachment #8979741 - Flags: review?(emorley) → review-
Attachment #8979741 - Flags: review- → review?(emorley)
Fyi, I've requested the OF robot api key from Emma Humphries and access to that account so I change the display name.
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

Left some comments - very close now :-)

With the change to handle the API key not being set, we can deploy this independently of actually switching it on - which should simplify things a bit.

One other thing I forgot to mention in the review - there is a default celery time limit (see settings.py) - presuming the task is expected to run longer than this, the `@task` definition will need it's own `soft_time_limit` and `time_limit` setting (soft should be slightly shorter than the time_limit to ensure the task has time to gracefully exit - see comment in settings.py).
Attachment #8979741 - Flags: review?(emorley) → review-
Attachment #8979741 - Flags: review- → review?(emorley)
Re updating the environment variables (ref PR comment) - I've set `BUG_FILER_API_KEY` to match the existing `BUGZILLA_API_KEY` value in the places where it was already set (prod, stage; latter since it uses the staging Bugzilla to test filing bugs). Some time after this PR is merged to master and deployed to stage+prod, we can remove the old `BUGZILLA_API_KEY` variable.

As for `BUG_COMMENTER_API_KEY`, we'll want to leave that unset initially, meaning we can merge/deploy this PR safely (since it will not leave any comments) but still get the benefit of seeing if any errors appear on New Relic (since it will do everything but the submission of comments). I suspect this step will show timeouts until the `{soft_,}time_limit` is adjusted per comment 22.

Then to enable it, it will just be a case of setting `BUG_COMMENTER_API_KEY` in production (only) without the need for another deploy.
Comment on attachment 8979741 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571

Resetting review until updated after PR comments from earlier today :-)
Attachment #8979741 - Flags: review?(emorley)
I'm running the commenter in weekly_mode to see how long it takes, so I'll add the soft_time_limit attribute in the task with the next round of updates if necessary. I suspect we'll need a longer buffer than what's in the settings file.
Attachment #8979741 - Flags: feedback?(emorley)
Attachment #8979741 - Flags: feedback?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/5386f7ed40bcacb9eaaca5e7ea21c613aaf3056f
Bug 1453760 - bug commentor (#3571)

* Bug 1453760 - bug commentor
replacement for Orange Factor bug commentor
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/057b17311bcdb711f0f20c9594be106159333be2
Bug 1453760 - Fix intermittents_commenter Celery crontab entries (#3848)

Adds `minute=0` to ensure task only runs once during the specified hour.
Blocks: 1445325
Summary: Convert OF commenter to use treeherder API → Create a replacement for the OrangeFactor intermittent bug commenter
Looking at Papertrail, the comment 27 fix has resolved the duplicate tasks being scheduled :-)
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/6b8cdf6babe6aca0aefb454d5da437192f914703
Bug 1453760 - Bug commenter: Skip bugs when metadata not retrievable (#3856)

Since certain bug metadata is not retrievable (even with an API key)
and returns a 401 Authorization Required (eg security bugs). 

The whiteboard is also now only updated if it has changed.
Sarah disabled the cron job on OrangeFactor, and set the API key environment variable on production Treeherder (which enabled the actual commenting rather than a ~dry run mode) last night. 

This morning's daily job appears to have run successfully - there were no exceptions in New Relic, and here's the task profile:

https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1533207981&tw%5Bstart%5D=1533121581&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f696e7465726d697474656e74732d636f6d6d656e746572222c22225d&app_trace_id=1deddd46-9622-11e8-ba42-0242ac110009_15785_19927

The output on Papertrail is:

Aug 02 08:02:31 treeherder-prod app/worker_default.1: [2018-08-02 07:02:31,560] WARNING [treeherder.intermittents_commenter.commenter:188] There were 17 comments for this daily task. 

An example comment left on a bug can be found in bug 1473920 comment 14.

Sarah/Geoff, does 17 comments being left sound right, or too few?
(In reply to Ed Morley [:emorley] from comment #32)
> Sarah/Geoff, does 17 comments being left sound right, or too few?

That seems just right.

https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&from=-14d&who=orangefactor%40bots.tld

One difference that jumps out at me there is that the time of day has changed: Instead of daily comments at 18:00 PDT, they are at 00:00 PDT now. Is that expected? Does anyone care? (I think it's fine.)

The comments look great. Well done!
In https://bugzilla.mozilla.org/show_bug.cgi?id=1479957#c1, the priority was changed from P5 to --. I thought we only did that when setting whiteboard tag [stockwell needswork:owner] (which did not happen here). I may be forgetting some rule, but this doesn't seem right. Sarah, can you check?
Flags: needinfo?(sclements313)
I took a look and you're right - my bad. It's a quick fix, so I'll get this pushed out asap. Also, the OF crontab showed that the daily commenter only runs the days the weekly commenter doesn't. Did you want this to be apply to the replacement commenter? I did select arbitrary times for the Intermittents Commenter task to run (to answer your comment above) but I can mirror what's on the crontab exactly if you think that's best.
Flags: needinfo?(sclements313) → needinfo?(gbrown)
(In reply to sclements from comment #35)
>  the OF crontab showed that the daily commenter only
> runs the days the weekly commenter doesn't. Did you want this to be apply to
> the replacement commenter? 

Yes: 6 days of the week we want only the daily comment and on the 7th day we want only the weekly comment.

> I did select arbitrary times for the
> Intermittents Commenter task to run (to answer your comment above) but I can
> mirror what's on the crontab exactly if you think that's best.

I think the exact times are not important.
Flags: needinfo?(gbrown)
Geoff/Sarah - once the changes mentioned above are made, how long do you think we need to wait before:
(a) disabling mirroring of data to OrangeFactor (bug 1445325)
(b) then beginning the OrangeFactor decommissioning process (bugs to be filed using notes attached to this bug; with one addition: requesting disabling/removal of the Auth0 client used by OrangeFactor)?

Considerations:
* Is it likely we'll need to switch back to the old commenter?
  -> I think perhaps not, since even if bugs are found, we'll presumably just fix in place each day and then wait to see what happens for the next scheduled job.
* The hard deadline for being out of SCL3 is 1st Sept, and we'll likely be hassled regularly if we're not done several weeks in advance of that.
* I'm on PTO from tomorrow for a week, so was thinking it would be good if we could do (a) and maybe even (b) before I go away.
If we didn't have a deadline, I would suggest waiting a week, or at least until there has been a --weekly run.

In light of the deadline, and given that last night's run was almost-perfect, and with faith in Sarah's ability to respond to the unexpected, I think we can be more aggressive / proceed with disabling right away.

Let's check with jmaher too (he's back).
Flags: needinfo?(jmaher)
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/c783f47e53d30467c493d937a5eb499a56d992b6
Bug 1453760 - Adjust daily intermittent commenter schedule (#3870)

So that it no longer runs on the day that the weekly commenter task
is scheduled to run.
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/8560678013304d7b63e54492844fc86ff2dd7aeb
Bug 1453760 - Fix commenter's check_needswork_owner() (#3874)

change logic so that it always updates whiteboard status,
unless whiteboard_needswork_owner already exists.
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/2c6a3b147129e605c83cf8e670a8cb881e593f60
Bug 1453760 - Fix commenter update_whiteboard() (#3877)

Handle the case where no existing stockwell text was found or whiteboard
is an empty string.
The follow-ups have been deployed to production, along with bug 1445325.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
Depends on: 1481499
Depends on: 1481137
Now that this has landed and not needed rolling back, I've unset the BUGZILLA_API_KEY environment variable on stage and prod (since it's replaced by BUG_FILER_API_KEY and BUG_COMMENTER_API_KEY).
Depends on: 1484641
Component: Intermittent Failures View → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: