Closed
Bug 1453760
Opened 6 years ago
Closed 6 years ago
Create a replacement for the OrangeFactor intermittent bug commenter
Categories
(Tree Management :: Treeherder, enhancement, P1)
Tree Management
Treeherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gbrown, Assigned: sclements)
References
Details
Attachments
(5 files, 1 obsolete file)
47 bytes,
text/x-github-pull-request
|
gbrown
:
review+
emorley
:
review+
|
Details | Review |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review | |
47 bytes,
text/x-github-pull-request
|
Details | Review |
As suggested in https://bugzilla.mozilla.org/show_bug.cgi?id=1367364#c7.
Reporter | ||
Comment 1•6 years ago
|
||
The OF commenter needs to determine the platform and repo breakdown of failures, so it can make comments like: Repository breakdown: * mozilla-central: 33 * mozilla-inbound: 20 Platform breakdown: * linux64: 28 * linux32-stylo-disabled: 11 I think the only way to get platform and repo info from the new api is with /failuresbybug, which returns data for one bug at a time. So we need a /failures query to get the list of bugs, and then a /failuresbybug query for each failing bug. :emorley - Can you verify that's the best way to proceed?
Flags: needinfo?(emorley)
Reporter | ||
Comment 2•6 years ago
|
||
Let's talk about something concrete. I *think* this works, but haven't tested much. It seems quite inefficient - takes me about 30 minutes to query treeherder - but that might be okay. I note 3 issues: - no support for tree=all - I seem to need to query one "page" at a time - as I described earlier, need to use /failures to get a list of bugs and then make individual queries for each bug Is there an easier way, currently? Should we update the api to make this more convenient?
Flags: needinfo?(emorley)
Attachment #8969798 -
Flags: feedback?(emorley)
Comment 3•6 years ago
|
||
Comment on attachment 8969798 [details] [diff] [review] a first attempt, for discussion Hi :-) Thank you for looking at this! I brought up this bug at the Treeherder meeting yesterday. I agree that it would be good to adjust the API to avoid the inefficient queries. However James raised the good point that if we're having to do that, whether it was worth moving this tool into Treeherder instead (after all)? Hopefully we'll be having Sarah back again soon to work on the new intermittents view - she wrote the API so might be a good person to coordinate with on this.
Attachment #8969798 -
Flags: feedback?(emorley)
Reporter | ||
Comment 4•6 years ago
|
||
Great - glad to hear it has been discussed. I'll set this aside for now, in hopes that Sarah or another Treeherderer can finish it off one way or another.
Assignee: gbrown → nobody
Reporter | ||
Updated•6 years ago
|
Priority: -- → P3
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → sclements313
Assignee | ||
Updated•6 years ago
|
Priority: P3 → P1
Assignee | ||
Comment 5•6 years ago
|
||
Ed, as far as integrating this into Treeherder, were you thinking we should use the Heroku scheduler add-on or the custom clock process with celery?
Flags: needinfo?(emorley)
Comment 6•6 years ago
|
||
It's a tough call to say at the moment. I was hoping to explore moving some tasks from the celery beat scheduler to the Heroku addon for the reasons in bug 1176492 comment 0, so my first reaction was to say "Heroku scheduler", but we also need to bear in mind: https://devcenter.heroku.com/articles/scheduler#known-issues-and-alternatives It really depends on: (a) how long running these tasks are, (b) how memory intensive (ie will they interfere with other tasks running on the default worker at present). The implementation difference between the two approaches will be minimal however, so I'd just work on getting the feature working, and initially it can just be called manually from a custom Django command (the command itself should be pretty empty, with the functionality implemented elsewhere): https://docs.djangoproject.com/en/1.11/howto/custom-management-commands/
Flags: needinfo?(emorley)
Assignee | ||
Comment 7•6 years ago
|
||
There's a comment in woo_commenter.py, line 150: ` # For an initial trial period, only bugs in these components will be # marked for triage. with open('owner_triage_components.json') as f: components = json.load(f) print "Found %d owner triaged components" % len(components) ` It uses data from the file as a dictionary for bug component and product info (used on line 225). Is this still needed? Also, is it still preferable to grab bug info (whiteboard, priority, etc) directly from bugzilla rather than using TH bugscache?
Flags: needinfo?(gbrown)
Reporter | ||
Comment 8•6 years ago
|
||
(In reply to sclements from comment #7) > There's a comment in woo_commenter.py, line 150: > ` # For an initial trial period, only bugs in these components will be > # marked for triage. > with open('owner_triage_components.json') as f: > components = json.load(f) > print "Found %d owner triaged components" % len(components) > ` > It uses data from the file as a dictionary for bug component and product > info (used on line 225). Is this still needed? Yes, the "trial" is on-going. > Also, is it still preferable to grab bug info (whiteboard, priority, etc) > directly from bugzilla rather than using TH bugscache? I don't know anything about TH bugscache, so hard for me to say. Does it provide the same information?
Flags: needinfo?(gbrown)
Assignee | ||
Comment 9•6 years ago
|
||
Yes it does, but that might be more of a question for Ed since it only fetches bugzilla data on a schedule.
Flags: needinfo?(emorley)
Comment 10•6 years ago
|
||
For bug metadata that's being updated, I'd query Bugzilla directly to ensure it's up to date.
Flags: needinfo?(emorley)
Assignee | ||
Comment 11•6 years ago
|
||
I assumed as much - thanks guys!
Comment 12•6 years ago
|
||
Assignee | ||
Comment 13•6 years ago
|
||
A note on my patch: Comparing one of the weekly bug comments - #1331445 (https://bugzilla.mozilla.org/show_bug.cgi?id=1331445) - there's a difference between the comment from the OF commenter and the refactored code (based on the template conditionals). From my understanding, this below comment should be comment that's used ('test has failed more than 150 times...'); please correct me if I'm mistaken. Also pushes are higher as I'm using tree=all for all data and it was run today so numbers are slightly different. 96 failures in 2249 pushes (0.043 failures/push) were associated with this bug in the last 7 days. This is the #17 most frequent failure this week. ** This test has failed more than 150 times in the last 21 days. It should be disabled until it can be fixed. ** Repository breakdown: * try: 3 * mozilla-central: 7 * autoland: 56 * mozilla-inbound: 30 Platform breakdown: * android-4-2-x86: 1 * android-4-3-armv7-api16: 95 For more details, see: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1331445&startday=2018-05-15&endday=2018-05-21&tree=all
Assignee | ||
Updated•6 years ago
|
Attachment #8979741 -
Flags: review?(emorley)
Updated•6 years ago
|
Status: NEW → ASSIGNED
Component: OrangeFactor → Intermittent Failures View
Comment 14•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 Sorry for the delay here - there's been a lot to page in (I'm not very familiar with the OrangeFactor commenter). I've started looking at this, but run out of time today - will continue tomorrow. However it would be good to have Geoff have a look (now that I think he's back?) to comment on the overall functionality/business logic.
Attachment #8979741 -
Flags: review?(gbrown)
Comment 15•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 The PR looks good so far -- next steps is perhaps to get Geoff to look over the business logic side, and then we can figure out adding unit tests / how best to get it running in production etc.
Attachment #8979741 -
Flags: review?(emorley)
Reporter | ||
Comment 16•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 Review of attachment 8979741 [details] [review]: ----------------------------------------------------------------- I left a couple of minor comments on the PR, but this looks great to me.
Attachment #8979741 -
Attachment is patch: true
Attachment #8979741 -
Attachment mime type: text/x-github-pull-request → text/plain
Attachment #8979741 -
Flags: review?(gbrown) → review+
Assignee | ||
Comment 17•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 I've added a test, the celery task and refactored the code to make it a little more readable. Probably worth having Geoff review again before it goes live.
Attachment #8979741 -
Flags: review?(gbrown)
Attachment #8979741 -
Flags: review?(emorley)
Attachment #8979741 -
Flags: review+
Reporter | ||
Updated•6 years ago
|
Attachment #8969798 -
Attachment is obsolete: true
Reporter | ||
Updated•6 years ago
|
Attachment #8979741 -
Flags: review?(gbrown) → review+
Comment 18•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 (Resetting attachment properties. Geoff - it's best to not use the GitHub splinter review UI when reviewing GitHub link attachments since otherwise it triggers bug 1384310. Using the basic Bugzilla review flag page doesn't trigger the bug.)
Attachment #8979741 -
Attachment is patch: false
Attachment #8979741 -
Attachment mime type: text/plain → text/x-github-pull-request
Comment 19•6 years ago
|
||
(And when the attachment properties get changed, the link stops working and doesn't redirect)
Comment 20•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 Looking pretty good! I've left some comments on the PR - still haven't had time to test it locally (due to needing to sort out more unrelated prod regressions today), so there may be a few more changes needed on the re-review - but this should be the bulk of it.
Attachment #8979741 -
Flags: review?(emorley) → review-
Assignee | ||
Updated•6 years ago
|
Attachment #8979741 -
Flags: review- → review?(emorley)
Assignee | ||
Comment 21•6 years ago
|
||
Fyi, I've requested the OF robot api key from Emma Humphries and access to that account so I change the display name.
Comment 22•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 Left some comments - very close now :-) With the change to handle the API key not being set, we can deploy this independently of actually switching it on - which should simplify things a bit. One other thing I forgot to mention in the review - there is a default celery time limit (see settings.py) - presuming the task is expected to run longer than this, the `@task` definition will need it's own `soft_time_limit` and `time_limit` setting (soft should be slightly shorter than the time_limit to ensure the task has time to gracefully exit - see comment in settings.py).
Attachment #8979741 -
Flags: review?(emorley) → review-
Assignee | ||
Updated•6 years ago
|
Attachment #8979741 -
Flags: review- → review?(emorley)
Comment 23•6 years ago
|
||
Re updating the environment variables (ref PR comment) - I've set `BUG_FILER_API_KEY` to match the existing `BUGZILLA_API_KEY` value in the places where it was already set (prod, stage; latter since it uses the staging Bugzilla to test filing bugs). Some time after this PR is merged to master and deployed to stage+prod, we can remove the old `BUGZILLA_API_KEY` variable. As for `BUG_COMMENTER_API_KEY`, we'll want to leave that unset initially, meaning we can merge/deploy this PR safely (since it will not leave any comments) but still get the benefit of seeing if any errors appear on New Relic (since it will do everything but the submission of comments). I suspect this step will show timeouts until the `{soft_,}time_limit` is adjusted per comment 22. Then to enable it, it will just be a case of setting `BUG_COMMENTER_API_KEY` in production (only) without the need for another deploy.
Comment 24•6 years ago
|
||
Comment on attachment 8979741 [details] [review] Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/3571 Resetting review until updated after PR comments from earlier today :-)
Attachment #8979741 -
Flags: review?(emorley)
Assignee | ||
Comment 25•6 years ago
|
||
I'm running the commenter in weekly_mode to see how long it takes, so I'll add the soft_time_limit attribute in the task with the next round of updates if necessary. I suspect we'll need a longer buffer than what's in the settings file.
Assignee | ||
Updated•6 years ago
|
Attachment #8979741 -
Flags: feedback?(emorley)
Updated•6 years ago
|
Attachment #8979741 -
Flags: feedback?(emorley) → review+
Comment 26•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/5386f7ed40bcacb9eaaca5e7ea21c613aaf3056f Bug 1453760 - bug commentor (#3571) * Bug 1453760 - bug commentor replacement for Orange Factor bug commentor
Comment 27•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/057b17311bcdb711f0f20c9594be106159333be2 Bug 1453760 - Fix intermittents_commenter Celery crontab entries (#3848) Adds `minute=0` to ensure task only runs once during the specified hour.
Updated•6 years ago
|
Summary: Convert OF commenter to use treeherder API → Create a replacement for the OrangeFactor intermittent bug commenter
Comment 28•6 years ago
|
||
Comment 29•6 years ago
|
||
Looking at Papertrail, the comment 27 fix has resolved the duplicate tasks being scheduled :-)
Comment 30•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/6b8cdf6babe6aca0aefb454d5da437192f914703 Bug 1453760 - Bug commenter: Skip bugs when metadata not retrievable (#3856) Since certain bug metadata is not retrievable (even with an API key) and returns a 401 Authorization Required (eg security bugs). The whiteboard is also now only updated if it has changed.
Comment 31•6 years ago
|
||
Comment 32•6 years ago
|
||
Sarah disabled the cron job on OrangeFactor, and set the API key environment variable on production Treeherder (which enabled the actual commenting rather than a ~dry run mode) last night. This morning's daily job appears to have run successfully - there were no exceptions in New Relic, and here's the task profile: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?show_browser=false&tw%5Bend%5D=1533207981&tw%5Bstart%5D=1533121581&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f696e7465726d697474656e74732d636f6d6d656e746572222c22225d&app_trace_id=1deddd46-9622-11e8-ba42-0242ac110009_15785_19927 The output on Papertrail is: Aug 02 08:02:31 treeherder-prod app/worker_default.1: [2018-08-02 07:02:31,560] WARNING [treeherder.intermittents_commenter.commenter:188] There were 17 comments for this daily task. An example comment left on a bug can be found in bug 1473920 comment 14. Sarah/Geoff, does 17 comments being left sound right, or too few?
Reporter | ||
Comment 33•6 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #32) > Sarah/Geoff, does 17 comments being left sound right, or too few? That seems just right. https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&from=-14d&who=orangefactor%40bots.tld One difference that jumps out at me there is that the time of day has changed: Instead of daily comments at 18:00 PDT, they are at 00:00 PDT now. Is that expected? Does anyone care? (I think it's fine.) The comments look great. Well done!
Reporter | ||
Comment 34•6 years ago
|
||
In https://bugzilla.mozilla.org/show_bug.cgi?id=1479957#c1, the priority was changed from P5 to --. I thought we only did that when setting whiteboard tag [stockwell needswork:owner] (which did not happen here). I may be forgetting some rule, but this doesn't seem right. Sarah, can you check?
Flags: needinfo?(sclements313)
Assignee | ||
Comment 35•6 years ago
|
||
I took a look and you're right - my bad. It's a quick fix, so I'll get this pushed out asap. Also, the OF crontab showed that the daily commenter only runs the days the weekly commenter doesn't. Did you want this to be apply to the replacement commenter? I did select arbitrary times for the Intermittents Commenter task to run (to answer your comment above) but I can mirror what's on the crontab exactly if you think that's best.
Flags: needinfo?(sclements313) → needinfo?(gbrown)
Reporter | ||
Comment 36•6 years ago
|
||
(In reply to sclements from comment #35) > the OF crontab showed that the daily commenter only > runs the days the weekly commenter doesn't. Did you want this to be apply to > the replacement commenter? Yes: 6 days of the week we want only the daily comment and on the 7th day we want only the weekly comment. > I did select arbitrary times for the > Intermittents Commenter task to run (to answer your comment above) but I can > mirror what's on the crontab exactly if you think that's best. I think the exact times are not important.
Flags: needinfo?(gbrown)
Comment 37•6 years ago
|
||
Geoff/Sarah - once the changes mentioned above are made, how long do you think we need to wait before: (a) disabling mirroring of data to OrangeFactor (bug 1445325) (b) then beginning the OrangeFactor decommissioning process (bugs to be filed using notes attached to this bug; with one addition: requesting disabling/removal of the Auth0 client used by OrangeFactor)? Considerations: * Is it likely we'll need to switch back to the old commenter? -> I think perhaps not, since even if bugs are found, we'll presumably just fix in place each day and then wait to see what happens for the next scheduled job. * The hard deadline for being out of SCL3 is 1st Sept, and we'll likely be hassled regularly if we're not done several weeks in advance of that. * I'm on PTO from tomorrow for a week, so was thinking it would be good if we could do (a) and maybe even (b) before I go away.
Reporter | ||
Comment 38•6 years ago
|
||
If we didn't have a deadline, I would suggest waiting a week, or at least until there has been a --weekly run. In light of the deadline, and given that last night's run was almost-perfect, and with faith in Sarah's ability to respond to the unexpected, I think we can be more aggressive / proceed with disabling right away. Let's check with jmaher too (he's back).
Flags: needinfo?(jmaher)
Comment 39•6 years ago
|
||
Comment 40•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/c783f47e53d30467c493d937a5eb499a56d992b6 Bug 1453760 - Adjust daily intermittent commenter schedule (#3870) So that it no longer runs on the day that the weekly commenter task is scheduled to run.
Comment 41•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/8560678013304d7b63e54492844fc86ff2dd7aeb Bug 1453760 - Fix commenter's check_needswork_owner() (#3874) change logic so that it always updates whiteboard status, unless whiteboard_needswork_owner already exists.
Comment 42•6 years ago
|
||
Comment 43•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/2c6a3b147129e605c83cf8e670a8cb881e593f60 Bug 1453760 - Fix commenter update_whiteboard() (#3877) Handle the case where no existing stockwell text was found or whiteboard is an empty string.
Comment 44•6 years ago
|
||
The follow-ups have been deployed to production, along with bug 1445325.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(jmaher)
Resolution: --- → FIXED
Comment 45•6 years ago
|
||
Now that this has landed and not needed rolling back, I've unset the BUGZILLA_API_KEY environment variable on stage and prod (since it's replaced by BUG_FILER_API_KEY and BUG_COMMENTER_API_KEY).
Updated•2 years ago
|
Component: Intermittent Failures View → TreeHerder
You need to log in
before you can comment on or make changes to this bug.
Description
•