Closed Bug 1176492 Opened 9 years ago Closed 5 years ago

Consider moving the less frequent periodic tasks on Heroku to use the scheduler addon

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file)

We currently have periodic tasks like cycle data running on the "worker_default" dyno:
https://dashboard.heroku.com/apps/treeherder-heroku/resources

This seems problematic for a few reasons:
1) Since dynos are restarted once every 24 hours, we may end up interrupting say the cycle-data task, since it's long-running.
2) The load on that dyno can vary quite considerably, depending on what periodic tasks are running at that point in time. We'll either end up overloading the dyno or else paying for more than we need 90% of the time.
3) Long running but low importance tasks like cycle-data can block more urgent tasks

It seems like the scheduler addon might be a better fit for things like cycle-data & fetch-bugs:
https://devcenter.heroku.com/articles/scheduler
https://elements.heroku.com/addons/scheduler
Summary: Consider moving the periodic tasks on Heroku to use the scheduler addon → Consider moving the less frequent periodic tasks on Heroku to use the scheduler addon
One limitation of this scheduler addon is that the job frequency has to be one of every {10 minutes, hour, day}.
This can wait until after the main move.
See Also: → 1339093
Fixing this would mean the cycle_data task gets its own dyno so less likely to run out of RAM as seen in bug 1346567.
Assignee: nobody → emorley
Blocks: 1346567
Priority: P3 → P1
Assignee: emorley → nobody
Priority: P1 → P2
Attachment #9008136 - Flags: review?(emorley)
I have added the scheduler add-on to proto, stage and prod already.  I also scheduled the tasks since running more often won't hurt anything, and this ensures we don't forget to do it if/when we merge the PR.  :)
(In reply to Cameron Dawson [:camd] from comment #5)
> I have added the scheduler add-on to proto, stage and prod already.  I also
> scheduled the tasks since running more often won't hurt anything, and this
> ensures we don't forget to do it if/when we merge the PR.  :)

Ah thank you :-)

We'll need to see how the tasks get on -- I suspect the cycle_data task might run out of RAM on the smaller P1 dyno (the default worker currently uses a P2 that has double the RAM) - but might as well start small and work our way up.
I'll leave this open to consider moving ``seta-analyze-failures`` and the intermittents commenter tasks.  I suppose we could even move over the ``fetch-push-logs-every-5-minutes`` if we could change it to every 10, which I imagine we could.
Comment on attachment 9008136 [details] [review]
Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/4019

(Was reviewed on GitHub and merged already; forgot to sync the r+ back to Bugzilla too)
Attachment #9008136 - Flags: review?(emorley) → review+
This worked great at fixing bug 1484642 :-)

Something we need to keep an eye on, is whether the tasks need larger sizes of dynos (the default worker was a P2 dyno and these new tasks are using a P1) - though now we can fine tune more than before, since tasks are separated out. The dyno specs are listed here:
https://devcenter.heroku.com/articles/dyno-types

Also, since these tasks don't run via a permanent dyno, they don't show up in metrics (https://dashboard.heroku.com/apps/treeherder-prod/metrics), so we'll need to monitor via Papertrial instead.

Logs:
* Prod: https://papertrailapp.com/systems/treeherder-prod/events?q=program%3Ascheduler
* Stage: https://papertrailapp.com/systems/treeherder-stage/events?q=program%3Ascheduler
* Prototype: https://papertrailapp.com/systems/treeherder-prototype/events?q=program%3Ascheduler

For `./manage.py update_bugscache`, the peak memory usage is only 128MB (so the smallest P1 dyno seems fine):

Sep 12 10:47:56 treeherder-prod heroku/scheduler.7655: source=scheduler.7655 dyno=<SNIP> sample#memory_total=128.48MB sample#memory_rss=124.78MB sample#memory_cache=3.70MB sample#memory_swap=0.00MB ...

For `./manage.py cycle_data`, the task is being killed since it exceeded the 512MB RAM of the P1 dyno (it reached 1271MB usage before it was killed):
https://papertrailapp.com/systems/treeherder-stage/events?centered_on_id=976239624282349590&q=program%3Aheroku%2Fscheduler.7040

I've bumped it to a Performance-M (2.5GB RAM) for now (which I hope will be enough?), but we should see about reducing usage in bug 1346567 to save credits later on.

At the moment the tasks won't appear in New Relic. However we should be able to make that happen by changing the command to `newrelic-admin run-program ./manage.py ...` and adding the relevant management commands to the list here:
https://github.com/mozilla/treeherder/blob/b5a6736f9b26ac7c6441fb5da3a95831933e7dd7/newrelic.ini#L28-L31

Finally, since cycle_data is no longer hogging RAM, I've dropped the "default" worker dyno type down from a P2 to a P1, and reduced the count from 2 to 1 for prototype/stage (but not prod, since the commenter has more to do there as the API key is set, and we don't want it blocking perf alert generation). Across prototype+stage+prod, that saves us another 8 dyno credits, lowering total Treherder Heroku usage (after bug 1443251 comment 6) from 88 to 80 credits/month.
Blocks: 1484642
(In reply to Ed Morley [:emorley] from comment #10)
> Finally, since cycle_data is no longer hogging RAM, I've dropped the
> "default" worker dyno type down from a P2 to a P1, and reduced the count
> from 2 to 1 for prototype/stage (but not prod, since the commenter has more
> to do there as the API key is set, and we don't want it blocking perf alert
> generation)

I've raised stage's default worker count from `1` back to `2` since there were a few queue spikes causing alerts. It's still a P1 dyno so still using less credits than prior to these changes.
The cycle_data task is still exceeding the RAM limits:
Sep 14 06:37:13 treeherder-prod heroku/scheduler.4138: Process running mem=5325M(208.0%) 
Sep 14 06:37:13 treeherder-prod heroku/scheduler.4138: Error R15 (Memory quota vastly exceeded) 
(https://papertrailapp.com/systems/treeherder-prod/events?centered_on_id=977131112423915535&q=program%3Aheroku%2Fscheduler.4138)

I've bumped it to a Performance-L for now (which has 14GB RAM instead of 2.5GB).

Assigning this to me to remind me to check back at cycle_data and also look at moving the remaining tasks at some point.
Assignee: nobody → emorley
I've updated the tasks to use the New Relic wrapper (ie prefixed with `newrelic-admin run-program`).
We will also need to update newrelic.ini to add these commands to the ones that are instrumented.
Depends on: 1503576
Depends on: 1508228
Depends on: 1518780
Depends on: 1518782

Most tasks have now been migrated. I've filed dep bugs for the remaining two (that are less urgent).

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: