Closed
Bug 1050766
Opened 11 years ago
Closed 11 years ago
measure delta between creating a task and it completing over time
Categories
(support.mozilla.org :: General, defect, P1)
support.mozilla.org
General
Tracking
(Not tracked)
RESOLVED
FIXED
2014Q3
People
(Reporter: willkg, Assigned: rrosario)
References
Details
(Whiteboard: u=dev c=infra p=1 s=2014.16)
We have no visibility into two important things right now:
1. the rabbitmq queue size
2. the delta in time between when a celery task is created and when it executes
Because we have no visibility into this, we can't track it over time and we have no idea when the numbers suggest imminent danger.
This bug covers using precision tools and alien artifacts we discovered in a dig on the moon to capture a set of statistics that will give us better visibility into these things and also reduce the risk of heart attack due to stress for the sumo engineering team.
| Reporter | ||
Comment 1•11 years ago
|
||
My current "big idea" is this:
Create a cron job.
The cron job kicks off once every 6 hours.
When it kicks off it grabs "now", sticks it in a specially formulated celery task and sends that to the queue.
When the task executes, it does two things:
1. computes the delta of the "new now" and the previous "now" and statsd.gauges that
2. figures out the current rabbitmq queue size and statsd.gauges that
That way we can look at graphite for graphs of what's going on.
We can change the sampling rate if we think it's important, but 6 hours seems like a good number and adding 4 tasks per day doesn't affect the queue much. We can trivially change it later.
Computing the delta between the two "nows" could be problematic because the machines they're running on will most certainly be slightly off relative to one another in regards to "what time is it?" I think we should convert everything to minutes and have a minimum of 0. Pretty sure that will alleviate the problems we will probably have, but still give us obvious cues when things are bad. We might even want to fire off HALP HALP!" emails.
The first thing is easy to write. The second thing requires bug #1040450 to be implemented first.
| Reporter | ||
Comment 2•11 years ago
|
||
I'm going to grab this to do today, too, because the current situation is infuriating. Making it block on bug #1050450. (I typoed the bug number in the previous comment.)
Assignee: nobody → willkg
Depends on: 1050450
| Reporter | ||
Comment 3•11 years ago
|
||
I stuck the second thing into a cron job instead where it makes more sense to have.
Having this is probably interesting, but it's probably not super vital right now.
| Reporter | ||
Comment 4•11 years ago
|
||
Changing the summary and unassigning me.
Ricky had an idea to create a decorator that measured how long each task took to complete. That idea is different than this one. This one measures how fast it takes for a task to get through everything if it doesn't do anything when it executes. It does not measure how fast it takes for a specific task to do all the work it does.
Assignee: willkg → nobody
Summary: measure rabbitmq queue size and time it takes for a task to complete over time → measure delta between creating a task and it completing over time
| Assignee | ||
Comment 5•11 years ago
|
||
grabbing and tossing into the sprint.
Assignee: nobody → rrosario
Priority: -- → P1
Whiteboard: u=dev c=infra p=2 s=2014.15
Target Milestone: --- → 2014Q3
| Assignee | ||
Comment 6•11 years ago
|
||
I landed a commit that measures the time a task is siting on the queue:
https://github.com/mozilla/kitsune/commit/2a2733e315094b90f11d027e4c8aa9b9252af1c0
| Assignee | ||
Comment 7•11 years ago
|
||
This was deployed to prod but I don't see anything. It is possible that we don't have the right network paths open between the celery server and statsd :/
| Reporter | ||
Comment 8•11 years ago
|
||
It's kicking this up:
Task was called with args: (datetime.datetime(2014, 8, 13, 18, 20, 19, 911437),) kwargs: {}.
The contents of the full traceback was:
Traceback (most recent call last):
File "/data/www/support.mozilla.com/kitsune/vendor/src/celery/celery/task/trace.py", line 233, in trace_task
R = retval = fun(*args, **kwargs)
File "/data/www/support.mozilla.com/kitsune/vendor/src/celery/celery/task/trace.py", line 420, in __protected_call__
return self.run(*args, **kwargs)
File "/data/www/support.mozilla.org/kitsune/kitsune/sumo/tasks.py", line 13, in measure_queue_lag
lag = (datetime.now() - queued_time).total_seconds()
AttributeError: 'datetime.timedelta' object has no attribute 'total_seconds'
I screwed up in the review and said to use total_seconds(), but that's new in Python 2.7.
https://docs.python.org/2/library/datetime.html#datetime.timedelta.total_seconds
So, instead do:
lag = datetime.now() - queued_time
lag = (lag.days * 3600 * 24) + lag.seconds
Sorry about that.
| Assignee | ||
Comment 9•11 years ago
|
||
| Assignee | ||
Comment 10•11 years ago
|
||
We are just missing the part where we measure the time a task takes to complete. I think that's 1pt.
Whiteboard: u=dev c=infra p=2 s=2014.15 → u=dev c=infra p=1 s=2014.16
| Assignee | ||
Comment 11•11 years ago
|
||
And the last part for timing tasks is in a pull request:
https://github.com/mozilla/kitsune/pull/2090
| Assignee | ||
Comment 12•11 years ago
|
||
Landed on master:
[bug 1050766] Measure and record time it takes tasks to execute.
| Assignee | ||
Comment 13•11 years ago
|
||
Deployed to prod.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•