Closed Bug 1050766 Opened 11 years ago Closed 11 years ago

measure delta between creating a task and it completing over time

Categories

(support.mozilla.org :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
2014Q3

People

(Reporter: willkg, Assigned: rrosario)

References

Details

(Whiteboard: u=dev c=infra p=1 s=2014.16)

We have no visibility into two important things right now: 1. the rabbitmq queue size 2. the delta in time between when a celery task is created and when it executes Because we have no visibility into this, we can't track it over time and we have no idea when the numbers suggest imminent danger. This bug covers using precision tools and alien artifacts we discovered in a dig on the moon to capture a set of statistics that will give us better visibility into these things and also reduce the risk of heart attack due to stress for the sumo engineering team.
My current "big idea" is this: Create a cron job. The cron job kicks off once every 6 hours. When it kicks off it grabs "now", sticks it in a specially formulated celery task and sends that to the queue. When the task executes, it does two things: 1. computes the delta of the "new now" and the previous "now" and statsd.gauges that 2. figures out the current rabbitmq queue size and statsd.gauges that That way we can look at graphite for graphs of what's going on. We can change the sampling rate if we think it's important, but 6 hours seems like a good number and adding 4 tasks per day doesn't affect the queue much. We can trivially change it later. Computing the delta between the two "nows" could be problematic because the machines they're running on will most certainly be slightly off relative to one another in regards to "what time is it?" I think we should convert everything to minutes and have a minimum of 0. Pretty sure that will alleviate the problems we will probably have, but still give us obvious cues when things are bad. We might even want to fire off HALP HALP!" emails. The first thing is easy to write. The second thing requires bug #1040450 to be implemented first.
I'm going to grab this to do today, too, because the current situation is infuriating. Making it block on bug #1050450. (I typoed the bug number in the previous comment.)
Assignee: nobody → willkg
Depends on: 1050450
I stuck the second thing into a cron job instead where it makes more sense to have. Having this is probably interesting, but it's probably not super vital right now.
Changing the summary and unassigning me. Ricky had an idea to create a decorator that measured how long each task took to complete. That idea is different than this one. This one measures how fast it takes for a task to get through everything if it doesn't do anything when it executes. It does not measure how fast it takes for a specific task to do all the work it does.
Assignee: willkg → nobody
Summary: measure rabbitmq queue size and time it takes for a task to complete over time → measure delta between creating a task and it completing over time
grabbing and tossing into the sprint.
Assignee: nobody → rrosario
Priority: -- → P1
Whiteboard: u=dev c=infra p=2 s=2014.15
Target Milestone: --- → 2014Q3
I landed a commit that measures the time a task is siting on the queue: https://github.com/mozilla/kitsune/commit/2a2733e315094b90f11d027e4c8aa9b9252af1c0
This was deployed to prod but I don't see anything. It is possible that we don't have the right network paths open between the celery server and statsd :/
It's kicking this up: Task was called with args: (datetime.datetime(2014, 8, 13, 18, 20, 19, 911437),) kwargs: {}. The contents of the full traceback was: Traceback (most recent call last): File "/data/www/support.mozilla.com/kitsune/vendor/src/celery/celery/task/trace.py", line 233, in trace_task R = retval = fun(*args, **kwargs) File "/data/www/support.mozilla.com/kitsune/vendor/src/celery/celery/task/trace.py", line 420, in __protected_call__ return self.run(*args, **kwargs) File "/data/www/support.mozilla.org/kitsune/kitsune/sumo/tasks.py", line 13, in measure_queue_lag lag = (datetime.now() - queued_time).total_seconds() AttributeError: 'datetime.timedelta' object has no attribute 'total_seconds' I screwed up in the review and said to use total_seconds(), but that's new in Python 2.7. https://docs.python.org/2/library/datetime.html#datetime.timedelta.total_seconds So, instead do: lag = datetime.now() - queued_time lag = (lag.days * 3600 * 24) + lag.seconds Sorry about that.
We are just missing the part where we measure the time a task takes to complete. I think that's 1pt.
Whiteboard: u=dev c=infra p=2 s=2014.15 → u=dev c=infra p=1 s=2014.16
And the last part for timing tasks is in a pull request: https://github.com/mozilla/kitsune/pull/2090
Landed on master: [bug 1050766] Measure and record time it takes tasks to execute.
Deployed to prod.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.