Closed Bug 1295182 Opened 8 years ago Closed 8 years ago

look into pending counts/timing - make it accurate and graph it

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

Attachments

(1 file)

The idea is to track something like the cumulative pending time of all pending tasks.
I told Brian I would help with this because the initial hack to the stats collector was my doing.  We have the pending times within signal fx per worker type, and this is what powers the alert we have for pending gecko decision wait times.  However, I have found that at times things are fragile and pending times recorded seem to sky rocket even though that's not reality.  We'll look more into that.
So I did a good bit of work on this, and https://app.signalfx.com/#/dashboard/Cp7oeIXAYDI now looks pretty good, but I still think it's inaccurate.  I added some logging today, so I'll dig more into that -- I suspect that there are tasks with unusual circumstances causing these inaccuracies.
Assignee: bstack → dustin
OK, I'm investigating why these ramps are occurring.  The timezone here is US/Eastern (UTC-4).  Timezones in the logs are US/Central (UTC-5) followed by UTC.  So these spikes occurred right around 1:00am US/Central = 6:00am UTC.

Skimming the logs for measurements with large pending, I see

 Aug 19 01:00:14 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:00:14 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 2271.451 seconds 

which reappears every 30 seconds with an additional 30-second pending, and

 Aug 19 01:00:15 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:00:14 GMT stats-collector:collector measuring task fnV_Uj29R2i5sKJnhgF8sQ/0 (desktop-test-xlarge), pending 1540.643 seconds 

which does the same.  Looking specifically for the former taskId, 

Aug 19 00:22:25 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:24 GMT stats-collector:collector task pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 00:22:32 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:32 GMT stats-collector:collector task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 00:22:33 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:33 GMT stats-collector:collector task pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 00:22:43 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:43 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 19.767 seconds 
Aug 19 00:23:13 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:23:13 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 49.785 seconds 

.. and continued to ramp up at 1 second per second until ..

Aug 19 01:10:45 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:10:45 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 2901.996 seconds 
Aug 19 01:11:15 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:11:15 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 2932.02 seconds 
Aug 19 01:11:26 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:11:26 GMT stats-collector:collector task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 01:11:45 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:11:45 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 8.801 seconds 

Yet https://tools.taskcluster.net/task-inspector/#FXpAfJLFSOmNkU1O61gydg/0 only shows one run, and indeed the started - scheduled time is about 8 seconds:

https://queue.taskcluster.net/v1/task/FXpAfJLFSOmNkU1O61gydg/status
        "scheduled": "2016-08-19T05:22:23.242Z",
        "started": "2016-08-19T05:22:32.043Z",
        "resolved": "2016-08-19T06:11:25.975Z"

It depends on a docker-image task that was long since finished, and on
https://tools.taskcluster.net/task-inspector/#ZxN4Ej9bTmSkhihml-SSyg

        "scheduled": "2016-08-19T02:09:36.617Z",
        "started": "2016-08-19T04:39:06.694Z",
        "resolved": "2016-08-19T05:22:13.536Z"

So, lining all of those times up in UTC/GMT:

05:22:13 - build task ZxN4Ej9bTmSkhihml-SSyg finished
05:22:23 - test task FXpAfJLFSOmNkU1O61gydg scheduled
05:22:24 - task pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on pending exchange)
05:22:32 - test task FXpAfJLFSOmNkU1O61gydg started (8 seconds actually pending)
05:22:32 - task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on running exchange)
05:22:33 - task pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on the pending exchange?!)
06:11:25 - test task FXpAfJLFSOmNkU1O61gydg resolved
06:11:26 - task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on completed exchange)

so that pulse message at 05:22:33 is the culprit, as it convinced tc-stats-collector that the task was pending when in fact it was running.

Jonas, can you think of a reason that pulse message might have been sent?
Flags: needinfo?(jopsen)
https://github.com/taskcluster/taskcluster-stats-collector/commit/4a9cee966debeff032aa5dec8769462e33bc2ba2

should help a little.  It means when that spurious pending message arrives, because the task is already in waitingTasks, it won't be re-added.  Still, it would be good to know why this spurious message is going out.
Another duplicate message:

Aug 22 04:12:17 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:12:16 GMT stats-collector:collector task no longer pending: e5GmWXreT-metsZQ9XNYHA/0 (gecko-3-b-linux) 
Aug 22 04:22:02 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:22:01 GMT stats-collector:collector task no longer pending: e5GmWXreT-metsZQ9XNYHA/0 (gecko-3-b-linux) 

The task actually resolved at 4:12:17 US/Central, per task inspector.  So why was there an additional message about it 10 minutes later?
Another one:

Aug 22 04:12:16 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:12:15 GMT stats-collector:collector task no longer pending: axL1j4S3Rjird6b8TMjbQA/0 (gecko-3-b-linux) 
Aug 22 04:53:43 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:53:43 GMT stats-collector:collector task no longer pending: axL1j4S3Rjird6b8TMjbQA/0 (gecko-3-b-linux) 

These cause weird "spikes" in the graph, the derivative of which should never be greater than one.

Interesting to note that the first messages in both this case and the previous comment are at almost exactly the same time.
Jonas points out that comment 5 and comment 6 are actually tasks running (at 4:12:xx) and then completing later.  So, normal, but still causing issues for the stats collector..
Command 5 and 6 are not duplicate pending messages.

---

So a 10min delay could be failure to delete from azure queue, hence, the task-resolved message from parent task arrives twice, causing two runs of the schedule task code... which is idempotent and, hence, sends the message twice..

---

Another explanation is listening code failing to handle the message, then PulseListener will retry it.
If it happens a lot, this might be worth checking out.

---

Fourth explanation for duplicate is two parent tasks racing to complete and both running the task-resolved code on the queue at the same time. But then the pending messages should happen at the same time.

---

Neither of the ones above should happen frequently, occasional duplicates is to expected and perfectly acceptable. Pulse is at-least-once delivery semantics :)
Flags: needinfo?(jopsen)
Looking at comment 3:
I guess this is either:
 A) worker reporting task completed twice for parent task, due to retry HTTP logic (legal)
 B) Race between two parent tasks to schedule child task (legal)
 C) Retry logic for message handling queue producing two messages (legal)
 D) Listener retry logic (legal):
    https://github.com/taskcluster/taskcluster-client/blob/e5a389ebe04737614c96c700348458c274f804bc/lib/pulselistener.js#L395-L415

All of these are legal and perfectly in line with at-least-once delivery semantics.
If this is something that happens frequently, then we should investigate further.
It's happening a half-dozen times per day.  I doubt it's a listener issue, as there are messages delivered in between:

05:22:24 - pending
05:22:32 - running
05:22:33 - pending

But you're right that these are all "at least once".  So it sounds like I need to set up some sort of de-duplication within the stats collector service.
OK, talked with Jonas quite a bit.  What we came up with at the very end is a model where tc-stats-collector will perform the 'maximum' operation locally, and only measure that.  So basically the pending numbers become a measure of the pending time of the pending task (run) that has been in that state the longest.  This has a few advantages:

 * after determining which run has been pending the longest based on pulse messages, we can confirm that with a call to the queue, without overwhelming the queue
 * we can safely measure 0 for workerTypes with no pending tasks, so every workerType will always have a measurement
OK, that's landed, and I with that I think we're done here.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: