1295182 - look into pending counts/timing - make it accurate and graph it

Assignee

Description

•

8 years ago

The idea is to track something like the cumulative pending time of all pending tasks.

Greg Arndt [:garndt]

Comment 1

•

8 years ago

I told Brian I would help with this because the initial hack to the stats collector was my doing.  We have the pending times within signal fx per worker type, and this is what powers the alert we have for pending gecko decision wait times.  However, I have found that at times things are fragile and pending times recorded seem to sky rocket even though that's not reality.  We'll look more into that.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

8 years ago

So I did a good bit of work on this, and https://app.signalfx.com/#/dashboard/Cp7oeIXAYDI now looks pretty good, but I still think it's inaccurate.  I added some logging today, so I'll dig more into that -- I suspect that there are tasks with unusual circumstances causing these inaccuracies.

Assignee: bstack → dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

8 years ago

Attached image Screenshot from 2016-08-19 10-28-36.png — Details

OK, I'm investigating why these ramps are occurring.  The timezone here is US/Eastern (UTC-4).  Timezones in the logs are US/Central (UTC-5) followed by UTC.  So these spikes occurred right around 1:00am US/Central = 6:00am UTC.

Skimming the logs for measurements with large pending, I see

 Aug 19 01:00:14 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:00:14 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 2271.451 seconds 

which reappears every 30 seconds with an additional 30-second pending, and

 Aug 19 01:00:15 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:00:14 GMT stats-collector:collector measuring task fnV_Uj29R2i5sKJnhgF8sQ/0 (desktop-test-xlarge), pending 1540.643 seconds 

which does the same.  Looking specifically for the former taskId, 

Aug 19 00:22:25 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:24 GMT stats-collector:collector task pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 00:22:32 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:32 GMT stats-collector:collector task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 00:22:33 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:33 GMT stats-collector:collector task pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 00:22:43 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:22:43 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 19.767 seconds 
Aug 19 00:23:13 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 05:23:13 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 49.785 seconds 

.. and continued to ramp up at 1 second per second until ..

Aug 19 01:10:45 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:10:45 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 2901.996 seconds 
Aug 19 01:11:15 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:11:15 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 2932.02 seconds 
Aug 19 01:11:26 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:11:26 GMT stats-collector:collector task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge) 
Aug 19 01:11:45 taskcluster-stats-collector app/run.1: Fri, 19 Aug 2016 06:11:45 GMT stats-collector:collector measuring task FXpAfJLFSOmNkU1O61gydg/0 (desktop-test-xlarge), pending 8.801 seconds 

Yet https://tools.taskcluster.net/task-inspector/#FXpAfJLFSOmNkU1O61gydg/0 only shows one run, and indeed the started - scheduled time is about 8 seconds:

https://queue.taskcluster.net/v1/task/FXpAfJLFSOmNkU1O61gydg/status
        "scheduled": "2016-08-19T05:22:23.242Z",
        "started": "2016-08-19T05:22:32.043Z",
        "resolved": "2016-08-19T06:11:25.975Z"

It depends on a docker-image task that was long since finished, and on
https://tools.taskcluster.net/task-inspector/#ZxN4Ej9bTmSkhihml-SSyg

        "scheduled": "2016-08-19T02:09:36.617Z",
        "started": "2016-08-19T04:39:06.694Z",
        "resolved": "2016-08-19T05:22:13.536Z"

So, lining all of those times up in UTC/GMT:

05:22:13 - build task ZxN4Ej9bTmSkhihml-SSyg finished
05:22:23 - test task FXpAfJLFSOmNkU1O61gydg scheduled
05:22:24 - task pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on pending exchange)
05:22:32 - test task FXpAfJLFSOmNkU1O61gydg started (8 seconds actually pending)
05:22:32 - task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on running exchange)
05:22:33 - task pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on the pending exchange?!)
06:11:25 - test task FXpAfJLFSOmNkU1O61gydg resolved
06:11:26 - task no longer pending: FXpAfJLFSOmNkU1O61gydg/0 (pulse message on completed exchange)

so that pulse message at 05:22:33 is the culprit, as it convinced tc-stats-collector that the task was pending when in fact it was running.

Jonas, can you think of a reason that pulse message might have been sent?

Flags: needinfo?(jopsen)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

8 years ago

https://github.com/taskcluster/taskcluster-stats-collector/commit/4a9cee966debeff032aa5dec8769462e33bc2ba2

should help a little.  It means when that spurious pending message arrives, because the task is already in waitingTasks, it won't be re-added.  Still, it would be good to know why this spurious message is going out.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

8 years ago

Another duplicate message:

Aug 22 04:12:17 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:12:16 GMT stats-collector:collector task no longer pending: e5GmWXreT-metsZQ9XNYHA/0 (gecko-3-b-linux) 
Aug 22 04:22:02 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:22:01 GMT stats-collector:collector task no longer pending: e5GmWXreT-metsZQ9XNYHA/0 (gecko-3-b-linux) 

The task actually resolved at 4:12:17 US/Central, per task inspector.  So why was there an additional message about it 10 minutes later?

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

8 years ago

Another one:

Aug 22 04:12:16 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:12:15 GMT stats-collector:collector task no longer pending: axL1j4S3Rjird6b8TMjbQA/0 (gecko-3-b-linux) 
Aug 22 04:53:43 taskcluster-stats-collector app/run.1: Mon, 22 Aug 2016 09:53:43 GMT stats-collector:collector task no longer pending: axL1j4S3Rjird6b8TMjbQA/0 (gecko-3-b-linux) 

These cause weird "spikes" in the graph, the derivative of which should never be greater than one.

Interesting to note that the first messages in both this case and the previous comment are at almost exactly the same time.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

8 years ago

Jonas points out that comment 5 and comment 6 are actually tasks running (at 4:12:xx) and then completing later.  So, normal, but still causing issues for the stats collector..

Jonas Finnemann Jensen (:jonasfj)

Comment 8

•

8 years ago

Command 5 and 6 are not duplicate pending messages.

---

So a 10min delay could be failure to delete from azure queue, hence, the task-resolved message from parent task arrives twice, causing two runs of the schedule task code... which is idempotent and, hence, sends the message twice..

---

Another explanation is listening code failing to handle the message, then PulseListener will retry it.
If it happens a lot, this might be worth checking out.

---

Fourth explanation for duplicate is two parent tasks racing to complete and both running the task-resolved code on the queue at the same time. But then the pending messages should happen at the same time.

---

Neither of the ones above should happen frequently, occasional duplicates is to expected and perfectly acceptable. Pulse is at-least-once delivery semantics :)

Flags: needinfo?(jopsen)

Jonas Finnemann Jensen (:jonasfj)

Comment 9

•

8 years ago

Looking at comment 3:
I guess this is either:
 A) worker reporting task completed twice for parent task, due to retry HTTP logic (legal)
 B) Race between two parent tasks to schedule child task (legal)
 C) Retry logic for message handling queue producing two messages (legal)
 D) Listener retry logic (legal):
    https://github.com/taskcluster/taskcluster-client/blob/e5a389ebe04737614c96c700348458c274f804bc/lib/pulselistener.js#L395-L415

All of these are legal and perfectly in line with at-least-once delivery semantics.
If this is something that happens frequently, then we should investigate further.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

8 years ago

It's happening a half-dozen times per day.  I doubt it's a listener issue, as there are messages delivered in between:

05:22:24 - pending
05:22:32 - running
05:22:33 - pending

But you're right that these are all "at least once".  So it sounds like I need to set up some sort of de-duplication within the stats collector service.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

8 years ago

OK, talked with Jonas quite a bit.  What we came up with at the very end is a model where tc-stats-collector will perform the 'maximum' operation locally, and only measure that.  So basically the pending numbers become a measure of the pending time of the pending task (run) that has been in that state the longest.  This has a few advantages:

 * after determining which run has been pending the longest based on pulse messages, we can confirm that with a call to the queue, without overwhelming the queue
 * we can safely measure 0 for workerTypes with no pending tasks, so every workerType will always have a measurement

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

8 years ago

https://github.com/taskcluster/taskcluster-stats-collector/pull/7

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 13

•

8 years ago

OK, that's landed, and I with that I think we're done here.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

look into pending counts/timing - make it accurate and graph it

Categories

(Taskcluster :: General, defect)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type