dependencyResolver stopped working

RESOLVED FIXED

Status

task
P2
major
RESOLVED FIXED
8 months ago
3 months ago

People

(Reporter: dustin, Assigned: bstack)

Tracking

Details

Oct 29 23:52:42 taskcluster-queue app/dependencyResolver.4: 2018-10-29T23:52:42.015Z base:entity TIMING: getEntity on QueueTasks took 135.544098 milliseconds. 
Oct 29 23:52:42 taskcluster-queue app/dependencyResolver.3: 2018-10-29T23:52:42.481Z base:entity TIMING: deleteEntity on QueueTaskGroupActiveSets took 94.896785 milliseconds. 
Oct 29 23:52:43 taskcluster-queue app/dependencyResolver.2: 2018-10-29T23:52:43.215Z base:entity TIMING: queryEntities on QueueTaskDependency took 97.040273 milliseconds. 
Oct 29 23:52:43 taskcluster-queue app/dependencyResolver.3: 2018-10-29T23:52:43.595Z base:entity TIMING: deleteEntity on QueueTaskGroupActiveSets took 92.485661 milliseconds. 
Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.4: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined 
Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.4: Failed to log error to Sentry: Error: HTTP Error (429): undefined 
Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.1: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined 
Oct 29 23:52:53 taskcluster-queue app/dependencyResolver.1: Failed to log error to Sentry: Error: HTTP Error (429): undefined 
Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.2: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined 
Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.2: Failed to log error to Sentry: Error: HTTP Error (429): undefined 
Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.3: raven@2.5.0 alert: failed to send exception to sentry: HTTP Error (429): undefined 
Oct 29 23:52:54 taskcluster-queue app/dependencyResolver.3: Failed to log error to Sentry: Error: HTTP Error (429): undefined
Looks like there was an Azure downtime around the same time.

429 is a rate limit, meaning Sentry is tired of hearing from us.  Perhaps from this very error.
Azure was returning OperationTimedOut errors for a brief time around when this failed (23:52:43).  But the error "stuck" because it killed all four dependencyResolver processes.

I restarted all dynos and it's up and running again.

It's really not clear to me why this error caused the resolver to exit and not restart.  John, is that something you could have a look at?
Assignee: dustin → nobody
Flags: needinfo?(jhford)
Looking at the metrics dashboard, there was a Heroku issue which happened at a similar time.  I'm not sure if the timelines match up perfectly, but it looks close enough to be related.  I also don't see any crashes in the Heroku metrics, which suggest to me that this was something which didn't crash the process.  It's probably that it was waiting around forever for a promise to resolve which didn't.

Let's see if this happens again, and also get the new pr from comment 1 landed in the meantime.
Flags: needinfo?(jhford)
Assignee

Comment 5

8 months ago
I think this reared its head today again. I believe bug 1503430 was caused by this.
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Priority: -- → P1
See Also: → 1503430
Brian: anything to do here?
Flags: needinfo?(bstack)
Severity: normal → major
Priority: P1 → P2
Assignee

Comment 7

7 months ago
I guess not!
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Flags: needinfo?(bstack)
Resolution: --- → FIXED
See Also: → 1521453
Duplicate of this bug: 1521453

From bug 1521453, we should probably look at why Azure errors are causing this worker dyno to hang -- that's an issue that will likely follow us into kubernetes.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Component: Operations → Operations and Service Requests
URL: 1527583
See Also: → 1527583
Assignee

Comment 10

3 months ago

We've added monitoring around this that should email the tc team when this happens again.

Status: REOPENED → RESOLVED
Closed: 7 months ago3 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.