Closed Bug 1579496 Opened 6 years ago Closed 5 years ago

Azure-related tests timing out

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dustin, Assigned: bstack)

References

Details

https://tools.taskcluster.net/groups/dBhsDGc3Skmz_NP2O01YLQ/tasks/D2gxpF4kShOqfTnJBTzXRA/runs/0/logs/public%2Flogs%2Flive_backing.log

  1) services/queue/test/reruntask_test.js (real)
       "before all" hook: withEntity for Artifact for "create, claim, complete and rerun (is idempotent)":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/reruntask_test.js)

https://tools.taskcluster.net/groups/EYx_pnPqT2eaVeTQ2Wmr1g/tasks/BKUOvIKFRues2IN26MfI1w/runs/0/logs/public%2Flogs%2Flive_backing.log

  1) services/queue/test/claimtask_test.js (real)
       "before each" hook: withEntity for TaskGroupActiveSet for "can claimTask":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/claimtask_test.js)

https://tools.taskcluster.net/groups/JyHHVJFaRLCi4P8WzO5t7g/tasks/Z-3zkzxaTwSw0rvc_jHU2A/runs/0/logs/public%2Flogs%2Flive_backing.log

  1) services/queue/test/workerinfo_test.js (real)
       queue.getWorker returns 20 most recent taskIds:
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/workerinfo_test.js)

https://tools.taskcluster.net/groups/WX5pnYVDQ7OwCIqKXlbmWQ/tasks/ZMOL_Mn2T2mfqPaim5-thw/runs/0/logs/public%2Flogs%2Flive_backing.log

  1) services/queue/test/claimresolver_test.js (real)
       "before each" hook: withEntity for TaskRequirement for "createTask , claimWork, claim expires, retried":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/claimresolver_test.js)

I managed to repro this locally, running mocha with -b so that it bails out at the first failure.

  azure:table Request: DELETE /WorkerT20190906TS7g3pqWq(PartitionKey='prov1-extended-extended-extended~gecko-b-2-linux-extended-extended',RowKey='my-worker-group-extended-extended~new'), retry: 0 +55ms                                                                                                                                                                                                                                    
  nock.common options.host: jungle.table.core.windows.net +55ms                                                                                                                                                                                                                                                                                                                                                                              
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms                                                                                                                                                                                                                                                                                                                                                              
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms                                                                                                                                                                                                                                                                                                                                                              
  nock.intercept interceptors for "jungle.table.core.windows.net:443" +54ms                                                                                                                                                                                                                                                                                                                                                                  
  nock.intercept filtering interceptors for basepath https://jungle.table.core.windows.net:443 +0ms
  nock.common options.host: jungle.table.core.windows.net:443 +0ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.table.core.windows.net:443 +0ms
  azure:table Request: DELETE /WorkerT20190906TS7g3pqWq(PartitionKey='prov1-extended-extended-extended~gecko-b-2-linux-extended-extended',RowKey='my-worker-group-extended-extended~newq'), retry: 0 +0ms
  nock.common options.host: jungle.table.core.windows.net +0ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept interceptors for "jungle.table.core.windows.net:443" +0ms
  nock.intercept filtering interceptors for basepath https://jungle.table.core.windows.net:443 +0ms
  nock.common options.host: jungle.table.core.windows.net:443 +0ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.table.core.windows.net:443 +0ms
  azure:table Request: DELETE /WorkerT20190906TS7g3pqWq(PartitionKey='prov1-extended-extended-extended~gecko-b-2-linux-extended-extended',RowKey='my-worker-group-extended-extended~old'), retry: 0 +0ms
  nock.common options.host: jungle.table.core.windows.net +0ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept interceptors for "jungle.table.core.windows.net:443" +1ms
  nock.intercept filtering interceptors for basepath https://jungle.table.core.windows.net:443 +0ms
  nock.common options.host: jungle.table.core.windows.net:443 +1ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.table.core.windows.net:443 +0ms
  azure:table Request: DELETE /WorkerT20190906TS7g3pqWq(PartitionKey='prov1-extended-extended-extended~gecko-b-2-linux-extended-extended',RowKey='my-worker-group-extended-extended~q'), retry: 0 +1ms
  nock.common options.host: jungle.table.core.windows.net +0ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept interceptors for "jungle.table.core.windows.net:443" +0ms
  nock.intercept filtering interceptors for basepath https://jungle.table.core.windows.net:443 +1ms
  nock.common options.host: jungle.table.core.windows.net:443 +1ms
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.table.core.windows.net:443 +0ms
  taskcluster.queue.table.worker INFO: monitor.measure:                                                       
  taskcluster.queue.table.worker        key: deleteEntity.success                                             
  taskcluster.queue.table.worker        v: 1                                                                  
  taskcluster.queue.table.worker        val: 31.806672 +0ms
  taskcluster.queue.table.worker INFO: monitor.count:                                                         
  taskcluster.queue.table.worker        key: deleteEntity.success
  taskcluster.queue.table.worker        v: 1                                                                  
  taskcluster.queue.table.worker        val: 1 +0ms
  taskcluster.queue.table.worker INFO: monitor.measure:                                                       
  taskcluster.queue.table.worker        key: deleteEntity.success
  taskcluster.queue.table.worker        v: 1                                                                                                                   
  taskcluster.queue.table.worker        val: 32.791403 +0ms
  taskcluster.queue.table.worker INFO: monitor.count:                                                         
  taskcluster.queue.table.worker        key: deleteEntity.success
  taskcluster.queue.table.worker        v: 1                                                                  
  taskcluster.queue.table.worker        val: 1 +0ms
  taskcluster.queue.table.worker INFO: monitor.measure:                                                       
  taskcluster.queue.table.worker        key: deleteEntity.success
  taskcluster.queue.table.worker        v: 1                                                                  
  taskcluster.queue.table.worker        val: 310.624817 +0ms
  taskcluster.queue.table.worker INFO: monitor.count: 
  taskcluster.queue.table.worker        key: deleteEntity.success                                                                              
  taskcluster.queue.table.worker        v: 1     
  taskcluster.queue.table.worker        val: 1 +0ms       
    1) "before each" hook: withEntity for Worker for "queue.listWorkers returns actions with the right context"
  nock.common options.hostname in the end: undefined +30s    
  nock.common options.host in the end: "localhost:443" +1ms                                                                                                           
  azure:table Request: GET /ArtifactsT20190906TS7g3pqWq(), retry: 0 +30s
  nock.common options.host: jungle.table.core.windows.net +0ms                   
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms

This is in the withEntities tear-down, where it's blasting through entity types, querying them, and deleting. A few interesting things:

  • +30s - that refers to the time since the last nock.common options.hostname which is just before all the monitor.count/monitor.measure stuff. So 30s went by with no logging.
  • The monitor.measure are in ms, so the longest one logged was 310ms. But there are only three of them.

I don't see any further debug logging from taskcluster.queue.table.worker for a while, until

  azure:table Request: GET /WorkerT20190906TS7g3pqWq(), retry: 0 +33ms         
  nock.common options.host: jungle.table.core.windows.net +32ms                 
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms                    
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept interceptors for "jungle.table.core.windows.net:443" +32ms    
  nock.intercept filtering interceptors for basepath https://jungle.table.core.windows.net:443 +0ms
  nock.common options.host: jungle.table.core.windows.net:443 +0ms              
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms                    
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.table.core.windows.net:443 +0ms
  taskcluster.queue.table.worker INFO: monitor.measure:                                            
  taskcluster.queue.table.worker        key: queryEntities.success              
  taskcluster.queue.table.worker        v: 1                                   
  taskcluster.queue.table.worker        val: 34.370085 +0ms                    
  taskcluster.queue.table.worker INFO: monitor.count:                           
  taskcluster.queue.table.worker        key: queryEntities.success                                 
  taskcluster.queue.table.worker        v: 1                      
  taskcluster.queue.table.worker        val: 1 +0ms                            
  azure:table Request: DELETE /WorkerT20190906TS7g3pqWq(PartitionKey='prov1-extended-extended-extended~gecko-b-2-linux-extended-extended',RowKey='my-worker-group-extended-extended~old'), retry: 0 +35ms
  nock.common options.host: jungle.table.core.windows.net +35ms                 
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms                    
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept interceptors for "jungle.table.core.windows.net:443" +35ms    
  nock.intercept filtering interceptors for basepath https://jungle.table.core.windows.net:443 +0ms
  nock.common options.host: jungle.table.core.windows.net:443 +0ms              
  nock.common options.hostname in the end: "jungle.table.core.windows.net" +0ms                    
  nock.common options.host in the end: "jungle.table.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.table.core.windows.net:443 +0ms
  taskcluster.queue.table.worker INFO: monitor.measure:                                            
  taskcluster.queue.table.worker        key: deleteEntity.success               
  taskcluster.queue.table.worker        v: 1                                   
  taskcluster.queue.table.worker        val: 32.961968 +0ms                    
  taskcluster.queue.table.worker INFO: monitor.count:                           
  taskcluster.queue.table.worker        key: deleteEntity.success                                  
  taskcluster.queue.table.worker        v: 1                      
  taskcluster.queue.table.worker        val: 1 +0ms                            

That this scan-and-delete happens twice makes sense -- we clean up rows after each test, and after the whole suite. But note here, the same row (RowKey='my-worker-group-extended-extended~old') turns up again and is successfully deleted in 30ms or so.

So it looks like requests to Azure are just disappearing into the ether and timing out. I suspect this isn't a change in our code, but a change in Azure's behavior. It's really common, though :(

That doesn't explain the distribution of the errors, though -- why so heavily queue?

https://github.com/taskcluster/azure-entities/blob/master/src/entity.js#L77

var AZURE_TABLE_TIMEOUT     = 7 * 1000;
...
  const client = new azure.Table(_.defaults({
    timeout:          AZURE_TABLE_TIMEOUT,
    agent:            options.agent,
  }, options.credentials));

so we do have a timeout that is passed to fast-azure-storage, of far less than 30s.

And fast-azure storage calls req.setTimeout with the appropriate value (7500 - it adds a half-second so that the server can abort the connection first). Something's not adding up here. I'll see if I can repro with some console.log's in fast-azure-storage.

  azure:authorization Refreshing shared-access-signature +29ms                                                        
  taskcluster-client Calling: azureTableSAS, retry: 0 +29ms           
  nock.common options.host: auth.taskcluster.net +30ms                                                                
  nock.common options.hostname in the end: "auth.taskcluster.net" +0ms                                                  
  nock.common options.host in the end: "auth.taskcluster.net:443" +0ms                                                         
  nock.intercept interceptors for "auth.taskcluster.net:443" +30ms                                                                                                         
  nock.intercept filtering interceptors for basepath https://auth.taskcluster.net:443 +0ms                            
  nock.common options.host: auth.taskcluster.net:443 +0ms                                              
  nock.common options.hostname in the end: "auth.taskcluster.net" +0ms                                                                                                                                        
  nock.common options.host in the end: "auth.taskcluster.net:443" +0ms        
  nock.intercept Net connect  enabled for auth.taskcluster.net:443 +0ms                                                                                                                          
  superagent GET https://auth.taskcluster.net/v1/azure/jungle/table/TaskGroupsT20190906TbNxtuZRk/read-write +31ms
    1) "before each" hook: withEntity for TaskGroup for "queue.listProvisioners returns an empty list"                
  taskcluster-client Request error calling azureTableSAS (retrying), err: Error: Aborted, JSON: Error: Aborted +30s
  nock.common options.hostname in the end: undefined +30s                                                             

That's a different error from last time, but certainly something that we could adjust.

  azure:queue Request: DELETE /q89743-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-1/messages/dcd0c4bb-8125-42ed-a24a-6025daa159cb, retry: 0 +2ms
7500
  nock.common options.host: jungle.queue.core.windows.net +2ms
  nock.common options.hostname in the end: "jungle.queue.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.queue.core.windows.net:443" +0ms
  nock.intercept interceptors for "jungle.queue.core.windows.net:443" +3ms
  nock.intercept filtering interceptors for basepath https://jungle.queue.core.windows.net:443 +0ms
  nock.common options.host: jungle.queue.core.windows.net:443 +1ms
  nock.common options.hostname in the end: "jungle.queue.core.windows.net" +0ms
  nock.common options.host in the end: "jungle.queue.core.windows.net:443" +0ms
  nock.intercept Net connect  enabled for jungle.queue.core.windows.net:443 +0ms
    1) queue.getWorker returns 20 most recent taskIds
  nock.common options.hostname in the end: undefined +22s
  nock.common options.host in the end: "localhost:443" +1ms

comment 2 also showed that options.hostname in the end: undefined. What's up with that :(

nock/lib/common.js

  7 /**
  8  * Normalizes the request options so that it always has `host` property.
  9  *
 10  * @param  {Object} options - a parsed options object of the request
 11  */
 12 function normalizeRequestOptions(options) {
 13   options.proto = options.proto || 'http'
 14   options.port = options.port || (options.proto === 'http' ? 80 : 443)
 15   if (options.host) {
 16     debug('options.host:', options.host)
 17     if (!options.hostname) {
 18       if (options.host.split(':').length === 2) {
 19         options.hostname = options.host.split(':')[0]
 20       } else {
 21         options.hostname = options.host
 22       }
 23     }
 24   }
 25   debug('options.hostname in the end: %j', options.hostname)                                                                                                                                                                                                                                                               
 26   options.host = `${options.hostname || 'localhost'}:${options.port}`
 27   debug('options.host in the end: %j', options.host)
 28 
 29   /// lowercase host names
 30   ;['hostname', 'host'].forEach(function(attr) {
 31     if (options[attr]) {
 32       options[attr] = options[attr].toLowerCase()
 33     }
 34   })
 35 
 36   return options
 37 }

That's some weird almost-semicolon-free JS. But it suggests that somewhere this function is getting called with no host or hostname.

Hm, but that seems to happen in success cases, too. Probably a false lead.

I tried annotating nock to log all http requests, hoping i could figure out which request was not timing out after 7s. But I can't repro the problem anymore.

It takes a while, but all I can see is

     Error: Errors logged to monitor during test run:
{
  "code": "OperationTimedOut",
  "incidentId": "c0d47e16-9615-4d7e-bf45-26b99ee2f036",
  "message": "Operation could not be completed within the specified time.\nRequestId:6de4942d-5003-00fd-69f3-6559e7000000\nTime:2019-09-08T03:15:07.2579762Z",
  "method": "createTask",
  "module": "fast-azure-storage",
  "name": "OperationTimedOutError",
  "note": "Failed to ensure azure queue in queueservice.js",
  "params": {
    "taskId": "_2keykyOTLemcae66rwOYA"
  },
  "payload": {
    "created": "2019-09-08T03:14:59.531Z",
    "deadline": "2019-09-08T03:44:59.531Z",
    "dependencies": [],
    "expires": "2020-09-08T03:44:59.531Z",
    "extra": {},
    "metadata": {
      "description": "Task created during unit tests",
      "name": "Unit testing task",
      "owner": "haali@mozilla.com",
      "source": "https://github.com/taskcluster/taskcluster-queue"
    },
    "payload": {},
    "priority": "lowest",
    "provisionerId": "no-provisioner",
    "requires": "all-completed",
    "retries": 5,
    "routes": [],
    "schedulerId": "-",
    "scopes": [],
    "tags": {},
    "taskGroupId": "_2keykyOTLemcae66rwOYA",
    "workerType": "gecko-b-1-android"
  },
  "retries": 0,
  "stack": "OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:6de4942d-5003-00fd-69f3-6559e7000000\nTime:2019-09-08T03:15:07.2579762Z\n    at /home/dustin/p/taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)",
  "statusCode": 500,
  "v": 1
}

which is the "normal" (handled) Azure timeout. It takes about 4h to generate that error, which matches the rate at which we see it in CI.

OK, got something:

2019-09-08T13:50:24.516Z taskcluster-client Calling: claimWork, retry: 0
req 1114 POST localhost undefined 60401 /api/queue/v1/claim-work/no-provisioner/gecko-b-1-android
...
    1) queue.getWorker returns 20 most recent taskIds
2019-09-08T13:50:39.674Z nock.common options.hostname in the end: undefined
2019-09-08T13:50:39.674Z nock.common options.host in the end: "localhost:443"
2019-09-08T13:50:39.674Z azure:table Request: GET /ArtifactsT20190908TEhVSExR8(), retry: 0
...
2019-09-08T13:50:40.661Z nock.intercept Net connect  enabled for jungle.table.core.windows.net:443
evt 1378 response
2019-09-08T13:50:40.696Z taskcluster.queue.table.worker INFO: monitor.measure: 
        key: deleteEntity.success
        v: 1
        val: 34.355328
2019-09-08T13:50:40.696Z taskcluster.queue.table.worker INFO: monitor.count: 
        key: deleteEntity.success
        v: 1
        val: 1
evt 1114 abort
2019-09-08T13:50:54.526Z taskcluster-client Request error calling claimWork (retrying), err: Error: Aborted, JSON: Error: Aborted

2019-09-08T13:50:54.527Z mocha:runner finished running

  42 passing (1m)
  1 failing

  1) services/queue/test/workerinfo_test.js (real)
       queue.getWorker returns 20 most recent taskIds:
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/home/dustin/p/taskcluster/services/queue/test/workerinfo_test.js)

The 1114 here is a single call to https.request. There are lots of other requests going on at the same time (up through 1378 as seen above). That particular request is aborted 30s after it is made. The test itself times out before that (at 13:50:39.674) and begins cleaning up, but mocha hangs around until req 1114 is done.

I can't quite figure out from https://nodejs.org/docs/latest-v10.x/api/http.html. Per that page, a request is only aborted when req.abort is called, and nothing in the tc-client code calls that method. The docs even call out:

Note that setting the timeout option or using the setTimeout() function will not abort the request or do anything besides add a 'timeout' event.

so this is definitely not a timeout.

The request is definitely delivered to the local express server:

Sun, 08 Sep 2019 13:50:24 GMT express:router dispatching POST /api/queue/v1/claim-work/no-provisioner/gecko-b-1-android
...
evt 1115 response
2019-09-08T13:50:24.520Z taskcluster-client Success calling: authenticateHawk, (0 retries)
...
2019-09-08T13:50:24.520Z azure:table Request: GET /WorkerT20190908TEhVSExR8(PartitionKey='no-provisioner~gecko-b-1-android',RowKey='my-worker-group-extended-extended~my-worker-extended-extended'), retry: 0
req 1116 GET jungle.table.core.windows.net undefined undefined /WorkerT20190908TEhVSExR8(PartitionKey='no-provisioner~gecko-b-1-android',RowKey='my-worker-group-extended-extended~my-worker-extended-extended')?timeout=7&sv=2014-02-14&tn=WorkerT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A43Z&sp=raud&sig=AWlEPCztY6O0zEcFQN
aBPDzRu5QpsZKWYT%2BoJuim7BM%3D&st=2019-09-08T13%3A34%3A43Z
...
evt 1116 response                                                                                                                                                                                                                                                                                                             
2019-09-08T13:50:24.548Z azure:table Error code: ResourceNotFound (404) for GET /WorkerT20190908TEhVSExR8(PartitionKey='no-provisioner~gecko-b-1-android',RowKey='my-worker-group-extended-extended~my-worker-extended-extended') on retry: 0                                                                                 
...
2019-09-08T13:50:24.549Z azure:table Request: GET /ProvisionerT20190908TEhVSExR8(PartitionKey='no-provisioner',RowKey='provisioner'), retry: 0                                                                                                                                                                                
req 1117 GET jungle.table.core.windows.net undefined undefined /ProvisionerT20190908TEhVSExR8(PartitionKey='no-provisioner',RowKey='provisioner')?timeout=7&sv=2014-02-14&tn=ProvisionerT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A42Z&sp=raud&sig=4ie9WI4GzKJJu3Td9yRfiZdWMvyVLUj4y3JqDzWjCt4%3D&st=2019-09-08T13%3A34%3A42Z 
...
2019-09-08T13:50:24.549Z azure:table Request: GET /WorkerTypeT20190908TEhVSExR8(PartitionKey='no-provisioner',RowKey='gecko-b-1-android'), retry: 0                                                                                                                                                                           
req 1118 GET jungle.table.core.windows.net undefined undefined /WorkerTypeT20190908TEhVSExR8(PartitionKey='no-provisioner',RowKey='gecko-b-1-android')?timeout=7&sv=2014-02-14&tn=WorkerTypeT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A43Z&sp=raud&sig=5D9iBVd22%2FzpoCBkWyoAzOKlhJb4I3CCbQgR7MahEPE%3D&st=2019-09-08T13%3A34%3A43Z                                                                                                                                                                                                                                                                                                                          
...
2019-09-08T13:50:24.549Z azure:table Request: GET /WorkerT20190908TEhVSExR8(PartitionKey='no-provisioner~gecko-b-1-android',RowKey='my-worker-group-extended-extended~my-worker-extended-extended'), retry: 0                                                                                                                 
req 1119 GET jungle.table.core.windows.net undefined undefined /WorkerT20190908TEhVSExR8(PartitionKey='no-provisioner~gecko-b-1-android',RowKey='my-worker-group-extended-extended~my-worker-extended-extended')?timeout=7&sv=2014-02-14&tn=WorkerT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A43Z&sp=raud&sig=AWlEPCztY6O0zEcFQNaBPDzRu5QpsZKWYT%2BoJuim7BM%3D&st=2019-09-08T13%3A34%3A43Z                                                                                                                                                                                                                                                                    
...
2019-09-08T13:50:24.550Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-7/messages, retry: 0                                                                                                                                                                                              
req 1120 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-7/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1117 response                                                                                                                                                                                                                                                                                                             
2019-09-08T13:50:24.579Z base:entity Return modify trivially, as no changes was made by modifier                                                                                                                                                                                                                              
...
evt 1119 response                                                                                                                                                                                                                                                                                                             
2019-09-08T13:50:24.580Z azure:table Error code: ResourceNotFound (404) for GET /WorkerT20190908TEhVSExR8(PartitionKey='no-provisioner~gecko-b-1-android',RowKey='my-worker-group-extended-extended~my-worker-extended-extended') on retry: 0                                                                                 
...
2019-09-08T13:50:24.580Z azure:table Request: POST /WorkerT20190908TEhVSExR8, retry: 0                                                                                                                                                                                                                                        
req 1121 POST jungle.table.core.windows.net undefined undefined /WorkerT20190908TEhVSExR8?timeout=7&sv=2014-02-14&tn=WorkerT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A43Z&sp=raud&sig=AWlEPCztY6O0zEcFQNaBPDzRu5QpsZKWYT%2BoJuim7BM%3D&st=2019-09-08T13%3A34%3A43Z                                                            
...
evt 1120 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.582Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-6/messages, retry: 0                                                                                                                                                                                              
req 1122 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-6/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1118 response                                                                                                                                                                                                                                                                                                             
2019-09-08T13:50:24.583Z azure:table Error code: ResourceNotFound (404) for GET /WorkerTypeT20190908TEhVSExR8(PartitionKey='no-provisioner',RowKey='gecko-b-1-android') on retry: 0                                                                                                                                           
...
2019-09-08T13:50:24.583Z azure:table Request: POST /WorkerTypeT20190908TEhVSExR8, retry: 0                                                                                                                                                                                                                                    
req 1123 POST jungle.table.core.windows.net undefined undefined /WorkerTypeT20190908TEhVSExR8?timeout=7&sv=2014-02-14&tn=WorkerTypeT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A43Z&sp=raud&sig=5D9iBVd22%2FzpoCBkWyoAzOKlhJb4I3CCbQgR7MahEPE%3D&st=2019-09-08T13%3A34%3A43Z                                                    
...
evt 1121 response                                                                                                                                                                                                                                                                                                             
...
evt 1122 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.612Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-5/messages, retry: 0                                                                                                                                                                                              
req 1124 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-5/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1123 response                                                                                                                                                                                                                                                                                                             
...
evt 1124 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.642Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-4/messages, retry: 0                                                                                                                                                                                              
req 1125 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-4/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1125 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.671Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-3/messages, retry: 0                                                                                                                                                                                              
req 1126 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-3/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1126 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.700Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-2/messages, retry: 0                                                                                                                                                                                              
req 1127 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-2/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1127 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.731Z azure:queue Request: GET /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-1/messages, retry: 0                                                                                                                                                                                              
req 1128 GET jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-1/messages?numofmessages=30&visibilitytimeout=300&timeout=7                                                                                                                                          
...
evt 1128 response                                                                                                                                                                                                                                                                                                             
...
2019-09-08T13:50:24.818Z azure:table Request: GET /TasksT20190908TEhVSExR8(PartitionKey='MLAiTJgIQECQOHNRCo5vKg',RowKey='task'), retry: 0                                                                                                                                                                                     
req 1129 GET jungle.table.core.windows.net undefined undefined /TasksT20190908TEhVSExR8(PartitionKey='MLAiTJgIQECQOHNRCo5vKg',RowKey='task')?timeout=7&sv=2014-02-14&tn=TasksT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A40Z&sp=raud&sig=xP5d598W2u26%2FHE8KsBIEWFGGB8fyoa1noN7tfkfpLc%3D&st=2019-09-08T13%3A34%3A40Z          
... (reqs 1129-1158, each with a different taskId, for a total of 30 taskIds)
2019-09-08T13:50:24.838Z azure:table Request: GET /TasksT20190908TEhVSExR8(PartitionKey='O0HqF9oLSGycelkuxM5cvg',RowKey='task'), retry: 0                                                                                                                                                                                     
req 1158 GET jungle.table.core.windows.net undefined undefined /TasksT20190908TEhVSExR8(PartitionKey='O0HqF9oLSGycelkuxM5cvg',RowKey='task')?timeout=7&sv=2014-02-14&tn=TasksT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A40Z&sp=raud&sig=xP5d598W2u26%2FHE8KsBIEWFGGB8fyoa1noN7tfkfpLc%3D&st=2019-09-08T13%3A34%3A40Z          
...
evt 1129 response                                                                                                                                                                                                                                                                                                             
evt 1132 response                                                                                                                                                                                                                                                                                                             
evt 1131 response                                                                                                                                                                                                                                                                                                             
evt 1130 response
...
2019-09-08T13:50:24.892Z azure:queue Request: POST /q78939-claim/messages, retry: 0                                                                                                                                                                                                                                           
req 1159 POST jungle.queue.core.windows.net undefined undefined /q78939-claim/messages?visibilitytimeout=1&messagettl=604800&timeout=7                                                                                                                                                                                        
... (a whole bunch of nonsense re-queueing, updating Task entities, etc.)
2019-09-08T13:50:25.253Z azure:queue Request: DELETE /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-1/messages/81ceb841-ba56-4a65-9018-6a12e207ebbc, retry: 0
req 1245 DELETE jungle.queue.core.windows.net undefined undefined /q78939-g3okl6atj5fuope2kwappjoz-3fhc2wyxltjsjxehpb7v7cl3-1/messages/81ceb841-ba56-4a65-9018-6a12e207ebbc?popreceipt=AgAAAAMAAAAAAAAAog8xEE1m1QE%3D&timeout=7

A bit of scripting, however, shows that

2019-09-08T13:50:24.830Z azure:table Request: GET /TasksT20190908TEhVSExR8(PartitionKey='6_26Ni5OQfeFKiSq8XU6zQ',RowKey='task'), retry: 0
req 1147 GET jungle.table.core.windows.net undefined undefined /TasksT20190908TEhVSExR8(PartitionKey='6_26Ni5OQfeFKiSq8XU6zQ',RowKey='task')?timeout=7&sv=2014-02-14&tn=TasksT20190908TEhVSExR8&se=2019-09-08T14%3A14%3A40Z&sp=raud&sig=xP5d598W2u26%2FHE8KsBIEWFGGB8fyoa1noN7tfkfpLc%3D&st=2019-09-08T13%3A34%3A40Z

never completes. In fact, none of the events abort, aborted, response, timeout ever occur on that request, even after the test fails -- so it's not even in the node event loop, or mocha wouldn't have exited. It occurs to me it might have emitted error, which I didn't listen for. But if that had occurred, then workclaimer.js should have logged "claimTask from hint failed" which I don't see. I do see this logged if I throw a test exception in the claimTask method.

      claims = await Promise.all(hints.map(async (hint) => {
        try {
          // Try to claim task from hint
          let result = await this._monitor.timer('claimTask', this.claimTask(                                                                                                                                                                                                                                                  
            hint.taskId, hint.runId, workerGroup, workerId, null, hint.hintId,
          ));
          ....
        } catch (err) {
          // Report error, don't block
          this._monitor.reportError(err, {
            comment: 'claimTask from hint failed',
          });
          ...
        }
        return 'error-claiming';
      }));
...
  async claimTask(taskId, runId, workerGroup, workerId, task = null, hintId = null) {
    // Load task, if not given
    if (!task) {
      task = await this._Task.load({taskId}, true);
      if (!task) {
        return 'task-not-found';
      }
    }
    ...
  }

Also, the claimWork API method implementation sets a done property in the claim method (via the aborted Promise) after 20 seconds. So if the outer loop in that method:

while (claims.length === 0 && !done) {

was still looping, then we would see the API method return after 20 seconds. It really does look like that this._Task.load(..) just never returned.

I'll try again, this time logging more events on the request.

2019-09-08T22:04:20.556Z taskcluster-client Calling: claimWork, retry: 0
req 1112 POST localhost undefined 60401 /api/queue/v1/claim-work/no-provisioner/gecko-b-1-android
...
evt 1112 abort
2019-09-08T22:04:50.563Z taskcluster-client Request error calling claimWork (retrying), err: Error: Aborted, JSON: Error: Aborted
evt 1112 error

2019-09-08T22:04:50.566Z mocha:runner finished running

1146 is never finished:

2019-09-08T22:04:20.870Z azure:table Request: GET /TasksT20190908TQPuX6t6T(PartitionKey='WnmWmHdUQUCsA1hrX7NO3w',RowKey='task'), retry: 0
req 1146 GET jungle.table.core.windows.net undefined undefined /TasksT20190908TQPuX6t6T(PartitionKey='WnmWmHdUQUCsA1hrX7NO3w',RowKey='task')?timeout=7&sv=2014-02-14&tn=TasksT20190908TQPuX6t6T&se=2019-09-08T22%3A28%3A46Z&sp=raud&sig=SpisK%2B1GwFRjHEyLO73OwF91jo0bQx%2B0QqmHA%2F8a9KA%3D&st=2019-09-08T21%3A48%3A46Z

So, exactly the same behavior as the last comment, and confirming that none of the events ['error', 'timeout', 'abort', 'aborted', 'response', 'timeout'] occurred on that ClientRequest object and that it was not associated with an open event-loop timer or socket. This looks like a bug in the node http implementation.

Maybe we're missing an event. https://stackoverflow.com/a/18087021 suggests monkey-patching emit. So let's do that.

Different error here, but really simple:

req 3 PUT jungle.blob.core.windows.net undefined 443 /?comp=properties&restype=service
2019-09-09T13:15:28.843Z nock.common options.host: jungle.blob.core.windows.net
2019-09-09T13:15:28.844Z nock.common options.hostname in the end: "jungle.blob.core.windows.net"
2019-09-09T13:15:28.844Z nock.common options.host in the end: "jungle.blob.core.windows.net:443"
2019-09-09T13:15:28.844Z nock.intercept interceptors for "jungle.blob.core.windows.net:443"
2019-09-09T13:15:28.844Z nock.intercept filtering interceptors for basepath https://jungle.blob.core.windows.net:443
2019-09-09T13:15:28.844Z nock.common options.host: jungle.blob.core.windows.net:443
2019-09-09T13:15:28.844Z nock.common options.hostname in the end: "jungle.blob.core.windows.net"
2019-09-09T13:15:28.844Z nock.common options.host in the end: "jungle.blob.core.windows.net:443"
2019-09-09T13:15:28.844Z nock.intercept Net connect  enabled for jungle.blob.core.windows.net:443
evt 3 socket
evt 3 prefinish
evt 3 finish
    1) "before all" hook: withEntity for Artifact for "queue.listProvisioners returns an empty list"
2019-09-09T13:15:58.615Z nock.common options.hostname in the end: undefined
2019-09-09T13:15:58.615Z nock.common options.host in the end: "localhost:443"

2019-09-09T13:15:58.616Z mocha:runner finished running

prefinish and finish are not documented. I'll try adding timestamps to the events.

2019-09-09T16:33:57.603Z azure:queue Request: POST /q03554-claim/messages, retry: 0
req 1180 POST jungle.queue.core.windows.net undefined undefined /q03554-claim/messages?visibilitytimeout=1&messagettl=604800&timeout=7
2019-09-09T16:33:57.603Z nock.common options.host: jungle.queue.core.windows.net
2019-09-09T16:33:57.603Z nock.common options.hostname in the end: "jungle.queue.core.windows.net"
2019-09-09T16:33:57.603Z nock.common options.host in the end: "jungle.queue.core.windows.net:443"
2019-09-09T16:33:57.603Z nock.intercept interceptors for "jungle.queue.core.windows.net:443"
2019-09-09T16:33:57.603Z nock.intercept filtering interceptors for basepath https://jungle.queue.core.windows.net:443
2019-09-09T16:33:57.603Z nock.common options.host: jungle.queue.core.windows.net:443
2019-09-09T16:33:57.603Z nock.common options.hostname in the end: "jungle.queue.core.windows.net"
2019-09-09T16:33:57.603Z nock.common options.host in the end: "jungle.queue.core.windows.net:443"
2019-09-09T16:33:57.603Z nock.intercept Net connect  enabled for jungle.queue.core.windows.net:443
evt 1180 socket 2019-09-09T16:33:57.604Z
evt 1180 prefinish 2019-09-09T16:33:57.604Z
...
evt 1114 abort 2019-09-09T16:34:27.140Z
2019-09-09T16:34:27.140Z taskcluster-client Request error calling claimWork (retrying), err: Error: Aborted, JSON: Error: Aborted
evt 1114 error 2019-09-09T16:34:27.141Z
evt 1114 close 2019-09-09T16:34:27.141Z

2019-09-09T16:34:27.142Z mocha:runner finished running

https://github.com/nodejs/node/blob/db706da235d80e5dadaab64328bfad9cb313be39/lib/_http_outgoing.js

    if (this.finished) {
      // This is a queue to the server or client to bring in the next this.
      this._finish();
    } else if (ret && this[kNeedDrain]) {
      this[kNeedDrain] = false;
      this.emit('drain');
    }

where _finish() is

OutgoingMessage.prototype._finish = function _finish() {
  assert(this.socket);
  this.emit('prefinish');
};

so that's where the prefinish is coming from (I think ClientRequest is a subclass of OutgoingMessage). "This is a queue to the server or client to bring in the next this." I suspect that is supposed to be "This is a cue to the server or client to bring in the next message" (when multiple messages/requests are queued on a single HTTP connection).

So I think prefinish means "I've sent all my data to the socket". There's a timeout set on that request (7500) so it should have called socket.setTimeout and if nothing else have a timeout event 7500ms later. But it doesn't.

I suspect I'm quite a way down the wrong trail here. Something changed last week to cause this. It wasn't a new node version.

  • Azure?
  • Task environment? Did github-worker change?
  • We started running a lot more TC CI tasks due to renovate

I'm unassigning myself. If you'd like to give this a try, please do. Alternately, we can disable some of the worst offending tests (such as the claim-30-tasks test) to get back to mostly green checkmarks in GitHub until we solve this.

Assignee: dustin → nobody

Not just queue!

https://taskcluster-artifacts.net/Xc-2Z4HiRjO23VK-0VrGCA/0/public/logs/live_backing.log

  1) services/worker-manager/test/worker_scanner_test.js (real)
       "before each" hook: withEntity for Worker for "single worker":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/worker-manager/test/worker_scanner_test.js)
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Depends on: 1580483
Blocks: 1580483
No longer depends on: 1580483

nock appears to be innocent -- even at v10,

https://tools.taskcluster.net/groups/ffJAYOC7SQ2eaDSJNxs1yg/tasks/ZBMVbQx6SBqjsldImObBGA/runs/0/logs/public%2Flogs%2Flive.log

  1) services/queue/test/taskgroup_test.js (real)
       "before each" hook: withEntity for TaskDependency for "Create two tasks with same taskGroupId":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/taskgroup_test.js)
  

  2) services/queue/test/taskgroup_test.js (real)
       "after all" hook: withEntity for TaskDependency for "task-group membership expiration (doesn't drop table)":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/taskgroup_test.js)
  

  3) services/queue/test/workerinfo_test.js (mock)
       "before all" hook for "queue.listProvisioners returns an empty list":
     TypeError: helper.queueService.client._reset is not a function
      at Context.<anonymous> (test/helper.js:118:34)

That last one is pretty interesting!

116       helper.load.cfg('azure.fake', true);
117       helper.queueService = await helper.load('queueService');
118       helper.queueService.client._reset();  

Where the QueueService constructor uses a new FakeClient (which has a _reset method) if options.credentials.fake is set. Presumably queueService was already initialized with a real client at that time. How could that possibly be intermittent?

https://tools.taskcluster.net/groups/G4-uy5Q-Sg-173lqjlLt0A/tasks/NSMI7xrITXuu1j1a_ZkSBg/runs/0/logs/public%2Flogs%2Flive_backing.log

  1) services/queue/test/claimwork_test.js (real)
       claimWork, reclaimTask, reportCompleted:
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/claimwork_test.js)

https://tools.taskcluster.net/groups/M9emc8x7R_KV50LiDGEO6w/tasks/NLhJzu42Q3iqsiGVVc4eQA/runs/0/logs/public%2Flogs%2Flive_backing.log


  services/queue/test/artifact_test.js (real)
    ✓ Post Public S3 artifact (15298ms)
    1) "after each" hook for "Post Public S3 artifact"


  1 passing (20s)
  1 failing

  1) "after each" hook for "Post Public S3 artifact":
     Error: Errors logged to monitor during test run:
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z</Message></Error>","provisionerId":"no-provisioner","queue":"q37121-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-5","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z</Message></Error>","provisionerId":"no-provisioner","queue":"q37121-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-5","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","incidentId":"4f67db50-06ba-47d6-b158-751cfc538dbd","message":"Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z","method":"createTask","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","params":{"taskId":"0tVM1HF7T36Z9-kClsxEsQ"},"payload":{"created":"2019-09-12T12:22:16.925Z","deadline":"2019-09-15T12:22:16.926Z","dependencies":[],"expires":"2020-09-15T12:22:16.926Z","extra":{},"metadata":{"description":"Task created during unit tests","name":"Unit testing task","owner":"jonsafj@mozilla.com","source":"https://github.com/taskcluster/taskcluster-queue"},"payload":{},"priority":"lowest","provisionerId":"no-provisioner","requires":"all-completed","retries":5,"routes":[],"schedulerId":"my-scheduler","scopes":[],"tags":{"purpose":"taskcluster-testing"},"taskGroupId":"dSlITZ4yQgmvxxAi4A8fHQ","workerType":"test-worker"},"provisionerId":"no-provisioner","queue":"q37121-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-5","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:2e7df674-f003-0015-0b64-69a41c000000\nTime:2019-09-12T12:22:30.6466616Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:082870ff-2003-00f2-1b64-69b411000000\nTime:2019-09-12T12:22:30.6644189Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:082870ff-2003-00f2-1b64-69b411000000\nTime:2019-09-12T12:22:30.6644189Z</Message></Error>","provisionerId":"no-provisioner","queue":"q37121-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-4","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:082870ff-2003-00f2-1b64-69b411000000\nTime:2019-09-12T12:22:30.6644189Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
      at Context.<anonymous> (/taskcluster/libraries/testing/src/with-monitor.js:22:13)
      at process.topLevelDomainCallback (domain.js:126:23)


I don't know why mocha stopped running tests after that failure -- we're not using --bail. Mysteries upon mysteries.

https://tools.taskcluster.net/groups/SRKHV0xfSwOELYyg8kTm7A/tasks/ZqTRKM4zSrWrtca4Nw4ILg/runs/0/logs/public%2Flogs%2Flive.log

  1) services/queue/test/canceltask_test.js (real)
       defineTask, cancelTask (idempotent):
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/canceltask_test.js)
  

  2) "after each" hook for "createTask, cancelTask (idempotent)":
     Error: Errors logged to monitor during test run:
{"code":"OperationTimedOut","incidentId":"17d65757-e93d-45c5-9908-83470453bf75","message":"Operation could not be completed within the specified time.\nRequestId:845276c6-5003-00d4-3c64-692fa5000000\nTime:2019-09-12T12:21:41.4148877Z","method":"cancelTask","module":"fast-azure-storage","name":"OperationTimedOutError","params":{"taskId":"-UIWvp4ATm256TezMDa8Cg"},"payload":{},"retries":1,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:845276c6-5003-00d4-3c64-692fa5000000\nTime:2019-09-12T12:21:41.4148877Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z</Message></Error>","provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-2","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z</Message></Error>","provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-2","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
{"code":"OperationTimedOut","incidentId":"e4d67ff0-c658-48ca-a3c7-a03d9b96164d","message":"Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z","method":"createTask","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","params":{"taskId":"pVQC0-XjSyKP5kxApWenEA"},"payload":{"created":"2019-09-12T12:19:27.594Z","deadline":"2019-09-15T12:19:27.594Z","dependencies":[],"expires":"2020-09-15T12:19:27.594Z","extra":{},"metadata":{"description":"Task created during unit tests","name":"Unit testing task","owner":"jonsafj@mozilla.com","source":"https://github.com/taskcluster/taskcluster-queue"},"payload":{},"priority":"lowest","provisionerId":"no-provisioner-extended-extended","requires":"all-completed","retries":5,"routes":[],"schedulerId":"my-scheduler-extended-extended","scopes":[],"tags":{"purpose":"taskcluster-testing"},"taskGroupId":"dSlITZ4yQgmvxxAi4A8fHQ","workerType":"test-worker-extended-extended"},"provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-2","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:2e55c6a2-3003-00a9-1e64-69b36d000000\nTime:2019-09-12T12:21:49.3750895Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:84e7d91d-4003-00e9-4a64-699a83000000\nTime:2019-09-12T12:21:49.4011341Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:84e7d91d-4003-00e9-4a64-699a83000000\nTime:2019-09-12T12:21:49.4011341Z</Message></Error>","provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-6","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:84e7d91d-4003-00e9-4a64-699a83000000\nTime:2019-09-12T12:21:49.4011341Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z</Message></Error>","provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-2","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z</Message></Error>","provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-2","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
{"code":"OperationTimedOut","incidentId":"90284588-acb9-4a11-92fa-a9cad8bcf46b","message":"Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z","method":"createTask","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","params":{"taskId":"pVQC0-XjSyKP5kxApWenEA"},"payload":{"created":"2019-09-12T12:19:27.594Z","deadline":"2019-09-15T12:19:27.594Z","dependencies":[],"expires":"2020-09-15T12:19:27.594Z","extra":{},"metadata":{"description":"Task created during unit tests","name":"Unit testing task","owner":"jonsafj@mozilla.com","source":"https://github.com/taskcluster/taskcluster-queue"},"payload":{},"priority":"lowest","provisionerId":"no-provisioner-extended-extended","requires":"all-completed","retries":5,"routes":[],"schedulerId":"my-scheduler-extended-extended","scopes":[],"tags":{"purpose":"taskcluster-testing"},"taskGroupId":"dSlITZ4yQgmvxxAi4A8fHQ","workerType":"test-worker-extended-extended"},"provisionerId":"no-provisioner-extended-extended","queue":"q66910-vofimivt2uifu3wkmxzjrhfm-dljkj3oal2nprldd4cf2aqrs-2","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:e6d2e46b-2003-007a-1c64-690cc8000000\nTime:2019-09-12T12:21:57.5851169Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker-extended-extended"}
      at Context.<anonymous> (/taskcluster/libraries/testing/src/with-monitor.js:22:13)
      at process.topLevelDomainCallback (domain.js:126:23)


Similar to comment 22

https://tools.taskcluster.net/groups/ZqefjorETZeKgwKkqyS2Pw/tasks/Zd_0OJQVTBGDgGufQWZhTg/runs/0/logs/public%2Flogs%2Flive.log


  services/queue/test/artifact_test.js (real)
    ✓ Post Public S3 artifact (36924ms)
    1) "after each" hook for "Post Public S3 artifact"


  1 passing (42s)
  1 failing

  1) "after each" hook for "Post Public S3 artifact":
     Error: Errors logged to monitor during test run:
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z</Message></Error>","provisionerId":"no-provisioner","queue":"q67094-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-7","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z</Message></Error>","provisionerId":"no-provisioner","queue":"q67094-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-7","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","incidentId":"2214646c-9e7f-4bab-b543-4b3523bf3626","message":"Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z","method":"createTask","module":"fast-azure-storage","name":"OperationTimedOutError","note":"Failed to ensure azure queue in queueservice.js","params":{"taskId":"HVmvOwF3TyqDq0G2w2dKIw"},"payload":{"created":"2019-09-12T12:22:46.900Z","deadline":"2019-09-15T12:22:46.900Z","dependencies":[],"expires":"2020-09-15T12:22:46.900Z","extra":{},"metadata":{"description":"Task created during unit tests","name":"Unit testing task","owner":"jonsafj@mozilla.com","source":"https://github.com/taskcluster/taskcluster-queue"},"payload":{},"priority":"lowest","provisionerId":"no-provisioner","requires":"all-completed","retries":5,"routes":[],"schedulerId":"my-scheduler","scopes":[],"tags":{"purpose":"taskcluster-testing"},"taskGroupId":"dSlITZ4yQgmvxxAi4A8fHQ","workerType":"test-worker"},"provisionerId":"no-provisioner","queue":"q67094-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-7","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:87a463c2-b003-00b3-1764-699c02000000\nTime:2019-09-12T12:23:14.1890006Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:79b389a8-5003-0135-5664-698e85000000\nTime:2019-09-12T12:23:14.7239282Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:79b389a8-5003-0135-5664-698e85000000\nTime:2019-09-12T12:23:14.7239282Z</Message></Error>","provisionerId":"no-provisioner","queue":"q67094-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-1","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:79b389a8-5003-0135-5664-698e85000000\nTime:2019-09-12T12:23:14.7239282Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
{"code":"OperationTimedOut","message":"Operation could not be completed within the specified time.\nRequestId:234ad7d6-1003-003d-2b64-69d3a3000000\nTime:2019-09-12T12:23:14.7690700Z","module":"fast-azure-storage","name":"OperationTimedOutError","payload":"<?xml version=\"1.0\" encoding=\"utf-8\"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.\nRequestId:234ad7d6-1003-003d-2b64-69d3a3000000\nTime:2019-09-12T12:23:14.7690700Z</Message></Error>","provisionerId":"no-provisioner","queue":"q67094-g3okl6atj5fuope2kwappjoz-ewlmcjzxabjkgg5zgktac73r-6","retries":0,"stack":"OperationTimedOutError: Operation could not be completed within the specified time.\nRequestId:234ad7d6-1003-003d-2b64-69d3a3000000\nTime:2019-09-12T12:23:14.7690700Z\n    at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27\n    at process._tickCallback (internal/process/next_tick.js:68:7)","statusCode":500,"v":1,"workerType":"test-worker"}
      at Context.<anonymous> (/taskcluster/libraries/testing/src/with-monitor.js:22:13)
      at process.topLevelDomainCallback (domain.js:126:23)


WAT

https://taskcluster-artifacts.net/Zd_0OJQVTBGDgGufQWZhTg/1/public/logs/live_backing.log

  1) services/queue/test/queueservice_test.js (real)
       deleteUnusedWorkerQueues (can delete queues):
     QueueBeingDeletedError: The specified queue is being deleted.
RequestId:e31aa8b4-7003-004b-3a66-69571f000000
Time:2019-09-12T12:34:38.6675581Z
      at /taskcluster/node_modules/fast-azure-storage/lib/queue.js:320:27
      at process._tickCallback (internal/process/next_tick.js:68:7)


https://tools.taskcluster.net/groups/U8wTCDtIRNi6BNv8LVuOVQ/tasks/ASsB151GSQKa7FuIMV9gTQ/runs/0/logs/public%2Flogs%2Flive.log

  1) services/queue/test/taskgroup_test.js (real)
       "before all" hook: withEntity for Artifact for "Create two tasks with same taskGroupId":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/taskgroup_test.js)

I tried bisecting but it didn't look like there was much of a pattern. I'm running 30 of each of the merge commits from the last
few days:

So, something has changed at Azure. This problem started in earnest around Sept 6, right about the same time that auth test started timing out because there were too many containers, and I added a "test:cleanup" test to delete containers.

Filpping through the various metrics in the Azure portal doesn't show anything that changed dramatically during that time. I see elevated numbers over that weekend, but that's because I was running a bunch of tests at that time. And I see a huge spike from today, but that's from comment 27 :)

Depends on: 1580970

OK, let's see if a new (non-classic) Azure account works better.

If so, I don't know if that makes me happy or sad, as it means that someday our production Azure will fall apart, right?

Well, https://tools.taskcluster.net/groups/YGK7A6KqTL6Yiv8OITeq2w/tasks/DXOR8HegTd28po0GZzXBgA/runs/0/logs/public%2Flogs%2Flive_backing.log was on bug 1580970's merge to master.

  1) services/queue/test/createtask_test.js (real)
       "before all" hook: withEntity for Artifact for "createTask":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/queue/test/createtask_test.js)

and that's still with a bunch of queue tests skipped.

So I think we can determine, something changed at Azure that's not related to a particular storage account.

It doesn't look like anything changed with github-worker at that time:
https://github.com/taskcluster/mozilla-history/commits/master/AWSWorkerTypes/github-worker
(that would be a possibility even though I've replicated "locally" - "local" for me is in EC2 on Ubuntu)

https://tools.taskcluster.net/groups/addgdtyWSNyoLp5XGk6fyw/tasks/VcYd_ECdR6SKvTCZ91MzCg/runs/0/logs/public%2Flogs%2Flive_backing.log

  1) services/worker-manager/test/provider_google_test.js (real)
       "before each" hook: withEntity for WorkerPoolError for "provisioning loop":
     Error: Timeout of 30000ms exceeded. For async tests and hooks, ensure "done()" is called; if returning a Promise, ensure it resolves. (/taskcluster/services/worker-manager/test/provider_google_test.js)
```
Summary: Queue tests timing out → Azure-related tests timing out

fast-azure-storage does have a custom https.Agent subclass which overrides private methods. That could explain some of the weird behavior of the Node http library seen above. It looks like it's built for an older version of Node.

I think the choices here are:

  • Rewrite our entire Azure interface to use the newer Azure SDKs instead of fast-azure-storage
  • Rewrite everything to use Postgres

I'm going to continue to hold my breath and hope nothing gets worse with Azure.

https://github.com/taskcluster/taskcluster/pull/1994 fixes #1393 to actually skip things..

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX

I would argue we FIXED this ;)

You need to log in before you can comment on or make changes to this bug.