Closed Bug 1637982 Opened 5 years ago Closed 5 years ago

queue.reportCompleted allows workerGroup in its scope expression, but that's not in the input

Categories

(Taskcluster :: Services, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

https://sentry.prod.mozaws.net/operations/taskcluster-community/issues/8431806/activity/?referrer=alert_email

Scope expression template expected parameter 'workerGroup' to be a string or number

Those are required for "legacy" scopes. I'll dig a bit in the history of this API method.

Even looking back to before scope expressions landed in 2018:

/** Report task completed */
api.declare({
  method:     'post',
  route:      '/task/:taskId/runs/:runId/completed',
  name:       'reportCompleted',
  stability:  API.stability.stable,
  scopes: [
    [    
      // Legacy
      'queue:resolve-task',
      'assume:worker-id:<workerGroup>/<workerId>',                                                                                                                                                                                                                                                                                                                                             
    ], [ 
      'queue:resolve-task:<taskId>/<runId>',
    ],   
  ],
  deferAuth:  true,
  input:      undefined,  // No input at this point
  output:     'task-status-response.json#',
  title:      'Report Run Completed',
  description: [
    'Report a task completed, resolving the run as `completed`.',
  ].join('\n'),
}, function(req, res) {
  var taskId = req.params.taskId;
  var runId  = parseInt(req.params.runId, 10); 
  // Backwards compatibility with very old workers, should be dropped in the
  // future
  var target = req.body.success === false ? 'failed' : 'completed';

  return resolveTask.call(this, req, res, taskId, runId, target);
});

that's still referencing parameters that aren't in the request. In fact, I can trace this all the way back to the first days of the queue. The legacy scopes were made "legacy" in 2015.

It looks like the scope-expression stuff only validates that parameters exist in the case where that branch is evaluated -- which makes sense for efficiency reasons. So I think this is just an error that occurs when a reportCompleted call is made with the wrong scopes?

Nope:

rubin ~ $ curl -s -d '' https://community-tc.services.mozilla.com/api/queue/v1/task/ajzXn8BTQgGE_BJVwPX2og/runs/0/completed | jq -r .message
This request requires Taskcluster credentials that satisfy the following scope expression:

```
{
  "AnyOf": [
    "queue:resolve-task:ajzXn8BTQgGE_BJVwPX2og/0",
    {
      "AllOf": [
        "queue:resolve-task",
        "assume:worker-id:random-local-worker/dummy-worker-GYbKKpHVS"
      ]
    }
  ]
}
```

---

* method:     reportCompleted
* errorCode:  InsufficientScopes
* statusCode: 403
* time:       2020-05-14T15:02:00.906Z

that workerId / workerGroup is the fake worker that has claimed the task in the docker-worker integration tests.

Oh, I am totally missing the resolveTask function which calls req.authorize:

let resolveTask = async function(req, res, taskId, runId, target) {
  assert(target === 'completed' ||
         target === 'failed', 'Expected a valid target');

  // Load Task entity
  let task = await this.Task.load({taskId}, true);

  // Handle cases where the task doesn't exist
  if (!task) {
    return res.reportError('ResourceNotFound',
      'Task `{{taskId}}` not found. Are you sure it was created?', {
        taskId,
      },
    );
  }

  // Handle cases where the run doesn't exist
  let run = task.runs[runId];
  if (!run) {
    return res.reportError('ResourceNotFound',
      'Run {{runId}} not found on task `{{taskId}}`.', {
        taskId,
        runId,
      },
    );
  }

  await req.authorize({
    taskId,
    runId,
    workerGroup: run.workerGroup,
    workerId: run.workerId,
  });
...

So this error occurs when somehow there's a run with an undefined workerGroup. But my curl above was with the same taskId / runId, and looked it up just fine.

I created a new task with a nonexistent worker pool, so its run 0 is pending.

rubin ~ $ curl -s -d '' https://community-tc.services.mozilla.com/api/queue/v1/task/WkClZ1CTQRye5dCGSGCJPA/runs/0/completed | jq -r .message
Internal Server Error, incidentId c6f159ea-5e6e-4a74-879a-18019e924c49.

---

* method:     reportCompleted
* errorCode:  InternalServerError
* statusCode: 500
* time:       2020-05-14T15:16:18.649Z

and another instance of the error in sentry. So the issue is trying to complete a run that's not running.

This bug has been around for a long time, and isn't a big deal: it's failing safe as designed. But it's also a pretty easy fix.

Assignee: nobody → dustin
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.