Closed Bug 1481178 Opened 6 years ago Closed 6 years ago

Increased rate of HTTP request timeouts in taskcluster queue

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: pmoore, Assigned: bstack)

References

Details

Attachments

(6 files)

TYPO: *lot* not *log*
Hey guys, any idea what might be up here?
Flags: needinfo?(jhford)
Flags: needinfo?(bstack)
Screenshot from heroku showing request timeout alerts
$ papertrail -s taskcluster-queue --min-time '1 hour ago' 'Request timeout' | wc -l
    2637

which equates to about 44/min at the current rate. The heroku front end seems to be inaccurate since it suggests a much lower rate of failure.
I suspect that something is getting stuck in the claim-work api handler.  There's not useful debugging right now, so I've opened https://github.com/taskcluster/taskcluster-queue/pull/286 in order to add some temporary debugging information to the logs.

This change makes a slight change where we remove a single Promise.all usage to make logging easier.  It also adds a bunch of logging which will let us know where the api handler gets to when the api handler stops running.  This information is printed after 25 seconds of waiting.  Since the endpoint is designed to short-circuit at 20 seconds, any time that this 25 second period has elapsed is likely to be an endpoint invocation which is going to timeout.
Flags: needinfo?(jhford)
Running the unit tests, I'm seeing a bunch of failures which look like they might be pulse related.
Assignee: nobody → bstack
Status: NEW → ASSIGNED
Flags: needinfo?(bstack)
Seems like these really picked up starting midday (pacific time) last friday
Looking in the last hour I'm seeing H12 on the following:


     54 PUT  /v1/task
     78 GET  /v1/task
    499 POST /v1/claim-work
   1760 POST /v1/task
Looking more in detail, the majority of these are POSTs to the create artifact endpoint.
Attached image NZ_hwV4q.png.part
Not seeing anything unusual in azure at a first glance. A lot of these endpoints that are timing out don't have anything to do with pulse though afaict, so all signs point to azure at the moment.
Going from that dashboard, the methods that are having a hard time are...


1. createTask
2. claimWork
3. reclaimTask
4. reportCompleted
5. createArtifact
6. getLatestArtifact

I think the most common attribute of all of these is that they are called a heck of a lot more than other methods. I'm bumping the number of dynos for a bit to see what happens.
That does not appear to have helped at all.
This is a pretty good smoking gun for blaming this on azure. Now the question is:

Did we make this happen or are they doing this to us?

I am really, really hoping this is not the 20,000 request per second limit jonas was always mumbling about.
I feel like we could get into this sort of situation if we ended up with a lot of MERGE retries given the "optimistic concurrency" stuff that we do in azure entities. However, I just grepped around the logs a bit and only found a small handful of retried merges. I don't think that would be the issue off the top of my head here.

There were 32 UpdateConditionNotSatisfied in the logs in the time period but 0 MAX_MODIFY_ATTEMPTS issues. I will try consuming a larger portion of logs.
I'm just hunting for what could cause a performance cliff in azure tables at the moment. Nothing much more to share.
Thanks for the deep dive Brian!

I think we should open a support case with Microsoft Azure. Do you know how we do that?
We discussed in the team meeting today.

Brian is going to see if he can introduce some data capture in our azure entities library, and next week when coop is back we can look into opening a support case with azure.
See Also: → 1481493
Just received https://sentry.prod.mozaws.net/operations/taskcluster-queue/issues/4534255/ "errorMAX_MODIFY_ATTEMPTS exhausted, check for congestion" as well. That could be contributing factor.
We've reopened the trees because the timeouts have gone away. We're guessing that it is just because the volume of requests has dropped off from the trees being closed. I have deployed https://github.com/taskcluster/taskcluster-queue/pull/287 and set the logging params to capture 100% of requests for a while. Please feel free to deploy master again if you want or tweak the knobs that control the logging.
This seems to have resolved itself for no reason. Please reopen if this causes more failures.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Taskcluster timeouts reappeared again - https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f4bbda249beed3143907a74c0683c87c9a9a692b&group_state=expanded&selectedJob=193582086

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=193582086&repo=mozilla-inbound&lineNumber=789

Task details: https://tools.taskcluster.net/groups/VY2L5IAOSReeifmTiR22OA/tasks/bbE0OIBsS0mgvHcAVY4i-Q/details

[task 2018-08-13T04:38:29.416Z] 04:38:29     INFO -  The error occurred in code that was called by the mach command. This is either
[task 2018-08-13T04:38:29.416Z] 04:38:29     INFO -  a bug in the called code itself or in the way that mach is calling it.
[task 2018-08-13T04:38:29.416Z] 04:38:29     INFO -  You should consider filing a bug for this issue.
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  If filing a bug, please include the full output of mach, including this error
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  message.
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  The details of the failure are as follows:
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  HTTPError: 500 Server Error: Internal Server Error for url: https://queue.taskcluster.net/v1/task/A17yqz7qRKuodUonoSnJAA/artifacts/public/chainOfTrust.json.asc
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/mach_commands.py", line 1462, in artifact_toolchain
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -      record = ArtifactRecord(task_id, name)
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/mach_commands.py", line 1356, in __init__
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -      cot.raise_for_status()
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -    File "/builds/worker/workspace/build/src/third_party/python/requests/requests/models.py", line 840, in raise_for_status
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -      raise HTTPError(http_error_msg, response=self)
[task 2018-08-13T04:38:29.433Z] 04:38:29    ERROR - Return code: 1
[task 2018-08-13T04:38:29.434Z] 04:38:29    ERROR - 1 not in success codes: [0]
[task 2018-08-13T04:38:29.434Z] 04:38:29  WARNING - setting return code to 2
[task 2018-08-13T04:38:29.434Z] 04:38:29    FATAL - Halting on failure while running ['/usr/bin/python2.7', '-u', '/builds/worker/workspace/build/src/mach', 'artifact', 'toolchain', '-v', '--retry', '4', '--artifact-manifest', '/builds/worker/workspace/build/src/toolchains.json', '--cache-dir', '/builds/worker/tooltool-cache', 'public/build/clang.tar.xz@BIUGWFsNTceXUoxyV_leVQ', 'public/build/gcc.tar.xz@BYmAixHnT-mG2qmATEonbw', 'public/build/rustc.tar.xz@QW0bdcrhSEewusHFyX1iWw', 'public/build/rust-size.tar.xz@FKi9VKRmSZmV7EYJp_VitQ', 'public/build/sccache2.tar.xz@ZhHlZDo7QBi22K6-ruY5Rg', 'public/build/node.tar.xz@A17yqz7qRKuodUonoSnJAA']
[task 2018-08-13T04:38:29.434Z] 04:38:29    FATAL - Running post_fatal callback...
[task 2018-08-13T04:38:29.434Z] 04:38:29    FATAL - Exiting 2
[task 2018-08-13T04:38:29.434Z] 04:38:29     INFO - [mozharness: 2018-08-13 04:38:29.434486Z] Finished build step (failed)
[task 2018-08-13T04:38:29.434Z] 04:38:29     INFO - Running post-run listener: _summarize
[task 2018-08-13T04:38:29.434Z] 04:38:29    ERROR - # TBPL FAILURE #
[task 2018-08-13T04:38:29.434Z] 04:38:29     INFO - [mozharness: 2018-08-13 04:38:29.434717Z] FxDesktopBuild summary:
[task 2018-08-13T04:38:29.434Z] 04:38:29    ERROR - # TBPL FAILURE #
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Interesting datapoint: In a log span lasting from 09:10:04 to 09:24:19, we only seem to get 10 second or greater azure timings in bursts that last less than a second. These seem to be distributed across different operations and tables with no discernible pattern. I believe the existence of insertEntity and getEntity in this list helps rule out excessive modify retries as the culprit. Showing my work below:

taskcluster-queue|bug-1481493⚡ ⇒ papertrail -s taskcluster-queue --min-time '6 hours ago' > ./tttt      


taskcluster-queue|bug-1481493⚡ ⇒ rg "TIMING.*[0-9]{5,}\." ./tttt | cut -d " " -f 3 | uniq -c
     65 09:13:44
     52 09:14:01
     29 09:15:10
      1 09:17:15

taskcluster-queue|bug-1481493⚡ ⇒ rg ".*TIMING: ([a-zA-Z]+) on ([a-zA-Z]+) took ([0-9]{5,}\.[0-9]+)" -oN ./tttt --replace '$1 $2' | sort | uniq -c
     29 deleteEntity QueueTaskRequirement
     32 getEntity QueueArtifacts
      1 getEntity QueueTasks
     52 getEntity QueueWorker
     31 insertEntity QueueArtifacts
      2 queryEntities QueueArtifacts
     23 updateEntity QueueTaskDependency
We've created an azure support request (id: 118081318779906). Will update this bug further when we know more.
Comment on attachment 9002108 [details]
Bug 1481178: Retry downloading `chainOfTrust.json.asc` in `mach artifact toolchain`; r?gps

Gregory Szorc [:gps] has approved the revision.
Attachment #9002108 - Flags: review+
Keywords: leave-open
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/mozilla-central/rev/3ef1c0555a29
Retry downloading `chainOfTrust.json.asc` in `mach artifact toolchain`; r=gps a=tomprince
Long ago, we noted during some other investigations that performing a DELETE operation on a table would cause subsequent reads to pause for 5 seconds.  I *think* we found this when trying to delete a few million "never expires" tasks from the early days of Taskcluster.  After a little searching, I can't find the bug though.  At any rate, it may be worth looking for correlations between the various services deleting things from tables and these long Azure requests.
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #35)
> Long ago, we noted during some other investigations that performing a DELETE
> operation on a table would cause subsequent reads to pause for 5 seconds.  I
> *think* we found this when trying to delete a few million "never expires"
> tasks from the early days of Taskcluster.  After a little searching, I can't
> find the bug though.  At any rate, it may be worth looking for correlations
> between the various services deleting things from tables and these long
> Azure requests.

Nice thinking! ++

Also, I got an email about recent Azure updates - looks like they have been pretty busy...
  https://azure.microsoft.com/en-us/updates/
So, I tried pulling the `retry` function out of https://github.com/taskcluster/fast-azure-storage/blob/609197d1f5ab0e44fe2da91ca21ff68bce2c3e7a/lib/utils.js and testing it out.  I'm passing the options that azure-entities passes.  Notably, this does not include randomizationFactor.

const test = async () => {
  let failures = 3;
  const start = new Date().getTime();

  const log = msg => {
    console.log(`${new Date().getTime() - start} ${msg}`);
  };  

  const f = async retry => {
    log(`start retry ${retry}`);
    await new Promise(resolve => setTimeout(resolve, 100));
    if (retry < failures) {
      log(`failing retry ${retry}`);
      const e = new Error();
      e.code = 'ServerBusy';
      throw e;
    }   
    log(`suceeded retry ${retry}`);
  }

  retry(f, {
  version: '2014-02-14',
  dataServiceVersion: '3.0',
  clientId: 'fast-azure-storage',
  timeout: 7000,
  clientTimeoutDelay: 500,
  metadata: 'fullmetadata',
  retries: 5,
  delayFactor: 100,
  maxDelay: 30000,
  transientErrorCodes: 
   [ 'InternalErrorWithoutCode',
     'InternalError',
     'ServerBusy',
     'ETIMEDOUT',
     'ECONNRESET',
     'EADDRINUSE',
     'ESOCKETTIMEDOUT',
     'ECONNREFUSED',
     'RequestTimeoutError',
     'RequestAbortedError',
     'RequestContentLengthError' ],
  accountId: 'jungle',
  accessKey: undefined,
  minSASAuthExpiry: 900000,
  }); 
}

The result is

0 start retry 0
104 failing retry 0
106 start retry 1
206 failing retry 1
207 start retry 2
308 failing retry 2
309 start retry 3
410 suceeded retry 3

printing `delay` shows that it is always NaN, which setTimeout appears to treat like 0, so there's no delay between requests.

My hypothesis was that this retry method was always waiting 30 seconds on error, which would neatly translate any error from Azure into a 30-second timeout.  But that does not appear to be the case.  So https://github.com/taskcluster/fast-azure-storage/pull/32 fixes the bug with retry, but I don't think it fixes the issue we're looking at.
I tried a quick hack to see if I could reproduce what I described in comment 35, and I could not.  Maybe I'm making things up!
Well, things seem to have fixed themselves for no reason. We're still keeping an eye on it but I am bumping down the priority for now.
Severity: blocker → major
I'm not certain it's this bug, but we're seeing some failures downloading CoT artifacts in tasks like https://tools.taskcluster.net/groups/SjAqqSqdQHmft4VBcfbZJw/tasks/YdSIz8cpSISbdEBE9nNTvQ/runs/0

From the scriptworker logs:

Sep 11 06:04:27 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker 2018-09-11T13:04:27    DEBUG - waiting 300 seconds before reclaiming...
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker 2018-09-11T13:04:32    ERROR - SCRIPTWORKER_UNEXPECTED_EXCEPTION task 
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker Traceback (most recent call last):
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/worker.py", line 50, in do_run_task
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await verify_chain_of_trust(chain)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/cot/verify.py", line 1813, in verify_chain_of_trust
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await download_cot_artifacts(chain)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/cot/verify.py", line 705, in download_cot_artifacts
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     mandatory_artifacts_paths = await raise_future_exceptions(mandatory_artifact_tasks)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 326, in raise_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     succeeded_results, _ = await _process_future_exceptions(tasks, raise_at_first_error=True)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 360, in _process_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     raise exc
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/cot/verify.py", line 660, in download_cot_artifact
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     chain.context, [url], parent_dir=link.cot_dir, valid_artifact_task_ids=[task_id]
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/artifacts.py", line 319, in download_artifacts
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await raise_future_exceptions(tasks)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 326, in raise_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     succeeded_results, _ = await _process_future_exceptions(tasks, raise_at_first_error=True)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 360, in _process_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     raise exc
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 260, in retry_async
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     return await func(*args, **kwargs)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 495, in download_file
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     chunk = await resp.content.read(chunk_size)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/aiohttp/streams.py", line 329, in read
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await self._wait('read')
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/aiohttp/streams.py", line 260, in _wait
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await waiter
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/aiohttp/helpers.py", line 673, in __exit__
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     raise asyncio.TimeoutError from None
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker concurrent.futures._base.TimeoutError
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker 2018-09-11T13:04:32 CRITICAL - Fatal exception
Hey Aki, in Ben's absence, do you know if this is still a problem?

Thanks!
Flags: needinfo?(aki)
I see 10 asyncio.TimeoutErrors total from scriptworkers in the past week. Digging deeper into a few, looks like these happened during artifact upload. So it looks like there is still an issue, but not a super high frequency.
Flags: needinfo?(aki)
How about now, aki? I can dig deeper on this soon if it is still a problem.
Flags: needinfo?(aki)
I see 12 `asyncio.TimeoutError`s total since Oct 10. The various intermittent bot comments above seem to be in the high 20's or mid-30's per week, so it looks like we're still better than before.
Flags: needinfo?(aki)
Ok, closing this for now. We can reopen if it causes issues again.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WORKSFORME
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: