1481178 - Increased rate of HTTP request timeouts in taskcluster queue

Reporter

Description

•

6 years ago

We seem to be getting a *log* of request timeouts from queue service since Friday.

  * https://dashboard.heroku.com/apps/queue-taskcluster-net/metrics/web?ending=0-hours-ago&starting=168-hours-ago
  * https://papertrailapp.com/groups/872673/events?q=Request%20timeout

Pete Moore [:pmoore][:pete]

Reporter

Comment 1

•

6 years ago

TYPO: *lot* not *log*

Pete Moore [:pmoore][:pete]

Reporter

Comment 2

•

6 years ago

Hey guys, any idea what might be up here?

Flags: needinfo?(jhford)

Flags: needinfo?(bstack)

Pete Moore [:pmoore][:pete]

Reporter

Comment 3

•

6 years ago

Attached image Bildschirmfoto 2018-08-06 um 14.45.40.png — Details

Screenshot from heroku showing request timeout alerts

Pete Moore [:pmoore][:pete]

Reporter

Comment 4

•

6 years ago

$ papertrail -s taskcluster-queue --min-time '1 hour ago' 'Request timeout' | wc -l
    2637

which equates to about 44/min at the current rate. The heroku front end seems to be inaccurate since it suggests a much lower rate of failure.

John Ford [:jhford] CET/CEST Berlin Time

Comment 5

•

6 years ago

I suspect that something is getting stuck in the claim-work api handler.  There's not useful debugging right now, so I've opened https://github.com/taskcluster/taskcluster-queue/pull/286 in order to add some temporary debugging information to the logs.

This change makes a slight change where we remove a single Promise.all usage to make logging easier.  It also adds a bunch of logging which will let us know where the api handler gets to when the api handler stops running.  This information is printed after 25 seconds of waiting.  Since the endpoint is designed to short-circuit at 20 seconds, any time that this 25 second period has elapsed is likely to be an endpoint invocation which is going to timeout.

Flags: needinfo?(jhford)

John Ford [:jhford] CET/CEST Berlin Time

Comment 6

•

6 years ago

Running the unit tests, I'm seeing a bunch of failures which look like they might be pulse related.

Brian Stack [:bstack]

Assignee

Updated

•

6 years ago

Assignee: nobody → bstack

Status: NEW → ASSIGNED

Flags: needinfo?(bstack)

Brian Stack [:bstack]

Assignee

Comment 7

•

6 years ago

Attached image Screenshot_2018-08-06 queue-taskcluster-net · Metrics Heroku.png — Details

Seems like these really picked up starting midday (pacific time) last friday

Brian Stack [:bstack]

Assignee

Comment 8

•

6 years ago

Looking in the last hour I'm seeing H12 on the following:


     54 PUT  /v1/task
     78 GET  /v1/task
    499 POST /v1/claim-work
   1760 POST /v1/task

Brian Stack [:bstack]

Assignee

Comment 9

•

6 years ago

Looking more in detail, the majority of these are POSTs to the create artifact endpoint.

Brian Stack [:bstack]

Assignee

Comment 10

•

6 years ago

Attached image NZ_hwV4q.png.part — Details

Not seeing anything unusual in azure at a first glance. A lot of these endpoints that are timing out don't have anything to do with pulse though afaict, so all signs point to azure at the moment.

Brian Stack [:bstack]

Assignee

Comment 11

•

6 years ago

https://app.signalfx.com/#/dashboard/DU6SDQQAgAA?density=4&startTime=-1w&endTime=Now is particularly useful for visualizing this.

Brian Stack [:bstack]

Assignee

Comment 12

•

6 years ago

Going from that dashboard, the methods that are having a hard time are...


1. createTask
2. claimWork
3. reclaimTask
4. reportCompleted
5. createArtifact
6. getLatestArtifact

I think the most common attribute of all of these is that they are called a heck of a lot more than other methods. I'm bumping the number of dynos for a bit to see what happens.

Brian Stack [:bstack]

Assignee

Comment 13

•

6 years ago

That does not appear to have helped at all.

Brian Stack [:bstack]

Assignee

Comment 14

•

6 years ago

Attached image Screenshot_2018-08-06 Dashboard - table tasks.png — Details

This is a pretty good smoking gun for blaming this on azure. Now the question is:

Did we make this happen or are they doing this to us?

I am really, really hoping this is not the 20,000 request per second limit jonas was always mumbling about.

Brian Stack [:bstack]

Assignee

Comment 15

•

6 years ago

I feel like we could get into this sort of situation if we ended up with a lot of MERGE retries given the "optimistic concurrency" stuff that we do in azure entities. However, I just grepped around the logs a bit and only found a small handful of retried merges. I don't think that would be the issue off the top of my head here.

There were 32 UpdateConditionNotSatisfied in the logs in the time period but 0 MAX_MODIFY_ATTEMPTS issues. I will try consuming a larger portion of logs.

Brian Stack [:bstack]

Assignee

Comment 16

•

6 years ago

I'm just hunting for what could cause a performance cliff in azure tables at the moment. Nothing much more to share.

Pete Moore [:pmoore][:pete]

Reporter

Comment 17

•

6 years ago

Thanks for the deep dive Brian!

I think we should open a support case with Microsoft Azure. Do you know how we do that?

Pete Moore [:pmoore][:pete]

Reporter

Comment 18

•

6 years ago

We discussed in the team meeting today.

Brian is going to see if he can introduce some data capture in our azure entities library, and next week when coop is back we can look into opening a support case with azure.

Pete Moore [:pmoore][:pete]

Reporter

Updated

•

6 years ago

Comment 19

•

6 years ago

Trees are closed for this.

https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=pending&filter-resultStatus=running&filter-classifiedState=unclassified&selectedJob=192894936

Severity: normal → blocker

Brian Stack [:bstack]

Assignee

Comment 20

•

6 years ago

Just received https://sentry.prod.mozaws.net/operations/taskcluster-queue/issues/4534255/ "errorMAX_MODIFY_ATTEMPTS exhausted, check for congestion" as well. That could be contributing factor.

Brian Stack [:bstack]

Assignee

Comment 21

•

6 years ago

We've reopened the trees because the timeouts have gone away. We're guessing that it is just because the volume of requests has dropped off from the trees being closed. I have deployed https://github.com/taskcluster/taskcluster-queue/pull/287 and set the logging params to capture 100% of requests for a while. Please feel free to deploy master again if you want or tweak the knobs that control the logging.

Comment hidden (Intermittent Failures Robot)

35 failures in 463 pushes (0.076 failures/push) were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 4
* autoland: 23
* mozilla-central: 8

Platform breakdown:
* linux64: 9
* android-4-2-x86: 3
* windows2012-32: 6
* linux32: 2
* osx-cross: 1
* windows10-64: 1
* android-em-4-3-armv7-api16: 2
* windows2012-64: 10
* windows7-32: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1481178&startday=2018-08-08&endday=2018-08-08&tree=all

Brian Stack [:bstack]

Assignee

Comment 23

•

6 years ago

This seems to have resolved itself for no reason. Please reopen if this causes more failures.

Brian Stack [:bstack]

Assignee

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Comment hidden (Intermittent Failures Robot)

35 failures in 2561 pushes (0.014 failures/push) were associated with this bug in the last 7 days.

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-inbound: 4
* autoland: 23
* mozilla-central: 8

Platform breakdown:
* windows10-64: 1
* android-em-4-3-armv7-api16: 2
* windows2012-64: 10
* windows7-32: 1
* linux64: 9
* android-4-2-x86: 3
* windows2012-32: 6
* linux32: 2
* osx-cross: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1481178&startday=2018-08-05&endday=2018-08-11&tree=all

Natalia Csoregi [:nataliaCs]

Comment 25

•

6 years ago

Taskcluster timeouts reappeared again - https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f4bbda249beed3143907a74c0683c87c9a9a692b&group_state=expanded&selectedJob=193582086

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=193582086&repo=mozilla-inbound&lineNumber=789

Task details: https://tools.taskcluster.net/groups/VY2L5IAOSReeifmTiR22OA/tasks/bbE0OIBsS0mgvHcAVY4i-Q/details

[task 2018-08-13T04:38:29.416Z] 04:38:29     INFO -  The error occurred in code that was called by the mach command. This is either
[task 2018-08-13T04:38:29.416Z] 04:38:29     INFO -  a bug in the called code itself or in the way that mach is calling it.
[task 2018-08-13T04:38:29.416Z] 04:38:29     INFO -  You should consider filing a bug for this issue.
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  If filing a bug, please include the full output of mach, including this error
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  message.
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  The details of the failure are as follows:
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -  HTTPError: 500 Server Error: Internal Server Error for url: https://queue.taskcluster.net/v1/task/A17yqz7qRKuodUonoSnJAA/artifacts/public/chainOfTrust.json.asc
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/mach_commands.py", line 1462, in artifact_toolchain
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -      record = ArtifactRecord(task_id, name)
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/mach_commands.py", line 1356, in __init__
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -      cot.raise_for_status()
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -    File "/builds/worker/workspace/build/src/third_party/python/requests/requests/models.py", line 840, in raise_for_status
[task 2018-08-13T04:38:29.417Z] 04:38:29     INFO -      raise HTTPError(http_error_msg, response=self)
[task 2018-08-13T04:38:29.433Z] 04:38:29    ERROR - Return code: 1
[task 2018-08-13T04:38:29.434Z] 04:38:29    ERROR - 1 not in success codes: [0]
[task 2018-08-13T04:38:29.434Z] 04:38:29  WARNING - setting return code to 2
[task 2018-08-13T04:38:29.434Z] 04:38:29    FATAL - Halting on failure while running ['/usr/bin/python2.7', '-u', '/builds/worker/workspace/build/src/mach', 'artifact', 'toolchain', '-v', '--retry', '4', '--artifact-manifest', '/builds/worker/workspace/build/src/toolchains.json', '--cache-dir', '/builds/worker/tooltool-cache', 'public/build/clang.tar.xz@BIUGWFsNTceXUoxyV_leVQ', 'public/build/gcc.tar.xz@BYmAixHnT-mG2qmATEonbw', 'public/build/rustc.tar.xz@QW0bdcrhSEewusHFyX1iWw', 'public/build/rust-size.tar.xz@FKi9VKRmSZmV7EYJp_VitQ', 'public/build/sccache2.tar.xz@ZhHlZDo7QBi22K6-ruY5Rg', 'public/build/node.tar.xz@A17yqz7qRKuodUonoSnJAA']
[task 2018-08-13T04:38:29.434Z] 04:38:29    FATAL - Running post_fatal callback...
[task 2018-08-13T04:38:29.434Z] 04:38:29    FATAL - Exiting 2
[task 2018-08-13T04:38:29.434Z] 04:38:29     INFO - [mozharness: 2018-08-13 04:38:29.434486Z] Finished build step (failed)
[task 2018-08-13T04:38:29.434Z] 04:38:29     INFO - Running post-run listener: _summarize
[task 2018-08-13T04:38:29.434Z] 04:38:29    ERROR - # TBPL FAILURE #
[task 2018-08-13T04:38:29.434Z] 04:38:29     INFO - [mozharness: 2018-08-13 04:38:29.434717Z] FxDesktopBuild summary:
[task 2018-08-13T04:38:29.434Z] 04:38:29    ERROR - # TBPL FAILURE #

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brian Stack [:bstack]

Assignee

Comment 26

•

6 years ago

Attached image Screenshot_2018-08-13 Chart - queue createTask() duration count.png — Details

Brian Stack [:bstack]

Assignee

Comment 27

•

6 years ago

Uggghhh: https://app.signalfx.com/#/chart/v2/DU6SAnAAYAA?density=4&startTime=-31d&endTime=Now

It has started back up again.

Brian Stack [:bstack]

Assignee

Comment 28

•

6 years ago

Interesting datapoint: In a log span lasting from 09:10:04 to 09:24:19, we only seem to get 10 second or greater azure timings in bursts that last less than a second. These seem to be distributed across different operations and tables with no discernible pattern. I believe the existence of insertEntity and getEntity in this list helps rule out excessive modify retries as the culprit. Showing my work below:

taskcluster-queue|bug-1481493⚡ ⇒ papertrail -s taskcluster-queue --min-time '6 hours ago' > ./tttt      


taskcluster-queue|bug-1481493⚡ ⇒ rg "TIMING.*[0-9]{5,}\." ./tttt | cut -d " " -f 3 | uniq -c
     65 09:13:44
     52 09:14:01
     29 09:15:10
      1 09:17:15

taskcluster-queue|bug-1481493⚡ ⇒ rg ".*TIMING: ([a-zA-Z]+) on ([a-zA-Z]+) took ([0-9]{5,}\.[0-9]+)" -oN ./tttt --replace '$1 $2' | sort | uniq -c
     29 deleteEntity QueueTaskRequirement
     32 getEntity QueueArtifacts
      1 getEntity QueueTasks
     52 getEntity QueueWorker
     31 insertEntity QueueArtifacts
      2 queryEntities QueueArtifacts
     23 updateEntity QueueTaskDependency

Comment hidden (Intermittent Failures Robot)

35 failures in 2579 pushes (0.014 failures/push) were associated with this bug in the last 7 days.

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 23
* mozilla-central: 8
* mozilla-inbound: 4

Platform breakdown:
* android-4-2-x86: 3
* windows7-32: 1
* windows10-64: 1
* osx-cross: 1
* linux32: 2
* windows2012-64: 10
* android-em-4-3-armv7-api16: 2
* linux64: 9
* windows2012-32: 6

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1481178&startday=2018-08-06&endday=2018-08-12&tree=all

Brian Stack [:bstack]

Assignee

Comment 30

•

6 years ago

We've created an azure support request (id: 118081318779906). Will update this bug further when we know more.

Tom Prince [:tomprince]

Comment 31

•

6 years ago

Attached file Bug 1481178: Retry downloading `chainOfTrust.json.asc` in `mach artifact toolchain`; r?gps — Details

Gregory Szorc [:gps]

Comment 32

•

6 years ago

Comment on attachment 9002108 [details]
Bug 1481178: Retry downloading `chainOfTrust.json.asc` in `mach artifact toolchain`; r?gps

Gregory Szorc [:gps] has approved the revision.

Attachment #9002108 - Flags: review+

Tom Prince [:tomprince]

Updated

•

6 years ago

Keywords: leave-open

Pulsebot

Comment 33

•

6 years ago

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/mozilla-central/rev/3ef1c0555a29
Retry downloading `chainOfTrust.json.asc` in `mach artifact toolchain`; r=gps a=tomprince

Comment hidden (Intermittent Failures Robot)

28 failures in 2672 pushes (0.01 failures/push) were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 28

Platform breakdown:
* linux64-noopt: 1
* android-5-0-aarch64: 1
* windows2012-32: 3
* linux32: 1
* windows2012-32-noopt: 2
* osx-cross: 1
* windows2012-64: 5
* linux64: 14

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1481178&startday=2018-08-13&endday=2018-08-19&tree=all

Dustin J. Mitchell [:dustin] (he/him)

Comment 35

•

6 years ago

Long ago, we noted during some other investigations that performing a DELETE operation on a table would cause subsequent reads to pause for 5 seconds.  I *think* we found this when trying to delete a few million "never expires" tasks from the early days of Taskcluster.  After a little searching, I can't find the bug though.  At any rate, it may be worth looking for correlations between the various services deleting things from tables and these long Azure requests.

Pete Moore [:pmoore][:pete]

Reporter

Comment 36

•

6 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #35)
> Long ago, we noted during some other investigations that performing a DELETE
> operation on a table would cause subsequent reads to pause for 5 seconds.  I
> *think* we found this when trying to delete a few million "never expires"
> tasks from the early days of Taskcluster.  After a little searching, I can't
> find the bug though.  At any rate, it may be worth looking for correlations
> between the various services deleting things from tables and these long
> Azure requests.

Nice thinking! ++

Also, I got an email about recent Azure updates - looks like they have been pretty busy...
  https://azure.microsoft.com/en-us/updates/

Dustin J. Mitchell [:dustin] (he/him)

Comment 37

•

6 years ago

So, I tried pulling the `retry` function out of https://github.com/taskcluster/fast-azure-storage/blob/609197d1f5ab0e44fe2da91ca21ff68bce2c3e7a/lib/utils.js and testing it out.  I'm passing the options that azure-entities passes.  Notably, this does not include randomizationFactor.

const test = async () => {
  let failures = 3;
  const start = new Date().getTime();

  const log = msg => {
    console.log(`${new Date().getTime() - start} ${msg}`);
  };  

  const f = async retry => {
    log(`start retry ${retry}`);
    await new Promise(resolve => setTimeout(resolve, 100));
    if (retry < failures) {
      log(`failing retry ${retry}`);
      const e = new Error();
      e.code = 'ServerBusy';
      throw e;
    }   
    log(`suceeded retry ${retry}`);
  }

  retry(f, {
  version: '2014-02-14',
  dataServiceVersion: '3.0',
  clientId: 'fast-azure-storage',
  timeout: 7000,
  clientTimeoutDelay: 500,
  metadata: 'fullmetadata',
  retries: 5,
  delayFactor: 100,
  maxDelay: 30000,
  transientErrorCodes: 
   [ 'InternalErrorWithoutCode',
     'InternalError',
     'ServerBusy',
     'ETIMEDOUT',
     'ECONNRESET',
     'EADDRINUSE',
     'ESOCKETTIMEDOUT',
     'ECONNREFUSED',
     'RequestTimeoutError',
     'RequestAbortedError',
     'RequestContentLengthError' ],
  accountId: 'jungle',
  accessKey: undefined,
  minSASAuthExpiry: 900000,
  }); 
}

The result is

0 start retry 0
104 failing retry 0
106 start retry 1
206 failing retry 1
207 start retry 2
308 failing retry 2
309 start retry 3
410 suceeded retry 3

printing `delay` shows that it is always NaN, which setTimeout appears to treat like 0, so there's no delay between requests.

My hypothesis was that this retry method was always waiting 30 seconds on error, which would neatly translate any error from Azure into a 30-second timeout.  But that does not appear to be the case.  So https://github.com/taskcluster/fast-azure-storage/pull/32 fixes the bug with retry, but I don't think it fixes the issue we're looking at.

Dustin J. Mitchell [:dustin] (he/him)

Comment 38

•

6 years ago

I tried a quick hack to see if I could reproduce what I described in comment 35, and I could not.  Maybe I'm making things up!

[github robot]

Comment 39

•

6 years ago

Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/299b9a11ef0c584c612b6b2980f4e35cc77a864b
empty commit to test bug 1481178 failures

Brian Stack [:bstack]

Assignee

Comment 40

•

6 years ago

Well, things seem to have fixed themselves for no reason. We're still keeping an eye on it but I am bumping down the priority for now.

Severity: blocker → major

bhearsum@mozilla.com (:bhearsum)

Comment 41

•

6 years ago

I'

bhearsum@mozilla.com (:bhearsum)

Comment 42

•

6 years ago

I'm not certain it's this bug, but we're seeing some failures downloading CoT artifacts in tasks like https://tools.taskcluster.net/groups/SjAqqSqdQHmft4VBcfbZJw/tasks/YdSIz8cpSISbdEBE9nNTvQ/runs/0

From the scriptworker logs:

Sep 11 06:04:27 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker 2018-09-11T13:04:27    DEBUG - waiting 300 seconds before reclaiming...
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker 2018-09-11T13:04:32    ERROR - SCRIPTWORKER_UNEXPECTED_EXCEPTION task 
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker Traceback (most recent call last):
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/worker.py", line 50, in do_run_task
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await verify_chain_of_trust(chain)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/cot/verify.py", line 1813, in verify_chain_of_trust
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await download_cot_artifacts(chain)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/cot/verify.py", line 705, in download_cot_artifacts
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     mandatory_artifacts_paths = await raise_future_exceptions(mandatory_artifact_tasks)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 326, in raise_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     succeeded_results, _ = await _process_future_exceptions(tasks, raise_at_first_error=True)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 360, in _process_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     raise exc
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/cot/verify.py", line 660, in download_cot_artifact
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     chain.context, [url], parent_dir=link.cot_dir, valid_artifact_task_ids=[task_id]
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/artifacts.py", line 319, in download_artifacts
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await raise_future_exceptions(tasks)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 326, in raise_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     succeeded_results, _ = await _process_future_exceptions(tasks, raise_at_first_error=True)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 360, in _process_future_exceptions
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     raise exc
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 260, in retry_async
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     return await func(*args, **kwargs)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/scriptworker/utils.py", line 495, in download_file
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     chunk = await resp.content.read(chunk_size)
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/aiohttp/streams.py", line 329, in read
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await self._wait('read')
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/aiohttp/streams.py", line 260, in _wait
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     await waiter
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker   File "/builds/scriptworker/lib/python3.6/site-packages/aiohttp/helpers.py", line 673, in __exit__
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker     raise asyncio.TimeoutError from None
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker concurrent.futures._base.TimeoutError
Sep 11 06:04:32 beetmoverworker-14.srv.releng.usw2.mozilla.com python: beetmover_scriptworker 2018-09-11T13:04:32 CRITICAL - Fatal exception

Pete Moore [:pmoore][:pete]

Reporter

Comment 43

•

6 years ago

Hey Aki, in Ben's absence, do you know if this is still a problem?

Thanks!

Flags: needinfo?(aki)

Aki Sasaki (not active)

Comment 44

•

6 years ago

I see 10 asyncio.TimeoutErrors total from scriptworkers in the past week. Digging deeper into a few, looks like these happened during artifact upload. So it looks like there is still an issue, but not a super high frequency.

Flags: needinfo?(aki)

Brian Stack [:bstack]

Assignee

Comment 45

•

6 years ago

How about now, aki? I can dig deeper on this soon if it is still a problem.

Flags: needinfo?(aki)

Aki Sasaki (not active)

Comment 46

•

6 years ago

I see 12 `asyncio.TimeoutError`s total since Oct 10. The various intermittent bot comments above seem to be in the high 20's or mid-30's per week, so it looks like we're still better than before.

Flags: needinfo?(aki)

Brian Stack [:bstack]

Assignee

Comment 47

•

6 years ago

Ok, closing this for now. We can reopen if it causes issues again.

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → WORKSFORME

Sylvestre Ledru [:Sylvestre]

Updated

•

6 years ago

Keywords: leave-open

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: Operations → Operations and Service Requests

Bildschirmfoto 2018-08-06 um 14.45.40.png 6 years ago Pete Moore [:pmoore][:pete] 162.19 KB, image/png		Details
Screenshot_2018-08-06 queue-taskcluster-net · Metrics Heroku.png 6 years ago Brian Stack [:bstack] 46.26 KB, image/png		Details
NZ_hwV4q.png.part 6 years ago Brian Stack [:bstack] 163.34 KB, image/png		Details
Screenshot_2018-08-06 Dashboard - table tasks.png 6 years ago Brian Stack [:bstack] 207.10 KB, image/png		Details
Screenshot_2018-08-13 Chart - queue createTask() duration count.png 6 years ago Brian Stack [:bstack] 190.32 KB, image/png		Details
Bug 1481178: Retry downloading `chainOfTrust.json.asc` in `mach artifact toolchain`; r?gps 6 years ago Tom Prince [:tomprince] 46 bytes, text/x-phabricator-request	gps : review+	Details \| Review