Intermittent-infra Funsize ValueError: No JSON object could be decoded

RESOLVED INCOMPLETE

Status

RESOLVED INCOMPLETE
2 years ago
11 months ago

People

(Reporter: intermittent-bug-filer, Unassigned)

Tracking

({bulk-close-intermittents, intermittent-failure})

Details

(Whiteboard: [stockwell infra])

It feels like it is a cloud-mirror issue:

---request begin---
GET /https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FM2ix_CuUSOm52G49nE76Jg%2F0%2Fpublic%2Fbuild%2Ftarget.test_packages.json H
TTP/1.1
User-Agent: Wget/1.13.4 (linux-gnu)
Accept: */*
Host: cloud-mirror-production-us-east-1.s3.amazonaws.com
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 200 OK
x-amz-id-2: PFuos/Q3ppTiXJKGxRmAFjbzqTHSQm4QlEkSSe1khDzYQ2D8gXvdGmHoeBeL74IeIBGsJnDWMVA=
x-amz-request-id: 266DE2A80F6E01FF
Date: Wed, 11 Jan 2017 11:42:54 GMT
Last-Modified: Wed, 11 Jan 2017 07:40:14 GMT
x-amz-expiration: expiry-date="Fri, 13 Jan 2017 00:00:00 GMT", rule-id="us-east-1-1-day"
ETag: "865588d50a8998d378f5afbf8c4c491f"
x-amz-meta-cloud-mirror-upstream-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/M2ix_CuUSOm52G49nE76Jg/0/public/build/target.test_
packages.json
x-amz-meta-cloud-mirror-upstream-content-length: <unknown>
x-amz-meta-cloud-mirror-stored: 2017-01-11T07:40:13.119Z
x-amz-meta-cloud-mirror-upstream-etag: <unknown>
x-amz-meta-cloud-mirror-addresses: [{"c":200,"u":"https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/M2ix_CuUSOm52G49nE76Jg/0/public/build/
target.test_packages.json","t":"2017-01-11T07:40:07.984Z"}]
Accept-Ranges: bytes
Content-Type: application/xml
Content-Length: 282
Server: AmazonS3
---response end---
200 OK
Disabling further reuse of socket 4.
Closed 4/SSL 0x0000000001005880
Registered socket 3 for persistent reuse.
Length: 282 [application/xml]
Saving to: `target.test_packages.json.2'
100%[==============================================================================================================>] 282         --.-K/s   in 0s      
2017-01-11 11:42:53 (6.86 MB/s) - `target.test_packages.json.2' saved [282/282]

root@taskcluster-worker:~/workspace/build# cat target.test_packages.json
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>InternalError</Code><Message>We encountered an internal error. Please try again.</Message><RequestId>A690174B7BB0423A</RequestId><HostId>+G
Wa47hi3/ZgD2bJwuRvCrTBi7/8XROTDQ5q9kVe2HpwrIi3DESwoopdIUAnUtQ66epbvon2k6Q=</HostId></Error>root@taskcluster-worker:~/workspace/build#
Component: General Automation → Platform and Services
Product: Release Engineering → Taskcluster
QA Contact: catlee
This message in the log suggests to me that the resource was attempted to be fetched before it was uploaded to the original upstream bucket.


<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/QS3KiICIRfiwGZUn-Bxaxw/0/public/env/manifest.json</Key><RequestId>E5BB45D3FECFB3E3</RequestId><HostId>EvAXe70tgw4SWYWOQVC4L6JVE0BPcJxwgxZIOs/VCvKO5CIvDCmKieVvJDLIQJN45WBwqFeBLvY=</HostId></Error>+ python /home/worker/bin/funsize-balrog-submitter.py --artifacts-url-prefix https://queue.taskcluster.net/v1/task/QS3KiICIRfiwGZUn-Bxaxw/artifacts/public/env --manifest /home/worker/artifacts/manifest.json -a http://balrog/api --signing-cert /home/worker/keys/nightly.pubkey --verbose

From US-East-1, I get the following for the resource that failed to download as:

~ $ curl -L -v -o out https://queue.taskcluster.net/v1/task/QS3KiICIRfiwGZUn-Bxaxw/artifacts/public/env/manifest.json
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.225.134.170...
* Connected to queue.taskcluster.net (54.225.134.170) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Server hello (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
* 	 subject: C=US; ST=California; L=Mountain View; O=Mozilla Corporation; CN=auth.taskcluster.net
* 	 start date: 2016-03-17 00:00:00 GMT
* 	 expire date: 2019-03-22 12:00:00 GMT
* 	 subjectAltName: queue.taskcluster.net matched
* 	 issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
* 	 SSL certificate verify ok.
> GET /v1/task/QS3KiICIRfiwGZUn-Bxaxw/artifacts/public/env/manifest.json HTTP/1.1
> User-Agent: curl/7.35.0
> Host: queue.taskcluster.net
> Accept: */*
> 
< HTTP/1.1 404 Not Found
* Server Cowboy is not blacklisted
< Server: Cowboy
< Connection: keep-alive
< X-Powered-By: Express
< Strict-Transport-Security: max-age=7776000
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
< Access-Control-Request-Method: *
< Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
< Content-Type: application/json; charset=utf-8
< Content-Length: 37
< Etag: W/"25-c445155e"
< Date: Wed, 11 Jan 2017 12:35:47 GMT
< Via: 1.1 vegur
< 
{ [data not shown]
100    37  100    37    0     0     95      0 --:--:-- --:--:-- --:--:--    95
* Connection #0 to host queue.taskcluster.net left intact
~ $ cat out 
{
  "message": "Artifact not found"
}~ $
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #2)

> From US-East-1, I get the following for the resource that failed to download
> as:
> 
> ~ $ curl -L -v -o out
> https://queue.taskcluster.net/v1/task/QS3KiICIRfiwGZUn-Bxaxw/artifacts/
> public/env/manifest.json

From https://tools.taskcluster.net/task-inspector/#QS3KiICIRfiwGZUn-Bxaxw/0 it looks like this is a signing-worker-v1 worker type (of the signing-provisioner-v1 provisioner) from 3 months ago, that has not yet expired (expires in October 2017) yet has no artifacts, (including no log file).

Aki, do you know more about this? Thanks!
Flags: needinfo?(aki)
The task definition for that task points to the manifest https://queue.taskcluster.net/v1/task/H6hLVYKBSAyZkxJmqwHzLg/artifacts/public/env/manifest.json which also appears not to exist at the moment.
Ah, it looks like that manifest probably used to exist, but expired. In task H6hLVYKBSAyZkxJmqwHzLg:

"public/env": {
    "path": "/home/worker/artifacts/",
    "expires": "2016-10-08T16:05:58.680033Z",
    "type": "directory"
}

So at the time task QS3KiICIRfiwGZUn-Bxaxw ran, it did exist. But it is not clear why there are no artifacts attached to QS3KiICIRfiwGZUn-Bxaxw - it could be that these artifacts were set to expire earlier than the task expiry, but that is not part of the task payload, so we can't see that. This would be my guess though - that the artifact(s) of task QS3KiICIRfiwGZUn-Bxaxw expired recently, causing this problem.
I think we can ignore this.

- signing-worker-v1 workers are only just now becoming tier1
- funsize running against signing-worker-v1 workers are only just now becoming tier1
- aiui there was a new release of cloud mirror, though I'm not sure if that happened after oct 1.

Have we seen other instances of this?
Flags: needinfo?(aki)
127 failures in 155 pushes (0.819 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-inbound: 127

Platform breakdown:
* android-4-3-armv7-api15: 127

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1306865&startday=2017-01-11&endday=2017-01-11&tree=all
127 failures in 722 pushes (0.176 failures/push) were associated with this bug in the last 7 days. 

This is the #12 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* mozilla-inbound: 127

Platform breakdown:
* android-4-3-armv7-api15: 127

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1306865&startday=2017-01-09&endday=2017-01-15&tree=all

Comment 9

2 years ago
So far this has only be reported on the 11th, but I'm still not sure why it happened.

Task B was requesting an artifact from task A after artifacts were uploaded for Task A and Task A was marked resolved.  This shouldn't have been a timing issue.  The artifact exists at the time of me writing this comment too.

It's hard to diagnose now that it's a week old (papertrail log searching is only around for 3 days).  John is working on improving how we upload/download artifacts so that tasks are only completed successfully once artifacts are uploaded and content is verified on the s3 side.
Whiteboard: [stockwell infra]
2 failures in 718 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 2

Platform breakdown:
* windows7-32-vm: 1
* gecko-decision: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1306865&startday=2017-06-26&endday=2017-07-02&tree=all
1 failures in 822 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1306865&startday=2017-07-17&endday=2017-07-23&tree=all
1 failures in 1008 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 1

Platform breakdown:
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1306865&startday=2017-07-24&endday=2017-07-30&tree=all
1 failures in 901 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-inbound: 1

Platform breakdown:
* gecko-decision: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1306865&startday=2017-08-07&endday=2017-08-13&tree=all
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Keywords: bulk-close-intermittents
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.