Open Bug 1823997 Opened 1 year ago Updated 8 days ago

Intermittent sccache: error: Timed out waiting for server startup

Categories

(Taskcluster :: General, defect, P5)

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Regression)

Details

(Keywords: intermittent-failure, regression)

Filed by: smolnar [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=409812528&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/BPN1dvcITZCPpoZBBSxlgg/runs/0/artifacts/public/logs/live_backing.log


INFO -      FOUND_MOZCONFIG=/builds/worker/checkouts/gecko/.mozconfig
[task 2023-03-22T18:16:20.293Z] 18:16:20     INFO -      export FOUND_MOZCONFIG
[task 2023-03-22T18:16:20.294Z] 18:16:20     INFO -    Parallelism determined by memory: using 16 jobs for 16 cores based on 70.5 GiB RAM and estimated job size of 1.0 GiB
[task 2023-03-22T18:16:20.294Z] 18:16:20     INFO -  /usr/bin/gmake -f client.mk -j16 -k
[task 2023-03-22T18:16:20.296Z] 18:16:20     INFO -  # Terminate any sccache server that might still be around.
[task 2023-03-22T18:16:20.297Z] 18:16:20     INFO -  /builds/worker/fetches/sccache/sccache --stop-server > /dev/null 2>&1
[task 2023-03-22T18:16:20.300Z] 18:16:20     INFO -  # Start a new server, ensuring it gets the jobserver file descriptors
[task 2023-03-22T18:16:20.301Z] 18:16:20     INFO -  # from make (but don't use the + prefix when make -n is used, so that
[task 2023-03-22T18:16:20.301Z] 18:16:20     INFO -  # the command doesn't run in that case)
[task 2023-03-22T18:16:20.302Z] 18:16:20     INFO -  mkdir -p /builds/worker/artifacts
[task 2023-03-22T18:16:20.303Z] 18:16:20     INFO -  env SCCACHE_LOG=sccache=debug SCCACHE_ERROR_LOG=/builds/worker/artifacts/sccache.log /builds/worker/fetches/sccache/sccache --start-server
[task 2023-03-22T18:16:20.306Z] 18:16:20     INFO -  [2023-03-22T18:16:20Z DEBUG sccache::config] Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2023-03-22T18:16:20.306Z] 18:16:20     INFO -  [2023-03-22T18:16:20Z DEBUG sccache::config] Couldn't open config file: failed to open file `/builds/worker/.config/sccache/config`
[task 2023-03-22T18:16:20.306Z] 18:16:20     INFO -  sccache: Starting the server...
[task 2023-03-22T18:16:20.309Z] 18:16:20     INFO -  [2023-03-22T18:16:20Z DEBUG sccache::config] Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2023-03-22T18:16:20.309Z] 18:16:20     INFO -  [2023-03-22T18:16:20Z DEBUG sccache::config] Couldn't open config file: failed to open file `/builds/worker/.config/sccache/config`
[task 2023-03-22T18:16:30.309Z] 18:16:30    ERROR -  sccache: error: Timed out waiting for server startup
[task 2023-03-22T18:16:30.310Z] 18:16:30     INFO -  gmake: *** [client.mk:57: build] Error 2
[task 2023-03-22T18:16:34.490Z] 18:16:34     INFO -  0 compiler warnings present.
[task 2023-03-22T18:16:34.611Z] 18:16:34     INFO - Return code: 2
[task 2023-03-22T18:16:34.611Z] 18:16:34  WARNING - setting return code to 2
[task 2023-03-22T18:16:34.611Z] 18:16:34    FATAL - 'mach build -v' did not run successfully. Please check log for errors.
[task 2023-03-22T18:16:34.611Z] 18:16:34    FATAL - Running post_fatal callback...
[task 2023-03-22T18:16:34.611Z] 18:16:34    FATAL - Exiting -1
[task 2023-03-22T18:16:34.611Z] 18:16:34     INFO - [mozharness: 2023-03-22 18:16:34.611656Z] Finished build step (failed)
[task 2023-03-22T18:16:34.611Z] 18:16:34     INFO - Running post-run listener: _parse_build_tests_ccov
[task 2023-03-22T18:16:34.612Z] 18:16:34     INFO - Running command: ['/builds/worker/workspace/obj-build/_virtualenvs/common/bin/python', 'mach', 'python', 'testing/parse_build_tests_ccov.py'] in /builds/worker/checkouts/gecko
[task 2023-03-22T18:16:34.612Z] 18:16:34     INFO - Copy/paste: /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python mach python testing/parse_build_tests_ccov.py
[task 2023-03-22T18:16:34.613Z] 18:16:34     INFO - Using env: {'ACCEPTED_MAR_CHANNEL_IDS': 'firefox-mozilla-central',
[task 2023-03-22T18:16:34.613Z] 18:16:34     INFO -  'DEBIAN_FRONTEND': 'noninteractive',
[task 2023-03-22T18:16:34.613Z] 18:16:34     INFO -  'EXTRA_MOZHARNESS_CONFIG': '{"mozconfig_variant": "debug", "objdir": '

Sounds like a regression from bug 1823796.

Component: Task Configuration → General
Keywords: regression
Regressed by: 1823796

Set release status flags based on info from the regressing bug 1823796

:Sylvestre, since you are the author of the regressor, bug 1823796, could you take a look?

For more information, please visit auto_nag documentation.

Here is the error:

[2023-03-22T18:16:20Z DEBUG sccache::cache::cache] Init gcs cache with bucket sccache-l3-us-central1, key_prefix 
storage write check failed: ObjectPermissionDenied (permanent) at write => GcsErrorResponse { error: GcsError { code: 403, message: "sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.", errors: [GcsErrorDetail { domain: "global", location: "", location_type: "", message: "sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.", reason: "forbidden" }] } }

Context:
    response: Parts { status: 403, version: HTTP/1.1, headers: {"x-guploader-uploadid": "ADPycds7myRzdQOsU__b3jrH3SXmnJQc6_gXrbBDbiPG6dVgk2jjgIyUjR_fLdQoW0K64dO6rTxN9tpfZ0ol7GMZ6DhKarFynhDx", "content-type": "application/json; charset=UTF-8", "date": "Wed, 22 Mar 2023 18:16:34 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "content-length": "448", "server": "UploadServer"} }
    service: gcs
    path: .sccache_check
Duplicate of this bug: 1824495

Set release status flags based on info from the regressing bug 1823796

i am not sure it is a regression. I am pretty sure it was happening before but we didn't see it because sccache wasn't showing the error.

also, it seems to be a configuration issue
"does not have storage.objects.delete access to the Google Cloud Storage object"
not a problem with sccache per say?!

Flags: needinfo?(sledru)

It's still a regression in the sense that this didn't cause builds to fail before, and now they do, and if you go back to the older version, they won't fail anymore.

Duplicate of this bug: 1846857

(In reply to Sylvestre Ledru [:Sylvestre] PTO from comment #3)

Here is the error:

[2023-03-22T18:16:20Z DEBUG sccache::cache::cache] Init gcs cache with bucket sccache-l3-us-central1, key_prefix 
storage write check failed: ObjectPermissionDenied (permanent) at write => GcsErrorResponse { error: GcsError { code: 403, message: "sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.", errors: [GcsErrorDetail { domain: "global", location: "", location_type: "", message: "sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.", reason: "forbidden" }] } }

Context:
    response: Parts { status: 403, version: HTTP/1.1, headers: {"x-guploader-uploadid": "ADPycds7myRzdQOsU__b3jrH3SXmnJQc6_gXrbBDbiPG6dVgk2jjgIyUjR_fLdQoW0K64dO6rTxN9tpfZ0ol7GMZ6DhKarFynhDx", "content-type": "application/json; charset=UTF-8", "date": "Wed, 22 Mar 2023 18:16:34 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "content-length": "448", "server": "UploadServer"} }
    service: gcs
    path: .sccache_check

That's actually a red herring. It's also in the log of non-failing builds.

That's actually a red herring. It's also in the log of non-failing builds.

Opened https://github.com/mozilla/sccache/issues/1853 for that, but it's not the root cause of the Time out problem. We do have a setup problem, that without the delete permission SCCACHE_RECACHE doesn't work, but that's not this bug.

Ooooh, the relevant part of the log is the timestamps. Here's from a recent one:

[2023-08-02T21:25:22Z DEBUG sccache::cache::cache] Init gcs cache with bucket sccache-l3-us-central1, key_prefix 
[2023-08-02T21:25:22Z DEBUG sccache::cache::gcs] gcs: start to load token from: http://taskcluster/auth/v1/gcp/credentials/sccache-3/sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com
[2023-08-02T21:27:22Z DEBUG sccache::cache::gcs] gcs: token load succeeded for scope: https://www.googleapis.com/auth/devstorage.read_write

Note how it took exactly two minutes to get the token from http://taskcluster/ !
But by then, the sccache server startup had already been timed-out:

[task 2023-08-02T21:25:22.485Z] 21:25:22     INFO -  sccache: Starting the server...
[task 2023-08-02T21:25:22.487Z] 21:25:22     INFO -  [2023-08-02T21:25:22Z DEBUG sccache::config] Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2023-08-02T21:25:22.487Z] 21:25:22     INFO -  [2023-08-02T21:25:22Z DEBUG sccache::config] Couldn't open config file: failed to open file `/builds/worker/.config/sccache/config`
[task 2023-08-02T21:25:32.488Z] 21:25:32    ERROR -  sccache: error: Timed out waiting for server startup

(sccache server timeout is 10s, which is much shorter than the two minutes it took to get the token)

So there are two problems:

  • the sccache server doesn't respond while it's waiting for the token.
  • getting the token can sometimes take way too long.

Filed https://github.com/mozilla/sccache/issues/1854 for the former, and moving the bug to taskcluster for the latter.

Product: Firefox Build System → Taskcluster

I guess the failure rate dropped thanks to:
https://github.com/mozilla/sccache/pull/2077

(In reply to Sylvestre Ledru [:Sylvestre] from comment #61)

I guess the failure rate dropped thanks to:
https://github.com/mozilla/sccache/pull/2077

we're still on 0.7.6, so no.

You need to log in before you can comment on or make changes to this bug.