Open Bug 1904878 Opened 3 days ago Updated 1 day ago

sccache hit rate is 0

Categories

(Firefox Build System :: Toolchains, defect)

defect

Tracking

(Not tracked)

People

(Reporter: jcristau, Unassigned)

References

(Regression)

Details

(Keywords: regression)

Since the bump to sccache 0.8.1 it looks like our hit rate on CI is down to 0, which slows down most builds significantly.

Example output from a try build:

[task 2024-06-25T18:13:09.656Z] 18:13:09     INFO -  Compile requests                   5575
[task 2024-06-25T18:13:09.656Z] 18:13:09     INFO -  Compile requests executed          4937
[task 2024-06-25T18:13:09.656Z] 18:13:09     INFO -  Cache hits                            0
[task 2024-06-25T18:13:09.656Z] 18:13:09     INFO -  Cache misses                       4929
[task 2024-06-25T18:13:09.656Z] 18:13:09     INFO -  Cache misses (C/C++)               4351
[task 2024-06-25T18:13:09.656Z] 18:13:09     INFO -  Cache misses (Rust)                 578
[task 2024-06-25T18:13:09.657Z] 18:13:09     INFO -  Cache timeouts                        0
[task 2024-06-25T18:13:09.657Z] 18:13:09     INFO -  Cache read errors                     0
[task 2024-06-25T18:13:09.657Z] 18:13:09     INFO -  Forced recaches                       0
[task 2024-06-25T18:13:09.657Z] 18:13:09     INFO -  Cache write errors                 4929
[task 2024-06-25T18:13:09.657Z] 18:13:09     INFO -  Compilation failures                  6
[task 2024-06-25T18:13:09.657Z] 18:13:09     INFO -  Cache errors                          2
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Cache errors (Rust)                   2
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Non-cacheable compilations            0
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Non-cacheable calls                 359
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Non-compilation calls               279
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Unsupported compiler calls            0
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Average cache write               0.000 s
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Average compiler                  5.962 s
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Average cache read hit            0.000 s
[task 2024-06-25T18:13:09.658Z] 18:13:09     INFO -  Failed distributed compilations       0
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  Non-cacheable reasons:
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  -P                                  164
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  crate-type                          134
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  -fsyntax-only                        45
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  unknown source language              10
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  -                                     3
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  missing input                         2
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  -E                                    1
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  Cache location                  gcs, name: sccache-l1-us-central1, prefix: /
[task 2024-06-25T18:13:09.659Z] 18:13:09     INFO -  Version (client)                0.8.1

Mike, do you think it can be "Disable preprocessor cache " is too aggressive ?
https://github.com/mozilla/sccache/releases

Flags: needinfo?(mh+mozilla)

from sccache.log:

[2024-06-25T17:34:32Z DEBUG sccache::cache::gcs] gcs: token load succeeded for scope: https://www.googleapis.com/auth/devstorage.read_write
storage write check failed: PermissionDenied (permanent) at Writer::close => GcsErrorResponse { error: GcsError { code: 403, message: "sccache-l1-us-west1@sccache-3.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.", errors: [GcsErrorDetail { domain: "global", location: "", location_type: "", message: "sccache-l1-us-west1@sccache-3.iam.gserviceaccount.com does not have storage.objects.delete access to the Google Cloud Storage object.", reason: "forbidden" }] } }

Context:
   uri: https://storage.googleapis.com/upload/storage/v1/b/sccache-l1-us-west1/o?uploadType=media&name=.sccache_check
   response: Parts { status: 403, version: HTTP/1.1, headers: {"content-type": "application/json; charset=UTF-8", "date": "Tue, 25 Jun 2024 17:34:32 GMT", "vary": "Origin", "vary": "X-Origin", "cache-control": "no-cache, no-store, max-age=0, must-revalidate", "expires": "Mon, 01 Jan 1990 00:00:00 GMT", "pragma": "no-cache", "x-guploader-uploadid": "ACJd0Nogt5ZWdfAo-4R9JZtTNLpB5l-GiTNahZjU2nUfJ2V-7CbLeUxd6lvbCNbWmAFKyWYTIO0Dz4qrsg", "content-length": "442", "server": "UploadServer"} }
   service: gcs
   path: .sccache_check

and then [2024-06-25T17:34:33Z DEBUG sccache::server] Error executing cache write: Cannot write to read-only storage, over and over again.

https://github.com/mozilla/sccache/pull/2091 looks quite suspicious.

That's the one, this PR sets the storage to be read only. It might well be that we can't write to the root or whatever.

Fixed by https://hg.mozilla.org/mozilla-central/rev/272e255a9ef8f9b76b2d394edfb9c4307189e7d9, we need someone with access to the bucket permission to continue here.

Fun fact: https://github.com/mozilla/sccache/pull/2091 comes from me filing https://github.com/mozilla/sccache/issues/1854 while investigating bug 1823997, which, you'll notice, already showed these write error messages more than a year ago.

Flags: needinfo?(mh+mozilla)

BTW, I thought we were getting perf alerts on high numbers of cache write errors... and now I can't even find the sccache graphs on perfherder.

(In reply to Mike Hommey [:glandium] from comment #10)

and there WERE alerts

... both marked invalid.
https://treeherder.mozilla.org/perfherder/alerts?id=719
https://treeherder.mozilla.org/perfherder/alerts?id=834

Hi Mike, thank you for raising the issue, I linked both alerts to this bug and marked them as "fixed". I think this was missed by the sheriffing team because we also have the "instrumented" build alerts, which we set to "invalid" by default (there is a plan to disable alerting for instrumented builds sometime in the future). Also, build_metrics alerts that are generated by noise in the graph, which then return to a hit rate of 0 get set as invalid, since they are not showing a sustained regression. It is possible that these alerts were classified as "noisy" by mistake.

You need to log in before you can comment on or make changes to this bug.