Closed Bug 1595498 Opened 5 years ago Closed 5 years ago

9.22 - 18305% Taskcluster infra change - build times / sccache

Categories

(Taskcluster :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla72

People

(Reporter: alexandrui, Assigned: tomprince)

References

(Regression)

Details

(Keywords: perf-alert, regression, Whiteboard: [necko-triaged])

Attachments

(1 file)

We have detected a build metrics regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=c79b90bae420d6b32db7dc58fb263dc4aaa00ef2

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

18305% sccache cache_write_errors android-5-0-aarch64 opt 3.33 -> 613.50
15167% sccache cache_write_errors android-4-0-armv7-api16 opt 4.00 -> 610.67
14432% sccache cache_write_errors android-4-2-x86 opt 4.17 -> 605.50
14235% sccache cache_write_errors osx-cross-noopt debug 4.50 -> 645.08
12955% sccache cache_write_errors android-5-0-x86_64 debug 4.67 -> 609.25
12929% sccache cache_write_errors osx-cross debug 4.92 -> 640.58
12744% sccache cache_write_errors android-5-0-x86_64 opt 4.75 -> 610.08
10862% sccache cache_write_errors osx-cross asan asan-fuzzing 5.92 -> 648.58
9013% sccache cache_write_errors osx-cross debug fuzzing 7.17 -> 653.08
5798% sccache cache_write_errors linux64-noopt debug 11.25 -> 663.50
5559% sccache cache_write_errors linux64 asan opt 11.50 -> 650.83
5465% sccache cache_write_errors linux64 debug 11.92 -> 663.17
5135% sccache cache_write_errors linux64 opt valgrind 12.50 -> 654.42
4730% sccache cache_write_errors linux64 opt 13.58 -> 656.08
4404% sccache cache_write_errors android-5-0-aarch64 debug 13.50 -> 608.00
3295% sccache cache_write_errors linux64 asan asan-fuzzing 19.50 -> 662.08
2755% sccache cache_write_errors android-4-0-armv7-api16 debug 21.25 -> 606.75
1129% sccache cache_write_errors android-4-2-x86 debug 50.08 -> 615.58
1029% sccache cache_write_errors linux32 debug 59.25 -> 668.67
978% sccache cache_write_errors linux64-aarch64 opt 60.42 -> 651.42
719% sccache cache_write_errors linux64 asan debug 80.29 -> 657.58

You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=23788

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Automated_Performance_Testing_and_Sheriffing/Build_Metrics

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Flags: needinfo?(valentin.gosu)
Component: Performance → Networking: DNS
Product: Testing → Core
Target Milestone: --- → mozilla72
Version: Version 3 → unspecified

The patch was backed out because of hazard failures (I am not sure yet what that means)

I also don't quite understand what sccache cache_write_errors means. Can you explain what the metric measures?

Flags: needinfo?(valentin.gosu) → needinfo?(alexandru.ionescu)
Assignee: nobody → valentin.gosu
Priority: -- → P2
Whiteboard: [necko-triaged]

You're right, this is odd that it was backed out and the regression is still there. Also sccache test react a bit with delay, but the regression should've been fixed.
igoldan, what do you think?

Flags: needinfo?(alexandru.ionescu) → needinfo?(igoldan)

(In reply to Valentin Gosu [:valentin] (he/him) from comment #1)

I also don't quite understand what sccache cache_write_errors means. Can you explain what the metric measures?

Kim, these tests used to be owned by Ted. Could you point us to their current owner, to receive some clarifications? Thanks!

Flags: needinfo?(igoldan) → needinfo?(kmoir)

This is indeed weird. Under no circumstance bug 1552176 could have caused these regressions.
These regressions are all over the place; all build_metrics have now changed their baselines & remained that way, not just sccache. Kim, please bring this heads up to your team.
Anyway, this feels to me like an infra change actually. But to prove it, we would need to do retriggers on older jobs, which ATM cannot be done (more about that in bug 1595359).

From the considerable infra changes which occurred during this weekend (Nov 9 and 10), after which all the build_metrics baselines changed, I think a pretty likely suspect is bug 1546801.
Dustin, I am not familiar with work done in this area, so I need your confirmation or clarification here. Tom, I believe you also have the required knowledge to answer this.

Flags: needinfo?(dustin)
No longer regressed by: 1552176
See Also: → 1552176
Regressed by: tc-cloudops
Component: Networking: DNS → Operations and Service Requests
Product: Core → Taskcluster

Valentin, I'd say you can unassigned yourself for now from this bug. It's very likely we need to clarify a whole different matter.

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(mozilla)
Summary: 718.99 - 18305% sccache cache_write_errors (android-4-0-armv7-api16, android-4-2-x86, android-5-0-aarch64|x86_64, linux32, linux64|-aarch64|-noopt, osx-cross|-noopt) regression on push c79b90bae420d6b32db7dc58fb263dc4aaa00ef2 (Mon November 11 2019) → Taskcluster infra change
Assignee: valentin.gosu → nobody
No longer regressed by: tc-cloudops
Component: Operations and Service Requests → General
Summary: Taskcluster infra change → 9.22 - 18305% Taskcluster infra change - build times / sccache
Regressed by: tc-cloudops
Flags: needinfo?(valentin.gosu)

I suspect this means that some or all of the sccache credentials aren't in place in the new deployment, so all writes are failing. Wander and Grenade are the most knowledgeable.

Flags: needinfo?(wcosta)
Flags: needinfo?(rthijssen)
Flags: needinfo?(mozilla)
Flags: needinfo?(kmoir)
Flags: needinfo?(dustin)

I created new credentials for Bug 1595567, but did not delete the old one. Maybe GCP IAM messed up something.

Flags: needinfo?(wcosta)

Looking at the logs it feels like it is unrelated, as my change only affects GCP, it actually can't find AWS credentials: From sccache.log:

Could not load AWS creds: Couldn't find AWS credentials in environment, credentials file, or IAM role.

It might be a configuration problem in taskcluster-auth.

before, we used to apply an iam role to windows ec2 instances that gave them access to sccache. we stopped doing that in bug 1562686 when task configuration was modified to use taskcluster scopes and roles to grant sccache access at the task level to windows workers.

workers and infrastructure no longer have anything to do with sccache bucket access permissions. it is managed at the task level and is not something i understand now that iam roles on instances are not used. furthermore, i have no information about where gcp buckets are hosted and what iam roles exist to manage their access.

in bug 1570148 comment 65, i asked for information about how the gcp sccache buckets are configured, what roles and iam permissions are used and how to deal with issues like this one, but that information has not been forthcoming. for example, i still don't know what gcp projects even host the sccache buckets or how to find information about what iam roles exist. without that information i am unable to troubleshoot sccache issues.

at the very least, i will need to know what project is used and will need access granted for the gcp project that hosts the sccache buckets and the iam configuration. without that, i'm completely in the dark about sccache and unable to assist with this.

i'm happy to help because i have done sccache configuration and debugging in the past but i do need the access and information i requested in bug 1570148 in order to do so.

Flags: needinfo?(rthijssen)

Tom suggested this might be a missing scope?

Flags: needinfo?(mozilla)

:Callek The project:taskcluster:{trust_domain}:level-{level}-sccache-buckets need to be ported to the new cluster. Those roles are directly referenced by the in-tree code, that adds them to the the tasks (so the in-tree code doesn't need to be able to enumerate the buckets).

Flags: needinfo?(mozilla) → needinfo?(bugspam.Callek)

I just tossed a patch up and applied this anyway, lets see what comes out of review

Flags: needinfo?(bugspam.Callek)

triaging, assigning to callek since he attached patches

Assignee: nobody → bugspam.Callek
Assignee: bugspam.Callek → mozilla

It looks like sccache is succesfully getting credentials from taskcluster, but those credentials are not working. Here is where those credentials are requested, and there isn't failure.

:dustin Can you verify if the credentials there? I suspect this should probably belong eventually to relops, but I'm not sure who there I should be talking to (cc: :fubar)

Flags: needinfo?(klibby)
Flags: needinfo?(dustin)

The access_key_id currently configured in the auth service is in the cloudops-taskcluster-aws-prod AWS account, not the mozilla-taskcluster account where we run workers. It has access to the Taskcluster backup bucket, but nothing more.

Longer-term, I think we'd like to configure these AWS credential the way we configure the GCP credentials, so they could potentially span multiple accounts. But for the moment, the fix is probably to replace that account with one from the mozilla-taskcluster account. I'll send some credentials along to cloudops.

Flags: needinfo?(dustin)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Flags: needinfo?(klibby)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: