Closed Bug 1443686 Opened 7 years ago Closed 7 years ago

Windows builds backlog

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dluca, Unassigned)

References

Details

Windows builds are not running. Autoland and Inbound is closed until the issue is resolved.
Earlier today pmoore updated OCC to start using generic-worker v10.6.0. Shortly after, grenade found that we were running into GitHub TLS issues when trying to download the EXEs. He copied them over to tooltool and added the sha512 hashes to the manifests in OCC. Inexplicably, he then changed the sha512 hashes for gecko-1-b-win2012, gecko-2-b-win2012, and gecko-3-b-win2012 with the message "sha512 corrections". Unfortunately, the file that matches that hash on tooltool is zero-length, so builds go boom. I wasn't able to determine WHAT version of generic-worker the "correct" hash referred to, so I simply reverted the commit, which puts us back onto generic-worker v10.6.0. NI on grenade for more info on that commit, and what the intention was.
Flags: needinfo?(klibby) → needinfo?(rthijssen)
If the move from github.com to S3/tooltool was intended as a temporary workaround, my vote is to make it permanent: we want to minimize CI's dependency on third party services - including github.com - because this will minimize points of failure and increase resiliency of Taskcluster.
(In reply to Kendall Libby [:fubar] from comment #2) > Earlier today pmoore updated OCC to start using generic-worker v10.6.0. I rolled out generic-worker 10.6.0 to our staging worker types - there was no production change.
It looks like in this commit[1] generic-worker 10.6.0 has been rolled out to gecko-{1,2,3}-b-win2012 rather than generic-worker 10.2.3. > commit 78eb9e9ce5d6b2cd093b8f602b58427efdcea877 > Author: kendall libby <klibby@mozilla.com> > Date: Tue Mar 6 21:03:06 2018 -0500 > > Bug 1443686, Bug 1443595 - revert sha512 corrections > > The sha512 corrections made in 75cceef point to a zero-length file on > tooltool, breaking builds on gecko-2-b-win2012 and gecko-3-b-win2012. > It's not clear what version of generic-worker that sha relates to, so > rolling back the commit to what I believe is the correct hash for > generic-worker 10.6.0. > > diff --git a/userdata/Manifest/gecko-1-b-win2012.json b/userdata/Manifest/gecko-1-b-win2012.json > index dfc578e..9b4a0c6 100644 > --- a/userdata/Manifest/gecko-1-b-win2012.json > +++ b/userdata/Manifest/gecko-1-b-win2012.json > @@ -1002,7 +1002,7 @@ > ], > "Source": "https://github.com/taskcluster/generic-worker/releases/download/v10.2.3/generic-worker-windows-amd64.exe", > "Target": "C:\\generic-worker\\generic-worker.exe", > - "sha512": "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e" > + "sha512": "dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849" > }, > { > "ComponentName": "LiveLogDownload", > diff --git a/userdata/Manifest/gecko-2-b-win2012.json b/userdata/Manifest/gecko-2-b-win2012.json > index dfc578e..9b4a0c6 100644 > --- a/userdata/Manifest/gecko-2-b-win2012.json > +++ b/userdata/Manifest/gecko-2-b-win2012.json > @@ -1002,7 +1002,7 @@ > ], > "Source": "https://github.com/taskcluster/generic-worker/releases/download/v10.2.3/generic-worker-windows-amd64.exe", > "Target": "C:\\generic-worker\\generic-worker.exe", > - "sha512": "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e" > + "sha512": "dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849" > }, > { > "ComponentName": "LiveLogDownload", > diff --git a/userdata/Manifest/gecko-3-b-win2012.json b/userdata/Manifest/gecko-3-b-win2012.json > index dfc578e..9b4a0c6 100644 > --- a/userdata/Manifest/gecko-3-b-win2012.json > +++ b/userdata/Manifest/gecko-3-b-win2012.json > @@ -1002,7 +1002,7 @@ > ], > "Source": "https://github.com/taskcluster/generic-worker/releases/download/v10.2.3/generic-worker-windows-amd64.exe", > "Target": "C:\\generic-worker\\generic-worker.exe", > - "sha512": "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e" > + "sha512": "dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849" > }, > { > "ComponentName": "LiveLogDownload", We see below, that this sha512 is for the wrong version (dceae809... = 10.6.0) of generic-worker (should be 9cc47318... = 10.2.3): > $ for version in 10.2.3 10.6.0; do curl -s -L "https://github.com/taskcluster/generic-worker/releases/download/v${version}/generic-worker-windows-amd64.exe" > "${version}"; shasum -a512 "${version}"; done > 9cc47318d60119d2d050040fe34046457ef9b1d83f25cc634badf98d811c302b61c4b2cba811ee25ec80a07f9f88f473dba061541585bf72015aa6ff08f0ae16 10.2.3 > dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849 10.6.0 -- [1] https://github.com/mozilla-releng/OpenCloudConfig/commit/78eb9e9ce5d6b2cd093b8f602b58427efdcea877
Depends on: 1443595
(In reply to Kendall Libby [:fubar] from comment #2) > Earlier today pmoore updated OCC to start using generic-worker v10.6.0. > Shortly after, grenade found that we were running into GitHub TLS issues > when trying to download the EXEs. He copied them over to tooltool and added > the sha512 hashes to the manifests in OCC. Inexplicably, he then changed the > sha512 hashes for gecko-1-b-win2012, gecko-2-b-win2012, and > gecko-3-b-win2012 with the message "sha512 corrections". Unfortunately, the > file that matches that hash on tooltool is zero-length, so builds go boom. > > I wasn't able to determine WHAT version of generic-worker the "correct" hash > referred to, so I simply reverted the commit, which puts us back onto > generic-worker v10.6.0. > > NI on grenade for more info on that commit, and what the intention was. Indeed, that hash (cf83e135....) is the correct sha512 hash for an empty file: > $ touch empty-file > $ shasum -a 512 empty-file > cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e empty-file
(In reply to Pete Moore [:pmoore][:pete] from comment #5) > It looks like in this commit[1] generic-worker 10.6.0 has been rolled out to > gecko-{1,2,3}-b-win2012 rather than generic-worker 10.2.3. > > We see below, that this sha512 is for the wrong version (dceae809... = > 10.6.0) of generic-worker (should be 9cc47318... = 10.2.3): > > > $ for version in 10.2.3 10.6.0; do curl -s -L "https://github.com/taskcluster/generic-worker/releases/download/v${version}/generic-worker-windows-amd64.exe" > "${version}"; shasum -a512 "${version}"; done > > 9cc47318d60119d2d050040fe34046457ef9b1d83f25cc634badf98d811c302b61c4b2cba811ee25ec80a07f9f88f473dba061541585bf72015aa6ff08f0ae16 10.2.3 > > dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849 10.6.0 Except that neither of those hashes is what grenade used in commit 75cceef to OCC (i.e. changing it from dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849 to cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e The commit messages and bug 1443595 lacked information, and I could not determine what version of generic-worker corresponded to the cf83e135... hash (it's NOT 10.2.3), nor which version we should have been on (75cceef led me to believe that it maybe should have been v10.6.0) Worse, Rob says that OCC should NOT have made the changes when the manifest was updated, yet it did (and I was under the impression from some of the Moonshot work that it *should* make changes on updates). So I have two conflicting stories, and it unexpectedly broke things. I also feel that making changes to the manifest that would get applied on the next deploy - and not deploying - is setting up a time bomb, especially when that isn't communicated in a commit message or related bug. Lastly, as :gps noted in #taskcluster and I agree, the time between commit and detection of the issue is too long (~4 hours). Possible bugs in OCC applying manifests aside, we need to figure out how to verify changes more quickly and/or have a better hand off so that folks in other time zones can more quickly pick things up (without having to wake up those that are already asleep).
(In reply to Gregory Szorc [:gps] from comment #3) > If the move from github.com to S3/tooltool was intended as a temporary > workaround, my vote is to make it permanent: we want to minimize CI's > dependency on third party services - including github.com - because this > will minimize points of failure and increase resiliency of Taskcluster. Concur; I've checked and tooltool is migrating out of SCL3, so it will continue to be available.
(In reply to Kendall Libby [:fubar] from comment #7) > (In reply to Pete Moore [:pmoore][:pete] from comment #5) > > It looks like in this commit[1] generic-worker 10.6.0 has been rolled out to > > gecko-{1,2,3}-b-win2012 rather than generic-worker 10.2.3. > > > > > We see below, that this sha512 is for the wrong version (dceae809... = > > 10.6.0) of generic-worker (should be 9cc47318... = 10.2.3): > > > > > $ for version in 10.2.3 10.6.0; do curl -s -L "https://github.com/taskcluster/generic-worker/releases/download/v${version}/generic-worker-windows-amd64.exe" > "${version}"; shasum -a512 "${version}"; done > > > 9cc47318d60119d2d050040fe34046457ef9b1d83f25cc634badf98d811c302b61c4b2cba811ee25ec80a07f9f88f473dba061541585bf72015aa6ff08f0ae16 10.2.3 > > > dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849 10.6.0 > > Except that neither of those hashes is what grenade used in commit 75cceef > to OCC (i.e. changing it from > dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a1 > 1bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849 to > cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f > 2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e Indeed, this is why it was broken. > > The commit messages and bug 1443595 lacked information, and I could not > determine what version of generic-worker corresponded to the cf83e135... > hash (it's NOT 10.2.3), I was able to determine that cf83e135... is the SHA512 of an empty file (see https://bugzilla.mozilla.org/show_bug.cgi?id=1443686#c6). > nor which version we should have been on (75cceef led me to believe that it maybe should have been v10.6.0) The "Source" URL two lines above the sha512 specifies version, and links to the binary, from which the sha512 can be determined: commit 78eb9e9ce5d6b2cd093b8f602b58427efdcea877 Author: kendall libby <klibby@mozilla.com> Date: Tue Mar 6 21:03:06 2018 -0500 Bug 1443686, Bug 1443595 - revert sha512 corrections The sha512 corrections made in 75cceef point to a zero-length file on tooltool, breaking builds on gecko-2-b-win2012 and gecko-3-b-win2012. It's not clear what version of generic-worker that sha relates to, so rolling back the commit to what I believe is the correct hash for generic-worker 10.6.0. diff --git a/userdata/Manifest/gecko-1-b-win2012.json b/userdata/Manifest/gecko-1-b-win2012.json index dfc578e..9b4a0c6 100644 --- a/userdata/Manifest/gecko-1-b-win2012.json +++ b/userdata/Manifest/gecko-1-b-win2012.json @@ -1002,7 +1002,7 @@ ], "Source": "https://github.com/taskcluster/generic-worker/releases/download/v10.2.3/generic-worker-windows-amd64.exe", "Target": "C:\\generic-worker\\generic-worker.exe", - "sha512": "cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e" + "sha512": "dceae809dbb3df6d6c6f8ee7d6222cf09a74019e7238010b01bf71db3a2bb266a99e343de14a11bd45cbf94517d544fe5d5699ba0732bc9d8700794f16a46849" }, { "ComponentName": "LiveLogDownload", > > Worse, Rob says that OCC should NOT have made the changes when the manifest > was updated, yet it did (and I was under the impression from some of the > Moonshot work that it *should* make changes on updates). So I have two > conflicting stories, and it unexpectedly broke things. I've checked, and can confirm that the AMIs were not updated but indeed it seems the change did propagate (otherwise we would have not had an issue). Worker type definitions of gecko-{1,2,3}-b-win2012 confirm that the worker type definition was not updated, since they refer to the previous deployment from three weeks ago: "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/137c8c1b0e4b3927f15cf38ee4f9771894818221/userdata/Manifest/gecko-1-b-win2012.json"
(In reply to Kendall Libby [:fubar] from comment #8) > (In reply to Gregory Szorc [:gps] from comment #3) > > If the move from github.com to S3/tooltool was intended as a temporary > > workaround, my vote is to make it permanent: we want to minimize CI's > > dependency on third party services - including github.com - because this > > will minimize points of failure and increase resiliency of Taskcluster. > > Concur; I've checked and tooltool is migrating out of SCL3, so it will > continue to be available. I believe when things are working correctly, downloads already get cached in tooltool automatically[1], so this should not be an issue. The root cause of the github download failures was the requirement to use TLS 1.2 with github - bug 1443595 has a fix for this. -- [1] https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-tooltool-repo.sh
(In reply to Pete Moore [:pmoore][:pete] from comment #9) > (In reply to Kendall Libby [:fubar] from comment #7) > I was able to determine that cf83e135... is the SHA512 of an empty file (see > https://bugzilla.mozilla.org/show_bug.cgi?id=1443686#c6). > > > nor which version we should have been on (75cceef led me to believe that it maybe should have been v10.6.0) > > The "Source" URL two lines above the sha512 specifies version, and links to > the binary, from which the sha512 can be determined: Agreed, but the prior commit had changed the sha512 hash to v10.6.0 without also changing the Source line, so I wasn't able to infer if 10.2.3 was correct or not. In any case, I think we're agreeing with each other. > > Worse, Rob says that OCC should NOT have made the changes when the manifest > > was updated, yet it did (and I was under the impression from some of the > > Moonshot work that it *should* make changes on updates). So I have two > > conflicting stories, and it unexpectedly broke things. > > I've checked, and can confirm that the AMIs were not updated but indeed it > seems the change did propagate (otherwise we would have not had an issue). *nod* And it's not clear what the correct behavior is, so we can sort that out and either fix the bug or correct our understanding (I can see it being both very useful and not). The rest is just developing better processes!
the commit with the sha "correction" was due to the fact that in the previous commit i had accidentally copy pasted the sha hash for gw 10.6 from the beta worker type. this was unintentional. i hadn't been trying to upgrade gw to 10.6 i just wanted tooltool hashes for the existing versions in order to mitigate the tls issue. the problem was caused because i somehow generated an incorrect hash for what i thought was gw 10.2.3.
Flags: needinfo?(rthijssen)
(In reply to Rob Thijssen (:grenade UTC+2) from comment #12) > the commit with the sha "correction" was due to the fact that in the > previous commit i had accidentally copy pasted the sha hash for gw 10.6 from > the beta worker type. this was unintentional. i hadn't been trying to > upgrade gw to 10.6 i just wanted tooltool hashes for the existing versions > in order to mitigate the tls issue. the problem was caused because i somehow > generated an incorrect hash for what i thought was gw 10.2.3. Interesting. When I generated a sha512 hash for v10.2.3 it didn't match what you had added, which is why I just backed it out. I wonder what went wrong!
Severity: blocker → normal
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.