Closed
Bug 1480368
Opened 7 years ago
Closed 7 years ago
patch occ to run a sha512 checksum validation against downloaded files whose sha hash is known
Categories
(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: grenade)
Details
Attachments
(1 file)
|
58 bytes,
patch
|
pmoore
:
review+
|
Details | Diff | Splinter Review |
In the following commit, we had a FileDownload specified with a sha512 pointing to a valid file in tooltool:
https://github.com/mozilla-releng/OpenCloudConfig/commit/6e1fdc4f4184116128bdc8e7bc64db8d4b528bb6
We know this was valid, because the correct file was downloaded on all worker types except for gecko-t-win10-64-cu.
On the gecko-t-win10-64-cu worker type, it created a 0 byte file:
> Microsoft Windows [Version 10.0.15063]
> (c) 2017 Microsoft Corporation. All rights reserved.
>
> C:\Users\Administrator>dir C:\Windows\Temp\NSSMInstall.zip
> Volume in drive C has no label.
> Volume Serial Number is C229-AA5B
>
> Directory of C:\Windows\Temp
>
> 08/01/2018 02:20 PM 953,735 NSSMInstall.zip
> 1 File(s) 953,735 bytes
> 0 Dir(s) 103,322,591,232 bytes free
>
> C:\Users\Administrator>
On the other worker types, it created a 953,735 byte file (which is correct). For example, this is on a gecko-t-win10-64-beta worker type:
Microsoft Windows [Version 10.0.15063]
(c) 2017 Microsoft Corporation. All rights reserved.
C:\Users\Administrator>dir C:\Windows\Temp\NSSMInstall.zip
Volume in drive C has no label.
Volume Serial Number is C229-AA5B
Directory of C:\Windows\Temp
08/01/2018 02:19 PM 953,735 NSSMInstall.zip
1 File(s) 953,735 bytes
0 Dir(s) 103,470,784,512 bytes free
C:\Users\Administrator>
Since the same sha512 is used in gecko-t-win10-64-beta and gecko-t-win10-64-cu this should have resulted in the same file being downloaded to C:\Windows\Temp\NSSMInstall.zip
So it seems that sometimes the file download step can fail, and leave a 0 byte file on the file system. This then isn't redownloaded on the next DSC run, presumably because the existence of the file is enough to satisfy the verification check that the file has been downloaded. Probably it should also check the sha512 value of the file, if provided.
I'm not sure where to find the logs to find out what the failure was in the NSSMDownload run, but the generated AMIs may hold the answer. They are:
eu-central-1: ami-53ada0b8
us-east-1: ami-8d1e0ef2
us-east-2: ami-6b84be0e
us-west-1: ami-3b779b58
us-west-2: ami-f572578d
| Reporter | ||
Comment 1•7 years ago
|
||
STOP THE PRESS!
It looks like either I copy/pasted from the wrong shell above, or something else is amiss - since clearly in the first console output we see the file is there, and does have the right file size - let me double check I copied/pasted from the right shell...
| Reporter | ||
Comment 2•7 years ago
|
||
In C:\Users\ADMINI~1\AppData\Local\Temp\atom-11872-7616-p6yrra.8bx4\20180801083907.userdata-run.zip\20180801084355-NSSMInstall-stderr.log I see:
> ERROR: The system cannot find the file specified.
> C:\Windows\Temp\NSSMInstall.zip
>
>
>
> System ERROR:
> The system cannot find the file specified.
However, this seemed to recover in a later userdata run, since folder C:\nssm-2.24-103-gdee49fc\ exists on the file system.
However, the workers are still not taking jobs - which I am investigating.
The first worker I looked at had a 0 byte file in C:\Windows\Temp\NSSMInstall.zip however, this seems to no longer be the case, so it looks like at least some of the workers have the correct file.
| Reporter | ||
Comment 3•7 years ago
|
||
(the first worker I looked at also did not have the folder C:\nssm-2.24-103-gdee49fc\)
| Reporter | ||
Comment 4•7 years ago
|
||
This is what I see in the GenericWorkerInstall step logs:
> 2018/08/01 08:43:56 Making system call GetProfilesDirectoryW with args: [0 C0423C36F8]
> 2018/08/01 08:43:56 Result: 0 0 The data area passed to a system call is too small.
> 2018/08/01 08:43:56 Making system call GetProfilesDirectoryW with args: [C0423E7EC0 C0423C36F8]
> 2018/08/01 08:43:56 Result: 1 0 The operation completed successfully.
> 2018/08/01 08:43:56 Command args: []string{"C:\\generic-worker\\generic-worker.exe", "install", "service", "--nssm", "C:\\nssm-2.24-103-gdee49fc\\win64\\nssm.exe", "--config", "C:\\generic-worker\\generic-worker.config"}
> 2018/08/01 08:43:56 Running command: 'C:\nssm-2.24-103-gdee49fc\win64\nssm.exe' 'install' 'Generic Worker' 'C:\generic-worker\run-generic-worker.bat'
> 2018/08/01 08:43:56 exec: "C:\\nssm-2.24-103-gdee49fc\\win64\\nssm.exe": file does not exist
> 2018/08/01 08:43:56 Error installing generic worker:
> 2018/08/01 08:43:56 &exec.Error{Name:"C:\\nssm-2.24-103-gdee49fc\\win64\\nssm.exe", Err:(*errors.errorString)(0xc042062230)}
So for some reason, the GenericWorkerInstall step tried to run, even though C:\nssm-2.24-103-gdee49fc\win64\nssm.exe did not exist. It managed to create C:\generic-worker\run-generic-worker.bat but got stuck installing the service.
The reason it failed was because NSSM had not been installed, which is because NSSM did not get downloaded (see comment 2).
However, since it partially installed generic-worker (creating the C:\generic-worker\run-generic-worker.bat file), future DSC runs considered that generic-worker had been installed, since the validation checks are:
> "Validate": {
> "PathsExist": [
> "C:\\generic-worker\\run-generic-worker.bat",
> "C:\\generic-worker\\generic-worker.exe"
> ],
> "CommandsReturn": [
> {
> "Command": "C:\\generic-worker\\generic-worker.exe",
> "Arguments": [
> "--version"
> ],
> "Like": "generic-worker 10.11.1 *"
> }
> ]
> }
That meant, the worker could never recover, since it never tried to reinstall generic-worker, believing it already to be successfully installed.
So the core problem here is - why did GenericWorkerInstall step run, when it depended on NSSMInstall step, which failed?
The NSSMInstall step has a validation that checks for the existence of C:\nssm-2.24-103-gdee49fc\win64\nssm.exe and so GenericWorkerInstall step should not have been run.
So to summarize the sequence of events: NSSMDownload step fails, GenericWorkerInstall step runs (despite depending on NSSMInstall which depends on NSSMDownload), and half-installs generic-worker due to missing NSSM dependency. Then GenericWorkerInstall step never runs again, and worker can't recover. At some point later NSSMDownload/NSSMInstall steps do run again and are successful, but since GenericWorkerInstall step never runs again, the worker never recovers.
This bug is to fix that OCC/DSC steps should not run if they depend on OCC/DSC steps whose validation fails.
Summary: FileDownload resulted in 0 byte file on file system, which prevented subsequent DSC runs from succeeding → OCC steps should not run if they depend on OCC steps whose validation fails
| Assignee | ||
Comment 5•7 years ago
|
||
dsc is only able to perform validations that it is told about.
the download step did not have any validations specified. we could probably add a validation that checks for a sha match, but in the case described here, there was no such validation specified so occ and dsc behaved as expected.
i don't know why a zero byte file was downloaded and it sounds like a good idea for the download step to check for this so that's something we'll need to add to occ.
the reason subsequent steps ran is because it wasn't obvious that something had failed. no validation was specified for the download.
occ steps will already only run if the steps they depend on have run successfully. the problem was that the pass criteria for the download was simply for a file to exist after the download had completed. i accept that this criteria should probably also include a check for more than a zero byte file. i just want to point out that the assumption that subsequent steps ran after a failed step, is misleading.
| Reporter | ||
Comment 6•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #5)
> the problem was that the pass criteria for the download was
> simply for a file to exist after the download had completed. i accept that
> this criteria should probably also include a check for more than a zero byte
> file.
Are you saying that when we specify the sha512 of a file to be downloaded in a manifest, it *isn't* used to validate the downloaded content? And that each download file entry should independently implement this functionality with its own custom validation step?
Why wouldn't OCC do this automatically?
Flags: needinfo?(rthijssen)
| Assignee | ||
Comment 7•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> (In reply to Rob Thijssen (:grenade UTC+2) from comment #5)
> Are you saying that when we specify the sha512 of a file to be downloaded in
> a manifest, it *isn't* used to validate the downloaded content?
yes. it's only used to construct the url for the file to be downloaded from tooltool which uses sha512 rather than filenames
> And that
> each download file entry should independently implement this functionality
> with its own custom validation step?
actually i said that could (rather than should) be done. i also said that it sounds like a good idea to implement something to perform a check of this nature.
> Why wouldn't OCC do this automatically?
mostly because it's a feature that has never been implemented. generally speaking, we'd have to write code to do that rather than just expect it to automatically happen.
Flags: needinfo?(rthijssen)
| Assignee | ||
Comment 8•7 years ago
|
||
changing bug summary to reflect the patch we'll make to avoid the confusion referenced above
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
OS: Unspecified → Windows
Summary: OCC steps should not run if they depend on OCC steps whose validation fails → patch occ to run a sha512 checksum validation against downloaded files whose sha hash is known
| Assignee | ||
Comment 9•7 years ago
|
||
this pr implements the following logic: if a sha512 was specified for tooltool download of a FileDownload component, validate the downloaded file has a sha512 hash matching the one in the manifest and use the result for the dsc validation of component run success
Attachment #9005563 -
Flags: review?(pmoore)
| Reporter | ||
Comment 10•7 years ago
|
||
Comment on attachment 9005563 [details] [diff] [review]
https://github.com/mozilla-releng/OpenCloudConfig/pull/172
Review of attachment 9005563 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks Rob!
Attachment #9005563 -
Attachment is patch: true
Attachment #9005563 -
Attachment mime type: text/x-github-pull-request → text/plain
Attachment #9005563 -
Flags: review?(pmoore) → review+
| Assignee | ||
Comment 11•7 years ago
|
||
merged: https://github.com/mozilla-releng/OpenCloudConfig/commit/44caa40a45720ad23dd85addd88b680e7787e049
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•