Closed Bug 1145774 Opened 6 years ago Closed 5 years ago

figure out how to do release av scans when we move to s3

Categories

(Release Engineering :: Release Automation: Other, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: nthomas)

References

Details

Attachments

(11 files, 6 obsolete files)

19.24 KB, patch
rail
: review+
Details | Diff | Splinter Review
53 bytes, text/x-github-pull-request
dustin
: review+
Details | Review
487 bytes, patch
nthomas
: review+
Details | Diff | Splinter Review
1.48 KB, patch
rail
: review+
dustin
: feedback+
nthomas
: checked-in+
Details | Diff | Splinter Review
841 bytes, patch
rail
: review+
nthomas
: checked-in+
Details | Diff | Splinter Review
7.29 KB, patch
jlund
: review+
nthomas
: checked-in+
Details | Diff | Splinter Review
1.07 KB, patch
rail
: review+
nthomas
: checked-in+
Details | Diff | Splinter Review
3.50 KB, patch
rail
: review+
Details | Diff | Splinter Review
9.43 KB, patch
nthomas
: review+
nthomas
: checked-in+
Details | Diff | Splinter Review
1.21 KB, patch
Callek
: review+
Details | Diff | Splinter Review
53 bytes, text/x-github-pull-request
rail
: review+
Details | Review
We currently run clamd on upload1...that's not going to work with s3. We should probably make sure we do something that doesn't process all of the release artifacts in serial, too.
One idea (that I think someone else can up with) is to monitor the release bucket(s) for new files and scan them as they come in. This would take the AV scan out of the critical path and potentially save tons of time during release automation.

Something mshal suggested today was to upload artifacts to the nightly bucket initial, and then have the AV scan process them and put them in the release bucket. This might solve the retention issue at the same time, and could probably be done in combination with the above. Depending what happens in bug 1147142 this may not be possible. Eg: if we serve files directly off of s3, we may need to know their final location at build time to have balrog submission work.
Blocks: 1147140
http://aws.amazon.com/lambda/ is a perfect match for this job! :)
(In reply to Rail Aliiev [:rail] from comment #2)
> http://aws.amazon.com/lambda/ is a perfect match for this job! :)

If we go this route we're going to have think about how to make sure we actually verify that everything has passed AV prior to releasing. The fact that AV as a node in the release DAG right now ensures that. If we're doing something disconnected from the automation we'll need something else. It doesn't necessarily have to be part of the DAG, it could be as simple as something that makes sure the AV lambda jobs are stuck or anything.
(In reply to Rail Aliiev [:rail] from comment #2)
> http://aws.amazon.com/lambda/ is a perfect match for this job! :)

I'm not so sure Lambda is a good fit after looking at it. AFAICT you can only run JS code, and don't have control over what's installed on the compute platform. Given that, I'm not sure how we'd be able to run clamd or any other AV scan on it. Am I missing something? The docs suggest that Beanstalk might be an alternative. Given that Beanstalk is web-app centric, I suppose we might be able to make use of it if we implement antivirus-as-a-web-service. I'm not sure where the glue that makes requests to such a service after each s3 upload would live, though.
Yeah, Lambda is not ready yet. ATM it's JS/node-only and limited to 60s per task.

We can use TC with custom scheduler listening for S3 PUT events.
I was chatting with Pete to get some additional TC understanding and we ended up talking about this. Now that I have a better understanding of things, I think you're right that TC would make a lot of sense. We could do something like:
* Docker Worker w/ image that has clam installed to do scanning (we'd want to make sure that they know how to update their definitions though)
* External process that creates task graphs in response to things being put into the release bucket (I think this is what you mean by "custom scheduler")

The trick is going to be knowing when all of the AV scans are done. Pete suggested that we could keep extending the task graph as deliverables come in, which would let us watch the task queue for completion. I don't think that will work because there's not a constant flow of them - you could have gaps where you've scanned the existing ones but are still waiting for more to be generated and AFAIK there's no way to have gaps like that in the task graph.

Rail and I talked about this a bit more and I think we can get what we need by setting extra metadata on the objects in s3 when AV completes, and have a Scheduler or job in Buildbot that makes sure all of the objects with a certain prefix have that metadata set before "ready for release" fires.

There's still a bit of trickyness in making sure you don't start scanning the s3 bucket too soon because partial MARs will soon be moved to Funsize - and Buildbot won't know when those jobs complete. We're not 100% sure what to do here yet, but we talked about the idea of another job in Buildbot that is fed the task ids for the partials, and watches those queues. Once all of those complete, and any jobs from Buildbot that generate deliverables, you could start looking for the antivirus metadata.
(In reply to Ben Hearsum [:bhearsum] from comment #6)
> There's still a bit of trickyness in making sure you don't start scanning
> the s3 bucket too soon because partial MARs will soon be moved to Funsize -
> and Buildbot won't know when those jobs complete. We're not 100% sure what
> to do here yet, but we talked about the idea of another job in Buildbot that
> is fed the task ids for the partials, and watches those queues. Once all of
> those complete, and any jobs from Buildbot that generate deliverables, you
> could start looking for the antivirus metadata.

The buildbot<->tc bridge from bug 1135192 might be what we need here...
(In reply to Ben Hearsum [:bhearsum] from comment #7)
> (In reply to Ben Hearsum [:bhearsum] from comment #6)
> > There's still a bit of trickyness in making sure you don't start scanning
> > the s3 bucket too soon because partial MARs will soon be moved to Funsize -
> > and Buildbot won't know when those jobs complete. We're not 100% sure what
> > to do here yet, but we talked about the idea of another job in Buildbot that
> > is fed the task ids for the partials, and watches those queues. Once all of
> > those complete, and any jobs from Buildbot that generate deliverables, you
> > could start looking for the antivirus metadata.
> 
> The buildbot<->tc bridge from bug 1135192 might be what we need here...

Talked about this with Catlee a bit yesterday. Turns out that the buildbot bridge only works if ALL of the scheduling is in taskcluster, so it won't help us out unless we block this on moving that there. We want to do that eventually anyways, so I'm going to see how much effort that's going to take to do it. If we can do that quickly enough, we don't need an external watcher, or even extra metadata on the s3 objects. If we can't do it quickly enough, we should stick with comment #6 and no buildbot bridge for now, unless someone comes up with something else better in the meantime.
Depends on: 1150162
See Also: → beetmover
No longer depends on: 1150162
<rail> so we are basically changing http://hg.mozilla.org/build/buildbotcustom/file/2bfdd8a7fea9/process/release.py#l1432 and the script
<rail> also we need to spin up this builder, using a golden AMI for this type of builder which runs freshclam on boot
<nthomas> sounds good. I can work on the script part
<rail> I'll work on puppet (extend existing build.pp, add a cronjob for golden ami), slavealloc (add slaves), cloud-tools (new type configs)
<nthomas> bug 1174145 has a nice template for the download from s3, using multiprocessing
NI myself just in case. :)
Flags: needinfo?(rail)
Attached patch av-linux64-puppet.diff (obsolete) — Splinter Review
* not tested
* this is basically our regular build slave + clamd
* freshclam run is "optimized" (@reboot cron) for slave-like instances, that don't live too long. We can also run it from a initscript as a daemon (will require an initscript and probably a config)
Attachment #8649584 - Flags: feedback?(bugspam.Callek)
Comment on attachment 8649584 [details] [diff] [review]
av-linux64-puppet.diff

Review of attachment 8649584 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/clamav/manifests/daemon.pp
@@ +15,5 @@
> +        require => Class["packages::clamd"];
> +    }
> +    file {
> +      "/etc/crond.d/freshclam":
> +        content => "@reboot root /usr/bin/freshclam --quiet --daemon-notify\n";

should the exec use --quiet (said as someone who never manually ran freshclam)

::: modules/packages/manifests/clamd.pp
@@ +6,5 @@
> +    case $::operatingsystem {
> +        CentOS: {
> +            package {
> +                "clamd":
> +                    ensure => latest;

I'm curious, how are we trying to ensure clamd is actually "new enough", since its a program that newer versions tend to matter more for, however centOS is generally "no new versions unless for critical sec reasons without feature changes" are we planning on building newer versions on our end.

(I do note, I'm generally against |ensure=>latest;| but its probably the right idea for clamd)

::: modules/toplevel/manifests/slave/releng/build/av.pp
@@ +3,5 @@
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +class toplevel::slave::releng::build::av inherits toplevel::slave::releng::build::mock {
> +    include clamav::daemon
> +}

If I'm reviewing the final patch, I probably want a chat about the goals of this instance/actual implementation. Because as it stands I can't imagine why we need/want "mock" on here, nor why we need to populate the shared repos.
Attachment #8649584 - Flags: feedback?(bugspam.Callek) → feedback+
(In reply to Justin Wood (:Callek) from comment #13)
> should the exec use --quiet (said as someone who never manually ran
> freshclam)

It doesn't really matter because it will be run only once when a golden AMI is created.
 
> I'm curious, how are we trying to ensure clamd is actually "new enough",
> since its a program that newer versions tend to matter more for, however
> centOS is generally "no new versions unless for critical sec reasons without
> feature changes" are we planning on building newer versions on our end.

We use whatever version we have in the repos. If we need something fresher, we can add a repo for this.

> If I'm reviewing the final patch, I probably want a chat about the goals of
> this instance/actual implementation. Because as it stands I can't imagine
> why we need/want "mock" on here, nor why we need to populate the shared
> repos.

I thought about "forking" mock.pp, but given the scope of the project (it will be replaced) I didn't want to bother myself. :)
Attached patch av-linux64-puppet-1.diff (obsolete) — Splinter Review
I reworked the initial patch to make freshclam more robust:
* no more cronjobs to update the db
* added an initscript to run freshclam as a daemon. It will update the DB on boot and will be checking for updates hourly (Checks 24)
* modified vanilla freshclam.conf to:
  * use PID
  * notify daemon on update
* "subscribe" the initial freshclam to package installation, so we run it only once, not on every puppet run

Technically freshclam depends on clamav (not clamd), but in this case I didn't want to add extra code.
Attachment #8649584 - Attachment is obsolete: true
Attachment #8651107 - Flags: review?(dustin)
Comment on attachment 8651107 [details] [diff] [review]
av-linux64-puppet-1.diff

Review of attachment 8651107 [details] [diff] [review]:
-----------------------------------------------------------------

Good to go once the toplevel inheritance is straightened out

::: modules/toplevel/manifests/slave/releng/build/av.pp
@@ +1,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +class toplevel::slave::releng::build::av inherits toplevel::slave::releng::build::mock {

If this inherits from mock, then it should be `toplevel::slave::releng::build::mock::av`
Attachment #8651107 - Flags: review?(dustin) → review+
Agree, toplevel::slave::releng::build::mock::av makes more sense. Renamed. Carry over r+
Attachment #8651107 - Attachment is obsolete: true
Attachment #8651135 - Flags: review+
Attachment #8651136 - Flags: review?(dustin)
Attachment #8651136 - Flags: review?(dustin) → review+
Nick, let me know when you want to land this.

Next steps:

* Add new slave type to production.py in bb configs
* change release.py to use this slave type
* land puppet and cloud-tools changes
* wait (or force) for a new golden AMI (I added it to DNS and created a new IAM Role (a copy of bld-linux64))
* reconfig
* add new slaves to slavealloc

Anything else?
I was going to land the patches above today, but it's probably safer to wait until we ship this chemspill. There is no rush in any case.
Comment on attachment 8651136 [details] [review]
Add av-linux64 to build-tools

merged
Attachment #8651136 - Flags: checked-in+
A small tweak to ensure that clamd is running, so the initial freshclam's notify doesn't barf.
Attachment #8653105 - Flags: review?(nthomas)
Attachment #8653105 - Flags: review?(nthomas) → review+
I forced golden AMI generation and it worked fine. Let's see how it goes overnight.
Rail, I had a go at spinning up an instance for staging but wondered if we need configs/dev-av-linux64 to do that properly. What would you recommend ?
Attached patch [gecko] antivirus script, wip (obsolete) — Splinter Review
I've been testing this on a dev-linux64-ec2 with clamd retrofitted, example call:

python2.7 scripts/release/antivirus.py --product firefox --version 41.0 --build-number 1 --bucket-name net-mozaws-stage-delivery-firefox -d 10 -s 4 --tools-revision FIREFOX_41_0_RELEASE --tools-repo https://hg.mozilla.org/users/stage-ffxbld/tools

Runs for ~2 minutes for a 4 locale subset, ~1 minute of that is scanning. Need to fix windows installers though, eg
04:33:09     INFO -  Command failed for the following files:
04:33:09     INFO -    /home/cltbld/av-dev/scripts/cache/win32/ja/Firefox Setup Stub 41.0.exe
04:33:09     INFO -    /home/cltbld/av-dev/scripts/cache/win32/de/Firefox Setup Stub 41.0.exe
and some of the scanning is suspiciously fast.

It shares a lot with bug 1181542, and bug 1174145. If this code were living a long time it would make sense to have a stage-tasks.py in mozharness.
(In reply to Nick Thomas [:nthomas] from comment #27)
> Rail, I had a go at spinning up an instance for staging but wondered if we
> need configs/dev-av-linux64 to do that properly. What would you recommend ?

I think so. At least you don't need to install/tweak clamd in this case.
Comment on attachment 8654808 [details] [diff] [review]
[gecko] antivirus script, wip

Review of attachment 8654808 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/scripts/release/antivirus.py
@@ +129,5 @@
> +        so that we can unpack various files for clam to scan them."""
> +        remote_file = "{}/raw-file/{}/stage/extract_and_run_command.py".format(self.config["tools_repo"],
> +                                                                               self.config["tools_revision"])
> +        self.download_file(remote_file, file_name="extract_and_run_command.py")
> +        remote_file = "{}/raw-file/{}/buildfarm/utils/mar.py".format(self.config["tools_repo"],

maybe use https://pypi.python.org/pypi/mar instead?
BTW, I just stumbled upon https://github.com/bloomreach/s4cmd, maybe it's worth to try it:

* Multi-threaded/multi-connection implementation for enhanced performance on all commands.
* recursive get
* md5 check of content
* wildcard support
You were right about needing p7zip, which extract_and_run_command.py uses to unpack our windows installers. This is an untested attempt to add this, bug I'm not sure if anchors should be used in p7zip.pp.
Attachment #8656964 - Flags: feedback?(rail)
Attached patch [gecko] antivirus script (obsolete) — Splinter Review
This is working on a dev slave (retrofitted with clam & p7zip). Not quite sure how to spin up one of the new instances properly, so that's TODO.

Thanks for the suggestions rail. I swapped to catlee's mar package, unfortunately it still emits
22:24:47     INFO -  Exception IOError: 'File not open for writing' in <bound method BZ2MarFile.__del__ of <mar.BZ2MarFile instance at 0x1df50e0>> ignored

s4cmd doesn't support exclusions, which is shame. I think I'll just stick to brute forcing download perf with high request parallelism.
Attachment #8654808 - Attachment is obsolete: true
Comment on attachment 8656964 [details] [diff] [review]
[puppet] Add p7zip package

Anchors are only needed when a custom class calls another custom class, as Puppet doesn't otherwise draw the right arrows in the dependency graph.  `package` is a built-in, so no need for an anchor.
Attachment #8656964 - Flags: feedback?(rail) → feedback+
Rail, could you please verify the p7zip patch ? I spun up an instance based on the av AMI, but my puppet-foo for getting happy certs has evaporated.

TODO
> * add new slaves to slavealloc
> * Add new slave type to production.py in bb configs
> * change release.py to use this slave type

* patch for release.py to call the new script (like attachment 8657074 [details] [diff] [review] but different args)
* land p7zip fix and get it into the golden AMI
* test staging release against production VM
NI myself
Flags: needinfo?(rail)
Comment on attachment 8656964 [details] [diff] [review]
[puppet] Add p7zip package

Ship it!

I created an instance by:
  cd /builds/aws_manager/cloud-tools/scripts
  /builds/aws_manager/bin/python aws_create_instance.py -c /builds/aws_manager/cloud-tools/configs/av-linux64 -r us-east-1 -s aws-releng -k /builds/aws_manager/secrets/aws-secrets.json --ssh-key /home/buildduty/.ssh/aws-ssh-key -i /builds/aws_manager/cloud-tools/instance_data/us-east-1.instance_data_prod.json --ignore-subnet-check av-linux64-ec2-golden

Then logged in and ran puppet against my env:
  puppet agent --test --server releng-puppet2.srv.releng.scl3.mozilla.com --environment raliiev

Verified by:

# rpm -q p7zip
p7zip-9.20.1-2.el6.x86_64
# rpm -e p7zip
# rpm -e p7zip
error: package p7zip is not installed
# puppet agent --test --server releng-puppet2.srv.releng.scl3.mozilla.com --environment raliiev                                        
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for av-linux64-ec2-golden.build.releng.use1.mozilla.com
Warning: The package type's allow_virtual parameter will be changing its default value from false to true in a future release. If you do not want to allow virtual packages, please explicitly set allow_virtual to false.
   (at /usr/lib/ruby/site_ruby/1.8/puppet/type/package.rb:430:in `default')
Info: Applying configuration version 'de152ef94d46+'
Notice: /Stage[main]/Packages::P7zip/Package[p7zip]/ensure: created
Notice: Finished catalog run in 29.71 seconds
# rpm -q p7zip
p7zip-9.20.1-2.el6.x86_64
Flags: needinfo?(rail)
Attachment #8656964 - Flags: review+
Comment on attachment 8656967 [details] [diff] [review]
[gecko] antivirus script

Tested with Firefox and Thunderbird, on a modded up build slave.

TODO - buildbotcustom patch to drive this, and verify av-linux64 AMI is good.
Attachment #8656967 - Flags: review?(rail)
Comment on attachment 8656967 [details] [diff] [review]
[gecko] antivirus script

Let me redirect this jlund's way - he is the pro! :)
Attachment #8656967 - Flags: review?(rail) → review?(jlund)
Comment on attachment 8656967 [details] [diff] [review]
[gecko] antivirus script

Review of attachment 8656967 [details] [diff] [review]:
-----------------------------------------------------------------

sweet. so this will download all the release artifacts from s3 and scan them in parallel. will there be another script that pushes them back up or changes some of the s3 metadata? Or am I missing some high level understanding?

::: testing/mozharness/scripts/release/antivirus.py
@@ +88,5 @@
> +        super(AntivirusScan, self)._pre_config_lock(rw_config)
> +
> +        # This default is set here rather in the config because default
> +        # lists cannot be completely overidden, only appended to.
> +        if not self.config.get("excludes"):

wfm.

I suppose the other option is to make excludes a self attr defined in __init__ and put dest_dir there too:
    self.excludes = self.config.get('excludes', DEFAULT_EXCLUDES)
    self.dest_dir = self.config.get('dest_dir', DEFAULT_DEST_DIR)

@@ +154,5 @@
> +        def worker(item):
> +            source, destination = item
> +
> +            self.info("Downloading {} to {}".format(source, destination))
> +            dest_path = os.path.dirname(destination)

looks like dest_path isn't used

@@ +156,5 @@
> +
> +            self.info("Downloading {} to {}".format(source, destination))
> +            dest_path = os.path.dirname(destination)
> +            key = bucket.get_key(source)
> +            return retry(key.get_contents_to_filename,

cool, fyi mh has a self.retry (from BaseScript) that might be good here. probably not worth the hassle this time since you have tested this though

@@ +175,5 @@
> +                    if not os.path.isdir(dest_dir):
> +                        os.makedirs(dest_dir)
> +                    yield (keyname, destination)
> +
> +        pool = ThreadPool(self.config["download_parallelization"])

I am a noob with using multiprocessing in python. From what I gather though, shouldn't we be wrapping any thread calls within a `if __name__ == main` or else importing the worker function and iterating list from another module to avoid unwanted side effects like recursive process inits?

I know this is still definetily an issue with windows according to the docs. not sure if they fixed this for unix like machines:
https://docs.python.org/2/library/multiprocessing.html#windows
https://pymotw.com/2/multiprocessing/basics.html#importable-target-functions
Depends on: 1206585
(In reply to Jordan Lund (:jlund) from comment #40)
> sweet. so this will download all the release artifacts from s3 and scan them
> in parallel. will there be another script that pushes them back up or
> changes some of the s3 metadata? Or am I missing some high level
> understanding?

In this iteration we're just scanning, there's no signature or other upload afterwards. In the event of any failure we'd follow up with an upload to virustotal.com (by hand), to see if it's just ClamAV or more widespread.

> I suppose the other option is to make excludes a self attr defined in
> __init__ and put dest_dir there too:
>     self.excludes = self.config.get('excludes', DEFAULT_EXCLUDES)
>     self.dest_dir = self.config.get('dest_dir', DEFAULT_DEST_DIR)

That might be a little cleaner, I'll try it.
 
> > +            self.info("Downloading {} to {}".format(source, destination))
> > +            dest_path = os.path.dirname(destination)
> 
> looks like dest_path isn't used

Removed. I was going to create the path/to/destination at this point, but doing it in find_release_files avoids races.
 
> > +        pool = ThreadPool(self.config["download_parallelization"])
> 
> I am a noob with using multiprocessing in python. 

Can't claim much on this either. Will check if that applies, but it looks like Windows only on the main multiprocessing doc. We're only planning to run this on linux.
I contemplated adding an entry to moco-nodes.pp to allow staging instance to talk to puppet, to make sure freshclam runs on boot. But don't the golden spot images have puppet disabled on the client side ?
Attachment #8663589 - Flags: review?(rail)
Also:
Sep 21 03:32:44 dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com puppet-agent[30186]: Could not retrieve catalog from remote server: Error 400 on SERVER: This host used to have trust level core, and cannot be changed to try without a reimage at /etc/puppet/production/modules/slave_secrets/manifests/init.pp:16 on node dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com
Sep 21 03:32:44 dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com puppet-agent[30186]: Not using cache on failed catalog
Sep 21 03:32:44 dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com puppet-agent[30186]: Could not retrieve catalog; skipping run
Attachment #8663589 - Flags: review?(rail) → review+
(In reply to Nick Thomas [:nthomas] from comment #42)
> Created attachment 8663589 [details] [diff] [review]
> [puppet] Add password to buildbot masters
> 
> I contemplated adding an entry to moco-nodes.pp to allow staging instance to
> talk to puppet, to make sure freshclam runs on boot. But don't the golden
> spot images have puppet disabled on the client side ?

Yeah, we run puppet on golden AMIs and then disable it, so the spot instances don't use it. But we still apply puppet changes to these machines.
(In reply to Nick Thomas [:nthomas] from comment #43)
> Also:
> Sep 21 03:32:44 dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com
> puppet-agent[30186]: Could not retrieve catalog from remote server: Error
> 400 on SERVER: This host used to have trust level core, and cannot be
> changed to try without a reimage at
> /etc/puppet/production/modules/slave_secrets/manifests/init.pp:16 on node
> dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com
> Sep 21 03:32:44 dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com
> puppet-agent[30186]: Not using cache on failed catalog
> Sep 21 03:32:44 dev-av-linux64-ec2-nthomas.dev.releng.use1.mozilla.com
> puppet-agent[30186]: Could not retrieve catalog; skipping run

Hmmm, sounds like it tries to use the trust level set in slavealloc, but dev-av-linux64-ec2-nthomas is not there.
Comment on attachment 8656967 [details] [diff] [review]
[gecko] antivirus script

Review of attachment 8656967 [details] [diff] [review]:
-----------------------------------------------------------------

clearing r? as it sounds like there will be a follow up patch
Attachment #8656967 - Flags: review?(jlund)
rail, to scan Firefox 42.0b2 we'd need about 80G, and 41.0.1 about 60G, to store the files. Can we do that easily with the existing AMI ? Say 120G at / instead of 50G.  I had a quick look at shifting the model from 'download all, scan all' to 'for each file: download, scan, discard (with multiple in parallel)' but the logs are very jumbled, so I'd have to slay that dragon.
Flags: needinfo?(rail)
50G is minimum of what the slaves in theory can get. There is this script http://hg.mozilla.org/build/puppet/file/4b182b11d6b0/modules/aws/files/manage_instance_storage.py which is responsible to do LVM magic similar to https://github.com/mozilla/build-cloud-tools/pull/119 commands, except that it operates on instance (local) storage that come for free with instances.

http://aws.amazon.com/ec2/instance-types/ and https://github.com/mozilla/build-cloud-tools/blob/master/configs/watch_pending.cfg#L128 tells me that the minimum we get is 80G. Add it to 50G on AMI and it'll give you 130G in total.

We don't use this approach on loaners because instance storage is ephemeral and if you use it as apart of an LVM volume, the volume will be "corrupt" after a stop/start.
Flags: needinfo?(rail)
This cleans up the exclude default as requested, as well as polishing up some help messages. I've left the multiprocessing as is because I think it's OK after reading the docs. There's a __main__ guard and we're on linux, and I haven't seen anything funky in lots of test runs.

It also deletes the cache of files at the end, just in case we get another av job on the same slave before we recycle the instance. I did look at using PurgeMixin.purge_builds() early in the job, but would have work around the default of not deleting release directories. We'll be running on spot instances and I doubt we'll have a chance to investigate any failures before an instance gets the chop by aws-manager.
Attachment #8668207 - Flags: review?(jlund)
Attachment #8656967 - Attachment is obsolete: true
We try to use catlee's mardor first, and fall back to the mar.py nearby in tools if that fails. This can't break current production because there's a copy of extract_and_run.py in svn, which gets deployed to stage.m.o by puppet. I don't think I'll bother doing that.
Assignee: nobody → nthomas
Status: NEW → ASSIGNED
Attachment #8668213 - Flags: review?(rail)
This is all just to make the slaves available to the antivirus job. It does mean that some other jobs may run on av slaves (if connected), eg ready for release and other notifications, start uptake checks, but I think that's harmless
Attachment #8668235 - Flags: review?(rail)
Three hours should be long enough for a whole release, as a 4 locale x 4 platform test case takes 4m 30s.
Attachment #8668236 - Flags: review?(rail)
Attachment #8668213 - Flags: review?(rail) → review+
Attachment #8668235 - Flags: review?(rail) → review+
Comment on attachment 8668236 [details] [diff] [review]
[buildbotcustom] Use new slaves, drive new script

Review of attachment 8668236 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/release.py
@@ +123,5 @@
>          if p == 'b2g':
>              continue
>          platform_slaves = branchConfig['platforms'][p].get('slaves', [])
>          all_slaves.extend(platform_slaves)
> +        if 'av' in p:

Maybe use "av-linux" to be more specific? Otherwise LGTM.
Attachment #8668236 - Flags: review?(rail) → review+
Blocks: 1210538
Comment on attachment 8668207 [details] [diff] [review]
[gecko] antivirus script, v2

Review of attachment 8668207 [details] [diff] [review]:
-----------------------------------------------------------------

::: testing/mozharness/scripts/release/antivirus.py
@@ +34,5 @@
> +            "help": "List of filename patterns to exclude. See script source for default",
> +        }],
> +        [["-d", "--download-parallelization"], {
> +            "dest": "download_parallelization",
> +            "default": 6,

10 was too much power? :)

@@ +82,5 @@
> +            config={
> +                "virtualenv_modules": [
> +                    "boto",
> +                    "redo",
> +                    "mar",

neat
Attachment #8668207 - Flags: review?(jlund) → review+
(In reply to Jordan Lund (:jlund) [pto until oct 5th] from comment #56)
> 10 was too much power? :)

I did some quick timings and there wasn't much speedup beyond 4, presumably because S3 and the connection to it (from an AWS instance) are pretty fast. Just padded a bit to make sure the link is saturated. Thanks for the review.
Keywords: leave-open
Add linux64-av to Buildslaves.py.template too, carrying rail's r+.
Attachment #8668235 - Attachment is obsolete: true
Attachment #8672543 - Flags: review+
Removes the linux64-av platform on try, so test-masters.sh passes. Carrying review over again.

https://hg.mozilla.org/build/buildbot-configs/rev/b3331e7e1413
Attachment #8672543 - Attachment is obsolete: true
Attachment #8675963 - Flags: review+
Attachment #8675963 - Flags: checked-in+
Added av-linux64-ec2-001 through 004 to slavealloc, two each in us-east-1 and us-west-2.
AV didn't work as expected for 42.0b9 for several reasons:

1) the slaves were disabled in slavealloc (fixed)
2) the regexp needs a tweak: https://github.com/mozilla/build-cloud-tools/pull/124 (works with local change)
3) we don't have any "-spot-" slaves in buildbot configs, I bumped on-demand limits in watch pending config

TODO:
1) Merge the patch
2) add "-spot-" slaves to slavealloc and bb-configs
Attached patch av.diffSplinter Review
Attachment #8678223 - Flags: review?
Attachment #8678223 - Flags: review? → review+
I added 4 slaves to slavealloc and now we have one taken a job! \o/
Bug 1218763 is about handleing more exceptions for download errors.

There's one outstanding issue. A full Firefox release build fills up the disk while caching the files. Rail, how do you feel about removing {c3,m3,r3}.xlarge so that we're guaranteed at least 160G of space ? Would that work out of the box ? Otherwise we could change the worker to download+scan in one go.
Status: ASSIGNED → NEW
Depends on: 1218763
Flags: needinfo?(rail)
Sounds good to me. The instances will automatically use all existing ephemeral storage (they will be added to LVM used by / )

The only thing that we may need to tweak is the bid price.
Flags: needinfo?(rail)
Another approach could be downloading, scanning and deleting files in batches.
Attachment #8682872 - Flags: review?(rail) → review+
I merged that PR but we're not launching an instance to do the 43.0b1 build2 av job. There's this in aws_watch_pending.log:

2015-11-04 00:26:04,202 - av-linux64 - started 0 spot instances for slaveset None; need 1
2015-11-04 00:26:04,202 - need 1 ondemand av-linux64 for slaveset None
2015-11-04 00:26:04,221 - 0 av-linux64 ondemand instances running globally
2015-11-04 00:26:04,480 - Using cached slaves.json
2015-11-04 00:26:05,097 - getting all spot requests for us-west-2
2015-11-04 00:26:13,104 - No free IP available in None for subnets [u'subnet-d748dabe', u'subnet-a848dac1', u'subnet-ad48dac4', u'subnet-c74f48b3']
2015-11-04 00:26:13,105 - No free IP available for av-linux64 in None
2015-11-04 00:26:13,472 - av-linux64 - started 0 instances for slaveset None; need 1

Those two 'None' look suspicious but I'm not going to try to fix that on the fly.

Landed
 https://github.com/mozilla/build-cloud-tools/commit/5b1b103dd99ced48f33f2f5e723b62dc840c0a77
as a work around. It re-enables spots, bumps the bid up to 0.40, and renormalises the performance_constant's.
Using the bigger instances has been fine.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.