Closed Bug 699579 Opened 8 years ago Closed 4 years ago

Only virus scan release files we haven't already scanned


(Release Engineering :: Release Automation: Other, defect, P3)



(Not tracked)



(Reporter: nthomas, Unassigned)



(Whiteboard: [kanban:engops:] [release][automation])

When virus scanning releases, hash files after extracting them from dmg/mar/etc, and don't rescan if filename and hash match a previously scanned file. This will make the scanning much quicker and not hammer stage so hard.
Whiteboard: [release][automation]
Just to point out a tradeoff being made here: if updated anti-virus protection data becomes available that can detect a previously undetected virus, we won't catch that virus if we don't rescan the file.

- upload fileA which contains virusXYZ
- scan fileA and don't detect virus since the scanner doesn't yet detect virusXYZ
- update anti-virus definitions so we now detect virusXYZ
- skip scan of fileA, hence don't catch instance of virusXYZ

One workaround could be to only cache the scan result for a certain amount of time.
It's a fair point. To clarify, I'm only talking about within a scanning run, which takes about 3 hours at the moment but would be much shortened by this change. It wouldn't persist over separate jobs.
Or save the hash of the virus definitions along with the file hash ? :)
I'm taking a stab at this here:

My initial tests with only one installer and one mar (of the same local) show that it's slower with caching, I guess because of all the hashing. I'm going to try a larger scale test tomorrow locally, but I'd like to do a full test on surf after things calm down. Maybe over the weekend.
Assignee: nobody → bhearsum
todo: walk directories in a predictable order, probably alphabetical
I'd love to do win32/ and update/win32/ first instead of last.
(In reply to Nick Thomas [:nthomas] from comment #6)
> I'd love to do win32/ and update/win32/ first instead of last.

Hmm, I think we could just adjustment the automation piece to do this. Rather than giving it "/pub/" to scan, we could give it the individual platform + update dirs, in whatever order we think best. I'd rather that than hardcode something in, or try to implement some generic sorting arguments. Like this:
Well, I ran my tests over the weekend and I'm not encouraged by the results of the current patch. Here they are:
j2 | 1h39min | 1h47min
j4 |  54min  | 56min
j6 |  55min  | 54min
j8 |  56min  | 54min

The tests were run on an ix machine, which has 4 logical CPU cores. It looks to me like once you start maxing out the CPUs at j4, that caching with md5 makes no difference.

I'm going to try swapping in a faster hashing algorithm from, probably SuperFastHash.
....and, I can't even get this library to install on the build machine, it requires some very specific Boost library to compile, or something.
I ended up looking at to get some hints on speeding this up and found that it was using sha1. Despite the fact that sha1 is supposed to be slower than md5, it seems to perform faster for this use case. In my tests, it ran at 46min in -j6 and -j8, and 48min in -j4 - which is 8min faster than no caching in both cases. I don't really understand how this can be the case, so if someone else understands, please fill me in!

I also found that has a cache of the hashes of the files themselves...which I bet is pretty hot considering how many guaranteed dupes you have per platform when generating partials. I bet this is a big part of where it gets its speed from. I'm not sure if we can do the same thing in without making it really specific for the task...but we might be able to throw the existing logic into a class and make it subclassable. If we do that, we could do something similar to what does, and cache the sha1 sums by file path + size + locale (we need locale, because helper.exe and probably some non-binary files could easily collide due to the only differences being strings...).

I'm not going to spend anymore time on this at the moment though, I want to see how things are with the new candidates/ directory before I spend more time here.
We're living with the current state of things, and I don't think I'll be touching this anytime soon.
Assignee: bhearsum → nobody
Component: Release Engineering → Release Engineering: Automation
QA Contact: release → catlee
Bulk move of bugs to Release Automation component.
Component: Release Engineering: Automation (General) → Release Engineering: Automation (Release Automation)
Mass move of bugs to Release Automation component.
No longer blocks: hg-automation
Product: → Release Engineering
Whiteboard: [release][automation] → [kanban:engops:] [release][automation]
Possibly something to tackle as part of release promotion/beetmover?
Closed: 4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1210538
You need to log in before you can comment on or make changes to this bug.