Looking at a snapshot of the safebrowsing data from last week, we found some significant inefficiency in the add/sub data - by the time we were done, only 13% of the malware data we processed ended up as relevant data on the system - the rest were subs from the list and the adds that matched these subs. Comparing that last week's snapshot to yesterday's data, it appears that 3.8megs (400,000 entries) have been added to the sub data. This is a lot more than the expected weekly volume (which is supposed to be closer to ~ 1000 subs per week, as I understand it). When I talked to the google guys they were surprised by this - I think we need to understand what's going on here quickly, because this is asking users to chew through a whole lot of data unnecessarily.
Adding [external dependency] -- Dave, we wouldn't be changing our processing here, right? It's just a matter of asking the list maintainers to more aggressively find +/- pairs and remove them from the list?
Yeah, shouldn't need code changes on our end.
There were a couple of different issues that were causing these problems. We have fixed most of them, so new data coming out should be relatively good. However, older data that has not expired is still going to have way more subs than is necessary. We are working on ways to clean up this old data. We will update here when we start actually editing this data.
Moving off to branch blocking nomination list.
dcamp: is this even wanted anymore? there's no plan and little detail here ...
After discussion, this is wanted. What's the progress? How do we get motion?
dcamp: whom do I have to bribe for updates? :)
Should we set closeme status this bug for a month from now? No one wants to acknowledge beltzner :(
Sorry about the lag, this thread somehow got muted in my e-mail. This issue is fixed. We didn't end up specifically changing the older data, but it has been automatically cleaned up at this point.