Closed Bug 402469 Opened 17 years ago Closed 16 years ago

urlclassifier database takes forever to load

Categories

(Toolkit :: Safe Browsing, defect, P2)

3.0 Branch
defect

Tracking

()

RESOLVED WORKSFORME
Firefox 3

People

(Reporter: wgianopoulos, Assigned: dcamp)

References

Details

(Keywords: regression, Whiteboard: [external dependency])

Under Firefox 2, if I remove the urlclassifer2.sqlite file and launch firefox2 it is rebuilt in seconds.  If I try the same with the trunk and the urlclissifer3.sqlite file, after an hour the file is only 212992 bytes long and none of the preferences for table version are populated.

Same happens with a new profile.
Flags: blocking-firefox3?
Summary: urlclassifer database takes forever to load → urlclassifier database takes forever to load
Blocking for investigation.
Flags: blocking-firefox3? → blocking-firefox3+
Keywords: qawanted, regression
Google feeds us this data in small chunks every half hour.  It can take a little while to build up the complete list.

This is the intended behavior, but maybe we can get them to be more aggressive sending the initial data.
(In reply to comment #2)
> Google feeds us this data in small chunks every half hour.  It can take a
> little while to build up the complete list.
> 
> This is the intended behavior, but maybe we can get them to be more aggressive
> sending the initial data.
> 

Perhaps then we could post a recent file on the Mozilla mirror servers we use for release distributions and update that weekly or something and then have the browser load that if there is no database and then go to Google for updates from that.
OS: Linux → All
Hardware: PC → All
Based on the amount of data I am downloading per 1/2 hour chunk, and the size of the urlclassifier.sqlite file on a system which appears to be doing anti-phishing protection correctly, it will take over 200 hours for the initial download to complete.

This means that after you upgrade from Firefox 2 to Firefox 3 you will have zero anti-phishing protection for your first 200 yours of usage.

This would seem to be entirely unacceptable.
It appears that once it really gets going the chunk size is more in the 90KB range than the 30KB I used above, so i guess it is more on the order of 75 hours.
One added note is this it is recently added entries first, which are generally the active phishes, so its not incredibly bad, but we should figure out how to do initial seeding faster.
Target Milestone: --- → Firefox 3 M10
Final result is that the final size of urlclassifier3.sqlite file was 17.728MB which at 90KB per transfer, or 180KB per hour, comes out to be 98.5 hours.

None of the examples anyone could give me seemed to trigger the phishing detection until the entire file was loaded.  But, I suppose it is possible those were all old entries. 
Assignee: nobody → dcamp
Priority: -- → P2
-rw-r--r-- 1 reed reed  41M 2007-11-17 20:29 urlclassifier3.sqlite

Why in the world is my urlclassifier3.sqlite 41MB large? Do we really make every user download a 41MB file in order to be secure? That seems crazy compared to 9.2M for urlclassifier2.sqlite and 1.7M for urlclassifier.sqlite.

Also, http://rrnryspace.com/index.cfm-fuseaction314Dlogin.process8526MyTokens79843964886883084155.htm shows up as a phish on branch but not on trunk.
(In reply to comment #8)
> -rw-r--r-- 1 reed reed  41M 2007-11-17 20:29 urlclassifier3.sqlite
> 
> Why in the world is my urlclassifier3.sqlite 41MB large? Do we really make
> every user download a 41MB file in order to be secure? That seems crazy
> compared to 9.2M for urlclassifier2.sqlite and 1.7M for urlclassifier.sqlite.

This is kind of the whole point of this bug.  The file will eventually grow to a ridiculous size despite the fact that it still does not have all the data.  I finally got a complete file loaded, and my urlclassifer3.sqlite is under 19MB.

-rw-r--r-- 1 wag wag 18640896 Nov 18 03:17 urlclassifier3.sqlite

> 
> Also,
> http://rrnryspace.com/index.cfm-fuseaction314Dlogin.process8526MyTokens79843964886883084155.htm
> shows up as a phish on branch but not on trunk.
> 

That URL is blocked for me. So, evidently, despite the fact that your urlclassifer3.sqlite file is over twice as large as mine, it is not the complete file.
After looking more closely at this, it is not just the initial load of the database that the current strategy does not work for.

If you only use the browser an hour or 2 per day, soon it will get hopelessly behind even if you seed it initially with a completely up-to-date database.

Given the current chunk size, the inter chunk delay needs to be more on the order of 1 or 2 minutes than 30 minutes in order for there to be any hope of maintaining an up-to-date database.

I think the longer delay needs to be between attempts to initiate an up date of a given table.

The way I would envision this working is you start to load a table, and use a 2 minute delay between chunks until that table is up-to-date and then wait at least 30 minutes before attempting to refresh that table again.
Well, a couple of things.  First of all I have been using bad terminology here based on a misunderstanding of the code I was reading.  It is not an inter chunk delay that is an issue here.  the delay is between update connections to the service.  Each connection can return thousands of chunks, so the chunksize is not an issue.

The second thing is that all of a sudden today, with no code changes on the Mozilla side of things, getting an up-to-date database, which up through yesterday took about 100 hours now seems to take more on the order of 3 hours.

The thing that seems to have changed is that suddenly in a single connection there are orders of magnitude more entries being added to the database.

My guess is that now that Firefox has gone beta, Google is dedicating more resources to the new format than they were when the browser was in an alpha state.

So, perhaps there is really no issue here at all.
Now issue is with these bigger data blocks, that it is almost impossible to make a new profile or update older one.

Building of urlclassifier3.sqlite file in new profile freezes Firefos 3.0 beta1 (and Trunk) totally. (In reply to comment #11)
I just noticed something that at first glance seems alarming in relation to phishing protection (although not exactly this bug). While I was checking the size of my urlclassifier3.sqlite file (it's 14.1 MB), I noticed that the Last Modified stamp on the file is November 15 2007 1013. That is when I closed Firefox and restarted it.

So my question is: is this expected behavior? For places.sqlite, the last modified stamp updates as I visit new pages. If the phishing protection database is being kept up-to-date, shouldn't the last modified stamp on the urlclassifier3.sqlite file be more recent? Or perhaps the updates from Google are cached in memory or something and only written to disk at the end of a session?
The download of the urlclassifier database seems to be completely and utterly broken for me on the trunk. I have had Firefox open for days, which should be ample time to download the whole database, and yet it is stuck at 6KB. :/

By comparison, on the branch the database downloads within about a minute (although I'm not sure how complete it is, but it seems mostly complete because after that the file size grows very slowly.)

Also, the Target Milestone should be changed to M11 since this hasn't been fixed for Beta 2.
Can anyone reproduce the behavior I experienced in comment 14? In other words, with a new profile, the urlclassifier3.sqlite database grows to 6KB and then gets stuck there.

> Also, the Target Milestone should be changed to M11 since this hasn't been
> fixed for Beta 2.

Could someone please update the Target Milestone? I don't want this to fall off the radar for Beta 3 :)
(In reply to comment #15)
> Could someone please update the Target Milestone? I don't want this to fall off
> the radar for Beta 3 :)

TM doesn't really matter much... priority means more about when something will be fixed.
Target Milestone: Firefox 3 M10 → Firefox 3 M11
(In reply to comment #14)
> The download of the urlclassifier database seems to be completely and utterly
> broken for me on the trunk. I have had Firefox open for days, which should be
> ample time to download the whole database, and yet it is stuck at 6KB. :/
> 

As far as I know, updates for Firefox 3 are currently disabled, as an emergency fix for bug 404645. There have several improvements for beta 2, so it's possible that the updates will start again soon, after everyone upgraded to beta 2.

I don't work for either Mozilla for Google, so it's purely speculation ofcourse.
Note that updates seem to have been started again - my database is back at 827KB.
So I left my computer on and Minefield open (it was fighting cancer at the same time on BOINC so it wasn't a complete waste of energy ;) for most of the time over the holidays, and this is how the size of the urlclassifier3.sqlite file grew:

602KB   1020  Thursday December 20
971KB   2245  Thursday December 20
2492KB  2340  Friday December 21
4221KB  2340  Saturday December 22
5036KB  2335  Sunday December 23
6617KB  2230  Monday December 24
7600KB  2330  Tuesday December 25
10266KB 2345  Wednesday December 26
12971KB 2000  Thursday December 27
17236KB 2340  Friday December 28
19216KB 2110  Saturday December 29
19259KB 1100  Sunday December 30
19334KB 1725  Monday December 31
19388KB 2330  Wednesday January 2

So for me it still seemed to take at least 150 hours (being conservative given that my computer was not on 100% of the time) of Minefield being open continuously before the database size seemed to start leveling off at about 19 MB.

To me this seems indeed like a big regression from Fx 2; very few users keep Firefox open for that long continuously, which means that their phishing protection would never seem to be complete.
Beltzner: is QA still wanted on this bug?  I don't see what new questions need to be answered here or what more information is needed.  Please let us know, or if you feel the problem is well-understood, please remove the QAWanted keyword.  Thanks.
Dave/Bill: was this fixed by the move to the new protocol in beta 3 and beyond? If so can we get a RESO on it?
This seems much better to me.  Loads in under an hour.
Google has fixed some problems in the list that were wasting some bandwidth, and seem to be feeding us significantly more data per update.  I'd like to keep this bug open a bit longer to keep it on my radar, but I don't think it needs to block release anymore.
(In reply to comment #23)
> Google has fixed some problems in the list that were wasting some bandwidth,
> and seem to be feeding us significantly more data per update.  I'd like to keep
> this bug open a bit longer to keep it on my radar, but I don't think it needs
> to block release anymore.

Re-nom'ng to make sure drivers see it leave the blocker list.
Flags: blocking-firefox3+ → blocking-firefox3?
Dave, the right way to do this is keep it as a blocker and resolve it when you're confident that it's fixed. That way if it becomes an issue again, it will get re-opened and re-inherit blocking status.

I trust that you'll continue to monitor.
Flags: blocking-firefox3? → blocking-firefox3+
Target Milestone: Firefox 3 beta3 → Firefox 3
Dave, if this issue is resolved to your satisfaction, please resolve by April 2nd so its out of the way before the final push.  Seems to me we can safely resolve it now, and file bugs on any new issues that arise...
Whiteboard: [appears fixed, resolve by 04/02]
I'm happy with the current state, will open bugs on new issues.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
OK, the server seems to have regressed a bit.  It appears that we get some of the list fairly quickly, but it takes way too long to really get the complete list.  I discussed it with the google guys, and it apparently it's related to how often the list is updated.

Google is aware and working on fixing this, I'm reopening this bug to track it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
And to clarify something a bit - the file still grows to roughly its expected size reasonably quickly as it adds the freshest information.  But once you have the freshest information, the older updates have less of an impact on database size, as more of the data is expired.
Whiteboard: [appears fixed, resolve by 04/02]
Whiteboard: [needs status update]
Status Update: We basically have a design flushed out which should increase the throughput of the redirects. We should have the changes made by the beginning of next week, and might need a few more days for testing.  I'll be around at the meeting tomorrow morning if people are interested in more information.
Whiteboard: [needs status update] → [server side]
Status update here? Last comment was April 14th - have we achieved satisfactory resolution?
Whiteboard: [server side] → [server side][needs status update]
Just wanted to give you an update on the status here.  We are wrapping up the server changes that Garrett mentioned in Comment 30.  There should be a new server available for testing today or tomorrow.

So, this is still an outstanding problem and we have not reached a resolution yet.
Whiteboard: [server side][needs status update] → [server side only][ETA 4/30]
Whiteboard: [server side only][ETA 4/30] → [external dependency][ETA 4/30]
We have a new server ready, but we are waiting on bug 430530 so that we don't totally destroy Linux users.  Should be pushed tomorrow morning.
(In reply to comment #33)
> We have a new server ready, but we are waiting on bug 430530 so that we don't
> totally destroy Linux users.  Should be pushed tomorrow morning.
> 

I like this plan, Garrett-- I commented earlier this afternoon (over on that bug) that I was hoping for a little test window before/after this one was pushed out on Google.

But that code still hasn't actually landed on the Trunk yet, Dave is still working on it. (Dave's comment 5:30 PM, nearly identical time as yours.) If he doesn't do it real early, even us 'Tinderbox' people will have only a few hours-- and tomorrow morning's nightly will definitely be concurrent with your change.

Maybe wait until 5/2 for the Google push, so "Nightly" users have a full day of the 430530 before your "push"? There's hardly anyone pulling from the Tinderbox builds, I wouldn't be surprised if it turned out that I was the ONLY Linux "non-expert, external to Mozilla" user to actually try it.

OTOH, If you would LIKE them to happen together, then your timing (tomorrow AM) as nearly perfect: It sounds like the 430530 change scraps database "Version 3", implementing "Version 4" (from my reading of https://bugzilla.mozilla.org/show_bug.cgi?id=430530#c26.) I haven't read the entire update and wouldn't understand it even if I tried, but I think this means that all the FF3 users are goijg to be starting over with a new "urlclassifier4.sqlite" database.
No, the filename will not be changed. But the schema is Version 4, and the mismatch will cause the "old version" which us FF3 users have now will be scrapped. (All the DB content will be replaced, using the new schema).
430530 made it in to the nightlies last night (except for the x86_64 builds, which apparently aren't auto-updated anyway), so I think we should go ahead and try this today.  The disk thrashing in 430530 gets worse as you get a bigger file, and this fix will make sure you get to a bigger file more quickly.
I agree with Dave.... Go ahead and do it at your EARLIEST convenience. I'll watch this bug and verify that a Linux update works properly after your changes. If desired, I can also create a new naked profile on Linux and verify that the database pretty quickly fills in and matches my 'main' profile's file, which is up to date and running with 430530 already present.
fyi, the new server has been running since 5/2 at around 4pm.
For both me (on Linux) and Windows users, urlclassifier3.sqlite is being updated with no quantifiable difficulties. (That's good, it was disastrous on Linux before 430530 wes done.)

OT: My urlclassifier3.sqlite file is now over 40MB in size. How large is it expected to get ??? That's a lot of raw data for dial-up users, even if we try to send it as carefully as possible.
Dave: let me know if we can close this out; not sure how to measure/test it.
Depends on: 432490
It appears that on first launch, fresh profile, the urlclassifier3.sqlite file on Linux (Ubuntu) doesn't get updated, and it stays at 32kb unless you restart the application. See bug #434624.

However, I also tried installing afresh, with livehttpheaders, and restarting and at times I could also see that the urlclassifier file did not get updated after one restart. Right now, on my vm installation, I see a GET key request, but hours later, no further update to file.
We have been trying to keep track of these issues on the server side, and the numbers unfortunately don't look as good as they should.  Theoretically people should be updating in a few hours (~4 last time I checked).  However it looks like it's actually taking on the order of 20 hours or so.  We are investigating the reasons for this, but we haven't figured it out yet.  Dave asked for QA to help us investigate, but I haven't heard back from them yet.  We should really get these numbers down before launch.
Whiteboard: [external dependency][ETA 4/30] → [external dependency]
Juan, to clarify, are you saying that you are not seeing a GET downloads request within an hour after startup?  
I'm not getting a GET downloads request after several hours on Linux on a fresh profile, first session. After a restart of the browser (or two), then I start getting data. I'll get some numbers this evening for Linux, but on Mac I observed this (time - malware data % / phishing data %):

30mins - 15% / 18%
1 hour - 22% / 29%
1.5 hour - 31% /  43%
2 hours - 37% /  57%
2.5 hours - 44% /  67%
3 hours - 53% / 80%
3.5 hours (after session resumed) - 52% /  80%
4 hours - 58% / 94%
4.5 hours - 64% /  97%
5 hours - 76% /  98%
5.5 hours -  84% /  99%
6 hours -  92% / 99%
6.5 hours - 100% / 100%
(In reply to comment #45)
> I'm not getting a GET downloads request after several hours on Linux on a fresh
> profile, first session. After a restart of the browser (or two), then I start
> getting data. I'll get some numbers this evening for Linux, but on Mac I
> observed this (time - malware data % / phishing data %):
> 
> 30mins - 15% / 18%
> 1 hour - 22% / 29%
> 1.5 hour - 31% /  43%
> 2 hours - 37% /  57%
> 2.5 hours - 44% /  67%
> 3 hours - 53% / 80%
> 3.5 hours (after session resumed) - 52% /  80%
> 4 hours - 58% / 94%
> 4.5 hours - 64% /  97%
> 5 hours - 76% /  98%
> 5.5 hours -  84% /  99%
> 6 hours -  92% / 99%
> 6.5 hours - 100% / 100%

I'm curious, how do you find out how complete the malware and phishing data are at a given point in time?
Andrew, you can try installing the extension mentioned here https://bugzilla.mozilla.org/show_bug.cgi?id=429263#c3

Then type about:safebrowsing in the location bar to see some numbers.
We're going to need more analysis on this, so I'm keeping it on the branch blocker nomination list.
Flags: blocking-firefox3.1?
Flags: blocking-firefox3-
Flags: blocking-firefox3+
Version: Trunk → 3.0 Branch
Flags: blocking1.9.0.1?
Flags: wanted1.9.0.x+
Flags: blocking1.9.0.1?
Flags: blocking1.9.0.1-
Dave: did we ever finish the analysis loop on this? Can we close this out?
Yeah, I believe that we concluded that stuff is happening at roughly the expected rate.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
There is no patch around. In such cases we mark bugs as WFM.
Resolution: FIXED → WORKSFORME
Flags: blocking-firefox3.1?
Based on the age of QAWANTED request on this bug, is QAWANTED still wanted?
Judging by comments #50 and #51, we can remove the qawanted status.
Keywords: qawanted
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.