Closed Bug 727370 Opened 12 years ago Closed 12 years ago

Bad(?) files in safebrowsing folder in user profile prevent loading any site after installing update

Categories

(Toolkit :: Safe Browsing, defect)

x86_64
Windows 7
defect
Not set
major

Tracking

()

RESOLVED FIXED
Firefox 18
Tracking Status
firefox17 + fixed

People

(Reporter: write2asitha, Assigned: gcp)

References

Details

(Keywords: regression, Whiteboard: [qa-])

Attachments

(7 files, 7 obsolete files)

165.05 KB, text/plain
Details
12.75 KB, text/plain
Details
1.67 KB, application/octet-stream
Details
33.32 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
8.75 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
4.20 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
7.01 KB, patch
dcamp
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:11.0a1) Gecko/20111220 Firefox/11.0a1
Build ID: 20111220031159



Actual results:

after installing of the automatic update for 2012.02.12 my nightly browser haven't able to load any web page,it always keep showing connecting and also when i close the application it won't close at all (still can find the firefox.exe on the process and i have to kill the process in order to close FF)
my runnig on windows 7 ultimate x64 and ff version is 13.0a1 x64
Severity: normal → major
Hardware: x86 → x86_64
sorry about my poor english and FYI due to this situation i removed the FF 13 and re instilled the FF 11.0a1
I have the exact same problem.  I've manually downloaded the latest nightly the last few nights but no pages ever load.  I'm on Win7 64-bit.  I've tried both the 32-bit and 64-bit versions and have the same issue.  I've disabled my firewall & virus scanning and the problem persists.  The status bar shows "Loading..." with "..." being whatever URL but it never loads.  Additionally, if you close the window either via the "X" button or File, Exit, the window closes but the process remains running and eats about 50MB of RAM until you kill it.
nothing except for windows firewall but its also in default settings.....
I just see you tested a nightly. So I guess the problem still occurs with a currently nightly version (for the example the nightly from today)? What nightly version did you use before the update, 2012-02-11 or an older version?
2012-02-12 was the last stable version i had,after thas still i keep geting the same problem, include today's one also
btw: Does the update function inside Firefox still work or is it also broken?
yes, there's no problem with it and also i can download updates from there too
The update function does download but I have to manually kill the firefox process for it to actually apply.  If you click the "Apply Update" button, the window shuts down but the process remains running.
@steve:got the exact same problem.even with the today's update
Does opening URLs like about:config work for you? Probably should move this bug to Core.
Yes, going to about:config works.  "about:" works fine, too.
Regression range would be: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9253f058824a&tochange=c90c9f273cad
Just to confirm this: The mozilla-central 2012-02-12 build was fine, 2012-02-13 build was broken?
That date sounds right.  I've tried going back a few versions of Nightly (branch 13)  and went back to the mozilla-central from the 10th, but they all have the same problem now.  I'll keep going through the nightly builds until I can find something that works.
yap,2012-02-13 was the fist broken build......
Weird; I had to go through uninstalls and reinstalls all the way back to the build done on the 3rd of February to get it working again.  I'm going to go through them one day at a time going forward to see when it "breaks" again.
That didn't take much.  Seems the latest build that I can get to work is from the 3rd of February.  I'm certain it did not break that early for me, though.  The 12th definitely sounds like the date it broke.
Moving to networking for now, not sure what's up in this bug. I don't see anything obvious in the suspected regression range in Comment 13.
Component: Untriaged → Networking
Keywords: regression
Product: Firefox → Core
QA Contact: untriaged → networking
Version: 13 Branch → Trunk
probly not a networking thing if the updates are still downloading.. but let's get a http log and see what's happening:

https://developer.mozilla.org/en/HTTP_Logging

(try and load a page with the logging turned on)
here is my log
(In reply to Steve from comment #20)
> Created attachment 598747 [details]
> http log from nightly build

Hi Steve,

you've got 6 transactions in there, all of them complete correctly. Most of them are related to addon version checks, etc.. but there is also one for http://my.yahoo.com/ I'm guessing that's your homepage?

from a http level it completes fine.. 40'ish KB of chunked data. however some time after it completes it receives a cancel event from the UI - which tells me the UI never rendered it for some reason. (and you've now moved on, or are shutting down perhaps).

So it appears we need to find a new category for this bug. I'm just going to make a wild guess and move it to layout to see if they have any ideas.

(In reply to asitha from comment #21)
> Created attachment 598779 [details]
> http log from nightly build (asitha)
> 
> here is my log

you have a newtork failure to www.google.com becuase you have proxy information required from http://wpad/wpad.dat and host wpad does not resolve. That seems like an unrelated error.

now perhaps you should be seeing an error in the browser and are not seeing the generated one for the same reason the above user is not getting the yahoo content.
Component: Networking → Layout
QA Contact: networking → layout
but that don't explain why the firefox.exe process remains running after closing the application from the UI?
Yes, my.yahoo.com is my home page.  There were a few add-on checks and to be clear, I've disabled all of them to see if it helped resolve the problem.  I have no idea what that 'wpad' thing is.   You are correct that I do not see an error.  I've never let it sit for more than a minute or so, but I can try to let it sit longer to see if anything ever happens.  To Asitha's point, why does the process not quit?  Would something hanging in the UI (even though it's not on screen) cause that?
(In reply to Steve from comment #24)
> have no idea what that 'wpad' thing is.   

Sorry for the confusion, that was my reply to asitha's log. I replied to you both in one comment.

> To Asitha's point,
> why does the process not quit?  Would something hanging in the UI (even
> though it's not on screen) cause that?

I don't know what is hanging in the UI - the networking log isn't going to show that.. I'm a little out of my depth at this point which is why I changed the category of the bug to get some fresh eyes on this who might be able to ask further, more helpful, questions of the folks who have a repeatable problem.
The problem does occur in safe mode.  I've not tried it from a different profile as I'm the only user of my PC.  I also tried to "Run as administrator" to no avail.

I can try the other builds once I'm on my home PC later.
Steve: I'm not sure if you're the best person for this to test as both the build from 12th and 13th failed for you (those builds above are builds between those two). But of course you can try it anyway ;)
Fair enough.  I'm wondering if it's not something gummed up in my profile from one of the builds that broke?  I'll build a new user profile later and see what happens.  I'll report back if there's a difference.  If not, I'll quit gumming up the works ;)
It does do the same thing regardless of 32 or 64-bit.  However, the problem may be user profile related in some way.  I am on a brand new user profile, I downloaded the latest nightly via FTP, and it's working without an issue.  Do you guys want me to test anything, regress to a different version, etc?
i also created a new profile and install the latest nightly build and it worked like charm...... are there any way to get it woking on my old profile?
Component: Layout → General
QA Contact: layout → general
Try this as it worked for me.  Go into:

C:\Users\YourUserNameHere\AppData\Local\Mozilla\Firefox\Profiles\

There should be a profile directory under that.  Go into it.  For example mine was named 5ynqsdec.default

Under that directory delete the directory called safebrowsing.

After that nightly started working for me again.
I should've added to this that the folder was indeed dated 2/12/2012.  It had 7 files in it: classifier.hashkey, test-malware-simple.cache, test-malware-simple.pset, test-malware-simple.sbstore, test-phish-simple.cache, test-phish-simple.pset, and test-phish-simple.sbstore.  When Nightly was run after deleting this directory, 7 files with the same name were created.
than you so much steve that worked for me too
To the Phishing protection team (or anyone who knows about it): Do you know of any recent changes that could have caused this bug? Comment 33 contains a workaround for this bug, Comment 13 has a regression range, but I don't see any related changes. Has the server maybe served "bad" data at that time? It looks like the issue does not occur anymore now.

Saving bz some bugspam, I think he's not involved in this bug anymore :)
Component: General → Phishing Protection
Product: Core → Firefox
QA Contact: general → phishing.protection
Summary: won't load any site after installing update → Bad(?) files in safebrowsing folder in user profile prevent loading any site after installing update
I saved the folder & it's contents if anyone would like them for analysis.
Assignee: nobody → gpascutto
The new phishing protection backend landed 2012-02-03. I don't think the Nightly from the 12th is particular, it sounds more like it's possible that you got some update of the SafeBrowsing DB that gets it in a broken state, or somehow corrupt the state, where it can't recover automatically.

Comment 33 is interesting because the directory doesn't have any of the "goog" files in it, which means the broken update didn't come from the network, but was the one which we use for our test pages.

Steve (or anyone else who ran into this problem), can you attach the (zipped) contents of the /safebrowsing subdirectory to this bug? It does *not* contain any personal information.
Cwiiis also ran into this and sent me his profile.

I see the same corruption as in Steve's data: the pset file is entirely zero-filled. In Cwiiis' case, it's exactly 1000000 bytes long, which I suspect isn't a coincidence. Steve's file has the expected size.

I can try to detect this corruption (am a bit surprised that didn't happen automatically), and reset.

But understanding why the file got corrupted would help too. Did any of you have a Firefox crash prior to this starting to happen?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Wonder if there's a correlation to bug 721196.
Heh, there is probably at least some relation: the contents of the file, combined with the correct or rounded sizes, seems to indicate the writing was interrupted in between the fallocate call and the actual writing.
I don't recall a crash but that doesn't mean there wasn't one.
1) Check the return value from the load call, and reset if the file is corrupted.
2) Don't consider the table as active if the lookupcache is empty.
3) Report the file as corrupted if the magic numbers don't match up.
4) Still attempt to load the prefixset if the .cache file is missing.
Attachment #608259 - Flags: review?(dcamp)
One of the disadvantages of the new code is that corruption of one file generally requires a reset of the entire database set. This is something that never happened with the old SQLite backend, which was extremely resistant to this kind of thing. It was bothering me a bit that users that have this kind of crash in the middle an update could unwillingly be without protection until the updates catch up again. You have to crash in the middle of the files being rewritten, but it seems from this bug people do hit that case.

Because the update code reads in all data at the start of an update, it's actually not very difficult to make this work like a transaction: just move the old files away before the update, write the new ones, and delete the backup. If there's a backup when we restart, move it over the (presumably incomplete) new files.

This also cleans up the code to make it more clear how the HashStore is supposed to operate: construct, read in file, close, process data, write out new file, destroy. Reopening the file or even clearing the internal buffers is not necessary, as the HashStore is supposed to be entirely transient during an update.
Attachment #608262 - Flags: review?(dcamp)
Attachment #608263 - Flags: review?(dcamp)
Attached patch Patch 4. Remove clear operations (obsolete) — Splinter Review
HashStore usage is transient, so there's no need to Clear up the internal storage.
Attachment #608264 - Flags: review?(dcamp)
(In reply to Gian-Carlo Pascutto (:gcp) from comment #44)
> Created attachment 608262 [details] [diff] [review]
> Patch 2. Backup the SafeBrowsing files before updating.
> 
> One of the disadvantages of the new code is that corruption of one file
> generally requires a reset of the entire database set. This is something
> that never happened with the old SQLite backend, which was extremely
> resistant to this kind of thing. It was bothering me a bit that users that
> have this kind of crash in the middle an update could unwillingly be without
> protection until the updates catch up again. You have to crash in the middle
> of the files being rewritten, but it seems from this bug people do hit that
> case.

Is this mostly a problem if one of the files is written an not another?  Should our SafeOutputStream be doing an atomic update?
>Is this mostly a problem if one of the files is written an not another?

Exactly.

>Should our SafeOutputStream be doing an atomic update?

It does so, but it is limited to a single file. AddPrefix data is split over two files, and inconsistencies between both will require a database reset.

Note that PrefixSet doesn't use SafeOutputStream because its serialization code predated the rewrite. Even if it was changed to do so, it wouldn't solve the problem entirely because of the above.

Put differently, this code implements SafeOutputStream behavior over a set of files.
Comment on attachment 608262 [details] [diff] [review]
Patch 2. Backup the SafeBrowsing files before updating.

Review of attachment 608262 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/Classifier.cpp
@@ +488,5 @@
>  }
>  
> +nsresult
> +Classifier::TableBackup(const nsACString& aTable)
> +{

If we crash during this loop, we'll be left with some files as backups and some files as primary sources.  But that's ok, because the restore will just move the backups back into place at the next startup, right?
Comment on attachment 608263 [details] [diff] [review]
Patch 3. Restore the files after a failed update

Review of attachment 608263 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/url-classifier/Classifier.cpp
@@ +580,5 @@
> +      newName.Cut(dot, suffix.Length());
> +      // This might overwrite an incomplete new file
> +      rv = file->MoveToNative(nsnull, newName);
> +      NS_ENSURE_SUCCESS(rv, rv);
> +    }

So if we crash during this loop we'll still be OK because the next time we start we'll finish moving the rest of the backups, right?

A comment to that effect would be nice.
Attachment #608264 - Flags: review?(dcamp) → review+
Attachment #608259 - Flags: review?(dcamp) → review+
>If we crash during this loop, we'll be left with some files as backups and some 
>files as primary sources.  But that's ok, because the restore will just move the 
>backups back into place at the next startup, right?

Yes.

>So if we crash during this loop we'll still be OK because the next time we start 
>we'll finish moving the rest of the backups, right?

Yes.

What *can* go wrong is that you finish the update, but crash in the middle of deleting the .backup files. At the next restore, the backup for some files is moved over the new ones, and not for the others.

We can't simply detect that case, because checking that the set of .backup files is incomplete would also trigger if we crash while creating them. 

I have no easy solution to make it waterproof. (It will reset & recover - that's no the problem). But the window for things go wrong is a lot tighter now, though.
Attachment #608262 - Flags: review?(dcamp) → review-
Attachment #608263 - Flags: review?(dcamp) → review-
We think we have a way to handle even that last case correctly. Instead of making .backup files, we make a .backup directory.

Whenever .backup exists, it is used to overwrite the default directory. We write the new files into the default directory. 

Before deleting the .backup dir, we move it (atomic!) to a new .oldbackup dir. We then delete that one (not atomic).

This makes the "deletion" (really: move away) of the .backup dir atomic. It doesn't matter that deleting .oldbackup isn't: we always should delete it first when it exist. When .backup exists, we should always first copy it to the default.
Rebased.
Attachment #608259 - Attachment is obsolete: true
Attachment #653766 - Flags: review+
rebased
Attachment #608262 - Attachment is obsolete: true
Attachment #653768 - Flags: review-
Rebased.
Attachment #653774 - Flags: review-
Attachment #608263 - Attachment is obsolete: true
Had to fold the patches as there's too many dependent changes.

1) Remove HashStore::Clear. It's a transient object dangling off an autoptr, so no need.
2) PrefixSet should pass detected corruption upwards, but an empty set is a legal set ever since the new backend has been in place.
3) Fix some MPL 1.1 -> 2.0 leftovers
4) Remove input stream handle members, open them on-the-fly and pass down as a param. Also avoids the Windows issue that was plaguing that code.
5) Implement the scheme for atomic updates that was devised in the comments above: store->backup->to_delete. Use the fact that directory moves are atomic.
6) Some functions where we can't sensibly deal with failure can be void.

One snag I hit here was that the code holds nsIFile handles to the directories where they want to write, but if you nsIFile->MoveTo a directory, the handle points to the new dir, which is never what we want in this code.
Attachment #608264 - Attachment is obsolete: true
Attachment #653766 - Attachment is obsolete: true
Attachment #653768 - Attachment is obsolete: true
Attachment #653774 - Attachment is obsolete: true
Attachment #664667 - Flags: review?(dcamp)
The checksumming was supposed to use the BufferedInputStream to avoid reading the input file twice. But somewhere along the way this got confused. So just delete a bunch of useless code and use that BufferedInputStream instead.
Attachment #664971 - Flags: review?(dcamp)
If we get a "pleasereset" and delete the stores and all dirs, we should create a new empty storedir at the end so the rest of the update can proceed normally.
Attachment #664972 - Flags: review?(dcamp)
Some minor cleanups/simplifications in HashStore.
Attachment #664974 - Flags: review?(dcamp)
Blocks: 741808
Attachment #664667 - Flags: review?(dcamp) → review+
Attachment #664974 - Flags: review?(dcamp) → review+
Attachment #664972 - Flags: review?(dcamp) → review+
Attachment #664971 - Flags: review?(dcamp) → review+
Comment on attachment 664667 [details] [diff] [review]
Patch 1. Make updates atomic. Fixes and cleanups.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 673470
User impact if declined: Corrupted files and unable to restart Firefox after a crash.
Testing completed (on m-c, etc.): on m-c for 1 week
Risk to taking this patch (and alternatives if risky): If the patch is buggy, about identical to not taking it - corrupted files and unable to restart.
Attachment #664667 - Flags: approval-mozilla-aurora?
Attachment #664971 - Flags: approval-mozilla-aurora?
Attachment #664972 - Flags: approval-mozilla-aurora?
Attachment #664974 - Flags: approval-mozilla-aurora?
Comment on attachment 664667 [details] [diff] [review]
Patch 1. Make updates atomic. Fixes and cleanups.

approving for aurora based on user impact in comment 66 .Please land before monday oct 8th merge.
Attachment #664667 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 664971 [details] [diff] [review]
Patch 2. Improve checksum performance.

approving for aurora based on user impact in comment 66 .Please land before monday oct 8th merge.
Attachment #664971 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #664972 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 664974 [details] [diff] [review]
Patch 4. Minor cleanups in HashStore

approving for aurora & tracking for 17 based on user impact in comment 66 .Please land before monday oct 8th merge.
Attachment #664974 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I got the NSPR removal, but missed the nsCAutoString -> nsAutoCString rename. Pushed a follow-up.

https://hg.mozilla.org/releases/mozilla-aurora/rev/921ab431cc37
Can someone who was experiencing this bug please verify it is now fixed in Firefox 17 (now in beta)? Thanks.
Whiteboard: [qa-]
Anthony, I will test this later tonight & report back.
(In reply to Steve from comment #73)
> Anthony, I will test this later tonight & report back.

Thanks Steven, much appreciated.
Beta 17 is working fine.  Mind you everything's worked since I cleared the safe browsing folders from my user profile and reinstalled Firefox.  I've not been able to reproduce the issue since that time.

If there's anything you can think of that would cause the problem again, I'm more than willing to give it a shot to see if this is a definitive fix.
The original problem required the user to have a Firefox crash in a specific time window of about 15 seconds every 30 minutes. This is small enough that it's going to be very hard to reproduce, but big enough that a few people hit it (...and managed to find that this bug was the cause, which is nontrivial too!).

Not sure we can do much more than keep an eye out for (lack of) similar reports. On Android this bug can trigger a bit more easily.

Programming is hard. Let's go shopping!
Product: Firefox → Toolkit
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: