Open Bug 551427 Opened 14 years ago Updated 2 years ago

Whole browser should not hang while download file is built/scanned

Categories

(Toolkit :: Downloads API, defect)

x86
Windows XP
defect

Tracking

()

People

(Reporter: aaronclawrence, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [Snappy:P2])

Attachments

(1 file)

At the end of a large download, with an antivirus product installed, the whole browser will hang for a while while the file is scanned.
It's understandable that the scanning takes a while, the performance is specific to the AV product.
However, during this time the whole of Firefox should not hang. 
Technically, building or copying the file (whatever FF is doing) should be in a background thread, and the download manager should just be showing something helpful like "writing downloaded file" or "scanning downloaded file" (if it knows).
(Tested FF3.5.7)
Bug 435598 seems to suggest that sometimes the scan never finishes (although never is a long time :)
I see the hang as the whole browser being frozen, but there seems some doubt whether this is always true. Some people say it just "hangs" showing scanning in download manager. 
I see it with Nod32, "advanced heuristics" enabled, scanWhenDone is set to true in FF, yet I don't think I see anything in Download Manager, which suggests Nod32 ignores the scan hook FF calls and does it's own thing on file writes.
In bug 490305  there was a comment that "This is probably because the back-end uses synchronous RPC to update the UI display.". So perhaps it IS multithreaded but because of the synchronous call the UI gets hung anyway.. ?
(In reply to comment #0)
> Technically, building or copying the file (whatever FF is doing) should be in a
> background thread, and the download manager should just be showing something
> helpful like "writing downloaded file" or "scanning downloaded file" (if it
> knows).
It is, in fact, always done in a background thread.
OK. Yet, somehow, the whole browser hangs.
My report is related, not an exact duplicate though.

On my PC it happens with smaller (<1 MB) executables/archives as well.

As suggested in this report, Firefox shouldn't hang while a file is being scanned, yet, somehow, it does.
LeetPirate @ nsane.forums posted:
'The reason FF might take longer than other browsers might have to do with the way it saves files. From what I have noticed, FF writes to a .part file then copies it over to the real file name which is 2 new file writes and multiple file modifications and reads, whereas IE writes 1 file and fills in the bytes into the same file till it's done. Newly created files receive the most attention from most AV software so FF requires twice the time to do the same file download. This is my theory, I have not really done any extensive testing.'
Blocks: 444467
(In reply to Shawn Wilsher :sdwilsh from comment #3)
> (In reply to comment #0)
> > Technically, building or copying the file (whatever FF is doing) should be in a
> > background thread, and the download manager should just be showing something
> > helpful like "writing downloaded file" or "scanning downloaded file" (if it
> > knows).
> It is, in fact, always done in a background thread.

I can reproduce this sometimes, but not always. browser.download.manager.scanWhenDone=true, often when I save an image (~150K), the Ui will be unresponsive for *multiple seconds*. E.g. right-clicking will not display the context menu for 7+ seconds as measured by a stopwatch. If I set the pref to false, then right-clicking after download always immediately displays the context menu. When this happens, it can be reproduced repeatedly. But, it doesn't always happen.

This would be very annoying for a user who is, for example, saving multiple images off of Facebook at once.
Whiteboard: [Snappy]
Brian can you look into this? Marking it a P2 as it's not clear how easy this is to reproduce.
Assignee: nobody → netzen
Whiteboard: [Snappy] → [Snappy:P2]
I've noticed this before.  Yup I'll look into it.
So the biggest bottleneck on the main thread happens in nsDownload::SetState.
After each scan it does about 150ms-200ms on the main thread.  Most the problem in Setstate is with the update to the DB:

> // Before notifying the listener, we must update the database so that calls
> // to it work out properly.
> nsresult rv;
> rv = UpdateDB();

The actual scan happens on a secondary thread, but the antivirus 
scan dispatch happens on the main thread as it should.  This dispatch is what calls SetState.

> nsresult
> nsDownloadScanner::Scan::Run()
> {
> ...
> // Download will be null if we already timed out  
> if (mDownload)
>   (void)mDownload->SetState(downloadState);
(In reply to Brian R. Bondy [:bbondy] from comment #12)
> So the biggest bottleneck on the main thread happens in nsDownload::SetState.
> After each scan it does about 150ms-200ms on the main thread.  Most the
> problem in Setstate is with the update to the DB:

150-200ms is too slow, and should be fixed, but that is way faster than the multi-second UI freezes that this bug is about.
There are actually 4 calls so total it is more like 1 second, but I'll keep looking after this is fixed.
There was also some file API ops that very recently landed that might have caused a bottleneck here as well by the way.
Attached patch Patch v1.Splinter Review
I can't detect any UI freezing after this patch.  Might not be fully solved but it is a major improvement.
Attachment #596496 - Flags: review?(mak77)
Comment on attachment 596496 [details] [diff] [review]
Patch v1.

Review of attachment 596496 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Brian Smith (:bsmith) from comment #13)
> (In reply to Brian R. Bondy [:bbondy] from comment #12)
> > So the biggest bottleneck on the main thread happens in nsDownload::SetState.
> > After each scan it does about 150ms-200ms on the main thread.  Most the
> > problem in Setstate is with the update to the DB:

Hm, that time doesn't make any sense, considered that it's a simple UPDATE statement on primary key, that surely may cause a slower fsync, but not enough to take that much time.  As far as I can tell nobody ever reported a slow query in the DM, and it would be pretty much visible in about:telemetry.
Are you sure it's the statement and not rather the notification? May be the statement is slow just because the disk is kept extremely busy by the antivirus check? can we measure this?

Btw, one issue with this change it's that it's introducing thread contention issues into the download manager, that by itself means solving an hang issue to introduce one potentially worse. Bug 699854 is filed to convert all downloads backend to async Storage to avoid contentions. To clarify, once you start using both synchronous and asynchronous statements on the same connection, you must be aware they share a common mutex, and they can race for it. That may be or not a problem, depending on the database traffic you expect at that time. Are there other synchronous writes on the database during this period apart updateState ones?

::: toolkit/components/downloads/nsDownloadManager.cpp
@@ +1944,5 @@
>        return CancelDownload(id);
>    } else if (strcmp(aTopic, "profile-before-change") == 0) {
>      mGetIdsForURIStatement->Finalize();
>      mUpdateDownloadStatement->Finalize();
> +    mozilla::DebugOnly<nsresult> rv = mDBConn->AsyncClose(NULL);

nsnull please

@@ +3091,5 @@
> +  // Just copy-construct ourselves except for mRefCnt.
> +  // The state of mDownloadState may change after the DB call, but we want to
> +  // notify the observers about the state at this instant.
> +  nsDownload *dl = new nsDownload(*this);
> +  if (!dl) {

new is infallible afaik

@@ +3094,5 @@
> +  nsDownload *dl = new nsDownload(*this);
> +  if (!dl) {
> +    return NS_ERROR_OUT_OF_MEMORY;
> +  }
> +  dl->mRefCnt = 0;

Could you rather make a proper copy constructor for nsDownload rather than changing the RefCnt here? Or could we just pass the current mDownloadState to the callback in addition and use that later? I don't like that much touching the RefCnt in code like this.

@@ +3099,5 @@
> +
> +  nsRefPtr<DownloadStateDBListener> callback = 
> +    new DownloadStateDBListener(dl, aOldState);
> +  nsCOMPtr<mozIStoragePendingStatement> handle;
> +  return stmt->ExecuteAsync(callback, getter_AddRefs(handle));

please assign to rv, check it and return NS_OK per code convention in this module

::: toolkit/components/downloads/nsDownloadManager.h
@@ +465,5 @@
> +  {
> +  }
> +
> +  NS_DECL_ISUPPORTS
> +  NS_IMETHOD HandleCompletion(unsigned short aReason)

add newline after NS_DECL_ISUPPORTS

@@ +467,5 @@
> +
> +  NS_DECL_ISUPPORTS
> +  NS_IMETHOD HandleCompletion(unsigned short aReason)
> +  {
> +    NS_ENSURE_ARG_POINTER(mDownload);

How could this fail?
Attachment #596496 - Flags: review?(mak77)
> Hm, that time doesn't make any sense, considered that it's a simple UPDATE
> statement on primary key, that surely may cause a slower fsync, but not enough
> to take that much time.  As far as I can tell nobody ever reported a slow 
> query in the DM, and it would be pretty much visible in about:telemetry.
>
> Are you sure it's the statement and not rather the notification? 
> May be the statement is slow just because the disk is kept extremely 
> busy by the antivirus check? can we measure this?

I'm not sure how to convince you of the results, but I can assure you I did not make up these numbers.
I would never guess at an optimization because that would be a waste of my time, as well as the time of the reviewer.
I'm doing my profiling on a newer MBP: m620 i7 2.67Ghz, 8GB RAM, Win7x64.

This is a profiling snapshot taken when downloading winrar.  The download itself only takes a few seconds.
I've been using this profiler for several years in the past and have never had it lie to me for an unexplained reason.

The format of the below is X.YYY:ZZ
YYY is the number of milliseconds, X would be seconds. 

nsDownload::SetState 
# of calls: 4
Total time: 0.998:94
Average time: 0.249:73 


If you split apart this function you have the following breakdown:


nsDownload::SetState::1
# of calls: 4
Total time: 0.014:07 
Average time: 0.003:51 

nsDownload::SetState::2
# of calls: 4
Total time: 0.959:17
Average time: 0.239:79 

nsDownload::SetState::3
# of calls: 4
Total time: 0.025:52
Average time: 0.006:38 

nsDownload::SetState::4
# of calls: 4
Total time: 0.000:00
Average time: 0.000:0:0


Note: all 4 calls may not be called during scanning, but this is all done on the main thread.
You can see the breakdown below...

> nsresult
> nsDownload::SetState(DownloadState aState)
> {
>   PROFILE_FUNC();
> 
>   NS_ASSERTION(mDownloadState != aState,
>                "Trying to set the download state to what it already is set to!");
> 
>   PRInt16 oldState = mDownloadState;
>   mDownloadState = aState;
> 
>   // We don't want to lose access to our member variables
>   nsRefPtr<nsDownload> kungFuDeathGrip = this;
> 
>   {
>     PROFILE_STR("nsDownload::SetState::1");
>   // When the state changed listener is dispatched, queries to the database and
>   // the download manager api should reflect what the nsIDownload object would
>   // return. So, if a download is done (finished, canceled, etc.), it should
>   // first be removed from the current downloads.  We will also have to update
>   // the database *before* notifying listeners.  At this point, you can safely
>   // dispatch to the observers as well.
>   switch (aState) {
> ...
> ...
> ...
>   default:
>     break;
>   }
>  }
>
>  // Before notifying the listener, we must update the database so that calls
>  // to it work out properly.
>  nsresult rv;
>  {
>    PROFILE_STR("nsDownload::SetState::2");
>    rv = UpdateDB();
>    NS_ENSURE_SUCCESS(rv, rv);
>  }
>
>  {
>    PROFILE_STR("nsDownload::SetState::3");
>
>  mDownloadManager->NotifyListenersOnDownloadStateChange(oldState, this);
>  }
>
>  {
>    PROFILE_STR("nsDownload::SetState::4");
>
>  switch (mDownloadState) {
> ...
> ...
> ...
>    default:
>      break;
>  }
>  }
>  return NS_OK;
>}
> That may be or not a problem, depending on the database traffic you expect at
> that time. Are there other synchronous writes on the database during this
> period apart updateState ones?

I think there are before but not around the same time. 

> +
> +  NS_DECL_ISUPPORTS
> +  NS_IMETHOD HandleCompletion(unsigned short aReason)
> +  {
> +    NS_ENSURE_ARG_POINTER(mDownload);

It can't currently.  This was added as a protection against future code changes. 

> +  if (!dl) {
> new is infallible afaik

Ya I thought so too but the rest of the file had checking like this. I'll remove it.
oh and by the way I even profiled into UpdateDB the first time I did it and all of the time is spent in:

> instmt->Execute();
(In reply to Brian R. Bondy [:bbondy] from comment #19)
> > +
> > +  NS_DECL_ISUPPORTS
> > +  NS_IMETHOD HandleCompletion(unsigned short aReason)
> > +  {
> > +    NS_ENSURE_ARG_POINTER(mDownload);
> 
> It can't currently.  This was added as a protection against future code
> changes. 

please use MOZ_ASSERT for these kind of protections
> Btw, one issue with this change it's that it's introducing thread contention
> issues into the download manager, that by itself means solving an hang issue
> to introduce one potentially worse.

I of course always verify my proifling results as well by the way. I verified it a second time just now so I could include the stats below.
          
nsDownload::SetState:
# of calls: 4
Total time: 0.014:14
Average time: 0.003:53

nsDownload::SetState::1
# of calls: 4
Total time: 0.012:80
Average time:  0.003:20 

nsDownload::SetState::2
# of calls: 4
Total time: 0.001:28
Average time: 0.000:32 

nsDownload::NotifyStateChange (equivalent ot the old nsDownload::SetState::2 + 3)
# of calls: 4
Total time: 0.015:80 
Average time: 0.003:95 

Definitely I think 14ms is better than the old 1second
30ms since SetState no longer contains nsDownload::NotifyStateChange
I didn't doubt much about the results, rather were about the reason these simple writes take so much time. There must be some high IO traffic that delays the fsync. May you do a similar benchmark setting "PRAGMA synchronous = OFF" after OpenDatabase, so we skip the fsyncs. If that doesn't require too much time.
ah, and I strongly suggest to do a try run, some tests may expect the state to change synchronously.
BTW my computer has nothing open otherwise using the disk extensively. 
But perhaps the high IO traffic is the download itself?

> There must be some high IO traffic that delays the fsync. 
> May you do a similar benchmark setting "PRAGMA synchronous = OFF" 
> after OpenDatabase, so we skip the fsyncs. If that doesn't require too much time.

Sure np, I'll give it a try and let you know the results.
PRAGMA synchronous = OFF does the trick...

nsDownload::SetState
# of calls: 4
Total time: 0.029:29
Average time: 0.007:32 

nsDownload::UpdateDB
# of calls: 4
Total time: 0.004:00
Average time: 0.001:00 

Could I just attach a patch that does that instead since this is not critically important data? It seems a few other places in code already do this (nsPermissionsManager, nsCookieService, nsDiskCacheDeviceSQL)
(In reply to Brian R. Bondy [:bbondy] from comment #26)
> BTW my computer has nothing open otherwise using the disk extensively. 
> But perhaps the high IO traffic is the download itself?
> 

I suspect you are right. I noticed that firefox is a lot less responsive while downloading large files. Windows might be flushing the download along with any other fsync we do..and if the fsync is on the main thread..then we are screwed. Brian, perhaps we should set some write-through flag for downloads to avoid polluting the OS buffer cache.
Not write through but perhaps something, but I think highest priority would be to get code that is waiting on IO like this out of main thread.  I'll look into it separately from this ticket.
(In reply to Brian R. Bondy [:bbondy] from comment #27)
> Could I just attach a patch that does that instead since this is not
> critically important data?

I agree it's not critical data, though it's not an easy decision to take off-hand. Actually once we move to the new download panel this data will be even less important.  The download history is also kept in Places, this database will just retain the current downloads status and we could even to kill this database completely (using some better optimized store, or just moving the single status table to places.sqlite).

There are some alternatives to completely killing data integrity with synchronous off (that mostly could cause corruption on crashes):
1. switch the journal mode to WAL, then the possibility of an fsync should be much rarer (I'm supposing we set auto checkpoints to 128KB or such, my current database is 160KB, so it may probably fsync once per session or such).
2. use the async-io vfs (http://www.sqlite.org/asyncvfs.html), that removes durability from the ACID database, but this database is so simple we may not care.

the former is clearly cheap to implement, the latter requires importing the SQLite vfs, making the needed changes (possibly none!) and expose a way to use it in the API (I was already thinking of adding an optional aFlags param to openDatabase and have options like exclusive, unshared, async, ...)
I do not think we should make the db lossy/async. It feels like fsyncs are important for flushing data to disk(sql on purpose, download due to OS limitations/heuristics). Getting rid of fsyncs in download manager will just make fsyncs elsewhere worse. Moving download manager off the main thread seems like the way to go.
(In reply to Taras Glek (:taras) from comment #28)
> I suspect you are right. I noticed that firefox is a lot less responsive
> while downloading large files. Windows might be flushing the download along
> with any other fsync we do..and if the fsync is on the main thread..then we
> are screwed. Brian, perhaps we should set some write-through flag for
> downloads to avoid polluting the OS buffer cache.

The database update problems seem like a red herring to me. Obviously, it is good to fix, but I don't see how it will solve the actual problem here. In particular, I would not expect a few fsyncs on my Vertex3 SSD on Windows 7 on NTFS to cause the 7 second delays I experienced.

I reproduced this bug multiple times and it is definitely related to the browser.download.manager.scanWhenDone setting. The multi-second hangs do not occur when it is false but they do occur when they are true. I verified this multiple times. So, unless we are writing something different to the database when doing virus scanning, I think the slow database update times are a separate bug. The difficult thing is that these hangs don't always occur even when the setting is true; though, when they start to occur, they tend to occur repeatedly in that browsing session.

I have heard that some anti-virus software does scanning at the time a file is closed, so it might be that we are hanging when we close the file handle, if we are doing so on the main thread. Also, we download files to a ".part" file and then copy/rename it to the final name; I have heard this causes virus scanners to scan the file multiple times, which can also cause slowness, especially if this copy/rename is happening on the main thread and/or the main thread is waiting on a mutex held during such operations.
(In reply to Brian Smith (:bsmith) from comment #32)
> I suspect you are right. I noticed that firefox is a lot less responsive
> while downloading large files.

There is a bug for this.

> Windows might be flushing the download along
> with any other fsync we do..and if the fsync is on the main thread..then we
> are screwed.

This might be possible, but I observed 7 second hangs with downloads that were only ~150KB (images) on a very fast SSD, so the amount of additional I/O generated by the download wouldn't explain the hang.

> Brian, perhaps we should set some write-through flag for
> downloads to avoid polluting the OS buffer cache.

The file is likely to be opened immediately after it is downloaded, so we are usually better off keeping it in the OS buffer cache.
> The database update problems seem like a red herring to me.

This is the cause of any jank that I'm experiencing but I have not experienced any 7 second delays like bsmith mentions.  

I do see that on my system it relieves 1 second of wait time on the main thread, and there may be cases on other systems that this could be significantly longer. 

If you can reproduce this easily bsmith perhaps you can profile it to see where the delay is? Please feel free to take the bug. I cannot profile something I do not experience. I have profiled what I do experience and that's the limit of what I can do for this bug.  

I'll post another bug for my findings and link them up because I'd like to make some progress and move on to make other progress in other bugs.
(In reply to Brian R. Bondy [:bbondy] from comment #34)
> If you can reproduce this easily bsmith perhaps you can profile it to see
> where the delay is? Please feel free to take the bug. I cannot profile
> something I do not experience. I have profiled what I do experience and
> that's the limit of what I can do for this bug.  

I tried to dive into this already but I quickly got lost in the back-and-forth between JS and native code. I even tried to use xperf to figure out what was going on but got nowhere (likely due to my own ignorance). Somebody more familiar with how the download manager works should take this bug.
Unassigned myself. On the bright side, you may soon only have a 6 second delay ;)
Assignee: netzen → nobody
Depends on: 727275
(In reply to Taras Glek (:taras) from comment #31)
> I do not think we should make the db lossy/async. It feels like fsyncs are
> important for flushing data to disk(sql on purpose, download due to OS
> limitations/heuristics). Getting rid of fsyncs in download manager will just
> make fsyncs elsewhere worse. Moving download manager off the main thread
> seems like the way to go.

Both of my suggestions don't create danger for data, both WAL and the async vfs are safe, the former preserves ACID, the latter just ACI, though I don't think we care about durability in downloads.sqlite.
Rewriting all the API to be asynchronous may take more time, I also suspect we'll want to throw away the database and maybe even store downloads metadata in the .part file (that would allow nice things like downloads portability, resume from .part and so on).
Depends on: 789932
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: