Closed Bug 699854 (asyncDownloadMgr) Opened 13 years ago Closed 10 years ago

Refactor Downloads APIs to use async Storage

Categories

(Toolkit :: Downloads API, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 825588

People

(Reporter: mak, Unassigned)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

Download manager and Downloads component should stop using synchronous Storage APIs.
Alias: asyncDownloadMgr
Whiteboard: [Snappy]
Whiteboard: [Snappy] → [Snappy:P1]
Assignee: nobody → adw
This looks like a lot of work.  (Do we have an idea about how much jank will go away if the download manager is asyncified?)  Looking at both changes to the component and changes to consumers:

Changes to the component:

* These nsIDownloadManager methods would have to be asyncified: addDownload, getDownload, removeDownload, removeDownloadsByTimeframe, retryDownload, attribute canCleanUp, cleanUp.  That doesn't include the component's private methods that would need to be asyncified, too.

* The DBConnection attribute has to be removed, so its users have to be accommodated in the new API.

* It's written in C++.  Not a huge deal but JS makes it much nicer to deal with async consequences like storage callbacks.  A total rewrite in JS would be a big project since there's just a lot of code, including migration paths, and you'd have to figure out how to make some platform-specific (Windows, Android, OS X) C/C++ API calls from JS that the component makes.  Maybe you could do a hybrid two-component approach where you write a bare-minimum new component in JS to implement the new API that delegates to the old C++ component for things like migration and maintaining observers.

Changes to consumers (the the scary part):

* Regardless of whether the component is rewritten, most if not all tests would have to be rewritten to use the new async API.  (Re)writing a test suite takes at least as much time as does writing the component it tests.

* The usual for sync-to-async conversions, you have to change not only your consumers but the consumers of your consumers and so on.  But for the methods that don't return values, like removeDownload, some consumers probably don't need to wait until they complete, so those consumers can treat them as basically sync.

* There are lots of consumers.

* As one example, the downloads window has a XUL command controller that relies on sync behavior of getDownload in its isCommandEnabled implementation [1].  Assuming getDownload becomes async, you'd have to rethink how to update the enabled status of commands.

Just an idea:

Downloads are timestamped.  So you might be able to avoid asyncifying getDownload, removeDownload, and retryDownload if you load chronological batches of downloads into memory and show only those batches in the UI.  (Oh, and hope that they don't get paged out.)  If all the downloads in the UI are in memory, then all interactions with them can remain synchronous.  Maybe canCleanUp and cleanUp, too, depending on what "clean up" means.  You'd have let the user request older downloads somehow (like reaching the end of a listbox), and you'd have to be careful to release downloads from memory once the user is done with them so you're not potentially keeping the entire database in memory for a long time.  Tests would still need to be updated but hopefully many would not need to be entirely rewritten.

[1] http://mxr.mozilla.org/mozilla-central/source/toolkit/mozapps/downloads/content/downloads.js#770
(In reply to Drew Willcoxon :adw from comment #1)
> This looks like a lot of work.  (Do we have an idea about how much jank will
> go away if the download manager is asyncified?)

It's in the top 10 things to fix, at approximately #8 position according to slow sql telemetry. http://people.mozilla.org/~xstevens/telemetry/712354/

I say approximately because we need a better dump from the server and we aren't submitting unprepared queries yet, so there might be some things that bump it down/up. I see this a lot during browsing, we should fix it.
Not doubting, just curious, what do you see and how do you know it's caused by storage in the download manager?  I see a lot of jank in Firefox, but I can't point to the download manager.  But I haven't instrumented it, and I don't often "download" or use any of the download UI.

What's the context of the top 10 stat?  Like, are you saying that over a given period of time, the time spent executing download manager SQL is a top-10 cause of the unresponsiveness experienced by the average user over that period?
(In reply to Drew Willcoxon :adw from comment #3)

> What's the context of the top 10 stat?  Like, are you saying that over a
> given period of time, the time spent executing download manager SQL is a
> top-10 cause of the unresponsiveness experienced by the average user over
> that period?

No... main thread sql is a big part of it visible jank. Download manager is ~#10 in that.
I'm asking what it means exactly to be top 10.  That's not really an answer...
(In reply to Drew Willcoxon :adw from comment #5)
> I'm asking what it means exactly to be top 10.  That's not really an
> answer...

pretty sure it is
I haven't done anything to deserve the attitude.

So it sounds like you're just comparing the times it took to execute all queries in Firefox which you measured and totaled in some manner you won't deign to tell me with no regard to how frequently those queries occur or how they contribute to the user's overall perception of unresponsiveness.  Weak rationale for fixing this bug.
there is a single important reason to fix this, off my head, that is the DM writes to the database happen basically at the same time the download itself does IO, and this ends up hanging the UI till the disk is free enough to handle the request. This hang for some user may last even multiple seconds (I think I saw reports of 7 seconds or such).
That said, I'm not sure that SQLite, and specifically downloads.sqlite is still the right store for this data.
One side would be so damn awesome if we could store the download data in a tail tag in the .part file (so to basically make the download portable across devices and profiles), the other side a whole relational db for just a single simple table makes not much sense.
(In reply to Marco Bonardo [:mak] from comment #8)
> the DM
> writes to the database happen basically at the same time the download itself
> does IO

without forgetting the additional double IO caused by the antivirus when the download is finished, that is the most offending reason.
From Marco and Drew's comments, it appears that this basically amounts to a rewrite, not a refactor.

1. We need bugs on the specific codepaths/queries that are causing problems. We can look at those individually to see if there are any mitigating improvements we can make in the short-term. Taras, can you post the specific queries that you have identified as slow, so we can look at spot-fixing where possible? The data in comment #2 is a haystack.

2. We need to design the replacement system. There's some talk in this bug, but we need to formalize that into a spec so that anyone can pick up the work. Let's tackle that after figuring out what the specific blocking pieces are, per point #1.
(In reply to Dietrich Ayala (:dietrich) from comment #10)
> From Marco and Drew's comments, it appears that this basically amounts to a
> rewrite, not a refactor.

yes, it may end up being a rewrite. The API was just not designed to be async.

> 1. We need bugs on the specific codepaths/queries that are causing problems.

Any write.

> We can look at those individually to see if there are any mitigating
> improvements we can make in the short-term.

I suggested we could use WAL, that will reduce fsyncs that are the major cause of hangs. Reducing means we may still hang sometimes, but potentially not on each write.  Another possibility could be to use the SQLite async vfs, though in the past it caused us corruption headaches.

> 2. We need to design the replacement system. There's some talk in this bug,
> but we need to formalize that into a spec so that anyone can pick up the
> work. Let's tackle that after figuring out what the specific blocking pieces
> are, per point #1.

Yes, my point was not exactly a guide to a good implementation, mostly the fact we should not be afraid of investigating even exotic solutions, since this needs a rewrite regardless.  Some suggested things from the past are download-info-tag in the .part file, merge downloads.sqlite in places.sqlite, don't use SQLite.
If we are doing a rewrite, we should look into piggiebacking on Places.
(In reply to Drew Willcoxon :adw from comment #7)
> So it sounds like you're just comparing the times it took to execute all
> queries in Firefox which you measured and totaled in some manner you won't
> deign to tell me with no regard to how frequently those queries occur or how
> they contribute to the user's overall perception of unresponsiveness.  Weak
> rationale for fixing this bug.

We need to fix all main thread IO. I use slow-sql reports to prioritize work, but there is no main thread io worth leaving in the product because it's tricky. We just spent multiple months rewriting the search service, and have some distance to go there still. Just because something is a pain, is not a good reason to avoid it.

I posted a link to slow sql queries above, here are the 3 downloads queries...Note how they have so far recorded stalls of up to a minute.

UPDATE moz_downloads SET tempPath = :tempPath, startTime = :startTime, endTime = :endTime, state = :state, referrer = :referrer, entityID = :entityID, currBytes = :currBytes, maxBytes = :maxBytes, autoResume = :autoResume WHERE id = :id198921(100.0,131.66666666666666,164.75,207.94444444444446,429.0,1722.5)
INSERT INTO moz_downloads (name, source, target, tempPath, startTime, endTime, state, mimeType, preferredApplication, preferredAction) VALUES (:name, :source, :target, :tempPath, :startTime, :endTime, :state, :mimeType, :preferredApplication, :preferredAction)138253(100.0,123.0,149.0,197.5,431.0,2681.0)
UPDATE moz_downloads SET state = :state WHERE state = :state_cond6262(101.0,143.0,244.0,488.0,1918.0,61720.0)

This is high priority.
Taras, nobody is arguing that we should not kill main thread IO :)
Looks like support got a bunch of requests due to UI hanging when finishing a download and passing it to MS Security essentials.  I think I saw the same problem and the reason are fsyncs happening when the disk is busy with the AV.  If we don't attack this now, we at least need one of the workarounds I pointed out to reduce fsyncs, WAL being the first off my head.

I'm fine with the plan in comment 12, though I don't know if mobile is still using the download manager somewhere (mbrubeck tells me they still do), if so we can't do that, since they explicitly opt-out of Places to reimplement their history and bookmarks backend.

Otherwise, a possibility may be to implement a completely new downloads service in Places, that is async and takes into account the new per-window PB needs, and stop using the old service in Firefox.  Other consumers, included mobile, may keep using the old service.  How would this sound?  From a certain point of view sounds like a waste of resources, but we already start from a waste of resources reality.  Regardless, having 2 different components in the tree sharing the same storage would be crazy so a merge would have to move all stuff to components/places, breaking any consumer rather than just Firefox and increasing the needed work.
Assignee: adw → nobody
Just a thing I've observed under daily usage, Microsoft Security Essentials on Windows XP seems to lock with 100% CPU (most of that taken by the MSE service itself) any time you first open the download manager during a browsing session, resulting in UI jank This keeps going until all downloads are populated in the download manager.

Also, even with "browser.download.manager.scanWhenDone;false", completing a download will result in a 5-10 second lag, more when you have selected to open the download.

The culprit is MSE, since it seems to be strangling disk IO in many circumstances, even when using Explorer to open a folder heavily populated with executables.

Killing the MSE service manually midway during all above operations speeds them up by an order of magnitude. For example, populating the download manager entries starts going a lot faster.

Curiously, this MSE issue does not manifest in Windows 7 (maybe it's using a newly introduced API to scan files?).
(In reply to John Volikas from comment #16)
> Just a thing I've observed under daily usage, Microsoft Security Essentials
> on Windows XP seems to lock with 100% CPU

I can confirm you that MSE has a quite bad IO management, I switched to it from Avast! and only got IO issues from then (included Thunderbird, where it completely blocks incoming mailboxes). We should test future enhancements with it to get a decent idea of the hangs issues.
(In reply to John Volikas from comment #16)
> Also, even with "browser.download.manager.scanWhenDone;false", completing a
> download will result in a 5-10 second lag, more when you have selected to
> open the download.

Are you saying that even with scanWhenDone=false, MSE is causing performance problem?
(In reply to Dietrich Ayala (:dietrich) from comment #18)
> Are you saying that even with scanWhenDone=false, MSE is causing performance
> problem?

Even if we don't explicitly pass the download to the antivirus, it's quite common for a real-time protection AV to check a newly added executable.  In such a case our fsync would still clash against a busy disk.
(In reply to Dietrich Ayala (:dietrich) from comment #18)
> Are you saying that even with scanWhenDone=false, MSE is causing performance
> problem?

Yeah, that's the case.
Related to storage handling: I had downloads.sqlite-journal file, that blocked the download window (also I was not able to download files via Firefox) . Removing file solved the problem. Please take care during the refactor.
(In reply to Kami from comment #21)
> Related to storage handling: I had downloads.sqlite-journal file, that
> blocked the download window

unless the db is corrupt or some huge operation is happening, the presence of the journal file is unrelated to locking, may rather be the opposite (the journal grows due to a blocking operation).
If, as I understand, there are two unrelated issues (anti-virus and actual I/O), we should probably split this in two bugs. Also, is the anti-virus problem something that we can solve or something that we should ask the anti-virus developers to solve?
(In reply to David Rajchenbach Teller (away until early August) [:Yoric] from comment #23)
> If, as I understand, there are two unrelated issues (anti-virus and actual
> I/O), we should probably split this in two bugs. Also, is the anti-virus
> problem something that we can solve or something that we should ask the
> anti-virus developers to solve?

You can use https://bugzilla.mozilla.org/show_bug.cgi?id=551427 for AV issue. It's not yet clear what causes AV slowdown. This bug is for async download manager db refactoring.
A quick look at downloads.sqlite seems to suggest that sqlite is a little bit overpowered for this use, so this might be a good candidate for JSON-ification.
(In reply to David Rajchenbach Teller (away until early August) [:Yoric] from comment #25)
> A quick look at downloads.sqlite seems to suggest that sqlite is a little
> bit overpowered for this use, so this might be a good candidate for
> JSON-ification.

sqlite provides good robustness which is harder with manual IO. We can switch to async SQLite and then look into a different backend if perf is lacking.
(In reply to Taras Glek (:taras) from comment #24)
> It's not yet clear what causes AV slowdown.

From some investigation I did in the past, just looks like AV has no limit on IO, so basically AV makes IO so much IO that we hang on synchronous fsyncs until it's done. Btw, as said it's sort of off-topic for this bug.
Blocks: 803570
Depends on: 809197
Please keep in mind there's also possibility to download files to remote locations using GIO (see bug 682838) so I would probably recommend to shift from using nsIFile to nsIURI (for example there's a lot of s/nsIFile/nsIURI/ in patch https://bugzilla.mozilla.org/attachment.cgi?id=642514&action=diff).
We're building a JavaScript API for downloads, tracked in bug 825588, that will be
asynchronous and with off-main-thread file access. We're using simple JSON file
serialization for persisting in-progress downloads across browsing sessions.

More details in the implementation bug 825591 and the working draft document:

https://wiki.mozilla.org/User:P.A./Download_architecture_improvements
Depends on: jsdownloads
Blocks: 830757
No longer blocks: 830757
Paolo, at this point I think we should just dupe this to bug 825588, we are not going to spend time rewriting the old implementation.
Flags: needinfo?(paolo.mozmail)
Definitely!
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(paolo.mozmail)
Resolution: --- → DUPLICATE
Whiteboard: [Snappy:P1]
You need to log in before you can comment on or make changes to this bug.