Last Comment Bug 627635 - (localStorageIO) Make remote localStorage not block the UI main thread on disk IO
(localStorageIO)
: Make remote localStorage not block the UI main thread on disk IO
Status: NEW
[at-risk][tech-bug]
: main-thread-io
Product: Core
Classification: Components
Component: DOM (show other bugs)
: unspecified
: All All
: -- normal with 13 votes (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
: Andrew Overholt [:overholt]
Mentors:
: 568095 701880 792870 823720 (view as bug list)
Depends on: 736144 600307 687566 704933
Blocks: 698500 StorageJank 701872 b2g-v-next 622112 701880
  Show dependency treegraph
 
Reported: 2011-01-20 22:35 PST by Chris Jones [:cjones] inactive; ni?/f?/r? if you need me
Modified: 2014-09-18 08:14 PDT (History)
46 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-


Attachments
experimental1 (11.55 KB, patch)
2011-01-27 16:35 PST, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
experimental1.1 (11.01 KB, patch)
2011-01-27 16:46 PST, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
profiling data (8.08 KB, text/plain)
2011-02-15 18:00 PST, Alon Zakai (:azakai)
no flags Details

Description Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-01-20 22:35:41 PST
It came out in bug 622112 that our localStorage implementation blocks the content-process main thread on the chrome-process main thread, which is not great but OK.  But, the chrome-process main thread then appears to be blocked on the localStorage request, which involves grabbing the SQLite mutex, which means blocking on disk.  That's not palatable for fennec.
Comment 1 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-01-20 22:41:30 PST
Allowing web content to block fennec's UI thread on disk IO is ... not palatable.  I think we either need to not do this or disable localStorage for 4.0.  We've never shipped a fennec with it enabled, and it's being phased out by IndexedDB, so I wouldn't lose too much sleep over punting it to a 4.x or later.

I don't know what's involved in avoiding UI-thread disk IO.
Comment 2 Josh Matthews [:jdm] (on vacation until Dec 5) 2011-01-20 22:53:27 PST
Honza, the following stack from bug 622112 is just the flushing timer running.  It looks like a prime candidate to move to a background thread, but I don't know anything about whether storage connections can be used off the main thread.

#0  mozilla::storage::SQLiteMutex::lock (this=0x7fffe08c6928) at
/home/cjones/mozilla/mozilla-central/storage/src/SQLiteMutex.h:128
#1  0x00007ffff4a40188 in SQLiteMutexAutoLock (this=0x7fffffffd290, aMutex=...)
at /home/cjones/mozilla/mozilla-central/storage/src/SQLiteMutex.h:183
#2  0x00007ffff4a3eaec in mozilla::storage::Connection::BeginTransactionAs
(this=0x7fffe08c68f0, aTransactionType=0) at
/home/cjones/mozilla/mozilla-central/storage/src/mozStorageConnection.cpp:930
#3  0x00007ffff37db6ee in mozStorageTransaction (this=0x7fffffffd300,
aConnection=0x7fffe08c68f0, aCommitOnComplete=0, aType=0) at
../../dist/include/mozStorageHelper.h:77
#4  0x00007ffff417fa46 in nsDOMStoragePersistentDB::FlushTemporaryTables
(this=0x7fffe08c4810, force=false) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStoragePersistentDB.cpp:535
#5  0x00007ffff4179a6b in nsDOMStorageDBWrapper::FlushAndDeleteTemporaryTables
(this=0x7fffe08c4810, force=false) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStorageDBWrapper.cpp:104
#6  0x00007ffff416dcc9 in nsDOMStorageManager::Observe (this=0x7fffe02cac00,
aSubject=0x0, aTopic=0x7ffff58f2205 "domstorage-flush-timer", aData=0x0) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStorage.cpp:486
#7  0x00007ffff4f0f771 in nsObserverList::NotifyObservers (this=0x9db7c8,
aSubject=0x0, aTopic=0x7ffff58f2205 "domstorage-flush-timer", someData=0x0) at
/home/cjones/mozilla/mozilla-central/xpcom/ds/nsObserverList.cpp:130
#8  0x00007ffff4f1140e in nsObserverService::NotifyObservers
(this=0x7fffe00f5160, aSubject=0x0, aTopic=0x7ffff58f2205
"domstorage-flush-timer", someData=0x0) at
/home/cjones/mozilla/mozilla-central/xpcom/ds/nsObserverService.cpp:182
#9  0x00007ffff416d837 in nsDOMStorageManager::Observe (this=0x7fffe02cac00,
aSubject=0x7fffe0c1de70, aTopic=0x7ffff5b5393d "timer-callback", aData=0x0) at
/home/cjones/mozilla/mozilla-central/dom/src/storage/nsDOMStorage.cpp:449
Comment 3 Honza Bambas (:mayhemer) 2011-01-23 14:14:49 PST
Josh, I commented on bug 622112 comment 30, but those suggestion are just a partial fix.

AFAIK, sqlite is not very thread safe, or at least wasn't in times I last witnessed attempts to use it on multiple threads.

I would more tend to use async API for this, that would accomplish the premiss of not having IO on the main thread.

But first I'd like to know why it takes so long time to acquire the mutex, when there is just a single connection bound to a single database file on disk and that all accessed just on the main thread.
Comment 4 Shawn Wilsher :sdwilsh 2011-01-24 10:00:12 PST
SQLite is threadsafe (in that it serializes access to the database).  The mozIStorageConnection object is also threadsafe, but I'd prefer you to use an asynchronous statement to do background thread work (this lets Storage manage the thread for you).

I don't have any comments as to why it takes so long to get the mutex.
Comment 5 Doug Turner (:dougt) 2011-01-25 12:38:01 PST
aza, can you take a look?
Comment 6 Doug Turner (:dougt) 2011-01-25 12:38:28 PST
^azakai, of course.
Comment 7 Honza Bambas (:mayhemer) 2011-01-25 13:57:15 PST
Alon, also check bug 622112 comment 30, though those suggestion are only a partial solution.
Comment 8 Alon Zakai (:azakai) 2011-01-25 16:01:18 PST
Some questions:

1. How common is localStorage on the web?
2. How big of a delay is caused? (hard for me to tell from bug 622112 whether localStorage is the main issue or not)
3. Can we estimate how much of a speedup the fixes in bug 622112 comment 30 would get us?
Comment 9 Honza Bambas (:mayhemer) 2011-01-25 17:29:17 PST
(In reply to comment #8)
> 3. Can we estimate how much of a speedup the fixes in bug 622112 comment 30
> would get us?

Avoiding call to GetTransactionInProgress in MaybeCommitInsertTransaction will save acquiring the lock for every GetItem not preceded by SetItem.

The second optimization should actually be: don't open a transaction until a scope whom data has been modified has been found.  That will save acquiring when a page only reads localStorage.


In general, I don't think we can do a lot here... localStorage API is intended to be fully synchronous so to satisfy all request for Get and Set we have to block.

I just don't know sqlite well to say why there is one global lock for all IO ops that, when accessing one file, prevents accessing other completely separated files.  It just doesn't make sense to me.
Comment 10 Doug Turner (:dougt) 2011-01-25 19:22:18 PST
can we do the sets+gets to an in-memory hash, then flush to sql when the tab closes?
Comment 11 Shawn Wilsher :sdwilsh 2011-01-25 19:30:21 PST
(In reply to comment #10)
> can we do the sets+gets to an in-memory hash, then flush to sql when the tab
> closes?
I think that's an awfully big change this late in the 2.0 release cycle.  I sure wouldn't like to see that go in now.
Comment 12 Doug Turner (:dougt) 2011-01-25 19:54:59 PST
cjones how bad is this?  is it worth just pulling the plug on localstore like we did with indexedDB?
Comment 13 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-01-25 22:11:26 PST
What I don't know is, how many sites rely on localstorage.  We know some do, like tbpl.  google.com uses it, but I doubt it's broken without it.  Comment 8 asks this too.

What we're trading for breaking those sites is potentially huge UI latency, O(second) judging by sync-disk-IO problems we've had elsewhere.  This will happen if the disk is busy and/or we're on a file system like the Galaxy S***'s.  The possibility of this happening when loading a site as common and innocuous as google.com concerns me.  This UI latency is also the under the control of web content.  The more web content uses localstorage, the more lags we get.  That makes fennec look bad.

I don't know what we gain in terms of site compat by leaving localstorage enabled.  That's the key bit of data we'd need to make a good decision.  Short of that, ... I guess we'd have to browse a bunch of sites known to use localstorage and see how bad things are in the common case, and if the experience is acceptable, hope we don't hit bad cases too much in the wild.
Comment 14 Honza Bambas (:mayhemer) 2011-01-26 14:02:13 PST
(In reply to comment #10)
> can we do the sets+gets to an in-memory hash, then flush to sql when the tab
> closes?

Please don't.  There is already a hash table used to store values, mirroring data, but the final goal is to get rid of it and make all processes over the data be fully under the sqlite control.  We need more then just a simple hash table to have all capabilities of localStorage (DOM storage in general) working.

The fix should figure out why we block and try to avoid the block as much as possible.

I can do some research with optimizations I suggested earlier.
Comment 15 Alon Zakai (:azakai) 2011-01-26 15:04:21 PST
I tried to find websites that use localStorage. The following do:

google.com does (but not mobile)
espn (but not mobile)
cnn (but not mobile)
orbitz
lonelyplanet
sfchronicle

but it looks like the majority of popular websites don't use it. And even when they do, their mobile versions don't.

As for how important localStorage actually is to those websites, it's hard to tell. I don't notice any immediate difference when using the websites mentioned above without localStorage - they don't obviously break - but possibly some minor functionality (that is noticed later on) is lost.

Given that, my recommendation is to disable localStorage for Fennec for 4.0. Definitely not worth a perf hit, and probably not worth the risk of any serious work (unless there is something simple that can completely fix this).
Comment 16 Josh Matthews [:jdm] (on vacation until Dec 5) 2011-01-26 15:12:50 PST
Any blog that uses disqus for its commenting engine uses localstorage, but I also don't know how it's used therein.
Comment 17 Honza Bambas (:mayhemer) 2011-01-27 16:35:16 PST
Created attachment 507715 [details] [diff] [review]
experimental1

This is an experimental patch that should lower the chance of hitting the lock.  If anyone can test with it let me know.  I can also provide a try build with this patch if anyone wants.
Comment 18 Honza Bambas (:mayhemer) 2011-01-27 16:46:32 PST
Created attachment 507720 [details] [diff] [review]
experimental1.1

Slightly simplified and a bit fixed from the previous one.
Comment 19 Alon Zakai (:azakai) 2011-01-28 16:27:06 PST
I'm not sure how to test this. I can create a page with localStorage and a page without, but I don't notice any difference in performance.

I tried the websites mentioned in the parent bug as well, but 2 out of the 3 websites mentioned (bugzilla and facebook) don't even use localStorage.

Are we sure that localStorage is actually a significant part of the ~1 second hang mentioned in the parent bug?
Comment 20 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-01-28 16:35:12 PST
(In reply to comment #19)
> I'm not sure how to test this. I can create a page with localStorage and a page
> without, but I don't notice any difference in performance.

On device?  What access pattern does your page with localStorage use?

> Are we sure that localStorage is actually a significant part of the ~1 second
> hang mentioned in the parent bug?

Not the least bit, sorry if I implied otherwise.  It's synchronous, UI-thread disk IO though, and that's guaranteed to bite us.  The question is will it bite us infrequently enough to leave it enabled for 4.0.
Comment 21 Alon Zakai (:azakai) 2011-01-28 17:21:17 PST
(In reply to comment #20)
> (In reply to comment #19)
> > I'm not sure how to test this. I can create a page with localStorage and a page
> > without, but I don't notice any difference in performance.
> 
> On device?  What access pattern does your page with localStorage use?

Galaxy S and Linux desktop.

My test page does a lot of writes and reads to localStorage.

> 
> > Are we sure that localStorage is actually a significant part of the ~1 second
> > hang mentioned in the parent bug?
> 
> Not the least bit, sorry if I implied otherwise.  It's synchronous, UI-thread
> disk IO though, and that's guaranteed to bite us.  The question is will it bite
> us infrequently enough to leave it enabled for 4.0.

Yes, good question.
Comment 22 Alon Zakai (:azakai) 2011-02-09 17:14:36 PST
IPC profiling data from an android device can be seen in bug 629513 comment 6 / bug 629513 comment 8. Web storage does indeed lead to a significant slowdown on google.com (nonmobile).
Comment 23 Honza Bambas (:mayhemer) 2011-02-15 11:15:23 PST
Alon, I'm curious why SetValue and RemoveValue have so short latency compared to GetValue.  Could it be, that google first calls to GetValue that leads to initial DB load for the whole domain to a temp table and only this first call is therefor that slow?  Or is that just called much more often then Set and RemoveValue?

If the former, we could pre-cache the DB on the chrome process somehow, or search for an opt in a more narrowed space.
Comment 24 Alon Zakai (:azakai) 2011-02-15 13:43:03 PST
(In reply to comment #23)
> Alon, I'm curious why SetValue and RemoveValue have so short latency compared
> to GetValue.  Could it be, that google first calls to GetValue that leads to
> initial DB load for the whole domain to a temp table and only this first call
> is therefor that slow?  Or is that just called much more often then Set and
> RemoveValue?
> 

The last column shows the amount of times it is called. So GetValue was called 11 times, SetValue 2 times and RemoveValue 3 times. However, even accounting for that, GetValue takes much more time - so your first explanation seems most likely.

However, it is also possible there was some noise in that run, so we should get more data. I am working on a better profiling tool now.
Comment 25 Alon Zakai (:azakai) 2011-02-15 18:00:39 PST
Created attachment 512675 [details]
profiling data

Better profiling data, of google.com load (nonmobile).

Again GetValue is quite high, ~0.5 seconds.
Comment 26 Honza Bambas (:mayhemer) 2011-02-16 12:35:26 PST
And what is the distribution of duration among each call?
Comment 27 Honza Bambas (:mayhemer) 2011-02-23 10:14:19 PST
Unfortunatelly I don't have a test env for this.  Android port for my Freerunner needs some fixes to be even able to connect wifi.  I don't know when I get to make it work, so please give me the most information you can to move this forward.
Comment 28 Brad Lassey [:blassey] (use needinfo?) 2011-02-23 11:26:37 PST
Honza, we're in the end game for Fennec. What's the plan for this? Also, what do you need to move forward?
Comment 29 Shawn Wilsher :sdwilsh 2011-02-23 11:34:06 PST
I don't think it's wise to take this in 2.0, FWIW.  This is going to be a big enough change that I'd think it would need beta coverage, but we just tagged our last beta.
Comment 30 Brad Lassey [:blassey] (use needinfo?) 2011-02-23 11:41:17 PST
based on Shawn's comment, moving this to 2.0next
Comment 31 Shawn Wilsher :sdwilsh 2011-02-23 11:42:50 PST
(In reply to comment #30)
> based on Shawn's comment, moving this to 2.0next
Although, mobile may want to disable localStorage like in the past.
Comment 32 Alon Zakai (:azakai) 2011-02-23 16:28:38 PST
(In reply to comment #26)
> And what is the distribution of duration among each call?

I am seeing less of a problem now, not sure why. But here is some data from loading google.com (nonmobile):

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00000 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.18600 seconds, mainThread == 1

I/Gecko   ( 9716): |IPC 'RecvGetValue' took 0.00100 seconds, mainThread == 1

So just one of those ended up significantly slow.
Comment 33 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-02-25 17:08:08 PST
FWIW, loading my gmail account on my galaxy s*** is very painful, many several-second UI pauses.  Didn't profile, but localStorage would be my first guess at why UI blockage.  The galaxy s*** is kind of a worst-case though.

It would be interesting for other folks to try gmail on their less s***ty phones.
Comment 34 Doug Turner (:dougt) 2011-03-03 09:43:51 PST
after discussing with blassey and stuart, this is something that is serious for many of the popular devices we are targeting.
Comment 35 Doug Turner (:dougt) 2011-03-03 09:46:12 PST
oh hell.  2.0next.  wrong bug.
Comment 36 Alon Zakai (:azakai) 2011-03-29 12:50:39 PDT
(In reply to comment #33)
> FWIW, loading my gmail account on my galaxy s*** is very painful, many
> several-second UI pauses.  Didn't profile, but localStorage would be my first
> guess at why UI blockage. 

cjones, I am trying gmail on galaxy s to try to figure out this issue. I don't think I see the same degree of pain as you do so I may be doing something different. I tried both mobile and standard HTML views, and both accounts with no emails in the inbox and with 20-30 emails.

The only interesting thing I see in the profiles I've generated is that we spend a lot of time in nsRefreshDriver, especially when there are lots of emails in the account (1.3130 seconds over 4 page loads). But even that doesn't seem so bad, no several-second UI pauses in particular. Are you still seeing those?
Comment 37 Alon Zakai (:azakai) 2011-03-29 12:51:51 PDT
(I am hoping fixing bug 634666 got rid of the UI pauses.)
Comment 38 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2011-03-29 12:55:01 PDT
Should note that I saw the pain on 2.1.  Haven't tried since I upgraded to 2.2; other stuff seems less s***ty with 2.2 so I wouldn't be surprised if gmail is too.
Comment 39 Alon Zakai (:azakai) 2011-05-03 13:09:57 PDT
Are we certain this is still an issue? Or should we collect new data?
Comment 40 Wesley Johnston (:wesj) 2011-05-03 16:54:20 PDT
Ran the profiling code on this and saw:

IPC|RecvGetValue|main 0.0280 total seconds, 0.0031 mean seconds, 9 counted

when loading google.com (desktop site) in Fennec on a Droid X.
Comment 41 Mark Finkle (:mfinkle) (use needinfo?) 2011-05-03 17:45:42 PDT
(In reply to comment #40)
> Ran the profiling code on this and saw:
> 
> IPC|RecvGetValue|main 0.0280 total seconds, 0.0031 mean seconds, 9 counted
> 
> when loading google.com (desktop site) in Fennec on a Droid X.

Seems to not be a problem, if I reading that right.
Comment 42 Dietrich Ayala (:dietrich) 2011-10-31 11:38:42 PDT
I'm seeing animation jank on desktop Firefox that correlates with writes to webappsstore.sqlite.

If confirmed, this is a P1 for Firefox. Honza, do you have time to pick this back up again?
Comment 43 Honza Bambas (:mayhemer) 2011-10-31 11:47:43 PDT
(In reply to Dietrich Ayala (:dietrich) from comment #42)
> Honza, do you have time to pick this
> back up again?

First I'd like to investigate if we could move all this stuff to LevelDB and see how that would work (bug 687566).

If that would have take too long to finish, I would try to get to this soon, but not before Aurora merge.
Comment 44 Dietrich Ayala (:dietrich) 2011-11-02 13:07:26 PDT
Thanks Honza.

How would moving to LevelDB specifically reduce the main thread IO caused by localStorage? Are all writes done off-thread? Or does it mitigate the problem by allowing us to write less and less often?
Comment 45 Honza Bambas (:mayhemer) 2011-11-06 13:26:12 PST
(In reply to Dietrich Ayala (:dietrich) from comment #44)
> Thanks Honza.
> 
> How would moving to LevelDB specifically reduce the main thread IO caused by
> localStorage? Are all writes done off-thread? Or does it mitigate the
> problem by allowing us to write less and less often?

I'm adding Matthew who's working on LevelDB integration to answer at least the off-main-thread question.  I personally don't remember LevelDB would use background threads, but can't say for sure for my self.

According the tests it seems that LevelDB should simply be much faster.  I want to investigate first, but unfortunately we still don't have even a prototype of LevelDB implementation in the mozilla tree to start playing with it.

If that would take even longer time, I would have to look for conventional solutions that would go against my future plans, which was reworking the code to be both much simpler and faster.
Comment 46 Shawn Wilsher :sdwilsh 2011-11-06 16:24:12 PST
(In reply to Dietrich Ayala (:dietrich) from comment #44)
> How would moving to LevelDB specifically reduce the main thread IO caused by
> localStorage? Are all writes done off-thread? Or does it mitigate the
> problem by allowing us to write less and less often?
Writing less doesn't really mitigate it; it just makes it harder to notice until you have a really bad case.

(In reply to Honza Bambas (:mayhemer) from comment #45)
> According the tests it seems that LevelDB should simply be much faster.  I
> want to investigate first, but unfortunately we still don't have even a
> prototype of LevelDB implementation in the mozilla tree to start playing
> with it.
LevelDB should be faster because it's designed to be a simple key-value store, which is exactly what `localStorage` is.  SQLite is a bit more complex than we need.
Comment 47 Matthew McPherrin [:mcpherrin] 2011-11-08 14:05:40 PST
(In reply to Honza Bambas (:mayhemer) from comment #45)
> (In reply to Dietrich Ayala (:dietrich) from comment #44)
> > How would moving to LevelDB specifically reduce the main thread IO caused by
> > localStorage? Are all writes done off-thread? Or does it mitigate the
> > problem by allowing us to write less and less often?
> 
> I'm adding Matthew who's working on LevelDB integration to answer at least
> the off-main-thread question.  I personally don't remember LevelDB would use
> background threads, but can't say for sure for my self.

LevelDB does use background threads to improve performance, but writes can still block when buffers fill up and the background thread needs to catch up.  So bad cases can still occur.
Comment 48 Dietrich Ayala (:dietrich) 2011-11-14 20:15:20 PST
*** Bug 568095 has been marked as a duplicate of this bug. ***
Comment 49 Nathan Froyd [:froydnj] 2011-11-23 09:17:22 PST
As stopgaps, is it worthwhile to do any or all of the following:

1) Use a more relaxed PRAGMA synchronous on the localStorage DB;
2) Use SQLite's asynchronous (writes on background threads) VFS or one that we write ourselves;
3) Some other way of moving IO off the main thread that I'm not aware of because I'm new to all of this...

?  At least option (2) sounds quite similar to what LevelDB already does.
Comment 50 Marco Bonardo [::mak] 2011-11-23 09:56:26 PST
(In reply to Nathan Froyd (:froydnj) from comment #49)
> 1) Use a more relaxed PRAGMA synchronous on the localStorage DB;

This is likely doable if we consider dataloss as a nonimportant issue for localStorage. Actually data is set by the page, I expect any page has already some sort of fallback, in case the data is not yet in the localStorage. On corruption it could just throw away the db. Cookies similarly use synchronous=OFF.
That way writes would be much faster, since it'd never fsync.

> 2) Use SQLite's asynchronous (writes on background threads) VFS or one that
> we write ourselves;

This is probably too expensive if levelDB is behind the corner.

> 3) Some other way of moving IO off the main thread that I'm not aware of
> because I'm new to all of this...

One may use partitioned tables with a part in memory, and handle all of that with magical queries. Well, we did this in Places for Firefox 3.5 and it is a nightmare.

(In reply to Dietrich Ayala (:dietrich) from comment #44)
> How would moving to LevelDB specifically reduce the main thread IO caused by
> localStorage?

Take these results with care, since obviously the benchmarks are made to be largely in favor of levelDB, but it is really good in what it does (simple hashing with sequential reads/writes)
http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
Comment 51 Marco Bonardo [::mak] 2011-11-23 10:00:47 PST
it may also use openUnsharedDatabase and exclusive locking, to avoid the shared cache and the concurrent writers checks.
These changes are simple and known to improve performances.
Comment 52 Marco Bonardo [::mak] 2011-11-23 10:18:03 PST
(In reply to Marco Bonardo [:mak] from comment #50)
> (In reply to Nathan Froyd (:froydnj) from comment #49)
> > 1) Use a more relaxed PRAGMA synchronous on the localStorage DB;
> 
> This is likely doable if we consider dataloss as a nonimportant issue for
> localStorage.

As a side note, levelDB in synchronous mode may perform worse than SQLite, so it's probably something that we'll have to do regardless, when moving to it.
Comment 53 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-11-23 11:54:46 PST
IMHO, dataloss is totally ok for localStorage.
Comment 54 Marco Bonardo [::mak] 2011-11-23 12:02:45 PST
I've filed Bug 704933 to investigate these easy improvements, I hope someone from the perf team (or anyone else) can take it.
Comment 55 Dietrich Ayala (:dietrich) 2011-11-23 12:02:55 PST
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #53)
> IMHO, dataloss is totally ok for localStorage.

Why? Unlike sessionStorage, which is understood to be transient and volatile, localStorage is designed to be persistent.
Comment 56 Marco Bonardo [::mak] 2011-11-23 12:08:00 PST
(In reply to Dietrich Ayala (:dietrich) from comment #55)
> Why? Unlike sessionStorage, which is understood to be transient and
> volatile, localStorage is designed to be persistent.

Dataloss here is intended as database corruption. All the data in localStorage comes from the web and can be added back again by the web app.
Comment 57 Shawn Wilsher :sdwilsh 2011-11-24 10:22:14 PST
(In reply to Marco Bonardo [:mak] from comment #56)
> Dataloss here is intended as database corruption. All the data in
> localStorage comes from the web and can be added back again by the web app.
That's not true at all.  Think offline apps.  Think of pages that store user preferences in localStorage because cookies can easily be deleted.
Comment 58 Marco Bonardo [::mak] 2011-11-24 15:33:41 PST
(In reply to Shawn Wilsher :sdwilsh from comment #57)
> That's not true at all.  Think offline apps.  Think of pages that store user
> preferences in localStorage because cookies can easily be deleted.

My point is that those pages have a fallback for when we don't have any of their data, otherwise they'd just not work. In case of a dataloss nothing is really lost, apart the fact the page will have again to populate the storage, that sure, may be an annoyance in case of offline apps. But even now we are surely not doing anything we can to protect from that, we use sync=normal.
If we really don't want to move to the complete removal of fsyncs, we may still move to WAL+NORMAL, with a small journal size.
At this point if we disallow levelDB from working on mainthread, I don't see that as a viable solution for localStorage, then we need an alternative path to reduce fsyncs.
Comment 59 Dietrich Ayala (:dietrich) 2012-02-13 15:54:31 PST
*** Bug 701880 has been marked as a duplicate of this bug. ***
Comment 60 Jason Smith [:jsmith] 2012-09-20 09:45:14 PDT
*** Bug 792870 has been marked as a duplicate of this bug. ***
Comment 61 (dormant account) 2012-10-17 16:33:33 PDT
e10s is not in Firefox desktop, so not covered by snappy.
Comment 62 Honza Bambas (:mayhemer) 2012-12-11 15:11:58 PST
dup of either 807021 or 600307
Comment 63 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2012-12-20 14:37:27 PST
*** Bug 823720 has been marked as a duplicate of this bug. ***
Comment 64 Jonas Sicking (:sicking) No longer reading bugmail consistently 2012-12-20 14:43:14 PST
Honza: Bug 807021 doesn't fully cover this since bug 807021 is only about writes. There is no reason we should ever need to do reads on the main thread if someone in a child process accesses localStorage since we can easily just block the child process until we've asynchronously read the data.

Likewise, I would imagine bug 600307 doesn't fully solve the read problem here, does it?

I do think that waiting for bug 600307 before fixing this is the right thing to do though.
Comment 65 Honza Bambas (:mayhemer) 2013-01-03 09:25:26 PST
(In reply to Jonas Sicking (:sicking) from comment #64)
> Honza: Bug 807021 doesn't fully cover this since bug 807021 is only about
> writes. There is no reason we should ever need to do reads on the main
> thread if someone in a child process accesses localStorage since we can
> easily just block the child process until we've asynchronously read the data.
> 
> Likewise, I would imagine bug 600307 doesn't fully solve the read problem
> here, does it?

Sure it does.  On the parent process we do the load on background.  Actually, it behaves the same way, as there would be just a single process.  Preloads happen on the background thread, if sync load happens, though, it blocks (there is no way around it, even for IPC!).  

According existing telemetry domain data reads are fast.


The first telemetry from desktop Firefox using the patch from 600307 shows 75 - 100% of preloads makes it before localstorage access.  SSD makes it better, HDD makes it a bit worse.  Load times about 20ms when blocking.

> 
> I do think that waiting for bug 600307 before fixing this is the right thing
> to do though.

This is a dupe of bug 600307.
Comment 66 Jonas Sicking (:sicking) No longer reading bugmail consistently 2013-01-03 15:02:39 PST
> Preloads happen on the background thread, if sync load happens, though, it
> blocks (there is no way around it, even for IPC!).  

That's the thing. For IPC there is a way around it!

On the child side you make a blocking call to the parent. But when there's a call coming from a child process, there is no need to do anything blocking in the parent! You can asynchronously load the data and while the load is happening the child process remains blocked. Once the data has been loaded, you can send back a response to the child which unblocks it.

So from the child process point of view, it is a blocking call. But in the parent process nothing blocking happens.

Of course, if localStorage is *used* in the parent process we'd have to make a blocking call. That we can't do anything about. But we shouldn't need to block when localStorage is used in a child process.

That's what this bug is about. And that part is not fixed in any of the dependent bugs.
Comment 67 Honza Bambas (:mayhemer) 2013-01-03 15:20:21 PST
(In reply to Jonas Sicking (:sicking) from comment #66)
> > Preloads happen on the background thread, if sync load happens, though, it
> > blocks (there is no way around it, even for IPC!).  
> 
> That's the thing. For IPC there is a way around it!

Do you have a code example?  Because I don't think what you suggest is going to work or I simply am not getting it.  How can I not block the parent?  That is not clear to me.

Thanks.
Comment 68 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-03 15:30:02 PST
Content processes should communicate directly with a non-main thread in the parent process.
Comment 69 Honza Bambas (:mayhemer) 2013-01-03 15:53:08 PST
(In reply to Chris Jones [:cjones] [:warhammer] from comment #68)
> Content processes should communicate directly with a non-main thread in the
> parent process.

How?
Comment 70 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-03 16:12:22 PST
Child process main thread creates an IPC channel for PStorage or something on its main thread, and then the parent process opens and binds that channel to a non-main thread that services the requests.
Comment 71 Honza Bambas (:mayhemer) 2013-01-03 16:19:04 PST
StorageParent::RecvSomeSyncRequest(const &in_arg, &out_arg) will be called on the main thread and will block the main thread until value of the |out_arg| is known.

I don't know about a way to make sync requests be handled on non-main thread.
Comment 72 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-03 16:38:35 PST
We would need to move the StorageParent code to run on an off-main-thread, with a dedicated IPC channel.  Nothing would be different from the client (child) side.
Comment 73 Jonas Sicking (:sicking) No longer reading bugmail consistently 2013-01-03 17:00:25 PST
Chris: Is there no way in ipdl to send a message from the child to the parent and then block until it gets a response from the parent, while letting the parent produce a response asynchronously? I.e. let the parent send a separate message in order to deliver the response or some such?

If that's not possible, that seems like something we should add to ipdl. Allowing a child to block while the parent runs some slow operation seems like one of the big benefits we could get from process separation.
Comment 74 Olli Pettay [:smaug] 2013-01-03 17:06:23 PST
Parent can spin event loop while creating the response.
That is kind of ok case for event loop spinning.
Comment 75 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-03 17:17:51 PST
Right, spinning the event loop is the only way to do that.  The semantics of half-sync are messy and not at all clear, so I don't want to add it.  Spinning the event loop achieves the same effect.
Comment 76 Jonas Sicking (:sicking) No longer reading bugmail consistently 2013-01-03 17:26:15 PST
Ah, is that what the message manager does? IIRC the message manager has the ability to let the parent process create a result asynchronously.

So yeah, one "simpler" option here would be to let the parent process, when it gets a localStorage request from the child, call the preload API added in bug 600307, and then spin the event loop until whatever data was requested is available. Then it could fetch that data without ever causing blocking IO in the parent.
Comment 77 Olli Pettay [:smaug] 2013-01-03 17:32:48 PST
Message manager doesn't have such functionality, but scripts in the parent can ofc spin the event loop.
Comment 78 Jonas Sicking (:sicking) No longer reading bugmail consistently 2013-01-03 17:55:42 PST
Ok, spinning the event loop works for me, and should make this bug relatively easy to fix as long as we have a way to get a callback when a particular origin has been preloaded into the localStorage.
Comment 79 Honza Bambas (:mayhemer) 2013-01-04 08:53:27 PST
Thanks for all suggestions guys.  I see two ways here:

- spin the even loop on the parent (super-easy to do), however it may cause unpredictable behavior when something underlying changes
- have a dedicated channel handling on a non-main thread (no idea how to do it), I am more fun of this one when easy to do
Comment 80 Olli Pettay [:smaug] 2013-01-04 09:22:36 PST
(In reply to Honza Bambas (:mayhemer) from comment #79)
> - spin the even loop on the parent (super-easy to do), however it may cause
> unpredictable behavior when something underlying changes
What underlying? You'd be spinning event loop very close to the place where data is received from the
other process, so usually very close to the normal main event loop spin.
And in case of showModalDialog nested spinning is legitimate anyway.
As far as I see spinning event loop in this case isn't really a problem. It doesn't cause similar
issues as alert() or sync XHR:
Comment 81 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-04 11:53:10 PST
(In reply to Honza Bambas (:mayhemer) from comment #79)
> Thanks for all suggestions guys.  I see two ways here:
> 
> - have a dedicated channel handling on a non-main thread (no idea how to do
> it), I am more fun of this one when easy to do

This is actually pretty straightforward and is definitely the better approach, but I don't know what assumptions in the localstorage code might be violated by running on a non-main thread.  I or bent can help with the IPC boilerplate.
Comment 82 Honza Bambas (:mayhemer) 2013-01-04 14:24:19 PST
(In reply to Chris Jones [:cjones] [:warhammer] from comment #81)
> This is actually pretty straightforward and is definitely the better
> approach, but I don't know what assumptions in the localstorage code might
> be violated by running on a non-main thread.  I or bent can help with the
> IPC boilerplate.

Can I let the *child* receive messages on other then the main thread somehow?  It would make the new code much much simpler and also immediately fix the issue of blocking parent's main thread.

Thanks for help.
Comment 83 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-04 14:31:45 PST
Yes, absolutely.
Comment 84 Honza Bambas (:mayhemer) 2013-01-04 14:37:55 PST
(In reply to Chris Jones [:cjones] [:warhammer] from comment #83)
> Yes, absolutely.

Can you please reveal how?  (Code example, API I can use to set the thread...)
Comment 85 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-04 15:16:56 PST
Here's the unit test for the feature

http://mxr.mozilla.org/mozilla-central/source/ipc/ipdl/test/cxx/TestOpens.cpp

The test starts in the parent process main thread, then sends a message to the child that initiates
 - in the child, creating a new channel, PTestOpensOpened
 - in the child, creating a new thread to process messages for that channel
 - binding the channel to the off-main thread
 - in the parent, accepting the channel allocation message
 - in the parent, also creating a non-main thread to process messages for the new channel and binding it to that thread

Then there's some IPC chatter on the newly-created channel.
Comment 86 Jonas Sicking (:sicking) No longer reading bugmail consistently 2013-01-04 16:13:05 PST
Honza: I don't understand why we would want to do anything off the main thread in the child process though? Can you explain?
Comment 87 Honza Bambas (:mayhemer) 2013-01-04 16:30:39 PST
(In reply to Jonas Sicking (:sicking) from comment #86)
> Honza: I don't understand why we would want to do anything off the main
> thread in the child process though? Can you explain?

Yes.

I will describe the new implementation in the bug 600307 now:

In a single process mode when a window is opening we tell the storage background I/O thread to do an async preload of the storage cache.  It SELECTs and then for each key+value calls LoadItem(key, value) method of the cache.  When done it calls LoadDone() that marks the cache as loaded (also can be understood as "valid").  These calls are made on the background I/O thread.

When a JS script accesses localStorage sooner then the preload has finished, we block the main thread until LoadDone call (we wait via monitor.wait(), LoadDone calls monitor.notify()).  There is no main thread event looping.

In the IPC case when we get to the situation we have to wait for the async preload to finish there is no way (I didn't find one...) to just loop the IPC message queue to receive all async messages from the parent we expect to come.  I cannot do monitor.wait() on the main thread since main thread has to spin the event loop to process the messages from parent.  And I don't want to spin the loop (in what ever cost, we know why).

I solved this by doing a sync call to parent to do a preload of all remaining unloaded data synchronously.


Thus, if the child process would process messages on a non-main thread, then I can just call monitor.wait() on the main thread to for all data to be received from the parent.  I could remove the relatively complicated sync load IPC code as well.
Comment 88 Honza Bambas (:mayhemer) 2013-01-04 16:59:05 PST
(In reply to Chris Jones [:cjones] [:warhammer] from comment #85)
>  - binding the channel to the off-main thread

Thanks for this example.

My situation is backwards according the IPC connection creation, it starts on the child.  So I need to initialize the child first on non-main thread.

The trick in the test is in these lines I presume:

child opens PTestOpensOpened;

virtual PTestOpensOpenedChild*
    AllocPTestOpensOpened(Transport* transport, ProcessId otherProcess) MOZ_OVERRIDE;

When is AllocPTestOpensOpened called?
Comment 89 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-04 17:09:04 PST
That example does initiate the new IPC connection from the child side.

The channel is allocated from this call

http://mxr.mozilla.org/mozilla-central/source/ipc/ipdl/test/cxx/TestOpens.cpp#157

which is the interface generated from |child opens PTestOpensOpened|.

That function does some IPC magic behind the scenes.  When it's finished, then your  AllocPFoo*() functions will be called in both the parent and child processes and handed tokens that refer to the new IPC channe.  Those methods, implemented by you, are responsible for Open()ing the channel (which binds it to the thread it was opened on).

For that part you'll just want to copy the boilerplate ;).
Comment 90 Honza Bambas (:mayhemer) 2013-01-16 12:49:05 PST
Chris, thanks very mach for suggestions!

I decided to do this as a followup after bug 600307 landing.
Comment 91 Honza Bambas (:mayhemer) 2013-01-18 15:52:32 PST
Note: after we land 600307 the situation will much improve.  However, there still is a chance to be blocked on a corner case synchronous load of the cache (comment 65).  It will be completely removed by work on this bug.
Comment 92 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-25 11:03:06 PST
(This is one of the sources of b2g jank that we prefer not to talk about.)
Comment 93 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-26 00:27:36 PST
Just need to fix.
Comment 94 Marco Bonardo [::mak] 2013-06-25 11:25:12 PDT
Honza, what's current situation here, did anything change from comment 91 after the new implementation landed? What may we still do here? Comment 43 is clearly no more the plan, having a plan, even if future, would be useful.
Looks like bug 736144, the only dependency, is blocked on getting more telemetry.
Comment 95 Honza Bambas (:mayhemer) 2013-06-25 17:13:03 PDT
I don't work on this bug just because I'm really pretty occupied with network stuff (http cache rewrite).

Blocking of the main thread on the parent may still happen, however with much lower probability then before 600307.  

What we need is to receive IPC messages on the parent process on a non-main thread - that's all - nothing more to do!  I have an idea how to do it, but IPC API is hard.  

If there is anyone that could do it quickly then please do it.  I may not get to this at least a month, I can say.
Comment 96 Honza Bambas (:mayhemer) 2013-06-25 17:16:21 PDT
I don't work on this bug just because I'm really pretty occupied with network stuff (http cache rewrite).

Blocking of the main thread on the parent may still happen, however with much lower probability then before 600307.  

What we need is to receive IPC messages on the parent process on a non-main thread - that's all - nothing more to do!  I have an idea how to do it, but IPC API is hard.  

If there is anyone that could do it quickly then please do it.  I may not get to this at least a month, I can say.
Comment 97 Marco Bonardo [::mak] 2013-06-26 07:01:53 PDT
Thanks, I was going through the StorageJank bugs and trying to update their status and ensure there's a plan for them, this is good information to have here.

Note You need to log in before you can comment on or make changes to this bug.