Closed Bug 1285898 Opened 4 years ago Closed 3 years ago

[e10s-multi] Localstorage "storage" event is not fired with multiple content processes

Categories

(Core :: DOM: Content Processes, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: gkrizsanits, Assigned: asuth)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(8 files, 5 obsolete files)

20.17 KB, patch
asuth
: review+
Details | Diff | Splinter Review
9.42 KB, patch
baku
: review+
Details | Diff | Splinter Review
11.54 KB, patch
baku
: review+
Details | Diff | Splinter Review
5.54 KB, patch
baku
: review+
Details | Diff | Splinter Review
15.86 KB, patch
baku
: review+
Details | Diff | Splinter Review
5.29 KB, patch
baku
: review+
Details | Diff | Splinter Review
3.23 KB, patch
Details | Diff | Splinter Review
7.05 KB, patch
baku
: review+
Details | Diff | Splinter Review
with multiple content processes:
1415 INFO TEST-UNEXPECTED-FAIL | editor/libeditor/tests/test_bug674770-1.html | The click operation worked successfully

Seems like the "storage" event is either not fired or it's just a buggy test because there is already a related intermittent:

relevant intermittent:
972110 Intermittent test_bug674770-1.html | The click operation worked successfully | The click operation shouldn't work in the contenteditable area
Blocks: e10s-multi
Whiteboard: [e10s-multi:M?]
Whiteboard: [e10s-multi:M?] → [e10s-multi:M1]
Blocks: 1301015
Whiteboard: [e10s-multi:M1]
Assignee: nobody → amarchesini
Attached patch storage_e10s.patch (obsolete) — Splinter Review
I'm waiting for a try result. But in the meantime, this is the patch.
I don't think we have to move the cache to the parent process for now, because there are no sync APIs where we can be 'faster' than a notification between 2 processes (I hope I'm not wrong).

This means that 2 processes "could" run out-of-sync. Moving the cache to the parent, doesn't fix race conditions between 2 pages, running on 2 different processes, writing data to the same key.
Attachment #8814386 - Flags: review?(jvarga)
So, if you have two tabs in separate processes, tab A changes a value in LS, now with your patch, tab B will get the storage event, but what if the handler ignores the new value in the event and fetches the value from LS using the same key ?
I understand this bug is just about the storage event not firing, but I'm trying to understand what you wrote about the cache in the parent since the storage event is one problem, and separate (not synced caches) is other problem, no ?
Also, race conditions, that's yet another problem, but we need to fix "the caches are not synced" first, no?
Attached patch storage_e10s.patch (obsolete) — Splinter Review
Attachment #8814386 - Attachment is obsolete: true
Attachment #8814386 - Flags: review?(jvarga)
Attachment #8814672 - Flags: review?(jvarga)
Comment on attachment 8814672 [details] [diff] [review]
storage_e10s.patch

Andrew, can you maybe take a look at this patch?
Attachment #8814672 - Flags: review?(bugmail)
Comment on attachment 8814672 [details] [diff] [review]
storage_e10s.patch

Review of attachment 8814672 [details] [diff] [review]:
-----------------------------------------------------------------

I definitely understand better why :janv was talking about overhauling the whole DOMStorage implementation!  I've provided specific review feedback on this patch, which I think largely works and will work better with the fixes I suggest...

BUT!  I think this patch's flood propagation of *all* localstorage changes across *all* child processes is fundamentally a bad idea.  Ignoring the security/privacy issues which DOMStorage already has under e10s, the flood approach means a badly behaved webpage in one process can affect all the others.  While we already face this issue to some extent with DOMStorage because the parent process still has to process the spam from that badly behaved webpage, assaulting the main threads of all content processes and their IPC channels is bad.

I'd propose we do the following instead:
0) Not land this patch.
1) Do this the right way by having the parent explicitly understand what caches are actively in use by the child, allowing us to only send events/updates to child processes that actually have the cache open.  I see a ton of appeal in :janv's proposal to only store the data in the parent and have the child issue sync read requests for the content.  Especially since it makes it much easier for us to optimize for things like not sending "storage" events to the child at all until they actively subscribe to the event.  (If PBackground has low latency right now, we could use that, otherwise maybe a PLowLatencyBackground thread/protocol could be introduced.)
2) We can mitigate the implementation delay of doing it the right way by having the local storage manager nsIDOMStorageManager expose mOriginsHavingData/ShouldPreloadOrigin() as DoesSiteAppearToUseLocalStorage() so that :mrbkap's (future?) logic that lets us run JS code to decide what process to put a tab in can help us put all sites from any origin that uses localstorage in a single process so they get a coherent view of localstorage and "storage" events.
3) After we do it the right way, if quantum hasn't given us multiple main threads, we can use the :mrbkap tab allocation logic again to try and keep localstorage-using sites from being co-located with innocent non-localstorage-using sites.

::: dom/base/nsGlobalWindow.cpp
@@ +11675,3 @@
>        // Allow event fire only for the same principal storages
>        // XXX We have to use EqualsIgnoreDomain after bug 495337 lands
> +      nsIPrincipal* storagePrincipal = event->GetPrincipal();

While you're touching this redundant line, I'd suggest removing it and the misleading XXX comment above it.  Specifically:
- Before and after your patch here, this line declares a storagePrincipal that shadows the identically initialized storagePrincipal above.  It might make sense to keep the line somehow if the comment that preceded it had special relevance, but...
- The "XXX comment" is very obsolete.  It's referring to using a check that ignores "document.domain".  The check below uses nsIPrincipal::Equals which does that.  (Code would have to use EqualsConsideringDomain to do the wrong thing.)

::: dom/ipc/ContentParent.cpp
@@ +4709,5 @@
> +        nsString(aNewValue), IPC::Principal(aPrincipal));
> +    }
> +  }
> +
> +  return IPC_OK();

It seems like there's either an oversight here or a significant design decision that merits a comment.

LocalStorage writes to an origin in the parent will be broadcast to children and they will see the change.  The opposite is not true.  I'm not sure that in a full-e10s world that's a use-case we care about, and certainly it would be nice to avoid spurious event creation, but it seems like you should either:
- Put a comment here that explicitly calls out we're intentionally not processing the event in the parent.
- Remove the iteration code above and instead just call DOMStorage::DispatchEvent, adding a ContentParent* sourceChild field so we can modify that logic to do the "don't send back to sourceChild" logic.

::: dom/storage/DOMStorage.cpp
@@ +285,5 @@
> +  }
> +}
> +
> +void
> +DOMStorage::ApplyEvent(StorageEvent* aStorageEvent)

Correctness/performance issue: The calls to DOMStorageCache::Clear,RemoveItem,SetItem here will all end up redundantly issuing manipulations back to the parent over IPC to be written to the database.  Besides the potential for waste[1], in the event of a backlogged child process this could result really weird consistency problems.

If using this implementation idiom, these "external" mutations need to not issue their sDatabase->MutateStuff() calls.

ProcessUsageDelta probably also needs to be modified to understand this is an externally imposed change and that it needs to report success and update its counter even if it ends up pushing us over quota in the interest of convergence.  Since BroadcastChangeNotification is only called if the DOMStorageCache call succeeded, this is largely safe since it means at least the writer's view of its storage was not overflowed.

The abuse scenario this enables is that conspiring code could use BroadcastChannel to use up max-quota*occupied-processes storage by clearing storage to zero and then all attempting to write max-quota bytes of data at the same time.  This could be mitigated by saying if the quota gets exceeded by a factor of 1.5 we just issue a clear().  Since an actively malicious website can already largely attempt to fill up the user's storage through using a bunch of different eTLD+1's, I'm not sure how important that threat model is.

1: Although DOMStorageDBThread's PendingOperations implementation is smart enough to coalesce writes if things get backlogged, especially using the SQLite WAL, I'm not sure backlogs will happen that much.  (And lacking DOMStorageCache's understanding of the current state, it can't recognize that duplicate writes are no-ops and perform the NS_SUCCESS_DOM_NO_OPERATION optimizations.)

@@ +287,5 @@
> +
> +void
> +DOMStorage::ApplyEvent(StorageEvent* aStorageEvent)
> +{
> +  MOZ_ASSERT(aStorageEvent);

Separate, but related issue: Because ApplyEvent is called once for every nsGlobalWindow for this origin, it might be appropriate to mark the aStorageEvent processed after the first time and then do an early return at the top here.

(This is another argument for having the synchronization happen at the DOMStorageCache level with its own explicit actor setup.)
Attachment #8814672 - Flags: review?(bugmail) → review-
> I definitely understand better why :janv was talking about overhauling the
> whole DOMStorage implementation!  I've provided specific review feedback on
> this patch, which I think largely works and will work better with the fixes
> I suggest...

In general, I agree with you idea, because it's also my idea since the beginning :)
But in  order to implement this plan we need to do a lot of work:

1. split LocalStorage from SessionStorage (doesn't make any sense to have sessionStorage running in the parent process). Bug 1322316.

2. rewrite localStorage from scratch (or a big subset of it).

This is something that must be done, eventually.

But the purpose of this patch is different: it is to provide a temporary fix for multi-e10s and, in the meantime, we can continue with the original plan.

I think this page is good enough if we add a (important and big) improvement: we don't broadcast messages to any child, but only to child processes who have a localStorage active for a particular origin. If you like this approach, I can write a second patch. I'm thinking about a sync IPC message when a localStorage object is created from the window in order to inform the parent process about which origins are interesting for the current process.
Flags: needinfo?(bugmail)
(In reply to Andrea Marchesini [:baku] from comment #6)
> I think this page is good enough if we add a (important and big)
> improvement: we don't broadcast messages to any child, but only to child
> processes who have a localStorage active for a particular origin. If you
> like this approach, I can write a second patch. I'm thinking about a sync
> IPC message when a localStorage object is created from the window in order
> to inform the parent process about which origins are interesting for the
> current process.

Yeah, filtering the messages based on origin interest addresses my concern.

To make your proposed plan work given the existing patch, I think the following changes will also be needed:

- Have nsGlobalWindow::AddEventListener invoke GetLocalStorage() if a "storage" event listener is registered.  This ensures that the origin will receive events even if localstorage isn't used by the origin when the page is first opened.

- Disable the cache keepalive mechanism by turning DOMStorageCache::KeepAlive into a no-op.  Since you plan to only deliver and propagate changes when there is a window alive, letting a DOMStorageCache stick around when there are no active windows for the origin in the process will result in the cache missing any/all events triggered in other processes.  Alternately, if you move the "interest" tracking and events over to be associated with DOMStorageCache and its lifecycle instead of the window, the keepalive mechanism can be maintained.  (And moving from PContent to its 1:1 PStorage could avoid crufting up ContentParent/Child with the extra state management.)

- Ensure there is no interleaving of preload results and broadcasted changes and that no changes are lost.  (Although DOMStorageCache::LoadItem won't overwrite an added/modified key it heard about via broadcast, if a broadcast removed an item or cleared storage, LoadItem will store the should-no-longer-exist value.)  At a high level, this can be accomplished by buffering any broadcast changes child until the DOMStorageCache's LoadDone is sent/processed.

  Note that the naive approach of bouncing the messages off the DB thread to leverage ordering doesn't really work because of the db thread's custom ordering logic and then I/O may jank propagation.  Something aware of LoadDone like the CacheParentBridge or SyncLoadCacheHelper in the parent, or in the child something in the DOMStorageDBChild::RecvLoadDone path needs to be involved.  Again, this would probably be a good reason to move things more into DOMStorageCache and DOMStorageIPC.
Flags: needinfo?(bugmail)
Comment on attachment 8814672 [details] [diff] [review]
storage_e10s.patch

Review of attachment 8814672 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/base/nsGlobalWindow.cpp
@@ +11809,5 @@
>  
>    RefPtr<DOMStorage> storage;
> +
> +  // If null, this is a localStorage event received by IPC.
> +  if (!storage) {

2 Things:

1) This wants to be a check against `storageArea`, `storage` is guaranteed to be null right now.

Also, because of the hackish nature of how IsFrozen() windows are handled, I think it's appropriate to have a comment around here along the lines of:

Note that CloneStorageEvent will be invoked twice for this window if the window IsFrozen().  During the initial call where the event is coming from IPC, storageArea will be null and the event will be applied.  Our cloned event will have its storage-area set so that when it's stored in mPendingStorageEvents and later re-processed, it will not trigger the ApplyEvent case anymore.


2) While writing that comment I realized the way the IsFrozen() is implemented messes up the event name when the window gets unfrozen and the event gets dispatched.  The logic that sets "fireMozStorageChanged" is basically "hey, if this is coming from our own window as detected by the storage being mLocalStorage, then change the event name from 'storage' to 'MozLocalStorageChanged' and mark this an internal-only event."  Unfortunately, for the cloned `newEvent`, the changingStorage is defined to be mLocalStorage, so by putting the cloned event in mPendingStorageEvents, we ensure we'll fire an internal "MozLocalStorageChanged" event on that regardless of the actual source of the event.

Fix-wise, storing the original `event` instead of `newEvent` potentially keeps some other page's DOMStorage binding alive for a while, which seems undesirable.  It might be best to specially mark the event in the frozen case as already processed and just dispatch it directly rather than running the fireMozStorageChanged-determining path and CloneStorageEvent again.  If you do that, that of course invalidates the comment I suggested above, and you can skip doing that.
(In reply to Andrew Sutherland [:asuth] from comment #7)
> - Ensure there is no interleaving of preload results and broadcasted changes
> and that no changes are lost.  (Although DOMStorageCache::LoadItem won't
> overwrite an added/modified key it heard about via broadcast, if a broadcast
> removed an item or cleared storage, LoadItem will store the
> should-no-longer-exist value.)  At a high level, this can be accomplished by
> buffering any broadcast changes child until the DOMStorageCache's LoadDone
> is sent/processed.

Uh, and of course, the interest wants to be expressed simultaneously with the Preload request being processed by the parent in such a way that sending/buffering of broadcast changes begins happening at the same time.
See Also: → 1331083
Attached patch storage_e10s_2.patch (obsolete) — Splinter Review
This is the interdiff.
About your comments, I don't think I need to do anything for the preload/broadcast sequence of events, because when a event is received from the broadcasting algorithm, I call removeItem/clear/... and those methods force a sync preload (if needed).
Attachment #8827155 - Flags: review?(bugmail)
Comment on attachment 8827155 [details] [diff] [review]
storage_e10s_2.patch

Review of attachment 8827155 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Andrea Marchesini [:baku] from comment #10)
> This is the interdiff.

I got the impression from :overholt you were planning to implement filtering along the lines of what you proposed in comment 6.  It's unclear if this is a WIP or if there was a glitch with the refresh/interdiff, but I would still like to see the filtering and points raised in comments 5 through 8 addressed or discussed.

> About your comments, I don't think I need to do anything for the
> preload/broadcast sequence of events, because when a event is received from
> the broadcasting algorithm, I call removeItem/clear/... and those methods
> force a sync preload (if needed).

Oh yes, very nice, WaitForPreload() does simplify things massively.  As long as the preload request is dispatched to the database thread at the same time the filtering broadcast-interest is registered, things should be good.  This addresses comment 7 bullet 3, thank you.


Note however, that this doesn't eliminate the problem of redundant db mutations mentioned in comment 5 hunk comment 3 about ApplyEvent.  Specifically, if we have process A that does `localStorage["foo"] = 1; localStorage["foo"] = 2;` with a process B with the origin open, B will issue redundant writes, potentially racing process N which is also opening a window on this origin, we can get the following sets of event streams:
process A: [ foo=1, foo=2 ]
process B: [ echo:foo=1, echo:foo=2 ]
process N: [ preload ]

And it's entirely legal for:
db thread sees: [ A:foo=1, A:foo=2, B:echo:foo=1, N:preload, B:echo:foo=2 ]

In that case, process N will perceive a localStorage state of foo=1 when A, B, and any windows in other processes opened after that point will also see foo=2.

With that addressed, I think we're all set in terms of coherency.

::: dom/base/nsGlobalWindow.cpp
@@ +11629,5 @@
>  
>      // Clone the storage event included in the observer notification. We want
>      // to dispatch clones rather than the original event.
>      ErrorResult error;
> +    RefPtr<StorageEvent> clonedEvent =

It looks like you began addressing comment 8 point 2 here, but I don't think you finished.  AFAICT, you just renamed newEvent to clonedEvent but all semantics are the same.

@@ +12930,5 @@
> +
> +  // We need to initialize localStorage in order to receive notifications.
> +  if (aType == nsGkAtoms::onstorage) {
> +    ErrorResult rv;
> +    GetLocalStorage(rv);

This addresses comment 7 bullet 1, thank you!

::: dom/storage/Storage.cpp
@@ -61,5 @@
>  }
>  
>  Storage::~Storage()
>  {
> -  mCache->KeepAlive();

It looks like you began addressing comment 7 bullet 2 here, but there is another existing call to KeepAlive() in StorageCache::LoadDone().  I think making KeepAlive() into a no-op with a comment of why it's the way it's being (temporarily?) disabled might be preferable.
Attachment #8827155 - Flags: review?(bugmail) → feedback+
> Fix-wise, storing the original `event` instead of `newEvent` potentially
> keeps some other page's DOMStorage binding alive for a while, which seems
> undesirable.  It might be best to specially mark the event in the frozen
> case as already processed and just dispatch it directly rather than running
> the fireMozStorageChanged-determining path and CloneStorageEvent again.  If
> you do that, that of course invalidates the comment I suggested above, and
> you can skip doing that.

Without my patch, when a StorageEvent is received, we clone it and we reset the storage to be the local LocalStorage or the local SessionStorage.
Then everything is stored in mPendingStorageEvents in case the window is frozen.

With my patch, we have exactly the same logic, but if the received storageArea is null, we apply the changes to the localStorage too.
I don't see why this changes the behavior or this introduces anything that was not already in place.
> It looks like you began addressing comment 7 bullet 2 here, but there is
> another existing call to KeepAlive() in StorageCache::LoadDone().  I think
> making KeepAlive() into a no-op with a comment of why it's the way it's
> being (temporarily?) disabled might be preferable.

I prefer to keep it out completely instead having this code disabled.
> And it's entirely legal for:
> db thread sees: [ A:foo=1, A:foo=2, B:echo:foo=1, N:preload, B:echo:foo=2 ]

I don't think there is an easy way to fix this. I prefer to keep the current setup. Except if you have a nice idea to suggest, of course :)
Attached patch storage_e10s.patch (obsolete) — Splinter Review
Attachment #8814672 - Attachment is obsolete: true
Attachment #8827155 - Attachment is obsolete: true
Attachment #8814672 - Flags: review?(jvarga)
Attachment #8827415 - Flags: review?(bugmail)
(In reply to Andrea Marchesini [:baku] from comment #12)
> With my patch, we have exactly the same logic, but if the received
> storageArea is null, we apply the changes to the localStorage too.
> I don't see why this changes the behavior or this introduces anything that
> was not already in place.

You're right that you're not changing anything.  The s/newEvent/clonedEvent/ is a good naming change; I just thought you might be trying to fix the pre-existing bug involving frozen events I raised in comment 8 point 2 but hadn't completed it.  Because the decision of whether to emit the "storage" event or the internal-only "MozLocalStorageChanged" is based on the storage area reported, events received while frozen in bfcache have never worked correctly.

(In reply to Andrea Marchesini [:baku] from comment #13)
> I prefer to keep it out completely instead having this code disabled.

That works for me; probably also the better call.

(In reply to Andrea Marchesini [:baku] from comment #14)
> > And it's entirely legal for:
> > db thread sees: [ A:foo=1, A:foo=2, B:echo:foo=1, N:preload, B:echo:foo=2 ]
> 
> I don't think there is an easy way to fix this. I prefer to keep the current
> setup. Except if you have a nice idea to suggest, of course :)

I have a fix in mind for this and the related quota-limit-hitting issue.  Since this patch is basically ready-to-go apart from the filtering issue and I am now able to apply the new patch locally, I'll just provide a patch on top of yours as my proposal.  I should have that for you later tonight.
Apologies for the delay; ETA another 1-1.5 days.  I started writing a multi-e10s test for localstorage to help us both feel more confident about things which is more work than I was originally planning, and a higher priority IDB fix has been troublesome.
I have a happy test and the redundant-write avoidance patches to go on top of :baku's (note: I haven't checked if the recent private browsing change restoration has bit-rotted anything), try run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c73e28a8e26537ad800a408ec73a45ded28bac4e

Unfortunately, while verifying the test did what I wanted I realized that the KeepAlive() removal wastefully defeats the precache mechanism.  That is, the content process will still speculatively perform the precache, but without KeepAlive(), unless content forces the LocalStorage object to be created, then the StorageCache will be released as soon as it is populated.  As a result, a preload will be triggered again when nsGlobalWindow::GetLocalStorage is actually invoked.

There's basically 2 possible fixes:
- Change nsGlobalWindow::PreloadLocalStorage and its calls to PrecacheStorage to create a Storage binding and save it to mLocalStorage like content explicitly touched `localStorage`.  The downside is that rather than having the StorageCache and its storage hang around for only ~20 seconds for an origin that uses localStorage but on a page that does not, we will keep the StorageCache around exactly as long as the page survives.
- A combination of the above and the keepalive mechanism.  We do create and save the Storage binding, but we also schedule a ~20 second timer that will drop the reference from nsGlobalWindow if no content-mandated call to GetLocalStorage occurs before the timer elapses.  This results in more timers than before the patch and with somewhat more state machine stuff going on because of extra situations where cancellation is necessary.

The other option is to disable precaching entirely.  That seems potentially quite bad for page-jank reasons even if it isn't janking the main browser process.

My current inclination would be the first potential fix where precaching saves off mLocalStorage and we avoid the additional timer/state management complexities.  I'd also add another step to the test that verifies that precaching does work as expected.  (Currently the test just verifies that precaching isn't interfering with potential edge cases the tests are trying to exercise.)  Unless I hear otherwise, I'll pursue that for tonight.
:baku, this is just your patch with the minimal, but incorrect, de-bitrotting of the changes from bug 1319951.  I'm going to attach a correct cleanup for you to review momentarily.  r=me on this patch, but preferably you can land this with the other patches I'm attaching or better variations on them.
Attachment #8827415 - Attachment is obsolete: true
Attachment #8827415 - Flags: review?(bugmail)
Attachment #8830694 - Flags: review+
My understanding of bug 1319951 is that we need to track private-browsing separately from the OriginAttributes because for chrome docshells the system principal doesn't end up with a private browsing id.  This fix adds an isPrivate flag to be propagated along with the other IPC data you're pushing around already.

Additionally, we are able to lose the check in the observer handling because the topic checks already ensure that isPrivateBrowsing matches up with the observer topic.  I think the modified check in bug 1319951 was not needed as long as the observer topic was correctly propagating the (separately-tracked) privating browsing state and likewise GetIsPrivate() tracked the private browsing state separately.

This patch should absolutely be folded into your existing patch if you're fine with it.
Attachment #8830698 - Flags: review?(amarchesini)
Attachment #8830699 - Flags: review?(amarchesini)
Attachment #8830700 - Attachment description: Proposed fix to avoid ApplyEvent triggering racey redundant database writes. v1 → Part 3: Proposed fix to avoid ApplyEvent triggering racey redundant database writes. v1
Attachment #8830699 - Attachment description: Multiprocess e10s localstorage test. v1 → Part 2: Multiprocess e10s localstorage test. v1
Attachment #8830698 - Attachment description: Private browsing bitrot fixes where we propagate isPrivate in addition to origin v1 → Part 1b: Private browsing bitrot fixes where we propagate isPrivate in addition to origin v1
Comment on attachment 8830694 [details] [diff] [review]
Part 1: baku's e10s-multi-localstorage patch with my incorrect de-bitrotting

Note that I'm also fine with folding all of these patches into each other.  I've just separated them out for review simplicity for you, or so you can more easily discard them in favor of your own preferred solution.

Try run for this stack of patches:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c9b8c29e21bec6b832a6d5012d395c8312f05aa2
Attachment #8830694 - Attachment description: baku's e10s-multi-localstorage patch with my incorrect de-bitrotting → Part 1: baku's e10s-multi-localstorage patch with my incorrect de-bitrotting
Cleanup test.  Replace the browser_largeAllocation.js cargo-culted getPID ContentTask.spawn usage with immediately-available tab.linkedBrowser.frameLoader.tabParent.osPid.  (Thanks :mrbkap!)
Attachment #8830699 - Attachment is obsolete: true
Attachment #8830699 - Flags: review?(amarchesini)
Attachment #8831619 - Flags: review?(amarchesini)
Attachment #8830698 - Flags: review?(amarchesini) → review+
Attachment #8831619 - Flags: review?(amarchesini) → review+
Attachment #8830700 - Flags: review?(amarchesini) → review+
Attachment #8830701 - Flags: review?(amarchesini) → review+
What are your landing plans for this?  Given we've got multi-e10s enabled on nightly, I'm okay with you landing the current patch-set now with the understanding a filtering follow-up will be filed and worked before 54 hits aurora.  (I can also land, if you want.  I'd fold 1 and 1b and then land the 4 parts.)
Flags: needinfo?(amarchesini)
> aurora.  (I can also land, if you want.  I'd fold 1 and 1b and then land the
> 4 parts.)

Yes, please, land these patches.
Assignee: amarchesini → bugmail
Flags: needinfo?(amarchesini)
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6681b50c1f6d
[e10s-multi] Localstorage "storage" event is not fired with multiple content processes. r=asuth
https://hg.mozilla.org/integration/mozilla-inbound/rev/6fdb24e1256d
[e10s-multi] LocalStorage e10s Test. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/bd68ebab64fa
[e10s-multi] LocalStorage e10s multiple write avoidance. r=baku
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c6057cde326
[e10s-multi] Make precaching without keepalive work by stashing on the window. r=baku
The test is catching races where "writerTab" has perceived only 2/5 and 3/5 of the mutation events generated by "lateWriteThenListenTab", respectively.  This check is the first state check after the "lateWrite" set of mutations.

By inspection, I think this is a straightforward case of Storage::DispatchStorageEvent's asynchronous NS_DispatchToMainThread(StorageNotifierRunnable) deferring the application of the mutation by a turn of the event loop which creates a race.

This is arguably a PContent-consistency correctness issue in that it would be possible for:
- page A to mutate localStorage then send a PContent-based message
- page B in an alternate process receives the message before the localStorage's StorageNotifierRunnable runs such that the updated localStorage state is not perceived.

However, this is only something that can happen when message-manager-using code such as BrowserTestUtils' ContentTask is involved.  Actual content pages are going to be limited to using PBackground-based messaging like BroadcastChannel or MessagePort for which we cannot expect consistent ordering relative to PContent.  However, since it looks like webextensions code extensively uses the message manager for main-thread-to-main-thread communication[1] and webextensions are allowed to use localStorage and are allowed to touch localStorage-using-content-pages, I think we absolutely want to maximize PContent-consistency.

So I'm going to prepare a minimal patch so that Storage::DispatchStorageEvent supports an immediate-dispatch mode to be used only by ContentChild::RecvDispatchLocalStorageChange.  In-process mutations will still enqueue the runnable for later execution as mandated by run-to-completion semantics and stack safety concerns.  This will not introduce new correctness issues because in-process mutations and other-process mutations are two logically distinct event streams.  (With the current problem being the deferred event jumbles up the PContent event stream of which other-process mutations are a subset.)

:baku, can I get your sign-off on this?  NI'ing for rubber-stamp for now, I'll ask for review if I get the patch working and tested before I hear back.

1: http://searchfox.org/mozilla-central/source/toolkit/components/extensions/MessageChannel.jsm is built on top of message manager and MessageChannel is used by ExtensionChild for the fundamental port mechanism.  I definitely see no use of BroadcastChannel or non-JSM HTML MessageChannel, so even if I'm a little confused, I'm not sure how they'd be bouncing anything off PBackground for comms.
Flags: needinfo?(amarchesini)
This is the fix.  Because the failure was reliably-ish happening on linux64 PGO I attempted to repro there.  Oddly, attempting to use "mach try ... PATH" didn't result in the e10s test getting run, so I had to do a more wasteful try run.  It got 13 happy bc7 runs:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e2f2db58d23a1d81caff1d06171a0b4546ef18e4
Flags: needinfo?(amarchesini)
Attachment #8832067 - Flags: review?(amarchesini)
Attachment #8832067 - Flags: review?(amarchesini) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/e56108122b9088555f29a28e6086d4d117a68c4e
Bug 1285898 - [e10s-multi] Localstorage "storage" event is not fired with multiple content processes. r=asuth

https://hg.mozilla.org/integration/mozilla-inbound/rev/8bbe285cf5c03f6df3b199374310ff5150664290
Bug 1285898 - [e10s-multi] LocalStorage e10s Test. r=baku

https://hg.mozilla.org/integration/mozilla-inbound/rev/bc99844772ee5452c313448bfdb0866428f1bcb1
Bug 1285898 - [e10s-multi] LocalStorage e10s multiple write avoidance. r=baku

https://hg.mozilla.org/integration/mozilla-inbound/rev/1d0bfb02d6c07f62ba21ccfa54a55c2d095ecfa1
Bug 1285898 - [e10s-multi] Make precaching without keepalive work by stashing on the window. r=baku

https://hg.mozilla.org/integration/mozilla-inbound/rev/15b3b9feb54f294171eaf86e7ad4b13c73e11d25
Bug 1285898 - [e10s-multi] fixup PContent ordering via immediate event dispatch. r=baku
Depends on: 1336242
I just can't get enough of backing this out. Backed out in https://hg.mozilla.org/mozilla-central/rev/2aede0a97bc685e163196cc451b947a04ae6a598 for bug 1336242.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla54 → ---
Many thanks to philor for the backout and Toshihiro Yamada and :mccr8 for the triage.

The bfcache case was missed in the "use strings not StorageArea to identify sessionStorage versus localStorage" refactor and I missed it in review.  As I noted in comment 8, this bfcache functionality has never actually worked (spinoff was planned), also turns out to leak memory, and clearly has no tests.  However, the intent does seem laudable, so I'm going to add/improve a test and implement appropriate fixes.
BTW, Ash (currently running with 4 content processes) was also seeing this permafail:
https://treeherder.mozilla.org/logviewer.html#?job_id=73923621&repo=ash

Just in case that's something you haven't seen before :)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #38)
> Just in case that's something you haven't seen before :)

I hadn't.  That's very helpful, thank you.

The problem is that the "lateOpenSeesPreload" tab is being opened in a process that already exists because there's already a bunch of processes loitering all over the place.  This is bad news for the preload mechanism and assertion because SendOriginsHavingData() is only sent from the parent to each child when the process is first created and the StorageDBParent PActor instance is created.  This causes the child's OriginsHavingData to get out-of-sync with the parent and the precaching logic to not know when it's appropriate to precache.  (Another vote for the overhaul!)

The simple testing fix is to terminate all existing processes at the start of the test so that every tab open gets a fresh process and OriginsHavingData state.  I think this may be the right course of action given that:
- We expect the set of localStorage sites to be largely stable.  It's likely that a site using localStorage now will continue to use localStorage in the future.  As such, avoidable misses should be rare.
- When encountering a new site using localStorage (an unavoidable miss), that process will have its local OriginsHavingData updated, so that any other opens of the site of that TabGroup in the future will have accurate information.  Although rel=noopener tabs opened in other processes may encounter an avoidable miss.

Once bkelly's Client Interface bug 1293277 lands, we can potentially move the precache mechanism to operate off of that in the parent process since it will allow us to know when documents are being created without having localstorage add its own IPC traffic.  Then localstorage can just send "hey, I noticed you opened a page on origin A that has localstorage, here's the data to precache."  This would let us avoid having to maintain the potentially large set of OriginsHavingData (I have 4617 distinct origins in my real profile) while providing perfect decision-making without needing to broadcast changes to the origin set.
To address the comment 38 and 39 problem which is reproducible with "--setpref dom.ipc.processCount=4", I tried the "terminate all the extra processes" strategy I proposed, but that quickly turned into a stack of dirty tricks that was still unreliable without knowing exactly how many processes were alive in the pool.

The better fix is just to figure out how many processes are in the keepalive pool via its magic pref and then boost dom.ipc.processCount to compensate for the pool.  The only downside to this is if the extra processes exhaust system memory.

I'll fold this into part 2 for landing; I'm not requesting review on this since it ended up being trivial.


One thing I did notice in testing is that the nsGlobalWindow observer private-browsing-enforcement check that causes us to not enter the handler if the event type does not match the windows' PB type causes warnings about unhandled observer notifications because control flow reaches the bottom of the function.  To avoid that the strcmp's are probably going to want to be cached so their results can be consulted twice so we can early return and enforce PB match.  (Alternately, the conditionals could be altered to be something like if ((strcmp) && (didMatchObserver = true) && !isPrivateBrowsing) so we can check the side-effect after.)
(In reply to Andrew Sutherland [:asuth] from comment #40)
> The better fix is just to figure out how many processes are in the keepalive
> pool via its magic pref and then boost dom.ipc.processCount to compensate
> for the pool.  The only downside to this is if the extra processes exhaust
> system memory.

Would it help in the future if we exposed some helper functions for tests in this area? Like killing all the content processes that do not have any related tabs (so just kept alive artificially).
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #41)
> Would it help in the future if we exposed some helper functions for tests in
> this area? Like killing all the content processes that do not have any
> related tabs (so just kept alive artificially).

Yes.  While I hope most tests/functionality don't feel the need to kill off processes[1], it would arguably be the best for there to be a single helper function that's guaranteed to work for those who decide they need to kill them off.  That way it's easier to identify the tests in question and yell at them, etc. :)  But it would probably be very useful for talos-like tests that provide benchmarks on process spin-up, etc.

I think the most important e10s-testing use-case to enable is helping make it easier to guarantee "these N tabs will open in N separate processes".  Until bug 1333799 is fixed, the only way to efficiently and reliably guarantee that is if we know N more processes are allowed and so fresh processes will be allocated.  Maybe BrowserTestUtils.openTabsInSeparateProcesses(urlsWhichareMaybeObjectsWithSpecificTabOptions, allOptions)?  One all-option could be { freshProcesses: true } which would could automatically kill off processes or not kill them off as appropriate.

1: API-wise, it really seems like only this current incarnation of LocalStorage has a justification, and even that should ideally be short-lived.
(In reply to Andrew Sutherland [:asuth] from comment #37)
> The bfcache case was missed in the "use strings not StorageArea to identify
> sessionStorage versus localStorage" refactor and I missed it in review.  As
> I noted in comment 8, this bfcache functionality has never actually worked
> (spinoff was planned), also turns out to leak memory, and clearly has no
> tests.  However, the intent does seem laudable, so I'm going to add/improve
> a test and implement appropriate fixes.

I went down a bit of a rabbit-hole with tests not failing on this before realizing the localstorage bfcache-support was even more broken than I'd originally realized.

There's an `!AsInner()->IsCurrentInnerWindow()` check at the top of the observer block.  This means the broken bfcache support only gets a chance to run when:
1. It's an iframe whose parent got navigated, resulting in the iframe being frozen.  Because the iframe itself was not navigated, its inner window is still the current inner for its outer window.
2. Maybe for extremely brief periods for the top-level window being navigated if the event loop is yielded between freezing and the new inner-window being assigned?  But in at least the fresh navigation case (not history navigation), freezing should happen concurrently with a change of inner-window because nsDocShell::CreateContentViewer should call nsDocumentViewer::InitInternal which should maybe call SetNewDocument.

I'm going to just provide a patch now to rip out the never-ever-ever-worked-but-did-leak-memory localstorage bfcache code.  This has the nice side effect of avoiding a whole bunch of complexity given how the multi-e10s propagation works here.
This addresses the crash witnessed on bug 1336242 by removing the broken mPendingStorageEvents/IsFrozen() support which wouldn't actually dispatch "storage" events (comment 8), and also apparently wouldn't work for top-level frozen windows (comment 43).

Additionally:
- Additional protection against null-derefs of aData (re: bug 1336242 again) by adding a MOZ_ASSERT to detect mis-use in (debug) tests and an NS_ERROR_INVALID_ARG early return if the assert isn't enabled.
- The conditional in the observer was slightly refactored to ensure that we early-return NS_OK in the case of an event/private browsing mismatch which can result in NS_WARNING being invoked at the bottom of the observer method.

I'll fold this into the part 1 patch when landing so things are bisection-safe.

You may also want to sanity-check part 2b (already folded into part 2 in the try push), but I think it's trivial enough that it doesn't require additional review.
Attachment #8837071 - Flags: review?(amarchesini)
Attachment #8837071 - Flags: review?(amarchesini) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/6910cab30bdd58270b39e7cf041c022c2aba6a2c
Bug 1285898 - [e10s-multi] Localstorage "storage" event is not fired with multiple content processes. r=asuth

https://hg.mozilla.org/integration/mozilla-inbound/rev/044c1bc329e901baa93cab07f400fc2cdd9f7f95
Bug 1285898 - [e10s-multi] LocalStorage e10s Test. r=baku

https://hg.mozilla.org/integration/mozilla-inbound/rev/9f8d50eaa15eaa12af3c2cab5fedadab49bea294
Bug 1285898 - [e10s-multi] LocalStorage e10s multiple write avoidance. r=baku

https://hg.mozilla.org/integration/mozilla-inbound/rev/e6a40909b410af0e307b2df2c57fddb4d5487c4b
Bug 1285898 - [e10s-multi] Make precaching without keepalive work by stashing on the window. r=baku

https://hg.mozilla.org/integration/mozilla-inbound/rev/c5b25450747f8aa25f19e4e819dff3c792a94413
Bug 1285898 - [e10s-multi] fixup PContent ordering via immediate event dispatch. r=baku
Depends on: 1341161
Duplicate of this bug: 1342010
Blocks: 1350071
Blocks: 1405839
(In reply to Andrew Sutherland [:asuth] from comment #18)
> - Change nsGlobalWindow::PreloadLocalStorage and its calls to
> PrecacheStorage to create a Storage binding and save it to mLocalStorage
> like content explicitly touched `localStorage`.  The downside is that rather
> than having the StorageCache and its storage hang around for only ~20
> seconds for an origin that uses localStorage but on a page that does not, we
> will keep the StorageCache around exactly as long as the page survives.

I'm considering preloading possibilities for new local storage implementation and I wonder if keeping the cache around while the page is alive (even when JS doesn't touch .localStorage) is actually better (which is the case in current implementation).
Imagine a case when a local storage access is triggered by clicking on a button which happens after 20 seconds. If the cache is dropped after 20s due to no .localStorage access, we will have to preload it again later when the access happens, this time synchronously. So in the end we would get two preloads and the second would be fully synchronous.

Andrew, do you think we should follow the current approach and keep preloaded data (this time in the parent and only once) for the lifetime of a page ?
Actually, a cache will be shared by multiple pages using the same origin and the cache will be dropped when there are no pages referencing the origin. We could add a timer there, when the last reference is removed to handle for example page reloads.
But this is implementation detail. I'm mostly interested if the keep-alive mechanism was removed with no plans to bring it back in future when we have better internal infrastructure/design.
Flags: needinfo?(bugmail)
(In reply to Jan Varga [:janv] from comment #49)
> But this is implementation detail. I'm mostly interested if the keep-alive
> mechanism was removed with no plans to bring it back in future when we have
> better internal infrastructure/design.

The entire rationale around this change was due to limitations of how we are/were propagating changes to LocalStorage.  Because it was the nsGlobalWindow that was applying the changes rather than the StorageCache, we needed to ensure the StorageCache would be released if there was no nsGlobalWindow to apply the changes.  Otherwise state would de-synchronize.

I thought it definitely made sense to add back the timer-based heuristic once we could safely re-implement it.


> Andrew, do you think we should follow the current approach and keep
> preloaded data (this time in the parent and only once) for the lifetime of a
> page ?

I think we want more telemetry in general.  I'd stick with the current implementation that ties things to page lifetime as you propose and add a few seconds of keepalive even after there are no LocalStorage bindings that want the LocalStorageCache-equivalent to remain alive.

## Questions we want answered by telemetry

1. Do pages that use LocalStorage cause a binding to be created immediately, or are there sites that do things in response to user input, potentially much later?
2. How much disk I/O and memory are we wasting by over-aggressive preloading of LocalStorage?  If we find out we're using a large amount of browser memory for LocalStorage, we may need to consider more active purging of LocalStorage for origins.
3. Can we improve our preloading and retention heuristics by having a site-specific memory for some sites?  For example, using a bloom filter over the page URL, possibly two, to do a blacklist-then-whitelist approach.

## Proposed telemetry.

This is brainstorming based on the above and some general thinking.  If they make sense to you, maybe we should pursue them at some point.

- "time to first local storage", a histogram whose buckets are the number of seconds since navigationStart before the LocalStorage binding was brought into existence by content (rather than a speculative preload on our part).  This helps answer question 1.

- "local storage peak memory use", a scalar of how much memory the parent process is using for all of its LocalStorage caches at "peak" determined by tracking the max and updating the scalar as the max increases.

- "local storage memory use", a histogram whose buckets are memory used, and where we log a new data-point whenever we load/evict and origin from memory.  While this ignores time as a factor, it helps us understand the general distribution of memory usage due to localstorage.

- "local storage preload usage",  a keyed histogram where the buckets are the LocalStorage quota usage of the origin and the keys are:
  - "cache used": We loaded the LocalStorageCache data from disk and it was used by some page (not necessarily the triggering page) before it was evicted.  This is cache-specific, not LocalStorage page-binding specific.  This shows us success from preloading.
  - "cache unused": We loaded the LocalStorageCache data from disk and it was not used before it was evicted.  This shows wasted memory and I/O from over-preloading.
  - "cache reused": A page was triggering preload, but we already had the data around in memory because some other page triggered it AND we still had a live binding somewhere.  This is interesting in relation to "cache used" and "cache unused" (which are both basically cache misses, since we had to hit disk, compared to this which is a cache hit), as well as to observe changes to these histograms over time as the result of changes to our constants or A/B testing.
  - "cache reused keptalive": A page was triggering preload, but we already had the data around in memory, and although there were no live bindings, our keepalive timer had kept it around.  This justifies the keepalive.  If we want to be able to better tune the keepalive, we might create a separate histogram whose buckets are the number of seconds between when a cache entry transitioned to keepalive and when it was resurrected by a binding to be used again because that helps us decide on a lower cut-off.  (To know we need a higher cut-off we'd want some type of tombstone for an evicted origin that has the timestamp of eviction.)
  - "page used": The page triggered preload and instantiated the binding for its use before being closed.  This shows preload success on a per-page basis.
  - "page unused": The page triggered preload but never instantiated the binding for its use.  This helps contextualize "page used".  This is not necessarily wasted I/O, we look at "cache unused" for that.
Flags: needinfo?(bugmail)
No longer blocks: 1405839
You need to log in before you can comment on or make changes to this bug.