Open Bug 1740144 Opened 3 years ago Updated 6 months ago

race-condition between localStorage.getItem and BroadcastChannel.onmessage

Categories

(Core :: Storage: localStorage & sessionStorage, defect, P3)

Firefox 94
defect

Tracking

()

Tracking Status
firefox-esr91 --- unaffected
firefox94 --- wontfix
firefox95 --- wontfix
firefox96 --- wontfix
firefox97 --- wontfix

People

(Reporter: s+mozb, Unassigned)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0

Steps to reproduce:

this issue was introduced in firefox 92, currently unfixed in 94.0.1 -- the attached testcase will reproduce it, but ONLY when accessed over http/tcp, not as a local file

bug trigger: modify localStorage in one tab, then postMessage to another tab and read the localStorage value from inside the onmessage-handler

Actual results:

the other tab receives the message, but localStore still contains stale data

Expected results:

localStore should be synchronized before the message is handled on the receiving end

The Bugbug bot thinks this bug should belong to the 'Core::Storage: localStorage & sessionStorage' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Storage: localStorage & sessionStorage
Product: Firefox → Core

Hi Jan, can you please help to triage this?

Flags: needinfo?(jvarga)

BroadcastChannel uses DOM manipulation task source and so does Storage https://html.spec.whatwg.org/multipage/webstorage.html#concept-storage-broadcast so this is valid.

Was LSNG enabled in 92? 92 I guess https://bugzilla.mozilla.org/show_bug.cgi?id=1599979#c12

Severity: -- → S3
Priority: -- → P3
Regressed by: 1599979
Has Regression Range: --- → yes

Set release status flags based on info from the regressing bug 1599979

(In reply to Olli Pettay [:smaug] from comment #3)

BroadcastChannel uses DOM manipulation task source and so does Storage https://html.spec.whatwg.org/multipage/webstorage.html#concept-storage-broadcast so this is valid.

Was LSNG enabled in 92? 92 I guess https://bugzilla.mozilla.org/show_bug.cgi?id=1599979#c12

Yes, it has been enabled in release since 92. Jari, can you take a look?

Flags: needinfo?(jvarga) → needinfo?(jjalkanen)

On it

Assignee: nobody → jjalkanen
Flags: needinfo?(jjalkanen)

:Jari any update on this?

Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(jjalkanen)
Attachment #9257822 - Attachment description: WIP: Bug 1740144 - Restore instant setItem for localStorage → Bug 1740144 - Restore instant setItem for localStorage. r=#dom-storage-reviewers

The cause was identified and a tentative patch created but it still needs to be reviewed whether the solution will perform in other use cases.

Flags: needinfo?(jjalkanen)

(In reply to Olli Pettay [:smaug] from comment #3)

BroadcastChannel uses DOM manipulation task source and so does Storage https://html.spec.whatwg.org/multipage/webstorage.html#concept-storage-broadcast so this is valid.

Note that LocalStorage does not specify/provide any guarantees around multi-process behavior, so the spec doesn't directly cover this. This does raise somewhat of a meta-issue around scheduling of same-origin JS that's in different processes here, and our potential to simply sidestep the complications by, say, delaying the BroadcastChannel/MessageChannel transmissions until LocalStorage checkpoints things anyways. I quasi-propose this in the PR at https://phabricator.services.mozilla.com/D135218#inline-743639.

One nice thing is that the omniscience provided by BrowsingContext does mean we can alter our behavior based on knowing if the origin is active in other processes or not (and thanks to fission, that should be a rare-ish case, although if we degrade our performance in that case, as checkpoint more aggressively would do, that partially defeats the point of using multiple processes for an origin).

I should probably add that we do also need to be aware of the potential implications of allowing LocalStorage to be a means to provide effectively synchronous communication between different agent clusters. If we checkpoint immediately and synchronously on all writes and we do not key LocalStorage based on COEP, it becomes a potential means of creating a high-ish resolution timer, etc.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)

Note that LocalStorage does not specify/provide any guarantees around multi-process behavior, so the spec doesn't directly cover this. This does raise somewhat of a meta-issue around scheduling of same-origin JS that's in different processes here, and our potential to simply sidestep the complications by, say, delaying the BroadcastChannel/MessageChannel transmissions until LocalStorage checkpoints things anyways. I quasi-propose this in the PR at https://phabricator.services.mozilla.com/D135218#inline-743639.

Delaying the BroadcastChannel/MessageChannel transmissions sounds good to me.

And actually, another possibility could be that, given that LocalStorage's design is inherently problematic when multiple processes exist (or rather, single origins exist in multiple processes and have means of communication), that we could change the task source used by LocalStorage to explicitly be different from other task sources so that ordering is not guaranteed relative to async APIs.

Do we know what other browsers do here? Given comment 0 this does seem like the kind of thing websites can depend upon so I doubt changing the task sources is feasible. If we synchronize upon cross-process message delivery it doesn't seem like there would be any Spectre-like implications.

If the delivery of broadcast messages is time sliced and the local storage performs writes within this deadline, there's a chance that the local storage can't make it when the write is large and slow.

The spec kind of suggests that all activities and messages per origin could be synchronized via one single channel, not sure if this would be feasible here.

Sharing the in-memory layer on top of the database would also complex unless there already is a generic component for this purpose.

(In reply to Jari Jalkanen from comment #15)

The spec kind of suggests that all activities and messages per origin could be synchronized via one single channel, not sure if this would be feasible here.

The spec around LocalStorage and original implementation are from the single process days and is inconsistent with reality which is what gives us the hand-waving warning at https://html.spec.whatwg.org/multipage/webstorage.html#introduction-15:dom-localstorage-2.

What if we use the BrowsingContext to check when an origin is shared between multiple processes and activate eager checkpointing only in that case? If the performance is still unacceptable, we could then try to profile and fix the checkpoint mechanism without creating a dependency to the broadcast messaging component?

In my view, coupling the message delivery and local storage performance optimization system seems a bit unexpected and may make it hard to maintain these components in the future. Messaging isn't even the responsible for this bug - as far as I can see, this same regression, and other similar inconsistencies could have been discovered with time stamping, without any broadcast channel.

Apart from the high-precision clock issue, I can't imagine many circumstances where slowed down message delivery would be desirable. On the other hand, it's not completely unfair if the writes to the local storage are somewhat slow - as a reward, we get persistence and consistent views. There are other constructs for interprocess communication and lots of small writes at high-speed, and to work around the speed issue, the users could perhaps buffer the data the same way as we do internally with the write optimizer.

Well, we probably need to coordinate with Fission plans. Especially, if the end goal is to have just one content process per origin in most of the cases.

We moved away from one process per origin model. Currently there are by default up to 4 processes per origin.
(dom.ipc.processCount.webIsolated)
We need more than one process per origin because of performance reasons.

We tried really hard in the past to improve performance of LS, doing checkpoints much more often would negate that effort to some extent. It's not only about sending an async message to the parent process. All existing snapshots for other content processes are marked as dirty after the parent process receives a checkpoint. So snapshots need to be created again later and fetch data from the parent again. So doing a checkpoint during each LS API call is something we should really try to avoid (snapshots would actually become almost useless).

Snapshots were mainly introduced because of the LSNG performance regression on the top sites like www.google.com, www.youtube.com, www.amazon.com, etc.

Given Olli's comment, having multiple content processes for the same origin is rather normal, so we can't base our approach on the number of content processes (just 1 or more).

(In reply to Jari Jalkanen from comment #17)

There are other constructs for interprocess communication and lots of small writes at high-speed, and to work around the speed issue, the users could perhaps buffer the data the same way as we do internally with the write optimizer.

Yeah, this is exactly what snapshots do, they buffer and optimize data during a task. I'm not sure how would we persuade consumers to do that in their code.

The report in comment 0 and what it assumes the synchronization model is is at odds with the spec warning pointed to by anne in comment 14. The spec unfortunately does not give any easy model for ensuring synchronization between tabs, from my quick look.

I suppose with a LOT of work you could share the in-memory data between processes using shared memory (only shared among processes with the same eTLD+1, thus no spectre/security risk). That would be a lot of work, with a good chance of introducing 'interesting' bugs I'd guess.

You could force a synchronization only at the point of a direct message between two processes. That should have minimal performance impacts, especially if you can keep a marker to know if synchronization is needed (extant write data that hasn't been synced yet).

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #22)

I suppose with a LOT of work you could share the in-memory data between processes using shared memory (only shared among processes with the same eTLD+1, thus no spectre/security risk). That would be a lot of work, with a good chance of introducing 'interesting' bugs I'd guess.

Yeah, we were considering a shared memory, but it was kind of hard to make it work with quota manager (I also remember problems with re-allocation of memory since we couldn't just allocate 5 MB for each origin in advance). So we gave up on that.

You could force a synchronization only at the point of a direct message between two processes. That should have minimal performance impacts, especially if you can keep a marker to know if synchronization is needed (extant write data that hasn't been synced yet).

I suggested something like that here https://phabricator.services.mozilla.com/D135218#inline-743583

But I still think some kind of coordination between LS and BroadcastChannel/MessageChannel would be cleaner.
Like BroadcastChannel/MessageChannel would queue messages only when there's an active LS snapshot.

(In reply to Jari Jalkanen from comment #17)

In my view, coupling the message delivery and local storage performance optimization system seems a bit unexpected and may make it hard to maintain these components in the future. Messaging isn't even the responsible for this bug - as far as I can see, this same regression, and other similar inconsistencies could have been discovered with time stamping, without any broadcast channel.

I agree that it's a hassle and undesirable coupling, but the reality is that they're already coupled by the spec which is largely a formalization of how things were previously implemented. The fundamental design problem with LocalStorage is that it is a synchronous view on data across all agents, and we continue to have to jump through hoops to deal with and mitigate the effects of that, and obvious multiple processes has worsened the problem.

Apart from the high-precision clock issue, I can't imagine many circumstances where slowed down message delivery would be desirable. On the other hand, it's not completely unfair if the writes to the local storage are somewhat slow - as a reward, we get persistence and consistent views. There are other constructs for interprocess communication and lots of small writes at high-speed, and to work around the speed issue, the users could perhaps buffer the data the same way as we do internally with the write optimizer.

It's worth noting that there are several tensions at work.

  • From a browser perspective that wants to provide the user with a jank-free experience, LocalStorage is a very unpleasant API because, in order to minimize jank, the browser effectively needs to load up to 5 MiB of data into memory for the origin and keep it around the entire time the origin is alive. It also requires a bunch of potentially blocking sync IPC in our case. The browser really wants users to use IDB. Giving LocalStorage super-powers that allows cross-process communication without yielding control flow only makes people more likely to want to use LocalStorage instead of IDB.
  • From a spec perspective, we want to make sure that we don't create constraints on web compat that effectively require a specific implementation for future browsers to be compatible with the web. Having LocalStorage be synchronously synchronized across all processes would be an example of creating such a problem. Also, again, the security implications are concerning in terms of creating new primitives that clever people can do clever things with
  • Sites that just use LocalStorage as a configuration store that has no strong consistency concerns don't want a lot of overhead on LocalStorage. They just want a write-through cache with eventual consistency. As Jan notes in comment 20, we've seen serious performance regressions without care, especially in this case.
  • Sites that are doing complex cross-window thing want a greater level of consistency. They probably would also love LocalStorage to even be more powerful without having to use atomics, etc.

One other thing to note about the discussion for effectively synchronous-LocalStorage is that we would likely see new and weird behavior for sites that accidentally interleave their LocalStorage usage across processes.

One of the key design decisions related to snapshots and coherency was that we would adhere to the run-to-completion consistency model. Once a page accesses LocalStorage in a task, it sees a coherent view of LocalStorage until the task completes. Its view of LocalStorage will not interleave with other tasks manipulating LocalStorage in the origin concurrently in other processes. Its changes are then applied to the canonical data store in the parent process once the task completes is a single atomic batch.

Deferring the cross-agent-cluster messages to be transmitted when the task completes is consistent with this attempt at run-to-completion consistency in the face of multiple processes without regressing existing performance characteristics. And as noted previously, it's likely helpful to sites as to some extent it helps serialize execution in a way that reduces races.

I should also note that it's already the case that we have to worry about the ordering of messages sent via task sources, but in general we've just lucked out because we send everything through the singleton PBackground thread.

We could likely address this specific use case in a somewhat less hacky fashion with a normalized API like queueDomManipulationTaskSourceMessagingRunnable. Or we could throw in CrossAgent or something.

To summarize the information,

  • To save their configurations, users prefer performance
  • We would prefer people to use IDB but we want localstorage to be also fast
  • We want to keep the performance optimization which the spec is conflicting with
  • We want to change the spec so that localstorage is asynchronous but all IPC is blocked until the inconsistencies are no longer visible to the users
  • For IPC, the performance hit is usually insignificant

One question that remains is what happens during shutdown? Is it possible that two tabs have inconsistent data and the instance which is behind gets flushed last, overwriting the up-to-date data? The test case attached to this bug could be part of such scenario.

Thanks for summarizing to help make sure we're all on the same page!

(In reply to Jari Jalkanen from comment #26)

  • We want to change the spec so that localstorage is asynchronous but all IPC is blocked until the inconsistencies are no longer visible to the users

I'm being partially pedantic here, but I think the important nuance here is that we want to be consistent with the spec and in order to do that (without throwing away performance or providing synchronous global memory access) we're deferring messaging related IPC until a singular point that's consistent with when LocalStorage checkpoints.

Also, I want to note that we do have some precedents for interventions like this in general. In bug 1522316 (PR https://phabricator.services.mozilla.com/D22128) :smaug introduced the *ForMessaging dispatching / nsIEventTarget provision which is explicitly about runnable ordering/sequencing and builds on top of the throttle event queue that was introduced to avoid Workers spamming the main thread with postMessage, etc.

  • For IPC, the performance hit is usually insignificant

And to go a step further, our choice to defer the sending of the messages will likely help content avoid races.

One question that remains is what happens during shutdown? Is it possible that two tabs have inconsistent data and the instance which is behind gets flushed last, overwriting the up-to-date data? The test case attached to this bug could be part of such scenario.

Last writer wins which is subject to when the task completes and when PBackground processes the received message. A site that wants stronger guarantees from the platform needs to use IDB which provides a transaction model with explicit sequencing and/or https://developer.mozilla.org/en-US/docs/Web/API/Web_Locks_API.

In terms of shutdown, there is the interesting case where tabs A and B in 2 different processes both write to "foo", with tasks in A and B both race to perform a write and A checkpoints first and B checkpoints second, but because of how browser shutdown occurs, A's write makes it to disk and B's doesn't. This is okay and consistent with what could likewise happen under IDB in analogous situations, but where IDB provides the ability for a site to report that something most likely durably persisted to its origin whereas LocalStorage does not provide similar confirmation.

I was curious and talked also with Jari and Jan and I just wanted to recap my understanding based on the provided testcase:

The task scope

A "task" mentioned above by Andrew is a piece of synchronously running Javascript, in this example the send_msg function. At the end of each task we will call Checkpoint and thus synchronize the data. This is a wanted optimization that reduces the synchronization overhead during Javascript execution.

But in this special case we dispatch an asynchronous postMessage to the BroadcastChannel inside the same task and on the receiving side this can race in a multi-process scenario:

function send_msg() {
		localStorage.setItem('msg', Date.now() % 100000);  <--- Moment in time we will create an active snapshot (local to the running process)
		chan.postMessage(1); <--- We dispatch a message to all processes running for the same origin
		show_value();
	} <--- Moment in time we will sync the data
	
...	
	chan.onmessage = function (msg) { <--- When we get here in the other process, the data might not yet been synchronized
		if (avoid_bug.checked) {
			setTimeout(show_value, 1);
		}
		else show_value();
	};

General consideration

If two tabs access/write data to localstorage in concurrency without additional measures to synchronize, there is no guarantee whatsoever, which writings will survive or in which order they will occur, even if this is happening in the same process. This is true with or without the immediate synchronization, but the problem might be more easily visible in the multiple process case and with the current snapshot optimizations in place. But there is a guarantee, that active snapshots will be synchronized at every end of a task, making it unlikely to have it out of sync for longer periods of time.

The example test case

In this particular case, waiting with the dispatch of the postMessage until the task ends (at least in case we have an active snapshot to be synchronized) would help. Without looking it up I assume that there is no guarantee of an immediate delivery of broadcast messages, and being a post an asynchronous method anyway it sounds reasonable to wait with its dispatch. On the other hand it is just a special case of concurrent access to localstorage that could happen for many different reasons, and the broadcast message is just one trigger that can make it happen concurrently.

The options we have seem to be

  1. Leave things as they are and ask developers to be careful with concurrent access to localstorage in general until we have a better spec that we can follow. To be careful is hard but might become easier for developers now with having implemented the Web Locks API.
  2. Find and list synchronously dispatched messages that can cause asynchronous execution of content Javascript on other processes for the same origin and defer their execution until after the end of each task (at least in case of an active snapshot). This would reduce the risk of concurrent execution during an out-of-sync state between processes, but it is no guarantee against all kinds of undefined behavior (think of timer based execution inside both tabs without any inter-process communication).
  3. Same as 2. but instead of deferring the dispatch force a Checkpoint() before dispatching.
  4. Synchronize immediately after each write. We know from previous measures that this comes with quite some loss of performance (the optimization of snapshots has been introduced for good reasons). We could repeat those measures just to be sure about the current state and we could limit the sync to the case where we actually see more than 1 content process involved for that origin (which is kind of the default case, though). Again this is no guarantee against all kinds of undefined behavior.

I think that 1. is not really an option as we have here an example of a clear order (first I write a value, then I broadcast a message and then I read that value) that could be already seen as "being careful with concurrent access" and it is difficult to expect from web developers to see the caveat here (as in contrary it is very easy to see for arbitrary, timer-based concurrent execution to be a bad thing, IMHO).
And I kind of see that 2. or 3. would probably reach a state where we had similar guarantees than with 4. at a hopefully lower performance penalty. But actually we would probably want to measure both? I know we have performance tests for local storage in general, but I assume it would be difficult to measure the impact of 2. and 3. in a meaningful way with the existing tests?

I would add that the difference between 2. and 3. is not just performance, but also a change of behavior. 3. means that changes would be propagated to other processes before a task is completed. A checkpoint is not only about updating data in the parent process (so new snapshots will be based on that), but also about firing storage events in other processes if there are listeners for that.

Spec notes:

  • (edited for clarity) To clarify for anyone reading along, tasks correspond to runnables. Comment 28's example is ambiguous (because the arrow could make it seem like functions are the unit of tasks), but correct for this test case with the click handler. The task concludes when JS yields control flow due to running out of code to run (end of function, this case) and (for our purposes) after the micro-tasks queue is processed for any promises scheduled during the execution. If async functions are involved and they await a promise, control flow is yielded and the task ends. The click handler just runs the function and schedules no promises, so the arrow is right. Here's the spec on tasks which is within the spec on event loops which also covers micro-tasks. The whole section, especially the processing model is worth a periodic reading.
  • Run-to-completion has some useful non-normative docs in Serializability of script execution and Common pitfalls to avoid when using the scripting APIs. MDN also has some text within its doc on concurrency and the event loop. Note that the text being non-normative doesn't mean that we can ignore it, but more that it's a human readable explanation of the implications of the spec (and the reality that it's a hard thing to concretely spec, plus there are other ill-conceived APIs like sync XHR that mess with things).

Yes, as Jan says, at least under our current checkpoint algorithm, performing a checkpoint during an ongoing task as proposed in option 3 or 4 would allow localStorage values to change while a task is running which would be a major spec violation in terms of violating run-to-completion semantics. (Also, the security concerns I raised[1].)

That is, given the code:

localStorage.foo = "bar";
chan.postMessage("I did a thing");
localStorage.foo = "baz";

Under single process execution (and run-to-completion semantics), it should never be possible for a read of localStorage.foo to have a value of "bar", only "baz". But that is exactly what option 3 and option 4 would do.

Additionally, under our current checkpoint behavior (which could be changed, but which wouldn't correct the previous issue), any localStorage reads performed by the code in the task could potentially have their values change whenever a postMessage is performed, which would be a different violation of run-to-completion.

1: With the disclaimer that it's not clear our current implementation is performant enough for someone to construct a high performance timer, but ironically any improvements we make to performance could potentially then cross such a threshold.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #22)

I suppose with a LOT of work you could share the in-memory data between processes using shared memory (only shared among processes with the same eTLD+1, thus no spectre/security risk). That would be a lot of work, with a good chance of introducing 'interesting' bugs I'd guess.

I think this would only hold true if we could guarantee that all the processes were COOP/COEP enabled such that atomics/SharedArrayBuffer are allowed to be used in the process (because there are still cross-origin resources that can be fetched, even if we guarantee no other origin's globals will run in the process). This is not a constraint that can be enforced for LocalStorage which has to span both types of processes.

Thanks for the clarification on tasks, I just wanted to make it more concrete with the example. At least for my level of understanding there is no obvious way for a web developer to easily see the boundaries of a task by just looking at their code. And most of them will not be very familiar with that spec, probably...

Anyhow, so if 3. and 4. violate the spec and 1. makes things behave in an unexpected way we probably should converge on 2. here without waiting for any performance tests. Jari, would you feel comfortable to give this a try?

Flags: needinfo?(jjalkanen)

:smaug, do you have thoughts on the best place to implement the buffering of the runnables and tracking when they need to be buffered?

Right now we handle things like this in CycleCollectedJSContext potentially via nsContentUtils helpers. Specifically:

Would it make sense to have nsContentUtils gain 2 specific new APIs that then end up calling the real implementations in CycleCollectedJSContext?:

  1. RunOrDeferMessagingRunnable: Runs the runnable immediately unless the next method has been called, in which case it buffers the runnables until after the snapshot method gets invoked. I guess this could also potentially take std::function lambdas instead?
  2. ScheduleLocalStorageSnapshotAndDeferMessagingRunnables: This could actually just enable the buffering and put the snapshot runnable as the first entry in the deferral list, and then makes it so all subsequent calls to the above buffer the runnables. The buffer would be drained at the stable state. (Maybe it even schedules running the buffered messages via RunInStableState? Or does it get its own explicit call?)
Flags: needinfo?(bugs)

(In reply to Jens Stutte [:jstutte] from comment #32)

Anyhow, so if 3. and 4. violate the spec and 1. makes things behave in an unexpected way we probably should converge on 2. here without waiting for any performance tests. Jari, would you feel comfortable to give this a try?

Sorry I still don't get it. The run-to-completion seems to be in the task spec, not in the localstorage spec, and the checkpoints violating the run-to-completion of tasks is due to our algorithm, why can't we just fix that?

The way I would expect the example of comment #30 to work is that if another tab initiates a "get" of localstorage.foo at a point of time after the first assignment has returned but before the second assignment is initiated, the value of the "get" should be "bar". It seems like we are saying that this is not how we want it. But how exactly do we want it then? Right now, we're not done when setItem returns, and we also don't have a then-callback to ensure that we are safe to proceed and it's very hard to know when we can proceed without loss of user data when there are (hundreds of) tabs open.

I agree that the standards are a little bit vague and have disclaimers leaving room for interpretations. At least on Windows, Edge and Chrome browsers also seem to have this same issue, and in order to ensure a consistent web developer experience, maybe it might be better to not make any changes before the standards are revisited?

Second, if we decide to fix something, I think performance tests might be useful. The issue is that we don't have a common language to assess the impact of all the performance optimizations under discussion. Right now, everyone seems to have a different use case in mind and without tests and numbers, I feel that we can't arrive at any useful conclusions.

For example, I don't understand how instant checkpointing could be slower than deferred checkpointing for single values, like when someone is saving preferences from a UI - without buffering, they would be the same operation. Also, sending large values whose processing is expensive one by one to another process or thread could also potentially be faster than sending them all at once due to better utilization of the overlay time.

However, if we decide to go for deferring all communication between processes until new data in localstorage is published, there is a risk that the scope of work expands greatly.

Before a decision to implement, people who understand the proposal better should estimate the necessary work in sufficient detail to properly understand how many resources should be allocated to this project, and when the project can be assumed to be complete.

Because these changes would impact features and components which are not owned by our team, the effort can be signifigantly obstructed or blocked by the responsible module owners and other stakeholders. In order to not waste our time, the identified stakeholders should probably sit together and reach a preliminary agreement before the project goes forward. These discussions should probably be done by someone who has the authority to negotiate on behalf of the proposal.

I am happy to help once we are ready to start with the implementation.

Flags: needinfo?(jjalkanen)

(In reply to Jari Jalkanen from comment #34)

Sorry I still don't get it. The run-to-completion seems to be in the task spec, not in the localstorage spec, and the checkpoints violating the run-to-completion of tasks is due to our algorithm, why can't we just fix that?

All of this is defined in the HTML spec, just different sections. That said, in general the web platform should be viewed as a consistent whole that reflects the existing content on the web first and the desires of browser implementers much later. Breaking changes need to be very heavily weighed in terms of costs and benefits and potentially first depend on trying to change the existing content on the web (ex: use of Synchronous XHR). New enhancements in terms of specs and implementations needs to be compatible with what already exists.

The way I would expect the example of comment #30 to work is that if another tab initiates a "get" of localstorage.foo at a point of time after the first assignment has returned but before the second assignment is initiated, the value of the "get" should be "bar". It seems like we are saying that this is not how we want it.

I'm saying this is wholly in conflict with the run-to-completion model that web content (implicitly) assumes.

But how exactly do we want it then? Right now, we're not done when setItem returns, and we also don't have a then-callback to ensure that we are safe to proceed and it's very hard to know when we can proceed without loss of user data when there are (hundreds of) tabs open.

Right, LocalStorage doesn't provide any guarantees or support for sites wanting to reason about durability. And it's worth noting that even in a single process model sites already would have to deal with the actions of other open tabs and the asynchronous delivery of the storage event for notifications about changes.

In terms of loss of data, the good news is that in general it seems like page authors have used LocalStorage like a key/value store and in general writes are not in conflict. Specifically:

  • For things like site settings, the writes happen when the user changes a setting and reflect the settings the user chose. If the user opens a site with locally configurable themes in multiple tabs and the user doesn't like the theme and changes the setting twice in both, the writes will be identical and the ordering won't matter. Also, it's unlikely the user could make the changes in both tabs faster than normal propagation would allow, so if the site is making use of the storage event then the second tab could already have already updated itself if it cared. (Sites frequently don't bother with this.)
  • For more sophisticated sites with features involving user data like bugzilla's auto-save feature for comments, sites frequently partition the keyspace, like my comment on this bug right now is stored under "bug-modal-saved-comment-1740144:Object", and frequently store the payload as a JSON string within the value, so that LocalStorage's atomic consistency of values helps avoid weird interactions. This does become a problem if the same URL is open in multiple tabs, but that's again something that's a problem in a single process as well. (And the web platform has only recently been able to help sites do something about this via ServiceWorkers and its Clients API which allows enumerating the existing window Clients and calling focus so that a site can help the user avoid multiple tab confusion (which is a real problem for people who aren't entirely tab literate or are too tab literate and have hundreds of open tabs).

I agree that the standards are a little bit vague and have disclaimers leaving room for interpretations. At least on Windows, Edge and Chrome browsers also seem to have this same issue, and in order to ensure a consistent web developer experience, maybe it might be better to not make any changes before the standards are revisited?

Can you clarify what behavior you're seeing from them? You are absolutely right that, given that the spec ignores the multiple process edge-cases, that the healthiest thing for the web is to make sure browsers are aligned here and that alignment takes into consideration what the behavior being observed on the web is. That said, again because users can only change between tabs and interact with them so quickly, it's likely that there is potentially a fair bit of wiggle room here because it's not clear that the behavior is depended on by content the user desires (versus, say, tracking logic doing some kind of side-channel attack to bypass iframe storage partitioning/double-keying).

Second, if we decide to fix something, I think performance tests might be useful. The issue is that we don't have a common language to assess the impact of all the performance optimizations under discussion. Right now, everyone seems to have a different use case in mind and without tests and numbers, I feel that we can't arrive at any useful conclusions.

For performance tests, it's worth noting that the only performance characteristic we care about for LocalStorage is minimizing the "jank" of synchronous blocking of the main thread.

So far, for LocalStorage we've used the page load performance tests as a proxy for this value since sites frequently do check LocalStorage during page load and the dominating factor there has been the time required for the LocalStorage QuotaManager client to reach a state where it can service the initial checkpoint request and it's captured well in the load time metric.

I think the next metric we'd care about is the cumulative synchronous jank resulting from naive websites that perform uncached localStorage reads during their normal operation, particularly if they perform multiple reads during a single task, and potentially if they interleave writes with those reads. For example, many sites with metrics seem to like to accumulate a journal of the user's actions on the page which they then store in LocalStorage so that if the user is offline or there are connectivity problems, the journal can be persisted and then uploaded in a subsequent visit by enumerating LocalStorage keys with a specific prefix (which had a random id suffixed to it). In this case we would likely want to track the number of synchronous blockages of the main thread rather than the latencies since we'd want to factor in PBackground contention and what that would look like under different levels of (mathematically) simulated contention based on general latencies we've observed in the wild.

For example, I don't understand how instant checkpointing could be slower than deferred checkpointing for single values, like when someone is saving preferences from a UI - without buffering, they would be the same operation. Also, sending large values whose processing is expensive one by one to another process or thread could also potentially be faster than sending them all at once due to better utilization of the overlay time.

The concern here is mainly about sync jank (where the sync jank happens in the name of consistency). A mechanism like the pre-LSNG LocalStorage that just sent out notifications of each delta via async IPC (and without any layering on top to make it manifest as sync) has no meaningful performance hit, but did for example mean that we had real consistency problems as every LocalStorage instance in every process could have its own view of LocalStorage state, none of which would be consistent with what's on disk or be consistent with run-to-completion semantics. (That said, slapping a high resolution timer on things to decide a winner of change events would have been a possibility but made the semantics of "storage" events extra confusing if they were telling you about a change that happened but which then was not applied.)

Before a decision to implement, people who understand the proposal better should estimate the necessary work in sufficient detail to properly understand how many resources should be allocated to this project, and when the project can be assumed to be complete.

Definitely! For those following along, a meeting has now been scheduled to try and delve into this more deeply in a synchronous fashion.

Thank you for the clarifications!

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #35)

(In reply to Jari Jalkanen from comment #34)

I agree that the standards are a little bit vague and have disclaimers leaving room for interpretations. At least on Windows, Edge and Chrome browsers also seem to have this same issue, and in order to ensure a consistent web developer experience, maybe it might be better to not make any changes before the standards are revisited?

Can you clarify what behavior you're seeing from them? You are absolutely right that, given that the spec ignores the multiple process edge-cases, that the healthiest thing for the web is to make sure browsers are aligned here and that alignment takes into consideration what the behavior being observed on the web is. That said, again because users can only change between tabs and interact with them so quickly, it's likely that there is potentially a fair bit of wiggle room here because it's not clear that the behavior is depended on by content the user desires (versus, say, tracking logic doing some kind of side-channel attack to bypass iframe storage partitioning/double-keying).

After a button press, the test case sets or replaces a value in the localStorage and as soon as the setItem call returns, sends a message to another tab which then reads the value and sets a visible UI property. All three tested browsers sometimes return the old value and sometimes the new one.

https://github.com/whatwg/html/issues/335 suggests that Chrome has a similar model to us, but perhaps they address comment 0 in a special way? It seems reasonable to escalate this to get a clearer specification and that process itself might give us more insight into Chrome behavior, though I don't think we should block fixing this on that concluding.

We had a video meeting on this on Jan 21st. The general conclusion is:

  • The plan proposed in comment 33 wherein we delay the transmission of BroadcastChannel messages seems like the best course of action if we take any action.
  • Because of the the hand-waving in the spec around multiple processes makes it unclear and other browsers seem to have inconsistent / racy behavior here, it's not clear that urgent action is required here, but...
  • It would be nice to try and improve the spec here to do some combination of:
    • Better reflect the current reality of browser implementations.
    • Potentially improve what browsers are doing in either the sync process or in the ability to perceive races by BroadcastChannel (or potentially other messaging APIs like MessageChannel and ServiceWorker.postMessage which are defined to use task sources but can still allow for apparent races in synchronization). For example, by doing what we propose in comment 33, as this helps make synchronization less of an issue as it largely side-steps the synchronization details and the potential frustrations for developers running into the synchronization races.

To this end, I wrote the comment https://github.com/whatwg/html/issues/403#issuecomment-1019008112 in the 2015 HTML Issue "Define how localStorage is synchronized between browser tabs/windows" https://github.com/whatwg/html/issues/403.

For now we'll defer this bug while we engage in spec discussion on https://github.com/whatwg/html/issues/403 or related issues.

Assignee: jjalkanen → nobody
Flags: needinfo?(bugs)
Attachment #9257822 - Attachment is obsolete: true
Duplicate of this bug: 1858936
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: