Open Bug 578694 Opened 12 years ago Updated 2 years ago

Clearing history should wipe server and remote clients (desktop)

Categories

(Firefox :: Sync, defect, P5)

defect

Tracking

()

People

(Reporter: philikon, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [Sync Q4 OKRs])

Attachments

(2 files)

If the user clears all history we should wipe the server and have remote clients wipe their history, too. Right now we just seem to be forcing a sync by upping the score by a ridiculous amount (500).

The user might clear just recent history in which case we should delete (= set to null) the relevant history entries. A quick test has revealed, however, that nsINavHistoryObservers don't seem to be notified of such an event.
Bug 568721 might be related.
Target Milestone: --- → 1.6
Flagging in-testsuite for further investigation
Flags: in-testsuite?
Duplicate of this bug: 597524
Duplicate of this bug: 580755
For the clear everything case, this should be simple (wipeRemote("history")) when we get the notification, followed by a sync (maybe just on the engine).
Priority: -- → P2
Target Milestone: 1.6 → ---
Blocks: 600429
Duplicate of this bug: 568721
From duped bug: 

(In reply to Richard Newman [:rnewman] from comment #14)
> Sync's only action in onClearHistory is to bump the score… but it doesn't
> track any of the changed IDs, nor receive any notifications for individual
> items, so it won't upload any deletions.
> 
> We could either:
> 
> * Query for every ID during onClearHistory and track them all, and allow
> sync to occur as normal
> * Act as if we're generating a reset event for the history collection, and
> tell other clients to wipe themselves.
> 
> Which we do depends on whether we think of clearing history as something
> that should affect other clients' history that hasn't yet synced, or
> something that should only affect the records on this machine. (And
> efficiency, too; I'm not super excited about POSTing up 80,000 deleted
> records.)

I think the later option is too drastic.  I believe users will actually want to keep history on the remote client (other than what they just cleared on local).  Blindly wiping remote will result in unintentional data loss.  Performance wise, the former option may not be great, but UX will be much better if we track each record.
Asa: do you have any thoughts on this?
Keywords: productwanted
I think we need some input from UX/UR here. My gut says a user clearing history on one device will be shocked when she opens Firefox on her other device to find it's cleared there as well. That being said, clearing ones tracks only to have them re-appear later would also be problematic. Madhava, Cori, can you help here or ask someone on your team to help figure this out?
Asa's point about multiple devices is important.

A solution could be to Clear local history and disable History from being Sync in that device until the preference is changed again.
FWIW, Sync muddles which devices incurred which history visits. Sync just records that URL X was visited at various times (along with other basic metadata, like the visit type).

The best we could theoretically do is see which URLs no longer have visits and forcefully remove these from the Sync server.

In the Sync 2.0 world, we were planning on drastically changing the data representation of history visits. There are some scenarios that would be technically very difficult to support. I'll push back if Product/UX/UR says they want something that is extremely challenging. For now, just think big and I'll tell you if that is feasible :)
I think it's a good point that a user may want to wipe history on one device, but not another. For example, I may want to clear my browser history for multiple reasons (privacy, etc) because I share my desktop computer with my family, but I don't share my phone, so I may not have the same needs for my phone.

Is there a way to have the user choose where to wipe histories from? For example, they choose to clear history on the desktop, and get some sort of dialog box that says, "Do you want to clear the history from your phone and your tablet as well?" Then the user could choose how they want to handle a synced action.

Is this possible? Would UX be into this type of alert & action?

Either way, this calls to attention that we need to determine a set UX strategies for syncing activities.
(Hmm, need a general uxwanted flag...)
Keywords: uiwanted, urwanted
Summary: Clearing history should wipe server + remote clients → Should clearing history wipe server + remote clients?
Agree that this needs a thorough UX review.

If privacy is an issue a re-engineering of the system may be needed. Wiping out all the data in one computer, while leaving it on the servers will make any of the previous data accessible in one click (i.e. check the component that was unchecked).

Wiping history permanently and removing the ability to synchronize it again doesn't seem correct either.

An in-between solution could be to disable History syncing in that device, remove the history completely from the local machine and password protect the activation of the engine.

This seems like a really niche use-case to me but because Privacy is involved we may want to pay closer attention if the solution doesn't involve a major resource investment.
Time to think about this again a little bit.

If we're converging on something more like a single-source-of-truth future -- where Metro and Classic are connected via Sync (and will have to track clear events separately), and where the cloud purportedly has the true version of your data -- does it make sense to blow away local data and then have it gradually re-converge to other devices as records are downloaded again?

Thinking about this from a different perspective: if we imagine sync running every ten seconds, and each time doing a full sync, the user won't be able to tell the difference. Everything just works; it's just more expensive than what we actually do now.

Everything, that is, apart from Clear History, which will only scrub the last nine seconds of history. The rest will be immediately recovered from the cloud.


You'll also see weird behavior if you browse a bunch of sites, clear history, then pair a new tablet: all of those sites you visited will pop up in the tablet's history, even though you cleared them on your desktop.

The fact that nobody has complained about this is probably due to the infrequency of both adding devices and clearing history.


If a user wants to scrub only the last few minutes of browsing, we support that (and we also support per-window private browsing now). IMO those last few minutes should get erased everywhere! So we should consider:

* Providing both options (delete my data, delete my data on this machine only), with a warning that the latter will eventually be eroded by Sync.

* Propagating clears by default, with a warning that anything you delete will be deleted from your other synced devices, too.

Note that this doesn't affect expiration, which is still per-device.

Asa?
Flags: needinfo?(asa)
I've put a lot more thought into this over the last couple of weeks and I think that the only solution that doesn't over-complicate things is that removals from one machine removes them from all machines. 

I would like to revisit the clear all history feature and tweak usability there to give people better options than exist, rather than trying to figure out how to do a one-off usability correction in sync.

So, in this new world, the profile in the cloud is canonical, and removals, up to and including "all history," should be propagated to all sync'd clients.
Flags: needinfo?(asa)
Thanks, Asa.

See also Bug 735532 for form history.
Keywords: productwanted
Priority: P2 → --
Summary: Should clearing history wipe server + remote clients? → Clearing history should wipe server and remote clients
Blocks: 833044
We need to be extra cautious with the communication and probably, explain in the product how to delete anything from one single machine (i.e. disconnect from sync first and clean history after).
(In reply to ibai from comment #18)
> We need to be extra cautious with the communication and probably, explain in
> the product how to delete anything from one single machine (i.e. disconnect
> from sync first and clean history after).

Agree absolutely.
Depends on: 833902
Keywords: uiwanted, urwanted
Summary: Clearing history should wipe server and remote clients → Clearing history should wipe server and remote clients (desktop)
Blocks: 1226369
Flags: needinfo?(kit)
Edouard and I talked about this on Tuesday.

It doesn't seem like deduping `moz_places` to match history record GUIDs will do what we want, because clearing history for a time range only removes `moz_historyvisits` for that period, not the Places themselves. If you clear history for the last 6 months, but visited Facebook before then, we still want to preserve those earlier visits. Conversely, if you visit Facebook after clearing history, but the Place still has the same GUID, we'll erroneously ignore that visit. Storing tombstone records for potentially thousands of history visits on the server is also expensive.

Another idea we had is to store tombstone records with date ranges: something like `{ id: "...", deleted: true, startDate: 1234, endDate: 5678 }`. Every time you clear history locally, we'll insert a row into a new `moz_historyvisits_deleted(guid TEXT, startDate INT, endDate INT, syncStatus INT)` table, with the cleared date range. When another devices sees a record like this, it permanently stores the tombstone, and clears its own history for that range. For every incoming visit, we check to see if its visit date falls anywhere in our tombstone ranges, and ignore if it does.

The danger here, of course, is clock skew. A client whose clock is far into the future is especially problematic because it might upload a tombstone for visits that don't even exist yet ("doomstones", a la https://aphyr.com/posts/299-the-trouble-with-timestamps). We could reduce this risk by storing the duration (`endDate - startDate`), instead of the start and end dates on the server. When we store the tombstone locally, we calculate the start and end date relative to our clock. If our clock is wrong, we might not clear out all past visits, but at least we won't doom future visits. We can also have a maintenance task periodically check these date ranges and see if they're sensible...so, if we fix our local clock, we won't end up in the same problem.

But the vague plan is to store time slices in tombstones. Richard, I expect you'll have thought about this, too; please comment at your convenience. :-)
Flags: needinfo?(kit) → needinfo?(rnewman)
I think I'm missing a little bit of context here. I imagine you're talking about storing tombstones for history in order to represent deletions of visits.

To switch vocabulary for a moment, you're talking about inventing an 'excision marker' in order to handle deletions of component data.

You don't want to accidentally excise future assertions, but you also don't want to describe the entity you're deleting! ("Delete my visit to deepsyncz.com from yesterday".)

The marker has a particular place in the timeline, and it implicitly rewrites history prior to the marker.

You've done this by referring to the entity by opaque identifier, and by implicitly denoting a range of visits.

This is a more limited form of having an _explicit_ collection of deleted visits, which avoids doomstones and issues of slice representation, at the cost of being more verbose.

Both formulations expose some information about the visits you deleted -- unless you're deleting the entire record along with its URI, the tombstone effectively permanently stores the visits you're deleting!



The bigger issue here is that a client that doesn't understand the extensions to your tombstones will delete the entire record -- if I over-zealously delete a facebook.com visit by asking Firefox to forget the last five minutes, my older devices will forget I ever visited Facebook. If you try to avoid that by writing a new record, you'll have a different set of issues.

You're talking about extending the history format, and that carries with it all the trouble of a seven year old bug :)
Flags: needinfo?(rnewman)
(In reply to Richard Newman [:rnewman] from comment #21)
 
> The marker has a particular place in the timeline, and it implicitly
> rewrites history prior to the marker.
>
> …
>
> unless you're deleting the entire record along with its URI…

I wanted to expand on this a little.

In Sync as it stands today, if you delete "abcdef" on one device, all future devices to sync (those that have synced in the past and thus associate "abcdef" with the same page, anyway) will see that deletion and drop _all_ visits to that history item.

That includes visits made since the remote deletion, potentially days later.

Hooray, a distributed systems problem.

When I talk about an excision marker on a timeline, I'm talking about the obvious solution to this: if we had a way to denote a deletion that was an intrinsic part of the sequence of changes stored on the server, then we'd be able to avoid issues like this, because it would be clear by sequencing what information was being deleted.
Priority: -- → P2
Whiteboard: [Sync Q4 OKRs]
Richard and I chatted more about this over IRC last week, and a vague plan is starting to form in my head. One of the problems with just uploading tombstones is that we'll still keep all the old visits (and URLs, even if all visits for that URL are cleared) on the server. So we'll want to excise visits from records, in addition to uploading tombstones for other clients to clear out visits.

There are three kinds of deletions that we need to think about: "Delete Page" (removes all visits for a Place from history: https://searchfox.org/mozilla-central/rev/d0448c970093f94bd986a21d3a9e8c366b316eb4/browser/components/places/content/controller.js#243-245), "Forget About This Site" (removes all visits for an entire domain, which might have multiple Places: https://searchfox.org/mozilla-central/rev/d0448c970093f94bd986a21d3a9e8c366b316eb4/browser/components/places/content/controller.js#247,254-255), and "Clear Recent History" (removes visits for a date range, for multiple Places: https://searchfox.org/mozilla-central/rev/d0448c970093f94bd986a21d3a9e8c366b316eb4/browser/base/content/sanitize.js#342-350).

* "Delete Page" and "Forget About This Site" can upload classic, one-off tombstone records, in the usual `{ guid: "placeAAAAAAA", deleted: true }` format. When a client gets a record like this, it deletes all visits for that Place, and the Place itself, if its `foreign_count = 0`.

One complication here is that the Place should still exist in `moz_places` if it's referenced anywhere else, like a bookmark or keyword: search for `foreign_count` in https://searchfox.org/mozilla-central/rev/d0448c970093f94bd986a21d3a9e8c366b316eb4/toolkit/components/places/nsPlacesTriggers.h#218. In that case, we should either not store the tombstone for `placeAAAAAAA`, or store the tombstone and change the GUID of `placeAAAAAAA` to something else. If we store the tombstone but keep the GUID, we'll mistakenly ignore new visits to the previously-deleted page.

Another complication is that we'll need to dedupe GUIDs in `moz_places` to incoming history record GUIDs. Without that, a Place for http://example.com might have GUID "placeAAAAAAA" on device 1, and "placeBBBBBBB" on device 2. If you click "Delete Page" on "placeAAAAAAA" on device 1, we won't currently delete "placeBBBBBBB" on device 2, because we match by URL instead of by GUID. One way around this is to include the deleted URL in the tombstone, but now we're storing information that "this specific URL was deleted"...which, as Richard says, exposes information about the entity we're deleting!

"Forget About This Site" will write multiple tombstones, for each subdomain and URL. For example, if you forget "example.com", and your `moz_places` has rows for "http://example.com/1", "http://example.com/b/2", and "http://a.example.com", we'll upload 3 tombstones. That seems OK.

* The "Clear Recent History" case is more interesting, because it's clearing a large range of visits, for many Places. This is where we can use the special range-based tombstone...something like `{ guid: "recordAAAAAA", deleted: true, duration: -86400000000, clearedAt: 1508446534860000 }` to represent "clear everything from the last hour", for example, or `{ duration: -1508446534860000, clearedAt: 1508446534860000 }` for "clear everything". These could be stored permanently in a `moz_historyvisits_deleted` table. In this case, "recordAAAAAA" is only used to identify the Sync record; it's not a Place GUID. Old clients that don't know about these kinds of tombstones will execute `PlacesUtils.history.remove("recordAAAAAA")`, which does nothing because there's no row in `moz_places` with that GUID. (Barring GUID collisions, of course, which would break so many other Sync and Places assumptions that it's not worth worrying about here).

To reduce churn, it probably *doesn't* make sense to reupload history records for *all* affected Places, even though their `visits` arrays might change. However, I think it could make sense to also issue `DELETE`s for all Places that no longer have any visits after clearing. For example, if you clear history from the last day, but you visited Facebook earlier in the week, then we won't reupload another history record for Facebook. Our range-based tombstone should be enough for clients to excise the visits from the past day. However, if you visited (say) WebMD today, and you clear everything from the past day, then we upload the range-based tombstone and hard-DELETE WebMD from the server. If the WebMD visit was already synced to your other devices, they'll delete it when they apply the range-based tombstone. If you clear everything, we wipe the history collection on the server, then upload the range-based tombstone for "clear everything".

Ed and Richard, what do you think about this approach? It's more complicated, and I'm not entirely sure it gets us closer to resolving the issues that Richard pointed out. (We're hacking around the fact that visits in today's Sync record format are component entities, without identifiers).

Could we simplify, or break this down into multiple parts (permanently storing one-off tombstones, deduping local `moz_places` GUIDs to remote history record GUIDs, storing and syncing range-based tombstones, handling deletions for ranges)? How much work can we do to make this "good enough", until we have a storage and Sync system that can represent deletions like this?
Flags: needinfo?(rnewman)
Flags: needinfo?(eoger)
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #23)
> `{ guid: "recordAAAAAA",
> deleted: true, duration: -86400000000, clearedAt: 1508446534860000 }` to
> represent "clear everything from the last hour", for example, or `{
> duration: -1508446534860000, clearedAt: 1508446534860000 }` for "clear
> everything". These could be stored permanently in a
> `moz_historyvisits_deleted` table.

I meant "clear everything from the last day". :-) To avoid doomstones, we can clamp `clearedAt` to the current local time, using something like `WHERE MIN(clearedAt, :currentLocalTime) - duration >= :incomingVisitDate AND :incomingVisitDate <= :currentLocalTime` to decide whether to ignore an incoming visit. The lower bound might be off: we might apply old visits that we should've ignored, but we won't ignore new visits that we should apply.
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #23)
> To reduce churn, it probably *doesn't* make sense to reupload history
> records for *all* affected Places, even though their `visits` arrays might
> change. However, I think it could make sense to also issue `DELETE`s for all
> Places that no longer have any visits after clearing. For example, if you
> clear history from the last day, but you visited Facebook earlier in the
> week, then we won't reupload another history record for Facebook. Our
> range-based tombstone should be enough for clients to excise the visits from
> the past day. However, if you visited (say) WebMD today, and you clear
> everything from the past day, then we upload the range-based tombstone and
> hard-DELETE WebMD from the server. If the WebMD visit was already synced to
> your other devices, they'll delete it when they apply the range-based
> tombstone. If you clear everything, we wipe the history collection on the
> server, then upload the range-based tombstone for "clear everything".

Another idea: just wipe the entire collection, upload a tombstone with the time slice, and don't bother with individual DELETEs. Or check how many Places would be affected: if more than, say, 5000, wipe; if less, issue the DELETEs. The old visits will still remain on the clients; they'll only clear visits back to the start of the range in the tombstone.

With this approach, new clients will only see history that existing clients uploaded since the tombstone, and not the complete history. But that's already true today, since we only sync a subset.
> * "Delete Page" and "Forget About This Site" can upload classic, one-off
> tombstone records, in the usual `{ guid: "placeAAAAAAA", deleted: true }`
> format. When a client gets a record like this, it deletes all visits for
> that Place, and the Place itself, if its `foreign_count = 0`.

Which of course isn't exactly what we want -- from the user's perspective they want to put a big red line across the timelines of all of their devices, and this approach, as I mentioned in Comment 22, is a bit broader. But it's what we do now.

 
> One complication here is that the Place should still exist in `moz_places`
> if it's referenced anywhere else…

There is a mismatch between Places' data model and Sync's, and of course we can't identify individual visits, so this is always going to be a bit awkward.

To summarize for the record:

In Places, a bookmark is a reference to a place, with a stable and customizable title; it doesn't store its own URL.

And Places mostly conflates the idea of a 'place' -- the association of a cheap integer and some metadata with a URL -- with browsing history.

In Sync, _and on every other platform_, bookmarks are independent of history.


> In that case, we should either not store the
> tombstone for `placeAAAAAAA`, or store the tombstone and change the GUID of
> `placeAAAAAAA` to something else. If we store the tombstone but keep the
> GUID, we'll mistakenly ignore new visits to the previously-deleted page.

The most natural thing to do is to end the lifetime of `placeAAAAAAA` with a tombstone, and treat new visits to the same URL as visits to a new record with a new GUID.

However, there's no guarantee that an already-syncing client will see _and process_ the tombstone before it sees the new record. (If they're uploaded in the same batch, they'll have the same timestamp!)

If it does, it'll take the new remote GUID and rename `placeAAAAAAA` locally. The tombstone will be a no-op, and the deletion will fail and will be gradually undone, because the other client still has the visits.

If we keep the old GUID and upload a tombstone, then we need to ensure that the tombstone is processed first, always, and only once. That's really challenging. Furthermore, once a client has deleted visits, if another client ever re-uploads them, they'll simply be re-visited: nobody is keeping track of which ones were deleted!


> These could be stored permanently in a
> `moz_historyvisits_deleted` table.

I probably wouldn't approve that in a privacy review.


> Ed and Richard, what do you think about this approach? It's more
> complicated, and I'm not entirely sure it gets us closer to resolving the
> issues that Richard pointed out. (We're hacking around the fact that visits
> in today's Sync record format are component entities, without identifiers).

I'm not feeling hopeful that we can produce a solution here that's better than using the Sync commands channel. That has a pile of edge cases, but so does a persistent tombstone approach, and the latter has more worrying privacy warts.


> Could we simplify, or break this down into multiple parts (permanently
> storing one-off tombstones, deduping local `moz_places` GUIDs to remote
> history record GUIDs, storing and syncing range-based tombstones, handling
> deletions for ranges)? How much work can we do to make this "good enough",
> until we have a storage and Sync system that can represent deletions like
> this?

I will think more, as I'm sure you will!
Flags: needinfo?(rnewman)
(In reply to Richard Newman [:rnewman] from comment #26)
> In Places, a bookmark is a reference to a place, with a stable and
> customizable title; it doesn't store its own URL.
> 
> And Places mostly conflates the idea of a 'place' -- the association of a
> cheap integer and some metadata with a URL -- with browsing history.
> 
> In Sync, _and on every other platform_, bookmarks are independent of history.

I find it helpful to think of bookmarks and history as different views into Places:

* Overlaying `moz_bookmarks` onto `moz_places` gives you bookmarks.
* Overlaying `moz_historyvisits` onto `moz_places` gives you history.

> The most natural thing to do is to end the lifetime of `placeAAAAAAA` with a
> tombstone, and treat new visits to the same URL as visits to a new record
> with a new GUID.

Agreed. It occurs to me that, if we do this, we can use the same technique as for bookmarks: give every row in `moz_places` a `syncStatus`, and have fresh visits use `NEW`. When we apply a tombstone, only remove visits for Places that aren't `NEW`. Rationale: since they're `NEW`, they won't exist on the server, so clearing your history on another device shouldn't consider them because that other device doesn't have these visits yet...even if they're for a URL that already exists.

> I probably wouldn't approve that in a privacy review.

Hmm, could you elaborate on this? I would think it's no different than storing bookmark tombstones permanently in `moz_bookmarks_deleted`, which Android already does (bug 1388884, bug 1343103). The schema for `moz_bookmarks_deleted` is `(guid, dateRemoved, syncStatus)`; `moz_historyvisits_deleted` would look something like `(guid, duration, dateCleared, syncStatus)`.

Analyzing `places.sqlite`, you'd know *that* history was cleared at this point in time, but not *which* URLs were affected. We would not store URLs for any tombstones; that's why we need deduping for the one-offs.
Flags: needinfo?(rnewman)
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #27)

> > I probably wouldn't approve that in a privacy review.
> 
> Hmm, could you elaborate on this? I would think it's no different than
> storing bookmark tombstones permanently in `moz_bookmarks_deleted`, which
> Android already does (bug 1388884, bug 1343103). The schema for
> `moz_bookmarks_deleted` is `(guid, dateRemoved, syncStatus)`;
> `moz_historyvisits_deleted` would look something like `(guid, duration,
> dateCleared, syncStatus)`.
> 
> Analyzing `places.sqlite`, you'd know *that* history was cleared at this
> point in time, but not *which* URLs were affected. We would not store URLs
> for any tombstones; that's why we need deduping for the one-offs.

My concerns are this:

- If you don't delete all visits for a URL, only some, then 'deleting' visits in this way at most blurs exactly when those visits occurred, because the tombstone stores a record of which ones you deleted, and the site is still in moz_places! Typically you'll leave a trail that's five minutes wide.

Indeed, that would be a super interesting query:

SELECT DISTINCT url FROM moz_historyvisits_deleted, moz_places ON guid;

tells you every URL I tried to hide.

If you're a teenager who's been told to stop visiting a site, or a journalist, your "plausible deniability" history -- there are still old visits in Places -- just made Forget a harmfully misleading feature. You thought you deleted your recent visits, but in fact you did not, you just made them blurry.

- Because you're storing a time range, you're leaking user behavior that the user tried to eliminate, _regardless of URL_.

SELECT duration, dateCleared FROM moz_historyvisits_deleted;

is a really interesting query -- I know exactly when you were sneakily browsing.

- If you do delete all visits, then anyone with an earlier copy of your profile, or an earlier snapshot of your sync account, or a detailed sync log — anything that ever associated a GUID with a URL — can _for all time_ see the GUIDs that you deleted and approximately when you visited them, even if those visits occurred after the snapshot, and cross-reference those GUIDs to the URL. It's harder than if the URL were still in the same DB, of course.


One can generalize these concerns to: storing the date range _permanently_ stores some of the information that the user is trying to eliminate, and storing the GUID _permanently_ stores an association between a time range and a URL, even if the mechanism for finding the URL can be difficult.

The worry, then, is that we might be moving from a situation in which behavior is buggy but transparent to one in which there is a privacy leak that is not obvious to the user.


Compare and contrast to a world in which each visit is given a unique identifier.

A deletion of a visit would persist only the identifier -- it would no longer be linked in any way to the page that was visited. Even if the page were still in history, there'd be no way to find out what the deleted visit represented. Indeed, the only time range that would leak is the longest possible one -- it was earlier than the deletion occurred. It's not possible to find out when you were browsing, what you deleted, or even if you visited a new URL -- only that a certain number of identities were deleted at some point.
Flags: needinfo?(rnewman)
Thanks for explaining, this was helpful! CCing Alex because this has product implications, too.

(In reply to Richard Newman [:rnewman] from comment #28)
> If you don't delete all visits for a URL, only some, then 'deleting' visits
> in this way at most blurs exactly when those visits occurred...

That's how "Clear History" works today. If you don't clear all visits for a URL (unless you select "Everything"), it'll still hang around in your history. Once all visits go away, the URL gets deleted. A Place without any visits, bookmarks, or keywords gets dropped automatically, and we won't sync history records without any visits.

> SELECT DISTINCT url FROM moz_historyvisits_deleted, moz_places ON guid;

Hmm, I still don't follow. That query doesn't make sense to me: a GUID can't exist in both `moz_places` and `moz_historyvisits_deleted`. Either it's in `moz_places`, in which case we have the URL and everything for it, or it's a tombstone and we don't know anything about it. For ranges, the tombstone GUID is completely arbitrary; not related to anything in `moz_places`, and not linked to the page that was visited.

This is a risk with one-off deletions, though. The same is true for bookmarks (and all synced data that's deleted later), except bookmarks are more permanent: you make a conscious decision to star a page, whereas history is saved automatically. We can represent one-offs as multiple range tombstones, one for each visit, with a random GUID, a duration of 0, and the visit time as the cleared time. If you visit a page on another device at the exact same time (or skewed clocks), applying the tombstone will clear those out, too.

IOW, we work around the fact that visits don't have identifiers by using the visit time as the identifier.

> Because you're storing a time range, you're leaking user behavior that the user tried to eliminate, _regardless of URL_.

That makes sense. If you're using Sync, it's possible your other devices won't see the deletions immediately, and so the visits you tried to excise will still exist until the next sync. This is a general problem with tombstones. It's easy to say "don't use Sync" or "use private browsing for sites you don't want synced", but people forget. That's why clearing your history exists as a backup. We should do the best we can to make that work, even when Sync is connected.

> The worry, then, is that we might be moving from a situation in which behavior
> is buggy but transparent to one in which there is a privacy leak that is not
> obvious to the user.

I don't think it's transparent how clearing your history interacts with Sync today. You'll find out when you switch to another device, and see all your visits are...still there. Sending a command has similar risks: now it appears to work, but visits might still be resurrected, or the command might be lost at the wrong time if shutdown is interrupted, or, depending on whether it's processed before or after the history sync, clear too much or not enough. You still need to know the period, too.

ISTM that permanent storage of tombstones can be split out from "how do we sync mass deletions." We should anticipate keeping tombstones around, but we don't need to do this until we work through the privacy implications. Alex, I expect you'll have interesting insights here, too.

> A deletion of a visit would persist only the identifier -- it would no longer
> be linked in any way to the page that was visited. Even if the page were still
> in history, there'd be no way to find out what the deleted visit represented.

I think the problems you pointed out also apply here. With an earlier copy of your profile, or an earlier snapshot of your Sync account, you can still see everything that was deleted. You can look at the timestamps on the deleted visits (especially if we persist them permanently, so that they're not resurrected!), and know when the entities were deleted, even if you can't know when they were created.
Perhaps I'm misunderstanding your proposal…

> > If you don't delete all visits for a URL, only some, then 'deleting' visits
> > in this way at most blurs exactly when those visits occurred...
> 
> That's how "Clear History" works today. If you don't clear all visits for a
> URL (unless you select "Everything"), it'll still hang around in your
> history.

I understand your proposal to be:

* Add ten visits for U1 between T=1 and T=100.
* Choose "Forget" for T=[50,101]
* moz_places still contains U1.
* moz_historyvisits contains visits for U1 between T=1 and T=50.
* moz_historyvisits_deleted contains [U1, 50-101].

The difference is that right now we fail to sync the deletion, but there's no trace of those visits between 50-100.

As I understand your proposal, we would now permanently track that you visited U1 between 50-100. That's what I mean by blurring -- we no longer know you visited U1 five times at precise instances, but we do know that you visited it after T=50 and before T=101.

But…

> For ranges, the tombstone
> GUID is completely arbitrary; not related to anything in `moz_places`, and
> not linked to the page that was visited.

Ah, then yes, I misunderstand your proposal.

In that case, how do we know which URL to delete visits for?

If your proposal is "we don't -- delete everything in that timeframe" then doesn't that have a fundamental problem with identifying which device's visits should be deleted?

After all, if I wipe the last hour's visits on my desktop, I don't intend to wipe the browsing I was doing on my phone…


> I don't think it's transparent how clearing your history interacts with Sync
> today.

By 'transparent', I mean that all of the data we're storing is visible in the UI on your devices. Sync might not move deletions around, but if you look in your history on each of your devices in turn, you'll see everything Firefox knows -- they just might not all agree. That's what I mean by "buggy but transparent".

If we move to a scheme in which detailed tombstones are recorded in the DB, we might reach convergence in the UI, but we'll also have a broader forensic trail than we display in the UI, so it's harder for a user to predict what information a snoop might be able to dig out of the DB.


> I think the problems you pointed out also apply here. With an earlier copy
> of your profile, or an earlier snapshot of your Sync account, you can still
> see everything that was deleted. You can look at the timestamps on the
> deleted visits (especially if we persist them permanently, so that they're
> not resurrected!), and know when the entities were deleted, even if you
> can't know when they were created.

Part of my implication is that if you see an opaque _non-typed_ identifier like "1234567", with no metadata at all other than "deleted at T1234", the only thing you can determine is that Firefox deleted _something_ at that time.

You would need to see a copy of the DB that actually described 1234567 to know it was a visit, and at that point you'd also know what page it represented. Because visits are uniquely identified only by the combination of (timestamp, page, type), approaches to deleting visits in the current Sync scheme are more likely to expose a page.

It's not possible in the general case to hide the fact that something was deleted, but _if every entity has an identifier_, then we can at least hide the nature of what was deleted.

But we're well off into the weeds of alternative data modeling strategies here :)
(In reply to Richard Newman [:rnewman] from comment #30)
> I understand your proposal to be:

Ah, not quite. My proposal is more like:

* Add ten visits for U1 between T=1 and T=100.
* Add five visits for U2 between T=51 and T=100.
* Sync.
* Choose "Forget" for T=[50,101].
* `moz_places` still contains U1.
* `moz_places` does *not* contain U2, since we removed all its visits.
* `moz_historyvisits` contains visits for U1 between T=1 and T=50.
* `moz_historyvisits` has nothing for U2.
* `moz_historyvisits_deleted` contains [RANDOM_GUID(), 50-101].

So, we'll store the fact that you cleared *some* visits between 50-101, but there's nothing linking that tombstone to U1 or U2.

> If your proposal is "we don't -- delete everything in that timeframe"...

Correct. Alex, when you have cycles, I'd like to get your thoughts on this behavior, to make sure this is still what we want.

If you wipe the last hour's visits on your Desktop, we'll also wipe the visits for the last hour's synced Places on your phone, and any other devices connected to your account. We can leave unsynced Places alone (pages that you haven't visited on your Desktop, and that your phone hasn't uploaded to the server yet).

> doesn't that have a fundamental problem with identifying which device's visits
> should be deleted?

It does. I think tracking the `syncStatus` in `moz_places` can mitigate this, though there is an edge case if we store the `syncStatus` in `moz_places` and not `moz_historyvisits`. If you visit a synced site on your phone before you cleared your history on Desktop, then we'll also clear those visits from your phone, because they fall into the cleared timeframe...even though they're new visits. We can be clever and track the `syncStatus` on visits instead, but that makes other queries tricky, and a component entity without a stable ID having a sync status is a strange concept.

Continuing the example from above, let's say you do this on another device:

* Add five visits for U1 between T=75 and T=100.
* Add five visits for U1 between T=110 and T=200.
* Add five visits for X1 between T=0 and T=100.
* Sync.
* `moz_places` still contains U1.
* `moz_historyvisits` contains visits for U1 in the ranges [1-50] and [110-200] *only*. `moz_historyvisits` does not contain the five visits you added in [75-100], even though they're new visits, because the tombstone is for [50-101].
* `moz_places` still contains X1.
* `moz_historyvisits` contains visits for X1 between T=0 and T=100, because the first device didn't know about X1. It only knew about U1 and U2 when you cleared history, so we leave X1 alone.

> Part of my implication is that if you see an opaque _non-typed_ identifier like
> "1234567", with no metadata at all other than "deleted at T1234", the only
> thing you can determine is that Firefox deleted _something_ at that time.

That makes sense. In practice, it's still possible to figure out which data type was deleted by inspecting `moz_historyvisits_deleted` and `moz_bookmarks_deleted`, and by slurping down, decrypting, and inspecting the server collection contents. (You probably don't even need to decrypt; tombstones are so much smaller that you can infer what they are based on the payload size). I also agree with you that alternative data modeling strategies are good to discuss for the long-term storage rework, but not really feasible for incremental improvements to Sync.
Flags: needinfo?(eoger) → needinfo?(adavis)
Alex and I chatted about this yesterday. The approach we discussed is:

1. Deleting a single page (open the Library, right-click a row, and choose "Delete Page") stores and uploads a "classic" tombstone (e.g., `{ id: "placeAAAAAAA", deleted: true }`) for that page. When another device applies this tombstone, it drops all visits for `placeAAAAAAA`. If that page isn't mentioned anywhere else (bookmarks, keywords), the other device will also remove `placeAAAAAAA` from its `moz_places` table.

2. Selecting all visits for a particular view (e.g., select "Last 7 days", then press Cmd+A to select all visits) uploads a single "range" tombstone (e.g., `{ id: "randomGUIDAA", deleted: true, duration: -604800000000, clearedAt: 1509038175804000 }` means "clear the last 7 days of synced history from this date"). This clears all synced visits for that date range on all your devices. We won't prompt here, but we can rethink that if we're worried about accidental data loss. More on that below.

3. Right-clicking a time (e.g., "Today", "Yesterday", "This month") underneath "History," and choosing "Delete," also uploads a range tombstone for the selected duration. This is really the same as (2).

4. Selecting multiple *discontinuous* visits, right-clicking, and choosing "Delete Pages" stores and uploads multiple classic tombstones for all selected pages. We can optimize this to upload a single (or multiple, discontinuous) range tombstone later. I'd prefer to avoid it for the initial implementation because it's tricky, and I'd like to have telemetry about how frequently our users do this before investing more work here.

5. Selecting a single row, right-clicking, and choosing "Forget About Site" uploads classic tombstones for all URLs under that domain. We can't use a range tombstone here, because you might have visited different subdomains or paths under that domain at different times. We can't use a single tombstone and include the cleared domain as a field, because that leaks the URL you're trying to delete. You can only forget sites one at a time, so that should limit the tombstone volume.

6. Clearing recent history, via "History > Clear Recent History" or the "Forget" button, uploads a range tombstone for the selected duration. The options in "Clear Recent History" are last hour, last 2 hours, last 4 hours, today, and everything. The options in the "Forget" button are last 5 minutes, last 2 hours, and today.

For (6), if you opt to clear "Everything", and have history sync turned on, we'll show a modal dialog that asks you something like, "Do you want to clear history on all your devices, or just this one?" (We need to figure out messaging here). If you say "Clear everywhere," we'll store and upload a range tombstone. If you say "Just this device," we'll store a range tombstone *locally*, but not upload it.
Flags: needinfo?(adavis)
All that said...I thought more about what Richard said in comment 30, and there's a fatal flaw in my proposal. What happens if you connect a *new* device to your account, and it has visits in a range that you previously cleared?

1. The new device will see the tombstone on the server, and won't apply it because all its local URLs are still "NEW". It'll then upload the last 30 days of its visits, and set their local sync status to "NORMAL". Good.

2. Other devices will download the new device's visits, and check if they fall into any cleared ranges. They do, so the other devices will ignore all those visits from the new device. Oops.

The problem with (2) is that the existing devices can't tell if something is an old visit from a new device, or a resurrection. Without persistent, stable ways to identify visits, this is hard to solve. The edge case I noted in comment 31 might be forgivable, but I don't think this is. Someone might clear the last week of history on their Desktop, connect their new phone two weeks later, and wonder why they don't see any history from three weeks ago on their Desktop...even though it's right there on their phone!

So, what can we do instead?

1. Go the other direction, and clear that week of history on the phone, too. That's even more surprising, because connecting a new device to your account now becomes a destructive action that might wipe some of its history, long before that new device was part of your account. Indeed, you might disconnect and reconnect because you accidentally cleared history on another device, and want to get it back. Instead, we'll clear it from the new device, too. (Though that's not specific to history; we should also think about this for bookmarks).

2. Punt on handling resurrections, and avoid storing those tombstones locally. At that point, we might as well use the commands channel, as Richard suggests in comment 26. Otherwise, we'll do a whole lot more work (add `syncStatus` and `syncChangeCounter` to `moz_places`, implement deduping for history records, record and store two different types of tombstones), and still not fix the actual issue.

3. Give visits their own GUIDs. This requires a new collection, it duplicates data on the server (every visit will need to carry its URL), and it'll bloat the table sizes on the server and Places (I think we were talking about Mentat using integer IDs to represent component entities like this, because GUIDs take up more bytes). Also, we'll still need to think about how to store mass deletes on the server, because uploading hundreds or thousands of tombstones for every visit when you clear history won't fly.
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #31)

> If you wipe the last hour's visits on your Desktop, we'll also wipe the
> visits for the last hour's synced Places on your phone, and any other
> devices connected to your account. We can leave unsynced Places alone (pages
> that you haven't visited on your Desktop, and that your phone hasn't
> uploaded to the server yet).
 
Speaking as a user, I'd find that surprising. Not just because the action is spooky-at-a-distance, but also because I can't predict what will happen and when:

- When two devices sync is unpredictable and not transparent.
- Which history items will be deleted (those with a shared places record) and those that won't (those that are genuinely new) is an implementation detail to which I'm not privy.


I haven't weighed them up too closely, but I suspect I'd rather keep the status quo than surprisingly delete history at a distance. That's doubly true if we might surprisingly delete history from a device that isn't even yet connected to the account, perhaps weeks in the future.

I'm very leery of storing persistent deletions that aren't tied to specific records -- it's baggage that moves around between accounts and devices, and can have surprising consequences. But tying to specific records is leaky and expensive.


(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #33)

> The problem with (2) is that the existing devices can't tell if something is
> an old visit from a new device, or a resurrection.

I think there's a different phrasing for this: it's not easy in Sync to establish a sequencing for deletions and visits. You're essentially saying here that we don't know if the visits on the new device were created (not the same as the date of the visit!) on the same timeline as, and prior to, the deletion.

Tombstones aren't anchored on a timeline, nor do we have any concept of multiple timelines, so they can delete facts from another timeline altogether -- from another device that isn't yet signed in. Indeed, Sync visits and records themselves aren't anchored on a timeline, which appears as resurrection!
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #33)

> 2. Punt on handling resurrections, and avoid storing those tombstones
> locally. At that point, we might as well use the commands channel, as
> Richard suggests in comment 26.

Indeed, using the commands channel is Sync's hack for imposing temporal ordering and exactly-once application…

If our desired product experience is to delete all history in a time window, on all devices, then the commands channel is a reasonable way to do that. (With appropriate attention paid to clock skew, obviously -- perhaps both start and end times should be relative to the server timestamp?)

If it's to delete only local visits, then IMO it's not unacceptable to _explicitly list page GUIDs and visit timestamp ranges_ within a command. Commands are transient, which avoids most of my privacy concerns around persistent tombstones.

Commands are processed before the history stage runs, so we could upload 'cleaned' history records, then send a command to say "delete visits between T1 and T2 for history GUIDs A, B, C", and other clients will end up in the correct state.

The contents of the command could actually be a reference to a 'manifest' temporarily stored elsewhere on the Sync server, too, so the command doesn't have to be ridiculously large: the command itself at that point is simply used to specify which devices should delete visits, and to tie that operation to a moment in time.
Assignee: nobody → kit
Status: NEW → ASSIGNED
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
I've had a few discussions about this with Thom, Mark, and Richard, which I'll summarize here. TL;DR: This is painfully complicated to do in the current Sync model, and it's not clear that it's a good use of our time. We can't have nice things. :-(

Conceptually, the hard part is "clearing a subset of visits for a URL in your history." Deleting all visits for a URL is easy: we just replace the history record with a tombstone. Clearing all your history is also easy: we wipe history on all your devices. (Since we don't sync your full history, it's possible that your devices might disagree on what "everything" means, so it's not completely easy, and we might have to go through the contortions I describe below for all cases...but let's ignore that for now).

There are several ways to clear subsets of visits in the UI. You might use "History > Clear Recent History", the Forget button, or the Library view. In the Library, you might right-click and delete a time in the left pane, or you might search for a site, select everything (or a subset!), and delete. The issue here is that these subsets might span multiple history records (URLs), but we don't have a way to identify the visits in those subsets.

We can persist affected history record GUIDs, along with the cleared timestamps, but now that's a privacy concern, as Richard pointed out in comment 28. Clearing a subset of your history stores the fact that you cleared those visits, for a GUID (and URL) that still exists in `moz_places`. Trying to get by without storing affected GUIDs runs into the "clear history visits from an unrelated timeline" issue from comment 34. When you connect a new device with existing visits, other devices can't tell between resurrections caused by a node reassignment (where it would be surprising to restore previously cleared visits), and visits that happened to occur on the newly-connected device within the cleared range (where it would be surprising *not* to apply those visits). Remember that records don't know their origin device, so other devices can't use that to help decide.

In addition to the resurrection case, I think we'd also need to persist history deletions for the "backfill" case (https://searchfox.org/mozilla-central/rev/bab833ebeef6b2202e71f81d225b968283521fd6/services/sync/modules/engines.js#1183-1186). This is the case where we see a tombstone *before* we see visits that we should ignore. Buffering tombstones in memory and applying them at the end of the sync is not enough here, because, thanks to the 5k limit, we might not fetch records containing those stale visits until the next sync, or the sync after that, and so on.

Another problem with storing date ranges is that tombstones and visits are affected by wrong clocks, and clamping might cause us to miss clearing some visits, but we'll ignore that for now.

Thom convinced me of the futility of using the commands channel last week. We still need to store tombstones to handle the backfill case, a manifest of "affected GUIDs" is still leaky, and needs to be stored locally and chunked across multiple commands to avoid blowing the record size limit. (Which, incidentally, is a problem now; Thom is making this more robust over in bug 1416313).

The key problems at the heart of history deletions are: 1) we don't have a way to identify visits separately from their URLs, and 2) it's possible to remove subsets of visits without removing the URL.

Thom suggested using a scheme like `sha256(moz_places.guid || moz_historyvisits.visit_date || moz_historyvisits.visit_type)` to identify visits. For every history sync, after applying all incoming history records, we'd run a query like `DELETE FROM moz_historyvisits WHERE id IN (SELECT id FROM moz_historyvisits v JOIN moz_places h ON h.id = v.place_id JOIN moz_historyvisits_deleted d ON d.id = SHA256(h.guid || v.visit_date || v.visit_type) WHERE v.syncStatus = NORMAL)`, to catch both resurrected visits, and visits from backfilled records. (Visits would also need to grow a "syncStatus" column, to avoid looking at every visit in the table). This query likely won't be cheap; we're hashing and comparing *all* synced visits against ones we've deleted.

We could also solve this more directly and give every visit its own GUID, which has its own pitfalls that I mentioned in comment 33.

We can wipe the server and other clients, have the client immediately reupload everything, and repopulate from the server on other devices. This is scary because clearing *any* subset of visits now clobbers the server and all clients. Also, the server never has a complete view of your history, so all old visits outside the subset would still be lost. As Mark suggested on IRC, in a perfect world, any clear operation would cause us to wipe the server, force a reupload, and have other devices clear and repopulate themselves, but we aren't anywhere near that.

Finally, we can preventing you from clearing subsets in the UI if Sync is configured. That's a non-starter for several reasons: there are too many ways to do that in the UI today, and we'd need to handle all of them; you can always sign out, clear, and sign back in; and it's leaking an implementation deficiency in the underlying Sync model to the user. A less extreme version of this is to show a modal warning when you clear subsets (e.g., "your history will not be cleared on <list of devices>"), but making that warning comprehensible and actionable is a UX challenge all its own.

Richard summarized this well: deleting history visits goes beyond today's Sync model. Sync can, in fairly common cases, surprise you by resurrecting history that you tried to delete, but the fixes for all these cases are very complex. "Clear Recent" also doesn't play well with other stores (DOM caches and site settings don't support clearing by range, so we clear *everything*)...so, from that perspective, Sync is inconsistently consistent with them. We can ensure we handle the "delete all visits" and "clear everything" cases well, but making subsets work is fraught.
TL;DRTL;DR: We should implement `_findDupe`, so that we can at least delete synced URLs if you clear all visits for them. We should also make sure "clear all history" works sensibly, maybe via tombstones that mentions multiple GUIDs if we're worried about clearing too much. That's it.
Fantastic summary, Kit. I'm bookmarking this!

(In reply to Kit Cambridge (he/him) [:kitcambridge] from comment #37)
> TL;DRTL;DR: We should implement `_findDupe`, so that we can at least delete
> synced URLs if you clear all visits for them. We should also make sure
> "clear all history" works sensibly, maybe via tombstones that mentions
> multiple GUIDs if we're worried about clearing too much. That's it.

This plan sounds sensible to me.
Comment on attachment 8931538 [details]
Bug 578694 - Dedupe and handle deletions for history records.

Mark, Ed: what do you think of this approach? The `deletedAt` hint is completely optional. If we don't think it's worth extending the tombstone record format just to handle the case where we might have new local visits for a page that was deleted on another device, I can drop it.

The tests need some more work, too, in particular to make sure we don't reupload tombstones and records unnecessarily.
Attachment #8931538 - Flags: feedback?(markh)
Attachment #8931538 - Flags: feedback?(eoger)
Comment on attachment 8931538 [details]
Bug 578694 - Dedupe and handle deletions for history records.

https://reviewboard.mozilla.org/r/202662/#review208174

This looks fine to me, that patch is pretty straightforward.
I think we should keep deletedAt since it can avoid some surprises for the user.

::: services/sync/modules/engines/history.js:127
(Diff revision 2)
>    },
> +
> +  async _findDupe(record) {
> +    let guid = null;
> +    try {
> +      guid = await PlacesSyncUtils.history.fetchGuidForURL(record.histUri);

Could we |return await| directly instead of using a variable?

::: services/sync/modules/engines/history.js:129
(Diff revision 2)
> +  async _findDupe(record) {
> +    let guid = null;
> +    try {
> +      guid = await PlacesSyncUtils.history.fetchGuidForURL(record.histUri);
> +    } catch (ex) {
> +      this._log.error("Error fetching GUID for record ${guid}: ${ex}",

I think guid will be null in that case.

::: services/sync/modules/engines/history.js:146
(Diff revision 2)
> +      }
> +    }
> +    return record;
> +  },
> +
> +  async _reconcile(record) {

Could we add a high level comment for this method?
From what I understand we do:
remote deleted - local deleted => nop
remote deleted - local modified => filter visits
remote new visits - local deleted => add visits
remote new visits - local modified => merge visits

Feel free to ignore my comment if you feel it's not valuable (_reconcile in engines.js is pretty similar).
Attachment #8931538 - Flags: feedback?(eoger) → feedback+
Thanks, Ed!

This patch isn't quite complete: we still need to upload tombstones when you clear all history. That's tricky, as we currently only fire an `onHistoryCleared` notification, without any GUIDs. We won't fire any `onDeleteURI` or `onDeleteVisits` notifications then.

I considered firing `onDeleteURI` for pages, too. That's not very efficient; it requires reading lots of rows into memory, and forces UI observers to do a lot of busy work refreshing views for about-to-be-deleted pages. I also thought about using a `BEFORE DELETE` trigger that calls a `NOTIFY_DELETED(url, guid)` SQL function, and adding a `notifyDeletedURIsOnClearHistory` attribute to `nsINavHistoryObserver`. That lets Sync opt in to receiving the extra notification, and solves the UI refresh problem. However, since (I think) the SQL function will be called on the storage thread, it'll still force a lot of main thread dispatches for consumers that largely ignore the events.

Both approaches will also require the history tracker to know that the `onDeleteURI` notifications are part of a "clear" operation, and shouldn't be uploaded individually. We could do that, but a tombstone per visited URI is a *lot* of records to upload.

Another idea is to add `syncChangeCounter` and `syncStatus` columns to `moz_places`. This is a lot simpler than for bookmarks, since history is already async, and everything goes through `History::UpdatePlaces`. I've started working on a patch to do that. We can add a `PlacesSyncUtils.history.fetchTombstones` method, and decide how to handle them: if it's below a threshold (10 sites?), upload individual tombstones; otherwise, upload a bulk tombstone like `{ guid: "dummy-guid", deleted: true, guids: ["place123", "place456", ...] }`.
Priority: P2 → P1
Comment on attachment 8931538 [details]
Bug 578694 - Dedupe and handle deletions for history records.

https://reviewboard.mozilla.org/r/202662/#review208174

> Could we |return await| directly instead of using a variable?

TBH, and just FYI, I prefer the way it is written - it makes it clearer it is returning a guid. I'm happy with whatever Kit decides though.
Comment on attachment 8931538 [details]
Bug 578694 - Dedupe and handle deletions for history records.

https://reviewboard.mozilla.org/r/202662/#review208174

> I think guid will be null in that case.

That looks correct to me - it will log the record's ID
Comment on attachment 8931538 [details]
Bug 578694 - Dedupe and handle deletions for history records.

https://reviewboard.mozilla.org/r/202662/#review208772

Looks great, thanks!

::: services/sync/modules/engines/history.js:206
(Diff revision 2)
> +          this._modified.changeID(localDupeGUID, record.id);
> +        }
> +        await this._switchItemToDupe(localDupeGUID, record);
> +        existsLocally = true;
> +      } else {
> +        this._log.trace("No local dupe for remote record ${id} with ${url}",

IIUC, this could be written as:

`this._log.trace("No local dupe for remote record ${id} with ${histUri}", record);
`

but I'm not too bothered.
Attachment #8931538 - Flags: feedback?(markh) → feedback+
I worked on adding sync status tracking for history last month, and ran into some complications that I'll summarize here.

First, I added an `isSynced BOOLEAN` column to `moz_historyvisits`, and a `lastSyncedVisitDate INTEGER` column to `moz_places`. This matches the model of how Sync works: we only upload records for new visits, not other attributes like page title changes. `SELECT h.guid FROM moz_places h JOIN moz_historyvisits v ON v.place_id = h.id WHERE NOT v.isSynced v.visit_date > h.lastSyncedVisitDate GROUP BY h.guid` gives us pages with new local visits to upload, we track the sync status for visits, and there's no need for a change counter. In theory, this also gives us a way to excise subsets of visits if we decide to do that in the near future, or to backfill visits for pages. There are problems with this approach, though. We currently only ever upload the most recent 20 visits for each page, but don't consider their sync status: they might have come from the server, new local visits, or a mix of the two. Additionally, the server might have visits that are newer than our local visits. If we fast-forward the `lastSyncedDate`, we might miss the older local visits...although, if there are more than 20 newer visits, we'll miss them, anyway.

It seemed like that added indirection with no clear gain, so I decided to add `syncStatus INTEGER` and `syncChangeCounter INTEGER` to `moz_places` instead, using the same approach as `moz_bookmarks`. Since we use `moz_places` to store URLs for keywords and bookmarks, not just visits, `syncChangeCounter` defaults to 0. Unlike bookmarks, we also don't need "UNKNOWN", since there's no way to restore history visits. The only time we bump the change counter is when non-Sync callers pass new visits to `UpdatePlaces` (https://searchfox.org/mozilla-central/rev/b24e6342d744c5a83fab5c15972e11eeb69d68e6/toolkit/components/places/History.cpp#2315). This is simpler, but still has interesting consequences. First, history records have a server TTL of 60 days (https://searchfox.org/mozilla-central/rev/b24e6342d744c5a83fab5c15972e11eeb69d68e6/services/sync/modules/engines/history.js#12), meaning newly connected devices won't have your complete history. (I mentioned possibly backfilling visits above, but, given the TTL, that's only relevant for uploading new local visits from new devices). A "NORMAL" sync status for a page, then, means "uploaded to the server at one time", not "uploaded and still on the server". If the page was last synced more than 60 days ago, it might not need a tombstone...though, if it was synced to at least one device, it will, even though it doesn't exist on the server now. See bug 1422420 for the discussion.

In addition, deduping makes things interesting if there are two records on the server with different GUIDs and the same URL. (The only reason we need deduping, in turn, is so pages have stable GUIDs that we can use for deletion). This can happen today if Places is corrupted and needs to be recovered from a backup. We don't currently back up or restore history visits (bug 431274), meaning the only rows in `moz_places` that exist after a recovery are bookmark URLs. Since the GUIDs for the URLs are also different, visiting one of those URLs will upload a new history record with the new GUID and existing URL. Today, that doesn't matter, because we ignore GUIDs and smush visits by URL. In a world where page GUIDs are meaningful, though, it's possible that two or more records *on the server* might be dupes of one another. We'd need to do what we do for bookmarks: pick one, and tombstone the other. However, we might now lose visits, because we cap at 20.

I chatted with Chris about this before the holiday. He suggested that clearing synced pages from *all* your devices when you clear history could be surprising, especially if the synced page was initially visited on another device. A better way might be to clear *visits* for pages that originated on *this* device. There are two problems with this. First, we don't have a way to identify or excise visits. Second, we don't track our round-trip the originating device for visits, pages, or any synced data type. (This also raises many of the same questions we had in bug 1302797, comment 1 for bookmarks, including what to do if you disconnect). Some people might expect that clearing all history on one device will clear everything on all, others might expect only pages or visits from this device to be cleared on the others, and (to borrow Mark's words) "the majority of people will probably just expect magic".

Many of these issues exist today, and we wouldn't really make things worse. Our existing support for deletion is half-hearted as it is. However, we would be adding a big chunk of complexity for deduping and change tracking, all so that we can remove pages from other devices when you clear history. (Remember that even this is best-effort: if you clear a subset of visits for a page, we won't remove those visits from other devices, since we have no way to identify or remove individual visits). The fact that history records expire means we don't know if a synced page still exists on the server, or any other devices. The URLs of visited *pages* might be the same on all your devices, but the *visit dates* won't be.

I wonder if it's worth trying to continue further with the change tracking approach, or if we should step back and think about how deletions *should* work, keeping in mind the constraints of the existing history record format.
Thanks for the excellent summary Kit!

(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #48)
> Many of these issues exist today, and we wouldn't really make things worse.

That's a relief ;)

> I wonder if it's worth trying to continue further with the change tracking
> approach, or if we should step back and think about how deletions *should*
> work, keeping in mind the constraints of the existing history record format.

TBH, I don't think it is worth pursuing this further at this time, and if we do in the future, we should consider it holistically - history is really just a poor design for syncing and we might be missing the forest for the trees. The lack of complaints about how poor history syncing is makes me believe that most of our users either don't notice or don't care.
> TBH, I don't think it is worth pursuing this further at this time, and if we do in the future, we should consider it holistically - history is really just a poor design for syncing and we might be missing the forest for the trees. The lack of complaints about how poor history syncing is makes me believe that most of our users either don't notice or don't care.

If this is the recommendation, I can align behind it in light of the new direction we are taking with Sync and storage in 2018.

With further thought... Kit, this just doesn't seem to be worth your time anymore and is a distraction from more important efforts.

More importantly, while this enhancement helped support our goal of providing users with more control, since there are few complaints at this time, we risk introducing a regression and causing complaints in a time where we should be focused on a new major project.

Hopefully you've learned a lot about the challenges of deletion that can be applied to "future sync".
(oops - I just realized I didn't hit "Save" on this comment)

To be clear, I don't think we should try and track tombstones for history and try to track partial history clearing. I do think we should look at doing exactly what the literal title of this bug says - if history is completely cleared, we should wipe the server.

Kit, WDYT?
(In reply to Mark Hammond [:markh] from comment #51)
> To be clear, I don't think we should try and track tombstones for history
> and try to track partial history clearing. I do think we should look at
> doing exactly what the literal title of this bug says - if history is
> completely cleared, we should wipe the server.

That sounds like it would work, and it'd be much simpler to implement. Existing devices would retain all their history, but new devices wouldn't see any of the old pages. Comment 0 suggests we should also send a command to have connected devices wipe themselves; I'm thinking, based on comment 34, that's not desirable?
(In reply to Kit Cambridge (they/them) [:kitcambridge] from comment #52)
> Comment 0 suggests we should also send a command
> to have connected devices wipe themselves; I'm thinking, based on comment
> 34, that's not desirable?

I don't have a strong opinion on that, but comment 34 does make a good case, so that sounds reasonable to me.
Thinking about this some more, I wonder if it's worth wiping the server. It won't impact existing devices; only new devices won't see that old history, and there's a TTL of 60 days on old history, anyway. It could also be surprising, for a similar reason as comment 34. Let's say you have three devices. A uploads new visits, B downloads, then you clear history on B. We erase all history on B and wipe the server. A retains all its history, including the new visits. Then C syncs. C doesn't get A's new visits, but still has some subset of all your history.

IOW, I'm starting to think that the most predictable behavior is to not sync deletions at all, and make them firmly per-device. This flies in the face of how Sync works today, where all your devices share one view of the world. (Though the cat's out of the bag in that we'll never sync your full history, anyway).

This might also be easier to explain to the user in a dialog when they clear history (something like, "Your history is currently syncing to 2 other devices. Clearing your browsing history on this device will not remove those pages from your other devices"). We could get fancy and show their device list, and offer to let them select devices to clear, which we'd accomplish through `wipeEngine` commands. But I think explaining this is going to be tricky, and require some UX.
Flags: needinfo?(markh)
Assignee: kit → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(markh)
Priority: P1 → --
Flags: needinfo?(markh)
I think Kit's comment 54 is correct and given there are so many edge-cases and a relative lack of users complaining about how things work, and given our priorities over the rest of this year, I don't think we should take any action here. However, seeing this bug has so much context I'm reluctant to actually close it as WONTFIX - so I think p5 is appropriate as we probably would consider a patch from a suitably motivated contributor.
Flags: needinfo?(markh)
Priority: -- → P5
Duplicate of this bug: 1506057
You need to log in before you can comment on or make changes to this bug.