Open Bug 717136 Opened 14 years ago Updated 2 years ago

Improved representation for history entries and visits

Categories

(Firefox :: Sync, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: rnewman, Unassigned)

References

(Blocks 5 open bugs)

Details

(Whiteboard: [sync:history])

(Filing this in General, because it affects multiple products.) telliott, atoll, and I discussed how brain-damaged the existing history representation is. We have one record per URI, and each record contains an array of visit objects. Each looks like this: { id:"5qRsgXWRJZXr", title:"Foobar Baz", histUri:"http://foo.example.com/bar/baz", visits:[{type:1, date:1319149012372425}] } There are two dumb things about this. Firstly, adding a visit requires altering the whole record. This is a side-effect of random access, but it means that Twitter with its 4,000 visits -- and frequent re-visits! -- gets re-uploaded all the time. 4,000 visits are 120KB in cleartext. This bug does not address that. Secondly, the representation of visits is inefficient -- the type is usually redundant, the object representation verbose, and the date unnecessarily precise. Imagine instead: visits: {1: [1319149012372425, 12314, 878782, 1232113]} Store the type (a fixed set of types!) as a key, the first timestamp, and then timestamp offsets in a sorted sequence. This alone would save 20-30 bytes of cleartext per visit, and be vastly more efficient to parse and process. There are an array of better representations for timelines, of course, but advanced data structures should be balanced against client complexity. It's clear, however, that there's a lot of room for improvement.
Blocks: 745408
Should we morph this into "devise new history record representation?"
(In reply to Richard Newman [:rnewman] from comment #0) > Firstly, adding a visit requires altering the whole record. This is a > side-effect of random access, but it means that Twitter with its 4,000 > visits -- and frequent re-visits! -- gets re-uploaded all the time. 4,000 > visits are 120KB in cleartext. This bug does not address that. Why is it not possible to do incremental updates? crypto reasons server side? > Secondly, the representation of visits is inefficient -- the type is usually > redundant So, I couldn't figure out if your proposal is to have visits grouped by type, or completely dismiss the type, cause type is actually useful (*cough* frecency). Moreover, currently Sync doesn't sync from_visit, killing basically any form of referrer and redirects support, this proposal makes that issue basically unfixable, ... Then maybe we should just stop syncing redirects sources, 404 pages and so on, since we don't support syncing them properly.
re: dates, fwiw we could even crop them at seconds, we don't need that precision.
The default behavior of Sync is that any single record is a full representation of that record. If we wanted to switch to a model where core "places" are being synced in a different location/record/collection from visits to those places, that would be theoretically doable.
(In reply to Marco Bonardo [:mak] from comment #2) > So, I couldn't figure out if your proposal is to have visits grouped by > type, or completely dismiss the type, cause type is actually useful (*cough* > frecency). Group by type. (See {1: [...]} in Comment 0.) This saves (asymptotically) 8 chars per visit.
Blocks: 726049
Whiteboard: [sync:history]
Component: General → Firefox Sync: Cross-client
Bug 1302797 discusses adding the creating/visiting/etc. device to synced data. Bug 1288858 discusses adding container metadata to visits.
Blocks: 1288858, 1302797
Depends on: 623667
Filed Bug 1384685 to cover adding icon references (and presumably other kinds of images) to Sync.
Blocks: 1384685, 623667
No longer depends on: 623667
Summary: Improved representation for history visits → Improved representation for history entries and visits
Component: Firefox Sync: Cross-client → Sync
Product: Cloud Services → Firefox
Blocks: 1516329
Type: defect → enhancement
Priority: -- → P3
Severity: normal → S3

https://bugzilla.mozilla.org/show_bug.cgi?id=717136#c0

bugzilla@twinql.com, I don't agree with what you've stated:

We have one record per URI, and each record contains an array of visit objects. Each looks like this:

{
    id:"5qRsgXWRJZXr",
    title:"Foobar Baz",
    histUri:"http://foo.example.com/bar/baz",
    visits:[{type:1, date:1319149012372425}]
}

There are two dumb things about this:

  1. Adding a visit requires altering the whole record. This is a side-effect of random access, but it means that Twitter with its 4,000 visits -- and frequent re-visits! -- gets re-uploaded all the time. 4,000 visits are 120KB in cleartext. This bug does not address that.
  2. The representation of visits is inefficient -- the type is usually redundant, the object representation verbose, and the date unnecessarily precise.

Imagine instead:

{visits: {1: [1319149012372425, 12314, 878782, 1232113]}}

Store the type (a fixed set of types!) as a key, the first timestamp, and then timestamp offsets in a sorted sequence.

Because, for me:

  1. The more precise the date recorded, the better, because I use my history for auditing purposes. The sole compromise possible here in my eyes is the ability for the user to choose how precise they want their entry records to be.
  2. The key names for the JSON objects provide significantly better backward compatibility when the format updates, because the order doesn't matter (consider the frequence of breakage difference when consuming values from an array versus a hashtable API). Additionally, an introspecting user is significantly more easily able to interpret these values, which although can be negated using a dedicated export format, it doesn't fix the issue in terms of privacy, because the user won't understand what these values represent.

Regardless, if you're to store the values numerically, wouldn't it be more storage-efficient to base-36 encode them? Consider the difference between

{visits: {1: [1319149012372425, 12314, 878782, 1232113]}}

and

{visits: {1: [CZLKOUJQO9, 9I2, IU2M, QEPD]}}

I don't know what the effect on the CPU would be when quickly scrolling down the history though. However, this brings me onto the crux of my disagreement - your proposed modifications would save a few MiB saved in total for most people. Are any of these tradeoffs really worth that? I'm aware that not everyone has the latest 2 x 4 TiB NVMe SSD storage solution, but when would this ever been an issue?

Note that the comment you are replying to refers to how sync copies visits between device, but your examples talk about how history is used once on the device. Sync has special and different requirements than once on the device. On the device we have a vastly more efficient storage mechanism backed by sqlite and with extensive use of complicated indexes etc.

You need to log in before you can comment on or make changes to this bug.