Redo collection id database structure

VERIFIED FIXED

Status

Cloud Services
Server: Sync
VERIFIED FIXED
7 years ago
6 years ago

People

(Reporter: telliott, Assigned: rfkelly)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [sync 2.0][qa+])

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

7 years ago
So we have pretty serious problems with the python implementation of collection handling (see Bug 713801 and Bug 688623). This means we're going to have to write a bunch of user-fixing scripts. Since this is happening, I think we should change the fundamental structure of the collections table. 

The original impetus for the current version was that it should be easy to treat a user as a self-contained unit. If we needed to move one from a node to another, we could just select from the wbo and collection tables and push the data into the other server, no calculation necessary.

However, it turns out that we never migrate users individually, just blow up nodes. That makes this design poor. Right now, we have a table that looks like:

telliott  100  fake_collection
telliott  101  another_collection
rkelly    100  my_fake
rkelly    101  fake_collection

Instead, we should have a table that just is:

100   fake_collection
101   another_collection
102   my_fake

That table will be a lot smaller in the long run, to the point where we can probably cache these collections in memory, or at least look them up efficiently. 

We'll call the table something different, and can branch the collections code based on which table currently exists. We can precalculate the new table (and just look for new additions during the migration), down the nodes, update the wbo tables with the recalculated numbers, drop the old table and restart.
Once you're doing a shared table, you can number from 1 instead of 100, it makes no difference to mysql and nothing anywhere should be hard-coding collection IDs for any reason, right?
(Reporter)

Comment 2

7 years ago
Given that there are likely to be intermediate stages where we're not fully caching this data, I'd stick with 100+ and keep constants for now. It's also probably a good idea to have our major collections have the same number across all nodes.
(Assignee)

Comment 3

7 years ago
We need to be careful about the semantics of custom collections here.  For example, Bug 687299 talks about the difference between a "non-existent collection" and an "empty collection".

Removing the table that maps collection names to users seems like it would remove this distinction.  To tell whether a particular custom collection exists for a particular user, you would have to check whether they've got any items stored in it.

Is this detail important enough to try to preserve it under the new system, e.g. by maintaining it in a separate table?  Or is it something we can just finesse away as unspecified?
(Assignee)

Comment 4

7 years ago
Created attachment 588616 [details] [diff] [review]
patch redoing collections database

Attached is a first attempt at redoing the collections database.

All tests pass expect for test_collections(), which breaks due to the semantic change I highlighted in my previous comment.

I don't particularly like how it handles the choice between using old-style or new-style layout, and think it might work better split into two separate subclasses.  Nevertheless, it's a start.  Thoughts?
Assignee: nobody → rkelly
Attachment #588616 - Flags: feedback?(telliott)
(Reporter)

Updated

7 years ago
Whiteboard: [sync 2.0]
Whiteboard: [sync 2.0] → [sync 2.0][qa+]
(Reporter)

Comment 5

7 years ago
Comment on attachment 588616 [details] [diff] [review]
patch redoing collections database

It's a good start, but some fixes needed.

I think you're making your life harder yourself with trying to maintain much of the old legacy flow. You may be happier with separate libraries, since the flow does change a fair bit here. For example, your USER_COLLECTION_NAMES query is inefficient - if there were, say, 1000 items in the table, you'd be making that query an awful lot. In the new flow, I suspect you're going to want to start with getting the user records. You then figure out which collection ids you don't recognize, and ask for those specifically (which you then cache in your mapping table along with the constants). If you want to prepopulate the cache on application startup, that would also be reasonably good. 


COLLECTION_EXISTS: I don't think we need this query at all any more. I suspect the flow is now:

   get collection id
   if doesn't exist, insert into collection_names, ignore a duplicate key error
      get collection id again

No need for COLLECTION_NEXT_ID or CUSTOM_COLLECTION_ID_OFFSET, since that can now be an autoincrement on the column. You'll need to add autoincrement and start value to the table definition.

LEGACY_COLLECTIONS_TABLE_IN_USE can just be "select 1 from collections". No need to actually read any rows.

how is set_collection currently working? The table structure shouldn't be allowing a userid, but I don't see any update to that function in this diff. This should be a MUCH simpler process in the new structure, and if SQLAlchemy is just silently discarding additional provided columns... well, it's not going to improve my impression of SQLAlchemy.

collection_names.name needs to be a unique key, or all sorts of badness happens. There should be a test for that, too.
Attachment #588616 - Flags: feedback?(telliott) → feedback-
(Assignee)

Comment 6

7 years ago
> I think you're making your life harder yourself with trying to maintain
> much of the old legacy flow.

I agree!  Considering the clean break between 1.1 and 2.0 I will now redo this to not mix the two different cases.

> No need for COLLECTION_NEXT_ID or CUSTOM_COLLECTION_ID_OFFSET, since that can now be an
> autoincrement on the column.

Just confirming then, the backwards-compatability-breaking also means that the concerns raised about this in Bug 579096 don't apply to the new version?

> how is set_collection currently working? The table structure shouldn't be allowing a userid,
> but I don't see any update to that function in this diff.

The diff from line 403 onwards is inside this function, but you're right, it's needlessly complicated because of branching for the old flow.

> if SQLAlchemy is just silently discarding additional provided columns... well, it's not
> going to improve my impression of SQLAlchemy.

Don't worry, I'm pretty sure this is not happening :-)
(Reporter)

Comment 7

7 years ago
(In reply to Ryan Kelly [:rfkelly] from comment #6)

> > No need for COLLECTION_NEXT_ID or CUSTOM_COLLECTION_ID_OFFSET, since that can now be an
> > autoincrement on the column.
> 
> Just confirming then, the backwards-compatability-breaking also means that
> the concerns raised about this in Bug 579096 don't apply to the new version?

No. 579096 concerns auto-increment *per user*, which is a relatively new db feature. This table is just good old standard autoincrement.
(Reporter)

Comment 8

7 years ago
Atoll correctly points out that the correct syntax for LEGACY_COLLECTIONS_TABLE_IN_USE is "select 1 from collection_names limit 1"

But I think just pulling them apart is better :P
(Reporter)

Comment 9

7 years ago
hmm, actually, that is or isn't a bug depending on whether we're testing for the presence of the new table or the absence of the old :P
(Assignee)

Comment 10

6 years ago
Created attachment 601089 [details] [diff] [review]
patch to change the way collections are handled

I have re-done the patch for sync2.0, where the new-style table structure is the only supported option.  I've also changes the SyncStorage plugin API to match:

  * set_collection/get_collection/collection_exists no longer take a userid argument
  * delete_collection has been removed (it would now imply deleting all items in the collection for all users, which we don't use and would be ugly to implement with sharded bso tables)

The key methods are now _get_collection_id and _get_collection_name, which implement the mapping in either direction.  They use a local cache of previously-seen collections.

There is also a _load_collection_names method to bulk fetch the collection names for a list of ids, used when generating the list of collection names for a single user.
Attachment #601089 - Flags: review?(telliott)
(Assignee)

Updated

6 years ago
Blocks: 713801
(Assignee)

Comment 11

6 years ago
Created attachment 602225 [details] [diff] [review]
updated patch to simplify custom collections management

Patch updated based on conversation with telliott.  This removes set_collection/get_collection/collection_exists which were only used for testing purposes, and localized all logic for creating collections on demand into the _get_collection_id method.
Attachment #588616 - Attachment is obsolete: true
Attachment #601089 - Attachment is obsolete: true
Attachment #601089 - Flags: review?(telliott)
Attachment #602225 - Flags: review?(telliott)
(Reporter)

Comment 12

6 years ago
Comment on attachment 602225 [details] [diff] [review]
updated patch to simplify custom collections management

Much easier to read. Thumbs up.
Attachment #602225 - Flags: review?(telliott) → review+
(Assignee)

Comment 13

6 years ago
https://github.com/mozilla-services/server-syncstorage/commit/4858e02b91fb49ae4713251e38eeeea17dc57d13
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Now built-in to aitc.
Edge/corner cases that need testing will be covered separately.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.