Closed Bug 1319485 Opened 6 years ago Closed 1 year ago

Add domains table, with domain_id FK in history/bookmarks (and combined)

Categories

(Firefox for Android Graveyard :: Activity Stream, defect, P5)

defect

Tracking

(firefox53 affected)

RESOLVED INCOMPLETE
Tracking Status
firefox53 --- affected

People

(Reporter: ahunt, Unassigned)

References

Details

(Whiteboard: [MobileAS])

Attachments

(4 files, 2 obsolete files)

We want to be able to group by domain for various parts of ActivityStream.

This requires some form of domain column in history+bookmarks (which are both queried for topsites and highlights). I'm following the iOS approach of having a domains table containing the actual domains, with a foreign key pointing there in history/bookmarks.

(Split out from Bug 1298785)
Assignee: nobody → ahunt
Blocks: 1298785
Iteration: --- → 1.9
Priority: -- → P1
Whiteboard: [MobileAS]
TODO:
- Add a test using a v3 DB (can we even run a version of Fennec that old...?)
- Add a test using a v4 DB (ditto)

(Both of these are to cover the domain_id trickiness upgradeDatabaseFrom30to31, aka the reading_list->bookmarks migration.)
Iteration: 1.9 → 1.10
Comment on attachment 8813339 [details]
Bug 1319485 - Add domain hashes to bookmarks and history

https://reviewboard.mozilla.org/r/94762/#review96872

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserContract.java:305
(Diff revision 2)
>          public static final Uri CONTENT_OLD_URI = Uri.withAppendedPath(AUTHORITY_URI, "history/old");
>          public static final String CONTENT_TYPE = "vnd.android.cursor.dir/browser-history";
>          public static final String CONTENT_ITEM_TYPE = "vnd.android.cursor.item/browser-history";
>      }
>  
> +    public static final class Domains implements CommonColumns {

It's a bit weird that Domains class does not implement DomainColumns... But I can see why.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:172
(Diff revision 2)
> +        // Domains._ID must be "INTEGER PRIMARY KEY" in order for it to be an alias of
> +        // rowid. This allows for a more efficient implementation of getDomainID() where
> +        // a call to SQLiteDatabase.insert() will return rowid == Domains._ID. See
> +        // http://sqlite.org/lang_createtable.html#rowid for the relevant documentation.
> +        db.execSQL("CREATE TABLE " + Domains.TABLE_NAME + "(" +
> +                Domains._ID + " INTEGER PRIMARY KEY," +

Not autoincrementing?

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:198
(Diff revision 2)
> +        // SharedBrowserDatabaseProvider.cleanUpSomeDeletedRecords().
> +        // We could clear the domain_id during that initial deletion stage, which would however
> +        // require addition triggers for AFTER UPDATE (and would result in a more complicated
> +        // CONSTRAINT on history, whereas we can currently use NOT NULL for its domain_id).
> +        // (We already need to handle AFTER DELETE because of the direct sync deletion scenario.)
> +        // Instead we just leave the domain_id when deleting history, which means that the domains

If I'm reading this correctly, after history is "deleted" (columns are nulled out), we still have the domain_id FK column intact, and we still have a row in the domains table.

If that's the case, this doesn't seem right. If a user deletes a history item for domain FOO.com, I shouldn't be able to figure out they've visited FOO.com at any point by inspecting the domains table, or even simply grepping against their browser.db file.

Perhaps handling of history/bookmark deletion in BrowserProvider is a better place to handle this, since you can target delete operations specifically (as opposed to having to deal with all kinds of updates in a trigger).
Iteration: 1.10 → 1.11
We actually do a little fun stuff like this on iOS:

  CONSTRAINT parentidOrDeleted CHECK (parentid IS NOT NULL OR is_deleted = 1)
(In reply to :Grisha Kruglov from comment #8)
> If I'm reading this correctly, after history is "deleted" (columns are
> nulled out), we still have the domain_id FK column intact, and we still have
> a row in the domains table.
> 
> If that's the case, this doesn't seem right. If a user deletes a history
> item for domain FOO.com, I shouldn't be able to figure out they've visited
> FOO.com at any point by inspecting the domains table, or even simply
> grepping against their browser.db file.
> 
> Perhaps handling of history/bookmark deletion in BrowserProvider is a better
> place to handle this, since you can target delete operations specifically
> (as opposed to having to deal with all kinds of updates in a trigger).

I'm still thinking this over:

*If* we keep the triggers, I would extend them to cover _all_ deletions/updates to address that case (i.e. if history is deleted, don't keep domain_id's around).

Doing explicit deletion in BrowserProvider does seem simpler, but I'm worried that's more likely to break if we ever add new code that fiddles the bookmarks/history tables and don't remember to do the respective domain_id modifications (well written tests would hopefully cover that case, but I'm not sure it's possible to ensure that they cover every single case where bookmarks or history could be modified).
Blocks: 1298786
Comment on attachment 8819091 [details]
Bug 1319485 - Pre: add domain extraction and tests to BrowserProvider

https://reviewboard.mozilla.org/r/98928/#review99474
Attachment #8819091 - Flags: review?(gkruglov) → review+
Comment on attachment 8813339 [details]
Bug 1319485 - Add domain hashes to bookmarks and history

https://reviewboard.mozilla.org/r/94762/#review99476

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:65
(Diff revision 3)
>  public final class BrowserDatabaseHelper extends SQLiteOpenHelper {
>      private static final String LOGTAG = "GeckoBrowserDBHelper";
>  
>      // Replace the Bug number below with your Bug that is conducting a DB upgrade, as to force a merge conflict with any
>      // other patches that require a DB upgrade.
> -    public static final int DATABASE_VERSION = 36; // Bug 1301717
> +    public static final int DATABASE_VERSION = 37; // Bug 1298785

Missing the test db file, as you must be aware :)

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:1417
(Diff revision 3)
> +            return;
> +        }
> +        didMigrateBookmarks = true;
> +
> +        if (!didCreateDomainsTable) {
> +            createDomainsTable(db);

From sqlite docs:
" A CREATE TABLE command operates the same whether or not foreign key constraints are enabled. The parent key definitions of foreign key constraints are not checked when a table is created. There is nothing stopping the user from creating a foreign key definition that refers to a parent table that does not exist, or to parent key columns that do not exist or are not collectively bound by a PRIMARY KEY or UNIQUE constraint."

So it seems that you should be able to simplify some things.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:1453
(Diff revision 3)
>       * Migrate a history table from some old version to the newest one by creating the new table and
>       * copying all the data over.
>       */
>      private void migrateHistoryTable(SQLiteDatabase db) {
> +        // We only need to run this migration once during a migration session:
> +        // this migration is needed for e.g. 3->4, and 36->37, but in each case the input data is still

6->7 is the migration which runs this.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:1485
(Diff revision 3)
> -        db.execSQL("INSERT INTO " + TABLE_HISTORY + " SELECT * FROM " + TABLE_HISTORY_TMP);
> +        // This migration may or may not be adding auto-generated columns (e.g. VISITS, LOCAL_VISITS etc).
> +        // At this stage we don't actually know what columns the input (TABLE_HISTORY_TMP) may contain,
> +        // so it's best to let sqlite copy as much as possible while autogenerating any needed values.
> +        // The only column that is definitely being added at this stage, which isn't auto-generated,
> +        // is the (foreign-key) DOMAIN_ID. To handle this we use a fake (temporary) domain ID for
> +        // the initial copying stage, followed by iterating over the new tablet to update DOMAIN_ID

s/tablet/table

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:1491
(Diff revision 3)
> +        // to be correct. (The alternative would be to copy rows one by one, and populating the
> +        // DOMAIN_ID at that stage, however that would require copying each single column, and detecting
> +        // whether or not that column exists before the migration in order to only copy existing columns.
> +        // Moreover that would result in all data being transferred into Java and back to sqlite, whereas
> +        // this approach minimises the amount of data passing from sqlite->Java->sqlite. Whether or not
> +        // that has a significant performance impact is unknown.)

As things stand, I think you're breaking migration path for anyone who'll need to run 6->7 migration. Wouldn't the SELECT below fail to work for them due to a bunch of missing columns?

Furthermore, I think you should be able to simplify this whole thing to just an update. At this point in time, foreign key constraints have not been enabled yet. We enable constraints in `onOpen` (https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java#2148), which is called after database has been migrated. So migrations can't rely on constraints, but also don't have to work around them.

You should be able to drop all of the "create fake IDs first" logic, and reduce this to "iterate over every history row, generating/looking up domain ID and updating the row".

Do the actual updates using a compiled update statement for a performance boost. See #synthesizeAndInsertVisits for an example of this.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:1536
(Diff revision 3)
> +                final String url = cursor.getString(historyColumn);
> +
> +                final ContentValues values = new ContentValues();
> +
> +                // We would usually have to handle null URLs, however in this case we've only queried
> +                // non-deleted items, which must have a non-null URL.

I wouldn't base success of this migration on this assumption (history could possibly have some weird stuff!). Ensure you won't fail on encountering a null host.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java:2087
(Diff revision 3)
> +                // - domain_id will be added during 36->37 (which processes all bookmarks and
> +                //   adds/generates the appropriate domain entries and ids).
> +                // But if instead we are doing 3->37:
> +                // - domain_id is added in the 3->4 bookmarks migration step (that step creates a new bookmarks table
> +                //   using the newest schema, and copies data to it)
> +                // Due to the constraints on domain_id we therefore need to know whether or not domain_id

At this point constraints shouldn't matter.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:1853
(Diff revision 3)
>  
> +    private static String extractDomainFromURL(@NonNull String url) {
> +        final String host = URI.create(url).getHost();
> +
> +        // Remove leading www. if present
> +        final String prefix = "www.";

There are possibly other potential prefixes you might want to care about, such as `ww2` - much less common in the wild though.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:1866
(Diff revision 3)
> +    /**
> +     * Obtain the domain_id for a given URL.
> +     * @param db The target database. Must be writable.
> +     * @return The domain id. A new id will be generated if necessary.
> +     */
> +    public static long getDomainID(SQLiteDatabase db, @NonNull String url) {

`getOrCreateDomainID` seems like a better name. It makes the "create if missing" side effect explicit.

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:1874
(Diff revision 3)
> +        // Handle special URLs (including about:*) by using the entire url as the "domain".
> +        // These are an edge case, and this should minimise difficulties handling such URLs
> +        // while not impacting the UX in any significant way. (I.e. we use domain_id for
> +        // grouping purposes in situations where about:* or other oddities are unlikely
> +        // to appear.)
> +        if (domain == null) {

Other than `about:*` urls, when else can a domain be legitemately null?

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:1886
(Diff revision 3)
> +                new String[] {
> +                        domain
> +                }
> +        );
> +        try {
> +            if (existingIDCursor != null && existingIDCursor.getCount() > 0) {

You can drop .getCount check, `if (existingIDCursor != null && existingIDCursor.moveToFirst()) { ... }`
extractDomainFromURL should be much closer to this:

https://github.com/mozilla-mobile/firefox-ios/blob/f65be4372e20d284aa1fc340159597db97194f8f/Utils/Extensions/NSURLExtensions.swift#L185

I'd be surprised if we didn't already have code in the tree that did TLD extraction. Do not reinvent this wheel.
s/TLD/baseDomain
Status: NEW → ASSIGNED
On a sidenote: I'm proposing to remove the pre-v8 migration code in Bug 1324914 (v8 is the first DB schema used in a release version of Fennec).

(I'll still make sure this migration doesn't break pre-v8 DBs, fixing that should also make my patch clearer/more-robust/more-obvous - but scrapping the unused migration code should also make BrowserDBHelper a bit easier to read and reason about.)
Comment on attachment 8813339 [details]
Bug 1319485 - Add domain hashes to bookmarks and history

https://reviewboard.mozilla.org/r/94762/#review99476

> As things stand, I think you're breaking migration path for anyone who'll need to run 6->7 migration. Wouldn't the SELECT below fail to work for them due to a bunch of missing columns?
> 
> Furthermore, I think you should be able to simplify this whole thing to just an update. At this point in time, foreign key constraints have not been enabled yet. We enable constraints in `onOpen` (https://dxr.mozilla.org/mozilla-central/source/mobile/android/base/java/org/mozilla/gecko/db/BrowserDatabaseHelper.java#2148), which is called after database has been migrated. So migrations can't rely on constraints, but also don't have to work around them.
> 
> You should be able to drop all of the "create fake IDs first" logic, and reduce this to "iterate over every history row, generating/looking up domain ID and updating the row".
> 
> Do the actual updates using a compiled update statement for a performance boost. See #synthesizeAndInsertVisits for an example of this.

I've discovered that there's no way to disable the not null constraint (even if foreign keys are disbaled), so it looks like I'll still need a fake domain ID regardless. (I.e. the "domain_id IS NOT NULL OR is_deleted = 1", plus "OR type != TYPE_BOOKMARK" constraints.)

Since we don't care about the foreign key, I could use a completely fake ID (e.g. a negative number), but I still feel slightly more comfortable using a "real" ID (that might be misguided, seeing as we're manually purging that at the end anyways, so all bets are off if we mess up...).

(FWIW the Exceptions that are thrown when constraints aren't upheld are somewhat useless, i.e. "android.database.sqlite.SQLiteConstraintException: Cannot execute for changed row count", without any indiciation of what actually went wrong.)
(In reply to Richard Newman [:rnewman] from comment #17)
> extractDomainFromURL should be much closer to this:
> 
> https://github.com/mozilla-mobile/firefox-ios/blob/
> f65be4372e20d284aa1fc340159597db97194f8f/Utils/Extensions/NSURLExtensions.
> swift#L185
> 
> I'd be surprised if we didn't already have code in the tree that did TLD
> extraction. Do not reinvent this wheel.

We kindof do: StringUtils.stripCommonSubdomains() removes "www.", "m." and "mobile.", however that's still much more basic than the iOS code. (I haven't been able to find anything more advanced that's usable.)

My main motivation for having a DB specific version of the domain extraction is to ensure that the domains-mapping is stable (but I forgot to make it even equivalent to the existing code before submitting for review). AFAICT all our other TLD extraction is only for UI usage, and is therefore more likely to be tweaked more regularly (which would require us to either migrate the DB to cleanup domains, or tolerate inconsistencies in the domain data).

(Tab does know about the baseDomain, but that's supplied by Gecko. AFAICT Gecko uses getBaseDomain(), with an API mirroring iOS's: https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIEffectiveTLDService#getBaseDomain() )


I'm wondering whether we should port the iOS or Gecko TLD extraction implementations to Java so we can do it properly?
Iteration: 1.11 → 1.12
Iteration: 1.12 → 1.13
(In reply to Richard Newman [:rnewman] from comment #17)
> extractDomainFromURL should be much closer to this:
> 
> https://github.com/mozilla-mobile/firefox-ios/blob/
> f65be4372e20d284aa1fc340159597db97194f8f/Utils/Extensions/NSURLExtensions.
> swift#L185
> 
> I'd be surprised if we didn't already have code in the tree that did TLD
> extraction. Do not reinvent this wheel.

AFAICT iOS is using only the "normalised" host for the domains table, which only does the www/mobile/m stripping ( https://github.com/mozilla-mobile/firefox-ios/blob/master/Utils/Extensions/NSURLExtensions.swift#L239 and https://github.com/mozilla-mobile/firefox-ios/blob/master/Storage/SQL/SQLiteHistory.swift#L219 ).

Do you still think it's worth using the baseDomain for the Android version (I now have an equivalent baseDomain() extraction equivalent implemented)? It seems the normalised domain is usually what's wanted (e.g. mail.google.com and news.google.com should be distinct, but using the baseDomain would result in both being treated as google.com)?
Flags: needinfo?(rnewman)
That's a good question, and one for nchapman, I suspect.

The motivation for the iOS grouping code was to fix some of the stuff linked from Bug 934702 -- that is, having separate top sites tiles for m.facebook.com and www.facebook.com.

That predated Activity Stream by years.


User expectations about grouping might not be what you or I expect:

* Perhaps they don't want Google swamping their top sites?
* Perhaps we want to group by 'owner', in which case baseDomain is also best?
* Perhaps users think of amazon.com/mp3 and amazon.com/Kindle-eBooks as different 'sites', in which case domain grouping is insufficient altogether?
* Or maybe it's enough to remove m/www flavors, as iOS does?
Flags: needinfo?(rnewman)
(In reply to Richard Newman [:rnewman] from comment #23)
> That's a good question, and one for nchapman, I suspect.
> 
> The motivation for the iOS grouping code was to fix some of the stuff linked
> from Bug 934702 -- that is, having separate top sites tiles for
> m.facebook.com and www.facebook.com.
> 
> That predated Activity Stream by years.
> 
> 
> User expectations about grouping might not be what you or I expect:
> 
> * Perhaps they don't want Google swamping their top sites?
> * Perhaps we want to group by 'owner', in which case baseDomain is also best?
> * Perhaps users think of amazon.com/mp3 and amazon.com/Kindle-eBooks as
> different 'sites', in which case domain grouping is insufficient altogether?
> * Or maybe it's enough to remove m/www flavors, as iOS does?

I've spoken to nchapman, and it seems we want to stick with "normalisation" for now.

However they're considering revisiting grouping in future, and possibly having separate grouping for topsites vs highlights. Which makes me wonder if we would have to either have multiple domains tables (fulldomain for foo.mozilla.org, basedomain for mozilla.org, pathed-domain for mozilla.org/videos/ - this gets messy...), or java-side grouping.

Since domains is already storing the full domain, it shouldn't be too hard to add an extra baseDomain column though, which would allow us to group by either as needed in future? I.e.:

_id INTEGER PRIMARY KEY
domain TEXT // "foo.mozilla.org"
baseDomain INTEGER FOREIGN KEY // references domains._id, in this case referencing the "mozilla.org" entry

The code for inserting a new domain gets slightly trickier (but I've already ported the base-domain extraction code into Java so it should be quick to add), everything else stays the same for now, and we can rewrite the queries for baseDomain grouping if needed.

(We could even take this approach a step further: domains can store a pathed-domain, e.g. amazon.com/mp3, along with a domain foreign key (-> amazon.com), and a baseDomain foreign key (also ->amazon.com), but it's probably cleaner to add that as a separate table in future, *if* needed.)
Comment on attachment 8819091 [details]
Bug 1319485 - Pre: add domain extraction and tests to BrowserProvider

https://reviewboard.mozilla.org/r/98928/#review105308

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:1816
(Diff revision 2)
> +        } catch (MalformedURLException e) {
> +            return url;
> +        }

If the URL is malformed we are returning the URL instead of the host? Will this end up in the database as domain?

::: mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java:1820
(Diff revision 2)
> +            host = new URL(url).getHost();
> +        } catch (MalformedURLException e) {
> +            return url;
> +        }
> +
> +        return StringUtils.stripCommonSubdomains(host);

Do we really want to strip common subdomains? After all we are storing this in the "domain" column. Isn't that misleading?
I don't think that's what the desktop implementation does - or do they?
Comment on attachment 8826227 [details]
Bug 1319485 - Improve StringUtils.stripCommonSubdomains() and add tests

https://reviewboard.mozilla.org/r/104212/#review105310

Mhm. I'm not 100% sure yet. We can strip either before storing or before displaying. Stripping before displaying has the advantage that we can modify the behavior - if we strip before storing then the information is lost. Although storing a lot of "www" and "m" comes at a cost too.

::: mobile/android/geckoview/src/main/java/org/mozilla/gecko/util/StringUtils.java:143
(Diff revision 1)
> -        if (host.startsWith("www.")) {
> -            start = 4;
> -        } else if (host.startsWith("mobile.")) {
> -            start = 7;
> +        // A host might have multiple prefixes, in undetermined order - sites such as www.mobile.foo.com
> +        // and m.www.foo.com exist. Hence we need to keep iterating over all prefixes (and removing them)
> +        // until no more known prefixes can be found.
> +        // We do this by resetting our iterator every time a prefix is found and removed.

Is this always better though? What about edge cases like www.mobile.org. I don't remember which bug it was but we have been running in problems in different places because we stripped "mobile".
(In reply to Sebastian Kaspari (:sebastian) from comment #31)
> Comment on attachment 8826227 [details]
> 
> Is this always better though? What about edge cases like www.mobile.org. I
> don't remember which bug it was but we have been running in problems in
> different places because we stripped "mobile".

This might be too much complexity, but that specific edge case could maybe be worked around (although there might still be other edge-cases we can't handle well):
1. Strip www/m/mobile (org)
2. Get the baseDomain (mobile.org)
3. if strippedDomain.length() < baseDomain.length(), use baseDomain (mobile.org), otherwise use strippedDomain (org)

(AFAICT desktop does strip www, but doesn't care about mobile/m)

Building off of my more complicated domains-table proposal above, we could store:
1. _id
2. raw-domain (www.foobar.mozilla.org)
3. stripped-domain (foreign key, pointing to foobar.mozilla.org)
4. base-domain (foreign key, pointing to mozilla.org)

That would give us the flexibility to change the grouping in future, although it will cause the DB to grow. Adding these extra columns now makes it easy to remove them again later (once we know what we're actually grouping on), as opposed to changing what we store if we decide to change our requirements in future?
That's potentially tripling the number of rows in the domains table, plus you'll need typically two joins to get from URL to the kind of domain you want, or from a base domain to known URLs. For a very 'broad' browser, you might have more domain rows than history rows! Each new visited URL would add two or three rows to domains… each of which would have a string. And each of those strings would be present in each index.

The point of the domains table on iOS was simply to allow history rows to 'collide' on something we wanted to be considered the same. A top sites query doesn't even really need to query by domain at all; we could just as easily have stored a 64-bit hash of the domain string in each history row.

I encourage you to figure out all of your current use cases, and write up what each needs to know. Then we can try to figure out a data representation that won't balloon in size or fragment the shit out of the DB file.
* Domain stripping *

- Desktop AS seems to dedupe based on www, but only in the addon code (whereas rev_host stores the original domain)
- iOS strips www/m/mobile in the DB

For consistency we should probably follow one or the other - I'm tending towards stripping the m/mobile since those pages tend not to be distinct so we shouldn't group them separately (which nchapman supports doing). It also looks like it would be easy to have desktop follow suit and do the same stripping.

(*If* the stripping schema does need to change, we can do another migration to change the stripping method without any schema changes being needed: we can wipe the domains table, and iterate over history/bookmarks again - that's not very efficient, but also not difficult to implement.)


* Use cases *

* domain based grouping for highlights - and probably topsites too
- www/m/mobile sites are in most cases the same site, so we want to know the stripped domain. Other subdomains are distinct, so we want to keep them distinct (mail.google.com vs google.com vs plus.google.com).
(- grouping on domain_id seems likely to be a bit faster than a hash, but not significantly so (2 byte vs 8 byte comparisons))

* Domain-based blocking
- Again we probably want fine grained control over subdomains, i.e. you might want to block mail.google.com and google.com searches, but keep news.google.com. (Having a mobile/non-mobile distinction might be useful here to always get the right site - this is trickier)
(- domains table requires a subquery to get the relevant blocked domains, which would reduce or possibly negate performance wins from ^^^.)


I'm naively assuming that the domain_id will be within 2 bytes (32k) for most users, and even in extreme cases with 4 bytes we're still using less space than storing a hash of the domain. (Short of writing some benchmarks I don't think there's any way for me to know which will actually be better.) This is based on assuming max 10k history items (no idea what the bookmarks limit is), and extrapolating to users visiting at least a few different pages per domain. I'd be interested in adding telemetry for this, so we can revise estimates if needed.

By storing the full domain instead of a hash (in the domains table), we're probably throwing away any space savings though - but we could store a hash in the domains table instead?

Purely intuitively, that looks like the best solution to me (separate domains table, that stores just the domain-hash). The blocklist subquery should return only a small number of domain-id's, whereas grouping has to be performed for all the matching history items (i.e. we should focus on the grouping performance?); moreover this minimises DB size increases?
I didn't quite think this through, but a simple tree structure will allow for a fairly space-efficient way to model domain information without loosing any information and allowing for flexibility in how we want to structure grouping rules and whatnot.

Consider that the following structure will cover hundreds or thousands of history items:
        com
       /   \
 facebook   google
  /   \     /   \
 www   m  mail  news

If we model this as a simple table, we'll only need 7 rows for the above tree:
domains:
- id
- parentId
- name ("org", or "mozilla", or "www", or "m")

We then need to establish a relationship between history and a domain. An FK column on a history table is probably enough (as opposed to an m2m table), since we have a natural one-to-many relationship from domains to history.

Our history's domainId reference naturally forms groupings of history items:
- first order grouping is by the "full" domain, e.g. all "m.facebook.com/*" history records will have the same domainID.
- second order grouping is via domainID's parent, e.g. all "www.facebook.com/*" and "m.facebook.com/*" history records will have the same parent domainID.
- third order grouping is via parent's parent domainID, e.g. "www.facebook.com" and "mail.google.com" are in the same "com" group.
- etc.

Now we need a way to integrate this information effectively into our queries (highlights, top sites...).

{...insert a neat and fast way to model this in SQL... "group by domainId" for first order, what about second order?...}

If we've succeeded with the above, we may establish a variety of meta rules for grouping and blocking, such as:
- group all second level (from the root) "m", "mobile", and "www" domains by their parent domainId.
-- "www.facebook.com" and "m.facebook.com" will collapse onto "facebook.com"
- "*.*.facebook.com" are blocked
- "www.google.com" blocked, but "news.google.com" allowed

We can also possibly allow user to augment these rule. E.g. "block all *.reddit.com domains".

And as long as we have:
- a way to efficiently query against this structure following our rules
- a way to iterate on the rules

... we can try out multiple approaches to grouping of highlights, top sites, etc., even in an A/B split type of an arrangement (so, based on the same underlying data), measure engagement, see what works better, etc.
(In reply to Andrzej Hunt :ahunt from comment #40)

> * domain based grouping for highlights - and probably topsites too
> - www/m/mobile sites are in most cases the same site, so we want to know the
> stripped domain. Other subdomains are distinct, so we want to keep them
> distinct (mail.google.com vs google.com vs plus.google.com).

You can store as many different hashed values as you like -- after all, they're short.

> (- grouping on domain_id seems likely to be a bit faster than a hash, but
> not significantly so (2 byte vs 8 byte comparisons))

If you index on domain_hash, they're identical: the index simply means that values are kept in sorted order. And otherwise, I'd be very surprised if there was a measurable difference between comparison within a query of a variable-length 1 or 2 byte integer and a 4- or 8-byte integer from a hash.


> I'm naively assuming that the domain_id will be within 2 bytes (32k) for
> most users, and even in extreme cases with 4 bytes we're still using less
> space than storing a hash of the domain.

Nah.

If you have a domain table, you need to store the string of the domain:

domain                 id
foo.com                1
facebook.com           2

site     url               domain_id
1        facebook.com/bar  2
2        facebook.com/baz  2
3        foo.com/bar       1

If you use a domain hash, you don't:

site     url               domain_hash
1        facebook.com/bar  22222222
2        facebook.com/baz  22222222
3        foo.com/bar       13333333


Strings multiplied by indices are where your database space goes. The magic of a hash-based solution is that *either* you're fetching by some other characteristic, in which case you have the URL for each row, or you're querying for a given URL, in which case you can compute the domain hash. You can even use the domain hash for faster lookups for a given URL: Places is doing something a little like this on desktop.


> By storing the full domain instead of a hash (in the domains table), we're
> probably throwing away any space savings though - but we could store a hash
> in the domains table instead?

It's kinda pointless trading a 4- or 8-byte integer per row with really quick lookups for a 4-byte integer in a separate table plus a 2-3 byte ID in that table and the same 2-3 byte ID in the history table. You'd have to manage rows in the domains table, plus do a join on retrieval.

Figure it out. For 10,000 domains across 12,000 history items, assuming one index per table:

- Domain ID:
  IDs: ((255 @ 1 + 9,745 @ 2) + (255 @ 1 + 11,745 @ 2) = 43,490
  Hashes: (10,000 * 8) = 80,000
  Times two (each index) = 246,980
- Domain hash: 2 * 12,000 * 8 = 192,000

So it's more cost-effective to just store the hash in the history table, using the hash of the domain as a 'domain ID'.
Put more briefly: the only point in having the domains table is if you're storing the domain string itself, using it for something -- e.g., to be able to query efficiently for known domains by string, or if you're not able to compute a good hash everywhere you need to.

(There's also value in allocating IDs if you're concerned about hash collisions, but you should do the math before worrying about that.)
Comment on attachment 8826227 [details]
Bug 1319485 - Improve StringUtils.stripCommonSubdomains() and add tests

I'll clear the review flag: From the comments it seems like this is not the final version yet? Just re-flag me if needed.

One use case here is bug 1312016: For the highlights ranking algorithm I need to know:
* How often was a specific domain visited?
* How many domains where visited?
The desktop add-on uses the rev_host column for that (which I think just contains the host and doesn't remove any prefixes). Just storing a hash would be sufficient for that.

Maybe Let's not think too much about "grouping" in this bug. The web is such a mess that it doesn't look like this can be solved by just picking a "good" substring of the host.

For the domain blacklist. Note that this is currently only a static list of a bunch of hosts in the desktop add-on:
https://github.com/mozilla/activity-stream/blob/938e0f9b2d241df519ca2fa51fce409f61543e11/addon/PlacesProvider.js#L35
It could be too early to design a great domain blocking thingy without a use case yet.
Attachment #8826227 - Flags: review?(s.kaspari)
Iteration: 1.13 → 1.14
Attachment #8826227 - Attachment is obsolete: true
Attachment #8819091 - Attachment is obsolete: true
Still a WIP - I need to do a bit more cleanup (and add more tests) before requesting review: the current draft has 3 hashes (stored directly in bookmarks and history): the full/raw domain, the stripped (m/www/mobile) domain, and also the baseDomain.

Since the hashes are cheap, I'd favour landing both the full and stripped domains - that means we can easily switch our grouping in future (which would let us be consistent with either desktop or iOS). I'm less sure about ever wanting the baseDomain, so I might cut that before requesting review.
Blocks: 1335817
No longer depends on: 1335817
Iteration: 1.14 → 1.15
Iteration: 1.15 → 1.16
Iteration: 1.16 → ---
Priority: P1 → P2
Comment on attachment 8827614 [details]
Bug 1319485 - Pre: Add comment explaining missing migrations

https://reviewboard.mozilla.org/r/105232/#review128796
Attachment #8827614 - Flags: review?(gkruglov)
Comment on attachment 8813339 [details]
Bug 1319485 - Add domain hashes to bookmarks and history

https://reviewboard.mozilla.org/r/94762/#review128798
Attachment #8813339 - Flags: review?(gkruglov)
Comment on attachment 8813340 [details]
Bug 1319485 - Test that domain hashes are correctly generated on migration

https://reviewboard.mozilla.org/r/94764/#review128800
Attachment #8813340 - Flags: review?(gkruglov)
Comment on attachment 8813341 [details]
Bug 1319485 - Add domain hash handling tests

https://reviewboard.mozilla.org/r/94766/#review128802
Attachment #8813341 - Flags: review?(gkruglov)
Well, it's been 2 months. Andrzej, I've cleared the review flags. IIUC we don't actually needs this for the current iteration of A-S. Feel free to flag me if that changes!
Rank: 2
Priority: P2 → P3
All open Activity Stream bugs are moving from the whiteboard tag, "[mobileAS]", to the Firefox for Android component, "Activity Stream", so that I can keep better track of these bugs as the new triage owner; I will send out an email shortly with additional details, caveats, etc.
Component: Data Providers → Activity Stream
Unassigning: I don't think this is actively being worked on.
Assignee: andrzej → nobody
Status: ASSIGNED → NEW
Re-triaging per https://bugzilla.mozilla.org/show_bug.cgi?id=1473195

Needinfo :susheel if you think this bug should be re-triaged.
Priority: P3 → P5
We have completed our launch of our new Firefox on Android. The development of the new versions use GitHub for issue tracking. If the bug report still reproduces in a current version of [Firefox on Android nightly](https://play.google.com/store/apps/details?id=org.mozilla.fenix) an issue can be reported at the [Fenix GitHub project](https://github.com/mozilla-mobile/fenix/). If you want to discuss your report please use [Mozilla's chat](https://wiki.mozilla.org/Matrix#Connect_to_Matrix) server https://chat.mozilla.org and join the [#fenix](https://chat.mozilla.org/#/room/#fenix:mozilla.org) channel.
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.