1386902 - Change top sites title to "subdomain.domain" (or "domain")

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Description

•

7 years ago

After implementing bug 1382036, we set the Top Sites tiles' titles via: if (pageMetadataProvider != null) { title = pageMetadataProvider.toLowerCase(); } else { title = hostSecondLevelDomain; } However, this does not accurately portray the site: accounts.google.com -> google bugzilla.mozilla.org -> mozilla Whereas iOS, what we're trying to copy, would show: accounts google bugzilla mozilla This problem is amplified by duplicate top sites (bug 1382332). --- From my small sample set, we don't get a PageMetadata provider as often as iOS, hence the worse names. It appears we only grab the provider name from the "og:site_name" tag [1]. On iOS, it's a minified file [2] so I assume they just took the page-metadata-parser from desktop, which uses the following algorithm [3]: function getProvider(url) { return urlparse.parse(url) .hostname .replace(/www[a-zA-Z0-9]*\./, '') .replace('.co.', '.') .split('.') .slice(0, -1) .join(' '); } Maybe we should do og:site_name first, and if not, ^. [1]: https://dxr.mozilla.org/mozilla-central/rev/52285ea5e54c73d3ed824544cef2ee3f195f05e6/mobile/android/modules/WebsiteMetadata.jsm#172 [2]: https://github.com/mozilla-mobile/firefox-ios/blob/00164413843f937dc5f2460fcf0e90764f16e464/Client/Frontend/Metadata%20Parsing/page-metadata-parser.js [3]: https://github.com/mozilla/page-metadata-parser/blob/cee8e669b0f4276ab52be8100f69ef357c4ccf3a/parser.js#L14

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 1

•

7 years ago

bug 1311434 implemented the provider extraction. From the commit summary: ``` There's a pull request [1] and issue [2] on GitHub tracking the work to land this in the page-metadata-parser repository. Multiple potential sources for getting a 'provider name' are considered in the linked issue - however they all have slightly different semantic meanings. Only 'og:site_name' actually has the same meaning as "provider name". Therefore it is pretty safe to land this part. [1] https://github.com/mozilla/page-metadata-parser/pull/81 [2] https://github.com/mozilla/page-metadata-parser/issues/79 ``` From bug 1311434 comment 3: > I wonder how prevalent is use of these open graph tags. Anecdata follows: > out of my top sites and current set of highlights, none had any `og:*` tags, > except for some github links. --- Sebastian, having done a little more research on provider names, do you have an opinion on how we should improve our providers extraction? fwiw, we could wait until bug 1382036 actually lands to get a sense of how well our implementation does on more than just my small sample size.

Flags: needinfo?(s.kaspari)

Comment 2

•

7 years ago

> > I wonder how prevalent is use of these open graph tags. Anecdata follows: > > out of my top sites and current set of highlights, none had any `og:*` tags, > > except for some github links. From a brief survey of my history, all Wordpress sites add og: tags, so that's about 25% of the web right there.

Sebastian Kaspari (:sebastian; :pocmo)

Comment 3

•

7 years ago

My opinion: Whatever algorithm we choose to pick a title from the URL will fail in some cases. URLs just do not follow a fixed schema that relates to the actual content (or what the user thinks the content is). GitHub or Reddit are examples where the interesting part is not in the domain but maybe in the paths. Anyways, whatever we do based on the domain can only be a fallback. We should try to use whatever the site gives us first (like og:site - back when I investigated this was the only reliable one for our purposes; others might be good fallbacks though). > function getProvider(url) { > return urlparse.parse(url) > .hostname > .replace(/www[a-zA-Z0-9]*\./, '') > .replace('.co.', '.') > .split('.') > .slice(0, -1) > .join(' '); > } This looks a bit hacky to be honest. It removes the TLD (A) and common prefixes like www (B). * (A) With working around second level domains like co.uk this works fine for most western URLs I guess. But there are a bunch of other second-level domains that we might want to strip - e.g. https://en.wikipedia.org/wiki/Second-level_domain#Country-code_second-level_domains * I was wondering whether it makes sense to strip the public suffix - but I guess that's too much. * (B) We have similar code that strips other common prefixes like m / mobile / ... - I think we can do better here too? And a question: If iOS uses og:site_name or the getProvider() method - Is the Public Suffix Code from bug 1377287 ever triggered on iOS?

Flags: needinfo?(s.kaspari)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Updated

•

7 years ago

Comment 4

•

7 years ago

A related bug is bug 1382332: "Both Highlights & Top Sites Have Duplicate Entries".

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 5

•

7 years ago

An example of collision that can't be fixed by bug 1382332: accounts.craigslist.org sfbay.craigslist.org These share a title, "craigslist", and a favicon so they're indistinguishable (note: prior to bug 1382036 landing but I don't see a site_name tag so I don't think it'll make a difference).

bbell

Comment 6

•

7 years ago

Attached image example — Details

The name on top sites should be the domain. If there’s a Subdmain it should be subdomain.domain

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 7

•

7 years ago

It sounds like this replaces bug 1382036 so I'll add it to this sprint.

tracking-fennec: ? → ---

Iteration: --- → 1.27

Summary: TopSites tiles' titles do not accurately indicate site (from PageMetadata provider) → Change top sites title to "subdomain.domain" (or "domain")

Whiteboard: [mobileAS] → [mobileAS] 1.27

bbell

Comment 8

•

7 years ago

(In reply to Michael Comella (:mcomella) from comment #4) > A related bug is bug 1382332: "Both Highlights & Top Sites Have Duplicate > Entries". Ideally Top Sites focuses on top level domains, and highlights is keyed toward pages (non-top-level domain content) that has been viewed in the last few days.

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Updated

•

7 years ago

Assignee: nobody → michael.l.comella

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 9

•

7 years ago

Attached image Screenshot post-patch — Details

Two concerns: 1) I assume we be removing common subdomains like "www." and "m."? 2) Do we have enough room to display "subdomain.domain" on mobile?

Flags: needinfo?(bbell)

bbell

Comment 10

•

7 years ago

(In reply to Michael Comella (:mcomella) from comment #9) > Created attachment 8896029 [details] > Screenshot post-patch > > Two concerns: > > 1) I assume we be removing common subdomains like "www." and "m."? > > 2) Do we have enough room to display "subdomain.domain" on mobile? Yeah for 1, there's not need to so common sub-domains. For 2, its okay to cut off names. The sub-domain is the most important bit of information so most of that will show.

Flags: needinfo?(bbell)

Comment hidden (mozreview-request)

Jenn Chaulk (:jchaulk)

Updated

•

7 years ago

Iteration: 1.27 → 1.28

Chenxia Liu [:liuche]

Comment 13

•

7 years ago

mozreview-review

Comment on attachment 8896031 [details] Bug 1386902: Add URIUtils.getFormattedDomain. https://reviewboard.mozilla.org/r/167298/#review173706 ::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:42 (Diff revision 1) > /** > - * Returns the second level domain (SLD) of a url. It removes any subdomain/TLD. > + * Returns the domain for the given URI, formatted by the other available parameters. > - * e.g. https://m.foo.com/bar/baz?noo=abc#123 => foo > * > - * The return value may still contain a public suffix (e.g. .com) if the suffix does not match any of our > - * known values. If a host cannot be determined from the given url String, the empty String will be returned. > + * A public suffix is a top-level domain. For the input, "https://github.com", you can specify public suffix: > + * - included: "github.com" Nit: Since this is represented as a boolean in the inputs, would be nice if these were true/false values for "included", rather than "included"/"excluded". ::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:76 (Diff revision 1) > - @WorkerThread // PublicSuffix methods can touch the disk. > - public static String getHostSecondLevelDomain(@NonNull final Context context, @NonNull final String uriString) > - throws URISyntaxException { > + @NonNull > + @WorkerThread // calls PublicSuffix methods. > + public static String getFormattedDomain(@NonNull final Context context, @NonNull final URI uri, > + final boolean shouldIncludePublicSuffix, @IntRange(from = 0) final int subdomainCount) { > if (context == null) { throw new NullPointerException("Expected non-null Context argument"); } > - if (uriString == null) { throw new NullPointerException("Expected non-null uri argument"); } > + if (uri == null) { throw new NullPointerException("Expected non-null uri argument"); } Is this redundant with the @NonNull annotation? -- Oh, though from looking at your tests, that's not expected. So I assume this is a runtime test, whereas the annotations are compile-time, makes sense. ::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:77 (Diff revision 1) > - public static String getHostSecondLevelDomain(@NonNull final Context context, @NonNull final String uriString) > - throws URISyntaxException { > + @WorkerThread // calls PublicSuffix methods. > + public static String getFormattedDomain(@NonNull final Context context, @NonNull final URI uri, > + final boolean shouldIncludePublicSuffix, @IntRange(from = 0) final int subdomainCount) { > if (context == null) { throw new NullPointerException("Expected non-null Context argument"); } > - if (uriString == null) { throw new NullPointerException("Expected non-null uri argument"); } > + if (uri == null) { throw new NullPointerException("Expected non-null uri argument"); } > + if (subdomainCount < 0) { throw new IllegalArgumentException("Expected subdomainCount >= 0."); } Same here - @IntRange should guard against this right? Actually, I don't know how exactly these argument annotations work - do they throw as well? -- Runtime vs compile-time, makes sense. ::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:107 (Diff revision 1) > + } > + > + /** Strips any subdomains from the host over the given limit. */ > + private static String stripSubdomains(String host, final int desiredSubdomainCount) { > + int includedSubdomainCount = 0; > + for (int i = host.length() - 1; i >= 0; --i) { Nit: why not i--? I think they compile to the same inside a Java for-loop, and it looks more standard. ::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:107 (Diff revision 1) > + } > + > + /** Strips any subdomains from the host over the given limit. */ > + private static String stripSubdomains(String host, final int desiredSubdomainCount) { > + int includedSubdomainCount = 0; > + for (int i = host.length() - 1; i >= 0; --i) { Maybe a quick optimization here where you start with the index of the last '.' and start from there (because the base subdomain is at 0) but that's just a minor improvement and won't cost much. But since you have the tests anyway.... :)

Attachment #8896031 - Flags: review?(liuche) → review+

Chenxia Liu [:liuche]

Comment 14

•

7 years ago

mozreview-review

Comment on attachment 8896032 [details] Bug 1386902: Use "subdomain.domain" in top sites title. https://reviewboard.mozilla.org/r/167300/#review173712

Attachment #8896032 - Flags: review?(liuche) → review+

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 15

•

7 years ago

mozreview-review-reply

Comment on attachment 8896031 [details] Bug 1386902: Add URIUtils.getFormattedDomain. https://reviewboard.mozilla.org/r/167298/#review173706 > Is this redundant with the @NonNull annotation? > > -- > Oh, though from looking at your tests, that's not expected. So I assume this is a runtime test, whereas the annotations are compile-time, makes sense. Precisely: the annotations do not guarantee runtime accuracy (so it's pretty annoying to have to specify them twice... x_x). > Nit: why not i--? I think they compile to the same inside a Java for-loop, and it looks more standard. `--i` is a micro-optimization that I don't think affects readability so I'm leaving it (though the fact that someone is questioning why I wrote it that way makes me question that logic...) Basically, `i--` is a store, decrement, and return (three operations) whereas `--i` is a decrement and return (two operations). You see this a lot in graphics where inner loops will be called thousands of times a second. Though I bet compilers nowadays are smart enough to fix this too... I did a quick skim for Java-related SO answers and couldn't find any. I don't know! I'll reconsider its use in the future. > Maybe a quick optimization here where you start with the index of the last '.' and start from there (because the base subdomain is at 0) but that's just a minor improvement and won't cost much. But since you have the tests anyway.... :) But under the hood, afaik, `lastIndexOf` is the same set of operations so I don't think it's any more optimized.

Comment hidden (mozreview-request)

Pulsebot

Comment 18

•

7 years ago

Pushed by michael.l.comella@gmail.com: https://hg.mozilla.org/integration/autoland/rev/ce9ca5e8f2d1 Add URIUtils.getFormattedDomain. r=liuche https://hg.mozilla.org/integration/autoland/rev/f98c01a22f7b Use "subdomain.domain" in top sites title. r=liuche

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Updated

•

7 years ago

Blocks: 1382332

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Updated

•

7 years ago

No longer blocks: 1382332

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Updated

•

7 years ago

Blocks: 1390372

Carsten Book [:Tomcat]

Comment 19

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ce9ca5e8f2d1 https://hg.mozilla.org/mozilla-central/rev/f98c01a22f7b

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox57: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 57

Chenxia Liu [:liuche]

Comment 20

•

7 years ago

> > I did a quick skim for Java-related SO answers and couldn't find any. I > don't know! I'll reconsider its use in the future. I found this while reviewing your patch earlier! Sorry, should have included it in my review comment so you didn't have to search (fruitlessly) for it: https://stackoverflow.com/questions/2315705/what-is-the-difference-between-i-i-in-for-loop-java

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 21

•

7 years ago

(In reply to Chenxia Liu [:liuche] from comment #20) > I found this while reviewing your patch earlier! Sorry, should have included > it in my review comment so you didn't have to search (fruitlessly) for it: > https://stackoverflow.com/questions/2315705/what-is-the-difference-between-i- > i-in-for-loop-java Specifically this answer: https://stackoverflow.com/a/2315795/2219998 Where the user generates the byte-code for i++ and ++i in a loop and determines that the byte code is the same. Thanks. :)

Bogdan Surd, QA [:bsurd, NI]

Comment 22

•

7 years ago

Verified this and depending on the site that I visit I either get the domain or subdomain.domain name.

Status: RESOLVED → VERIFIED

status-firefox57: fixed → verified

:Grisha Kruglov

Comment 23

•

7 years ago

Mike, a bunch of URIUtils.getFormattedDomain are failing for me locally on the current m-c. It's a bit odd that they've started to fail now. Could you take a look?

Flags: needinfo?(michael.l.comella)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 24

•

7 years ago

(In reply to :Grisha Kruglov from comment #23) > Mike, a bunch of URIUtils.getFormattedDomain are failing for me locally on > the current m-c. What do you mean they're failing? What behavior are you seeing?

Flags: needinfo?(gkruglov)

:Grisha Kruglov

Comment 25

•

7 years ago

Straightforward behavioural test failures - e.g. expected to get "bbc", got "com" type of a thing. Nick is also seeing the same in https://bugzilla.mozilla.org/show_bug.cgi?id=1409087#c9

Flags: needinfo?(gkruglov)

Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla]

Assignee

Comment 26

•

7 years ago

(In reply to :Grisha Kruglov from comment #25) > Straightforward behavioural test failures - e.g. expected to get "bbc", got > "com" type of a thing. Nick is also seeing the same in > https://bugzilla.mozilla.org/show_bug.cgi?id=1409087#c9 Responded in https://bugzilla.mozilla.org/show_bug.cgi?id=1409087#c25.

Flags: needinfo?(michael.l.comella)

Nobody; OK to take it and work on it

Updated

•

4 years ago

Product: Firefox for Android → Firefox for Android Graveyard

example 7 years ago bbell 57.76 KB, image/jpeg		Details
Screenshot post-patch 7 years ago Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla] 176.11 KB, image/png		Details
Bug 1386902: Add URIUtils.getFormattedDomain. 7 years ago Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla] 59 bytes, text/x-review-board-request	liuche : review+	Details
Bug 1386902: Use "subdomain.domain" in top sites title. 7 years ago Michael Comella (:mcomella) [NI reported issues only: ex-Mozilla] 59 bytes, text/x-review-board-request	liuche : review+	Details