Closed Bug 1386902 Opened 3 years ago Closed 3 years ago

Change top sites title to "subdomain.domain" (or "domain")

Categories

(Firefox for Android :: General, enhancement)

All
Android
enhancement
Not set

Tracking

()

VERIFIED FIXED
Firefox 57
Iteration:
1.28
Tracking Status
firefox57 --- verified

People

(Reporter: mcomella, Assigned: mcomella)

References

Details

(Whiteboard: [mobileAS] 1.27)

Attachments

(4 files)

After implementing bug 1382036, we set the Top Sites tiles' titles via:
if (pageMetadataProvider != null) {
  title = pageMetadataProvider.toLowerCase();
} else {
  title = hostSecondLevelDomain;
}

However, this does not accurately portray the site:

accounts.google.com -> google
bugzilla.mozilla.org -> mozilla

Whereas iOS, what we're trying to copy, would show:

accounts google
bugzilla mozilla

This problem is amplified by duplicate top sites (bug 1382332).

---

From my small sample set, we don't get a PageMetadata provider as often as iOS, hence the worse names. It appears we only grab the provider name from the "og:site_name" tag [1].

On iOS, it's a minified file [2] so I assume they just took the page-metadata-parser from desktop, which uses the following algorithm [3]:


function getProvider(url) {
  return urlparse.parse(url)
    .hostname
    .replace(/www[a-zA-Z0-9]*\./, '')
    .replace('.co.', '.')
    .split('.')
    .slice(0, -1)
    .join(' ');
}

Maybe we should do og:site_name first, and if not, ^.

[1]: https://dxr.mozilla.org/mozilla-central/rev/52285ea5e54c73d3ed824544cef2ee3f195f05e6/mobile/android/modules/WebsiteMetadata.jsm#172
[2]: https://github.com/mozilla-mobile/firefox-ios/blob/00164413843f937dc5f2460fcf0e90764f16e464/Client/Frontend/Metadata%20Parsing/page-metadata-parser.js
[3]: https://github.com/mozilla/page-metadata-parser/blob/cee8e669b0f4276ab52be8100f69ef357c4ccf3a/parser.js#L14
bug 1311434 implemented the provider extraction. From the commit summary:

```
There's a pull request [1] and issue [2] on GitHub tracking the work to land this in the page-metadata-parser repository. Multiple potential sources for getting a 'provider name' are considered in the linked issue - however they all have slightly different semantic meanings. Only 'og:site_name' actually has the same meaning as "provider name". Therefore it is pretty safe to land this part.

[1] https://github.com/mozilla/page-metadata-parser/pull/81
[2] https://github.com/mozilla/page-metadata-parser/issues/79
```

From bug 1311434 comment 3:
> I wonder how prevalent is use of these open graph tags. Anecdata follows:
> out of my top sites and current set of highlights, none had any `og:*` tags,
> except for some github links.

---

Sebastian, having done a little more research on provider names, do you have an opinion on how we should improve our providers extraction?

fwiw, we could wait until bug 1382036 actually lands to get a sense of how well our implementation does on more than just my small sample size.
Flags: needinfo?(s.kaspari)
See Also: → 1311434
> > I wonder how prevalent is use of these open graph tags. Anecdata follows:
> > out of my top sites and current set of highlights, none had any `og:*` tags,
> > except for some github links.

From a brief survey of my history, all Wordpress sites add og: tags, so that's about 25% of the web right there.
My opinion: Whatever algorithm we choose to pick a title from the URL will fail in some cases. URLs just do not follow a fixed schema that relates to the actual content (or what the user thinks the content is). GitHub or Reddit are examples where the interesting part is not in the domain but maybe in the paths.

Anyways, whatever we do based on the domain can only be a fallback. We should try to use whatever the site gives us first (like og:site - back when I investigated this was the only reliable one for our purposes; others might be good fallbacks though).

> function getProvider(url) {
>  return urlparse.parse(url)
>    .hostname
>    .replace(/www[a-zA-Z0-9]*\./, '')
>    .replace('.co.', '.')
>    .split('.')
>    .slice(0, -1)
>    .join(' ');
> }

This looks a bit hacky to be honest. It removes the TLD (A) and common prefixes like www (B).

* (A) With working around second level domains like co.uk this works fine for most western URLs I guess. But there are a bunch of other second-level domains that we might want to strip - e.g. https://en.wikipedia.org/wiki/Second-level_domain#Country-code_second-level_domains
* I was wondering whether it makes sense to strip the public suffix - but I guess that's too much.

* (B) We have similar code that strips other common prefixes like m / mobile / ... - I think we can do better here too?


And a question: If iOS uses og:site_name or the getProvider() method - Is the Public Suffix Code from bug 1377287 ever triggered on iOS?
Flags: needinfo?(s.kaspari)
A related bug is bug 1382332: "Both Highlights & Top Sites Have Duplicate Entries".
An example of collision that can't be fixed by bug 1382332:

accounts.craigslist.org
sfbay.craigslist.org

These share a title, "craigslist", and a favicon so they're indistinguishable (note: prior to bug 1382036 landing but I don't see a site_name tag so I don't think it'll make a difference).
Attached image example
The name on top sites should be the domain. If there’s a Subdmain it should be subdomain.domain
It sounds like this replaces bug 1382036 so I'll add it to this sprint.
tracking-fennec: ? → ---
Iteration: --- → 1.27
Summary: TopSites tiles' titles do not accurately indicate site (from PageMetadata provider) → Change top sites title to "subdomain.domain" (or "domain")
Whiteboard: [mobileAS] → [mobileAS] 1.27
(In reply to Michael Comella (:mcomella) from comment #4)
> A related bug is bug 1382332: "Both Highlights & Top Sites Have Duplicate
> Entries".

Ideally Top Sites focuses on top level domains, and highlights is keyed toward pages (non-top-level domain content) that has been viewed in the last few days.
Assignee: nobody → michael.l.comella
Attached image Screenshot post-patch
Two concerns:

1) I assume we be removing common subdomains like "www." and "m."?

2) Do we have enough room to display "subdomain.domain" on mobile?
Flags: needinfo?(bbell)
(In reply to Michael Comella (:mcomella) from comment #9)
> Created attachment 8896029 [details]
> Screenshot post-patch
> 
> Two concerns:
> 
> 1) I assume we be removing common subdomains like "www." and "m."?
> 
> 2) Do we have enough room to display "subdomain.domain" on mobile?

Yeah for 1, there's not need to so common sub-domains. 

For 2, its okay to cut off names. The sub-domain is the most important bit of information so most of that will show.
Flags: needinfo?(bbell)
Iteration: 1.27 → 1.28
Comment on attachment 8896031 [details]
Bug 1386902: Add URIUtils.getFormattedDomain.

https://reviewboard.mozilla.org/r/167298/#review173706

::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:42
(Diff revision 1)
>      /**
> -     * Returns the second level domain (SLD) of a url. It removes any subdomain/TLD.
> +     * Returns the domain for the given URI, formatted by the other available parameters.
> -     * e.g. https://m.foo.com/bar/baz?noo=abc#123  => foo
>       *
> -     * The return value may still contain a public suffix (e.g. .com) if the suffix does not match any of our
> -     * known values. If a host cannot be determined from the given url String, the empty String will be returned.
> +     * A public suffix is a top-level domain. For the input, "https://github.com", you can specify public suffix:
> +     * - included: "github.com"

Nit: Since this is represented as a boolean in the inputs, would be nice if these were true/false values for "included", rather than "included"/"excluded".

::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:76
(Diff revision 1)
> -    @WorkerThread // PublicSuffix methods can touch the disk.
> -    public static String getHostSecondLevelDomain(@NonNull final Context context, @NonNull final String uriString)
> -            throws URISyntaxException {
> +    @NonNull
> +    @WorkerThread // calls PublicSuffix methods.
> +    public static String getFormattedDomain(@NonNull final Context context, @NonNull final URI uri,
> +            final boolean shouldIncludePublicSuffix, @IntRange(from = 0) final int subdomainCount) {
>          if (context == null) { throw new NullPointerException("Expected non-null Context argument"); }
> -        if (uriString == null) { throw new NullPointerException("Expected non-null uri argument"); }
> +        if (uri == null) { throw new NullPointerException("Expected non-null uri argument"); }

Is this redundant with the @NonNull annotation?

--
Oh, though from looking at your tests, that's not expected. So I assume this is a runtime test, whereas the annotations are compile-time, makes sense.

::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:77
(Diff revision 1)
> -    public static String getHostSecondLevelDomain(@NonNull final Context context, @NonNull final String uriString)
> -            throws URISyntaxException {
> +    @WorkerThread // calls PublicSuffix methods.
> +    public static String getFormattedDomain(@NonNull final Context context, @NonNull final URI uri,
> +            final boolean shouldIncludePublicSuffix, @IntRange(from = 0) final int subdomainCount) {
>          if (context == null) { throw new NullPointerException("Expected non-null Context argument"); }
> -        if (uriString == null) { throw new NullPointerException("Expected non-null uri argument"); }
> +        if (uri == null) { throw new NullPointerException("Expected non-null uri argument"); }
> +        if (subdomainCount < 0) { throw new IllegalArgumentException("Expected subdomainCount >= 0."); }

Same here - @IntRange should guard against this right? Actually, I don't know how exactly these argument annotations work - do they throw as well?

--
Runtime vs compile-time, makes sense.

::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:107
(Diff revision 1)
> +    }
> +
> +    /** Strips any subdomains from the host over the given limit. */
> +    private static String stripSubdomains(String host, final int desiredSubdomainCount) {
> +        int includedSubdomainCount = 0;
> +        for (int i = host.length() - 1; i >= 0; --i) {

Nit: why not i--? I think they compile to the same inside a Java for-loop, and it looks more standard.

::: mobile/android/base/java/org/mozilla/gecko/util/URIUtils.java:107
(Diff revision 1)
> +    }
> +
> +    /** Strips any subdomains from the host over the given limit. */
> +    private static String stripSubdomains(String host, final int desiredSubdomainCount) {
> +        int includedSubdomainCount = 0;
> +        for (int i = host.length() - 1; i >= 0; --i) {

Maybe a quick optimization here where you start with the index of the last '.' and start from there (because the base subdomain is at 0) but that's just a minor improvement and won't cost much. But since you have the tests anyway.... :)
Attachment #8896031 - Flags: review?(liuche) → review+
Comment on attachment 8896032 [details]
Bug 1386902: Use "subdomain.domain" in top sites title.

https://reviewboard.mozilla.org/r/167300/#review173712
Attachment #8896032 - Flags: review?(liuche) → review+
Comment on attachment 8896031 [details]
Bug 1386902: Add URIUtils.getFormattedDomain.

https://reviewboard.mozilla.org/r/167298/#review173706

> Is this redundant with the @NonNull annotation?
> 
> --
> Oh, though from looking at your tests, that's not expected. So I assume this is a runtime test, whereas the annotations are compile-time, makes sense.

Precisely: the annotations do not guarantee runtime accuracy (so it's pretty annoying to have to specify them twice... x_x).

> Nit: why not i--? I think they compile to the same inside a Java for-loop, and it looks more standard.

`--i` is a micro-optimization that I don't think affects readability so I'm leaving it (though the fact that someone is questioning why I wrote it that way makes me question that logic...)

Basically, `i--` is a store, decrement, and return (three operations) whereas `--i` is a decrement and return (two operations). You see this a lot in graphics where inner loops will be called thousands of times a second. Though I bet compilers nowadays are smart enough to fix this too...

I did a quick skim for Java-related SO answers and couldn't find any. I don't know! I'll reconsider its use in the future.

> Maybe a quick optimization here where you start with the index of the last '.' and start from there (because the base subdomain is at 0) but that's just a minor improvement and won't cost much. But since you have the tests anyway.... :)

But under the hood, afaik, `lastIndexOf` is the same set of operations so I don't think it's any more optimized.
Pushed by michael.l.comella@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ce9ca5e8f2d1
Add URIUtils.getFormattedDomain. r=liuche
https://hg.mozilla.org/integration/autoland/rev/f98c01a22f7b
Use "subdomain.domain" in top sites title. r=liuche
https://hg.mozilla.org/mozilla-central/rev/ce9ca5e8f2d1
https://hg.mozilla.org/mozilla-central/rev/f98c01a22f7b
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
> 
> I did a quick skim for Java-related SO answers and couldn't find any. I
> don't know! I'll reconsider its use in the future.

I found this while reviewing your patch earlier! Sorry, should have included it in my review comment so you didn't have to search (fruitlessly) for it:
https://stackoverflow.com/questions/2315705/what-is-the-difference-between-i-i-in-for-loop-java
(In reply to Chenxia Liu [:liuche] from comment #20)
> I found this while reviewing your patch earlier! Sorry, should have included
> it in my review comment so you didn't have to search (fruitlessly) for it:
> https://stackoverflow.com/questions/2315705/what-is-the-difference-between-i-
> i-in-for-loop-java

Specifically this answer: https://stackoverflow.com/a/2315795/2219998

Where the user generates the byte-code for i++ and ++i in a loop and determines that the byte code is the same.

Thanks. :)
Verified this and depending on the site that I visit I either get the domain or subdomain.domain name.
Status: RESOLVED → VERIFIED
Mike, a bunch of URIUtils.getFormattedDomain are failing for me locally on the current m-c. It's a bit odd that they've started to fail now. Could you take a look?
Flags: needinfo?(michael.l.comella)
(In reply to :Grisha Kruglov from comment #23)
> Mike, a bunch of URIUtils.getFormattedDomain are failing for me locally on
> the current m-c.

What do you mean they're failing? What behavior are you seeing?
Flags: needinfo?(gkruglov)
Straightforward behavioural test failures - e.g. expected to get "bbc", got "com" type of a thing. Nick is also seeing the same in https://bugzilla.mozilla.org/show_bug.cgi?id=1409087#c9
Flags: needinfo?(gkruglov)
(In reply to :Grisha Kruglov from comment #25)
> Straightforward behavioural test failures - e.g. expected to get "bbc", got
> "com" type of a thing. Nick is also seeing the same in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1409087#c9

Responded in https://bugzilla.mozilla.org/show_bug.cgi?id=1409087#c25.
Flags: needinfo?(michael.l.comella)
You need to log in before you can comment on or make changes to this bug.