Closed Bug 1688019 Opened 3 years ago Closed 3 months ago

Google redirect links on search suggestion can be misleading

Categories

(Firefox :: Address Bar, enhancement, P3)

Desktop
All
enhancement
Points:
5

Tracking

()

VERIFIED FIXED
119 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox84 --- wontfix
firefox85 --- wontfix
firefox86 --- wontfix
firefox119 --- verified

People

(Reporter: florencia.diciocco, Assigned: mcheang)

References

(Blocks 1 open bug)

Details

(Keywords: papercut, Whiteboard: [snt-scrubbed][search-papercut])

Attachments

(2 files)

Attached image Screenshot

[Affected platforms:]
mac 10.14.6
Ubuntu 18.04.2 LTS
Windows: will test it by eod

[Affected versions:]
Nightly 86.0a1 (2021-01-21)
Beta 85.0b9
ESR 78.7.0esr
Release 84.0.2

[Description:]
FF is showing redirect links as history suggestions when you search for a visited link. This is a bit misleading since the user might believe that it's opening google but ends up in another site.

[Steps:]

  1. Go to google.com and google something.
  2. Open a result/link in a new tab
  3. Close those tabs
  4. In a new tab, in the url bar type the first letters of the link opened in the new tab.
  5. It appears the google link (the one in grey). Click on it.

[Actual result:]
It opens the link from step 2

[Expected result:]
It should open a google link.

In general we don't hide redirect sources, because sometimes they can be more useful or readable than targets. Often the user also remembers what they typed, rather than the destination.

This is a special case though, it's a redirect where the url contains another embedded url... It sounds like we should be able to do something about these, that may also just mean hiding them (since the destination visit should be in history anyway).

Severity: -- → S3
Points: --- → 5
Keywords: papercut
Priority: -- → P3
Component: Search → Address Bar

The component has been changed since the backlog priority was decided, so we're resetting it.
For more information, please visit auto_nag documentation.

Priority: P3 → --
Priority: -- → P3

We should investigate de-dupe urls containing embedded urls for search results.

For example, there’s a url to the recipe for ice tea, but there’s another result that’s for google to redirect to the same url with the recipe for ice tea. Therefore, this is a duplicate

Assignee: nobody → mcheang
Whiteboard: [snt-triaged][search-papercut]
Whiteboard: [snt-triaged][search-papercut] → [snt-scrubbed][search-papercut]

I was thinking of picking this up, but I feel like I need more info to know how to move forward.

If I'm understanding this correctly, the issue is that target url is also contained in the Google search results URL. So if the user happens to type a string that matches both the target URL as well as something in the source URL, then source link may show up in suggest.

Other providers (i.e. Bing) may also use redirects but they may not be including the target URL in the request URL, so we avoid these duplicate entries.

I feel like I should know how URL's are added to history. There are times when I visit a link that redirects but it doesn't appear in my list of recent history. Thus I think it's worth investigating the process of adding a link to history. I'm going to needsinfo you :mak because you probably have a quick answer to this (and more insight).

I feel into the weeds and stepped through the backend logic of adding URI visits which called VisitURI in History.cpp. Will do more testing to see how the behaviour of this method changes depending on the conditions.

Flags: needinfo?(mak)

(In reply to James Teow [:jteow] from comment #4)

If I'm understanding this correctly, the issue is that target url is also contained in the Google search results URL. So if the user happens to type a string that matches both the target URL as well as something in the source URL, then source link may show up in suggest.

that's correct, both source and target may show.

Other providers (i.e. Bing) may also use redirects but they may not be including the target URL in the request URL, so we avoid these duplicate entries.

I didn't check what others are doing, off-hand it may be interesting and useful to check if Bing, DuckDuckGo, Yahoo, Yandex, Baidu (I think these are still the most used engines today) do something similar.

I feel into the weeds and stepped through the backend logic of adding URI visits which called VisitURI in History.cpp. Will do more testing to see how the behaviour of this method changes depending on the conditions.

When the docshell visits a url, is is indeed invoking VisitURI and SetURITitle (this is not a single call because the title must be extracted later from the DOM, there was some discussion about how to do a single call but it's low priority). The docshell knows the most about of the page was visited, and passes that info through flags
History may decide that some visits are not worth showing in the UI, and then sets the hidden field that can be used later by history views to exclude them. The main reason for that is performance of views showing hundreds/thousands of entries and providing to the user history lists that are more useful (like in the history menu). I think the urlbar, like searching through history, completely ignores hidden, the reason is if you are looking for something you should be able to find it.
The other part of this is frecency, when updating frecency from a new visit, IIRC we pass down the redirect info, otherwise we try to extract it from history itself. If we detect a visit is a redirect source, that is 0.
Sometimes we exclude frecency = 0 entries in the urlbar, but I suspect we're not very consistent with that, some queries may not do it, or frecency of these entries may not be exactly zero.

Flags: needinfo?(mak)

Note ideally you could just dedupe in the Urlbar Muxer, there is already code doing deduplication of other urls there.

Thanks for the background info, Marco. I'll do a deeper dive on how the main search engines code their URLs.

If just de-duping from the Muxer, I feel the simplest way without causing potential side effects would be to check if the URL comes from a search engine domain and it contains a URL in one of the query params.

The key point is making sure the URL could be from a SERP, which isn't trivial since the URL isn't necessarily going to have the same sub domain and path of a SERP, hence making parseSubmissionURL not applicable.

Some ideas:

  • UrlbarSearchUtils.enginesForDomainPrefix(): While I can extract the domain from the result URL, the helper is async and it's overall implementation could add a performance hit to _canAddResult
  • SearchUtils.GENERAL_SEARCH_ENGINE_IDS: Check if the result domain starts with a www and contains a name that starts with any of the general search engines before checking the query params. It's not entirely comprehensive since you can't check engines outside of those defined engines, and if the provider switches or uses subdomains for the redirect this won't work
  • Hard code an exception if this is predominantly an issue with Google
  • Dig deeper to find out why this result is being shown whereas other redirects from other SERPs aren't always shown even if the user types a terms that are similar to the redirect URL: Maybe something could be done prevent the result from appearing in _canAddResult?

I took a stab at conducting searches in the urlbar, opening both non-ads and ads into new tabs and then checked the places.sqlite -> MozPlaces database for changes.

Non-ad links:

  • Google and Yahoo US can include a URL as a param value.
  • Yahoo US will also use a different sub-domain to trigger the redirect (r.search.yahoo.com), so strictly matching against www.yahoo.com won't work.
  • DDG, Yandex don't add hidden entries.
  • Bing and Baidu do add hidden entries but they don't encode the target URLs in a human readable format.

Ad-links:

  • Google and Yahoo can include the target domain in their ad URLs, though the domains are different from the search engine.
  • Bing doesn't include the target domain in the ad url.
  • DDG uses their own URL that does include the target domain, before loading the Bing ad URL.
  • Baidu doesn't include a target domain in the ad URL that I saw. Likewise Yandex.

Also interesting is that other browsers may choose to omit these ad URLs in history even if you paste the ad manually: Perhaps they prevent specific URLs that redirect from being added to the user history? Or maybe they cherry pick a specific set of domains that they know are ad serving.

I have a WIP patch for this: https://phabricator.services.mozilla.com/D186578
It's not exactly ready yet and it's deduping but in some cases it's not. So I'm still testing it out.

A question I had was should we only apply this to the google case? Or should we parse the nested url for other search engines?
I did a manual test and the problem is only with google redirects.

Reproducible in

  • Google

Not Reproducible

  • DDG
  • Bing
  • Qwant
  • Yahoo
  • ask.com
  • Baidu
Attachment #9349636 - Attachment description: WIP: Bug 1688019 - Dedupe google redirect links results. r=mak → Bug 1688019 - Dedupe google redirect links results. r=mak

(In reply to Mandy Cheang [:mcheang][she/her] from comment #10)

A question I had was should we only apply this to the google case? Or should we parse the nested url for other search engines?

The best would ideally be to apply it only to redirecting urls... unfortunately we don't have that info available easily, so for now we could use a short blocklist with partial origins and file a follow-up bug to return redirecting info from history.

Pushed by mcheang@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/cacd1aee41f8
Dedupe google redirect links results. r=mak
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 119 Branch

Confirming enhancement is implemented on Nightly 119.0a1(build ID: 20230925093027) on Ubuntu 22, macOS 12, Windows 10.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.