Closed Bug 1038225 Opened 6 years ago Closed 6 years ago

Define how to map search URLs to terms and engine name

Categories

(Toolkit :: Places, defect)

defect
Not set
Points:
2

Tracking

()

RESOLVED FIXED
Iteration:
34.1

People

(Reporter: Paolo, Assigned: Paolo)

References

Details

In bug 1034381 we want to improve how we display history results for past searches, started either from in-product search input fields or from the search engine web pages themselves (for example when a search is refined).

In order to do that, we'll need to map the search result URL to the engine name and the search terms used.

This bug is about defining the logic to use when doing the mapping.
Flags: firefox-backlog+
Blocks: 959582
Points: --- → 2
QA Whiteboard: [qa-]
Added to Iteration 33.3
Assignee: nobody → paolo.mozmail
Status: NEW → ASSIGNED
Iteration: --- → 33.3
I have looked into some of our default search engine definitions:

http://mxr.mozilla.org/mozilla-central/source/browser/locales/en-US/searchplugins/

All of those I've inspected put the search terms into a standalone parameter, using this format:

<Url type="text/html" method="GET" template="https://www.google.com/search">
<Param name="q" value="{searchTerms}"/>

I've seen that the OpenSearch format allows the parameter to be defined in the "template" attribute as well, for example "http://example.com/search?q={searchTerms}":

http://www.opensearch.org/Specifications/OpenSearch/1.1#OpenSearch_URL_template_syntax

If that is correct, we might want to support that syntax as well. We may want to exclude more complex templates where we have "http://example.com/search?action=search:{searchTerms}".

I don't know if we want to support the "{searchTerms?}" syntax, but if I understand correctly it's effectively the same as "{searchTerms}" so we might as well support it. The parameter name is case sensitive.

We could use this logic for reversing a URL from a known engine:
* Treat http and https schemes as equivalent.
* Match the exact search page domain and path, for example "www.google.com/search".
  - We may want to match the path case-insensitively, as some servers might be case-insensitive.
* Require the parameter name associated to {searchTerms} to be present.
* Parse the parameter value using the <InputEncoding> of the search engine.
* Ignore every other parameter.

For domain matching, there is still the issue of "www.google.com" redirecting to localized domains like "www.google.co.uk", as well as searches started from the web from a localized page. I also don't know if some search engines may allow variants of their search URL, like "example.com?q=" and "www.example.com?q=".

We may add a list of supported language TLDs (or full domains?) for our default search engine definitions, however I'm not sure where, maybe we can just add a custom attribute to the search engine definition files.

Searches from engines that are not in our default list would be treated as normal URLs, with no special styling in location bar autocomplete results. I wonder if we may want to identify searches from a list of search engines that are not in the default list and style them better, but this seems like something we should not do at present because of the additional complexity.

Gavin, do you have an opinion on these choices? Is there anyone else that you know we should consult for this design?
Flags: needinfo?(gavin.sharp)
Also, with regard to language redirection, there might be the case where the user installed new search engine variants for specific locales on purpose, in order to choose which one to use for the search, for example separate "Wikipedia (en)" and "Wikipedia (fr)" entries. In this case I think we should prioritize the language-specific engine rather than the TLD variants in the main engine when determine which engine was used.
(In reply to :Paolo Amadini from comment #2)
> All of those I've inspected put the search terms into a standalone
> parameter, using this format:
> 
> <Url type="text/html" method="GET" template="https://www.google.com/search">
> <Param name="q" value="{searchTerms}"/>
> We may  want to exclude more complex templates where we have
> "http://example.com/search?action=search:{searchTerms}".

I actually missed eBay:

<Url type="text/html" method="GET" template="http://rover.ebay.com/rover/1/711-47294-18009-3/4" resultdomain="ebay.com">
<Param name="mpre" value="http://www.ebay.com/sch/i.html?_nkw={searchTerms}" />

This means we might need to either support prefixes or suffixes in the parameter containing {searchTerms}, or not support eBay for search term reversing.
I think we need to reduce scope as much as possible. When we spoke we discussed limiting this to our built-in engines, but perhaps we could go even further and limit to built-in "general purpose" search engines (Bing, Yahoo, Google, maybe Wikipedia), on the basis that they are the most-used in general, and also most-used for frequent searching (as opposed to one-off specialized/specific searches).

I'm not sure I understand entirely what you were envisioning for an implementation strategy - the {searchTerms?} syntax and {searchTerms}-in-template functionality isn't used by our default engines and so we should avoid worrying about them IMO.

It might be interesting to implement this as an "engine result" RegExp specified in the search engine description that matches result URLs from that engine, and captures the parameter value. We could pair that with some telemetry that measures how many times we end up loading a search from a given engine that does not match its "engine result" regexp, so that we could see when the matching was failing in practice (e.g. due to a search engine changing the URLs it uses).
Flags: needinfo?(gavin.sharp)
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #5)
> I think we need to reduce scope as much as possible. When we spoke we
> discussed limiting this to our built-in engines, but perhaps we could go
> even further and limit to built-in "general purpose" search engines (Bing,
> Yahoo, Google, maybe Wikipedia), on the basis that they are the most-used in
> general, and also most-used for frequent searching (as opposed to one-off
> specialized/specific searches).

I'm not sure whether you are suggesting creating a whitelist of engines initially supporting this functionality, or providing the functionality to any engine that honors certain restrictions, which will include the engines you mentioned.

Creating a whitelist is more work than the other option.

> I'm not sure I understand entirely what you were envisioning for an
> implementation strategy - the {searchTerms?} syntax and
> {searchTerms}-in-template functionality isn't used by our default engines
> and so we should avoid worrying about them IMO.

We can limit this to engines having a simple {searchTerms} <Param> entry.

> It might be interesting to implement this as an "engine result" RegExp
> specified in the search engine description that matches result URLs from
> that engine, and captures the parameter value.

I think using the existing parameter definitions plus additional rules will be easier to declare, and more maintainable than a regular expression, that could easily go out of sync. We need to reverse the URL encoding anyways.

For an example of what I mean with "additional rule", we could add a new tag with a space-separated list of "result domains" for regional TLDs. This will be very easy to update from Google's list when needed:

http://www.google.com/supported_domains

As I understand it, the Google redirect is based on your current location rather than the browser locale, so we should have the same region list available in all Firefox locales.

Maybe we should keep the list of regional domains in a separate file? This will be some more work for this implementation, but less work for localizers. It may have the downside of making the feature unavailable to any user-installed engine. Even if we don't care about user-installed engines for the moment, it can be a future request.

> We could pair that with some
> telemetry that measures how many times we end up loading a search from a
> given engine that does not match its "engine result" regexp, so that we
> could see when the matching was failing in practice (e.g. due to a search
> engine changing the URLs it uses).

This is a very good idea. Will this be able to send distinct data for each search engine? I'll file a separate bug, but I think we're already doing some work about search engine telemetry, maybe you know of some current or past work that I can use as a reference?
Flags: needinfo?(gavin.sharp)
Blocks: 1040721
Iteration: 33.3 → 34.1
No longer blocks: 959582
Spoke to Paolo about this over vidyo - don't think there is much precedent for this kind of telemetry. The other details in comment 6 have been resolved by the new approach in bug 1040721, I believe.

Can we call this FIXED now?
Flags: needinfo?(gavin.sharp)
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #7)
> Spoke to Paolo about this over vidyo - don't think there is much precedent
> for this kind of telemetry. The other details in comment 6 have been
> resolved by the new approach in bug 1040721, I believe.
> 
> Can we call this FIXED now?

I filed bug 1042604 for the telemetry part, so we can now call this fixed!
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.