URIs in places DB and created by nsIIOService.newURI() are not normalized wrt character escaping

RESOLVED FIXED in Firefox 55

Status

()

defect
RESOLVED FIXED
7 years ago
2 years ago

People

(Reporter: jha, Assigned: emk)

Tracking

unspecified
mozilla55
x86
Linux
Points:
---

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(2 attachments)

Reporter

Description

7 years ago
nsIIOService.newURI() produces a nsIURI with special (e.g. UTF-8) characters escaped with uppercase A-F (e.g. %7E). However, the browser history and bookmarks use lowercase escapes (e.g. %7e). This causes a query from the history or bookmarks to not return any matches for the nsIURI, even if the page in question is visited/bookmarked.

Character escaping should be case-consistent in all browser components to prevent this problem.
> However, the browser history and bookmarks use lowercase escapes

How are the managing that?  Necko uses uppercase escapes, as you noted.  encodeURI and encodeURIComponent use uppercase escapes.  So how is Places ending up with lowercase escapes, exactly?  Is it rolling its own escaping?
Component: General → Places
Product: Core → Toolkit
(In reply to jha from comment #0)
> However, the browser history and bookmarks use lowercase escapes (e.g. %7e).

I don't understand what this means either. I don't know of any escaping code specific to history/bookmarks.

jha, do you have steps to reproduce the issue?
Reporter

Comment 3

7 years ago
Sorry, my bad, this may or may not be a bug after all.

I ran into this while trying to figure out why my Link Status Redux add-on does not show some visited links as visited. Turns out that the links in question were already escaped in lowercase. The add-on gets the link URL by replacing XULBrowserWindow.setOverLink. This URL has been unescaped (to produce more readable overlink text), and when it gets escaped again by feeding it to nsIIOService.newURI() the escapes are in uppercase and the URI won't match in a history query. I could work around this by converting the escapes to lowercase so I just assumed (yeah, I know ;) that there was some escape case-inconsistency (when bookmarked, those links that didn't work correctly showed as lowercase-escaped when hovering over the bookmark and worked just fine).

Now the question that remains is this: if a link specifies a href of "foo%7ebar", which when displayed is converted to "foo~bar", which in turn is escaped to "foo%7Ebar", should a places query return the "foo%7ebar" as a result when queried for "foo%7Ebar"? (After all, if I understand correctly, %7E and %7e in URLs mean the same thing.) If not (and I admit this is not a very critical issue), then this bug can be closed as invalid.
Are places history queries case-sensitive by default? How are you querying places exactly?
Reporter

Comment 5

7 years ago
> Are places history queries case-sensitive by default?

Apparently, and they should be, as the path component is case-sensitive (excluding the escapes I guess), /FOO is a different page from /foo. (Whereas the host component is case-insensitive; is it converted to lowercase when storing to places db?).

If the places db is supposed to store "normalized" URLs (is it?), then maybe any existing escapes should be converted to uppercase before storing to the db? For example, my workaround of lowercasing the escapes in the query URI won't work if the original link is only partially escaped or escaped in mixed case i.e. something like "foo%7e%7Ebar" is stored in the db.

> How are you querying places exactly?

Like this (the workaround not shown here):

var iosvc = Components.classes["@mozilla.org/network/io-service;1"].getService(Components.interfaces.nsIIOService);
var histsvc = Components.classes["@mozilla.org/browser/nav-history-service;1"].getService(Components.interfaces.nsIGlobalHistory2);
var uri = iosvc.newURI(link, null, null);
if (histsvc.isVisited(uri)) {
  // to get the visit date
  histsvc.QueryInterface(Components.interfaces.nsINavHistoryService);
  var query = histsvc.getNewQuery();
  query.uri = uri;
  var queryOptions = histsvc.getNewQueryOptions();
  queryOptions.includeHidden = true;
  queryOptions.maxResults = 1;
  var results = histsvc.executeQuery(query, queryOptions).root;
  ...
}

isVisited() will return false for "foo%7Ebar" if the link href linking to the page that was visited was "foo%7ebar"
Reporter

Updated

7 years ago
Summary: nsIIOService.newURI() escapes special characters with uppercase A-F → URIs in places DB and created by nsIIOService.newURI() are not normalized wrt character escaping
Reporter

Comment 6

7 years ago
Reporter

Comment 7

7 years ago
OK, here's the correct description of the problem, also updated the title to reflect this:

URIs stored in the places DB and those produced by nsIIOService.newURI() (which the places DB possibly uses) are not normalized (unambiguous) with regards to escaped characters. This causes problems when querying the places DB for an URI that is originally in unescaped form, which will be escaped with uppercase A-F by nsIIOService.newURI(). There are two aspects of the problem:

1. Escaped special characters already present in the URI are left as-is and not ensured to be uppercase. If a visited link is "%c3%84" (UTF-8), it is stored unchanged into the DB. If you refer to this in the unescaped form "Ä", which is escaped to "%C3%84" by nsIIOService.newURI(), you won't get any matches from a history query. This can be worked around by creating a second URI by lowercasing all escapes in the uri.spec and querying also that URI, but this won't help if the entry in the DB contains mixed-case escapes.

2. Escaped non-special charaters in the URI are not unescaped. For example the tilde is such a character. If a visited link is "%7E", the unescaped form "~" is not escaped by nsIIOService.newURI() and thus won't be found in a history query.

Whether this behavior is a bug or not depends on whether nsIIOService.newURI() is supposed to create normalized URIs and whether URIs stored in the places DB are supposed to be normalized. If so, a solution would be to first unescape and then re-escape the URI before storing it to the DB (I don't known whether there would be any problems with this approach, though). I have attached a HTML file containing test cases for the problem.
Reporter

Updated

7 years ago
Attachment #688173 - Attachment mime type: text/plain → text/html
Assignee

Comment 8

3 years ago
Yeah, we prettify URLs from web content before calling setOverLink:

https://dxr.mozilla.org/mozilla-central/rev/57a8cde3f08ca9d60bcd8bdd698ceec687f0aed2/docshell/base/nsDocShell.cpp#14079-14089

On the other hand, URLs from bookmarks/history will be passed as-is.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment hidden (mozreview-request)
Assignee

Comment 10

3 years ago
We should call UnEscapeURIForUI as late as possible.
Assignee

Updated

3 years ago
Component: Places → Document Navigation
Product: Toolkit → Core
Assignee

Updated

3 years ago
Attachment #8812739 - Flags: review?(jst) → review?(bugs)

Comment 11

3 years ago
mozreview-review
Comment on attachment 8812739 [details]
Bug 817374 - Fix inconsistency of the setOverLink parameter.

https://reviewboard.mozilla.org/r/94378/#review97182

::: docshell/base/nsDocShell.cpp
(Diff revision 1)
> -  if (NS_FAILED(rv)) {
> -    return rv;
> -  }
> -
> -  // use url origin charset to unescape the URL
> -  nsAutoCString charset;

Why the new code doesn't need to take care of originCharset?
Please explain and ask review again or fix.
Attachment #8812739 - Flags: review?(bugs) → review-
Assignee

Comment 12

3 years ago
Comment on attachment 8812739 [details]
Bug 817374 - Fix inconsistency of the setOverLink parameter.

(In reply to Olli Pettay [:smaug] from comment #11)
> Why the new code doesn't need to take care of originCharset?
> Please explain and ask review again or fix.

MozReview did not allow me to request re-review. Here is a copy from the MozReview reply:

I think we should just drop support for unescaping non-UTF8 URLs. Currently we don't support re-escaping blacklisted characters correctly (bug 1248812). But it is impossible to re-escape bidi formatting characters unless the charset happens to be UTF-8 or one of Arabic/Hebrew charsets because other charsets don't contain such characters.
Attachment #8812739 - Flags: review- → review?(bugs)
(In reply to Masatoshi Kimura [:emk] from comment #12)
> MozReview did not allow me to request re-review. 
Uh, want to file a MozReview bug about it? Or perhaps there is already such filed?
I wonder if hsivonen or someone has opinion on this.
Flags: needinfo?(hsivonen)
Comment on attachment 8812739 [details]
Bug 817374 - Fix inconsistency of the setOverLink parameter.

Maybe hsivonen could even review this.
The code looks ok, but I'm not sure about the charset handling change.
Attachment #8812739 - Flags: review?(bugs) → review?(hsivonen)
Assignee

Comment 16

3 years ago
(In reply to Olli Pettay [:smaug] from comment #13)
> (In reply to Masatoshi Kimura [:emk] from comment #12)
> > MozReview did not allow me to request re-review. 
> Uh, want to file a MozReview bug about it? Or perhaps there is already such
> filed?

I have filed bug 1321699.
Assignee

Comment 17

3 years ago
Other reasons why we should drop support for unescaping non-UTF-8 URLs:
1. UTF-8 proliferation
2. No other browsers supports unescaping non-UTF8 URLs.
3. We already treat non-ASCII octets in URLs as UTF-8 (except for the query part).
4. IRI (RFC 3987) only considers UTF-8 (comment #12 is one of the fallout).
(In reply to Masatoshi Kimura [:emk] from comment #17)
> Other reasons why we should drop support for unescaping non-UTF-8 URLs:
> 1. UTF-8 proliferation
> 2. No other browsers supports unescaping non-UTF8 URLs.

I agree that we should stop trying to unescape URLs according to a non-UTF-8 encoding.

> 3. We already treat non-ASCII octets in URLs as UTF-8 (except for the query
> part).
> 4. IRI (RFC 3987) only considers UTF-8 (comment #12 is one of the fallout).

I don't agree with these being good reasons, but I agree with the conclusion.
Flags: needinfo?(hsivonen)

Comment 19

3 years ago
mozreview-review
Comment on attachment 8812739 [details]
Bug 817374 - Fix inconsistency of the setOverLink parameter.

https://reviewboard.mozilla.org/r/94376/#review98304

I think we should make this change, but it's not clear to me that this change is sufficient. Please remove the charset argument to unEscapeURIForUI() and make it use UTF-8, since it seems that we now would want all callers to pass "UTF-8" as the encoding anyway. Other than that, could you, please, explain how this patch fixes the URL bar / history-related aspects of the bug as reported (as opposed to merely fixing the UI on link hover)?
Assignee

Comment 20

3 years ago
(In reply to Henri Sivonen (:hsivonen) from comment #19)
> I think we should make this change, but it's not clear to me that this
> change is sufficient. Please remove the charset argument to
> unEscapeURIForUI() and make it use UTF-8, since it seems that we now would
> want all callers to pass "UTF-8" as the encoding anyway. Other than that,
> could you, please, explain how this patch fixes the URL bar /
> history-related aspects of the bug as reported (as opposed to merely fixing
> the UI on link hover)?

URL bar does not use unEscapeURIForUI because it needs to re-escape the URL (unEscapeURIForUI is lossy).

History did not unescape the URL at all and we should not tamper the URL internally because we can't tell how the server treat the URL equivalence and unEscapeURIForUI does not care about reserved characters. UnEscapeURIForUI is for UI, as the name implies. So we should used it only when we show the URL to the user in our UI (status tooltip, in this case).
(In reply to Masatoshi Kimura [:emk] (use Splinter to ask me for review, see bugzil.la/1321953) from comment #20)
> (In reply to Henri Sivonen (:hsivonen) from comment #19)
> > I think we should make this change, but it's not clear to me that this
> > change is sufficient. Please remove the charset argument to
> > unEscapeURIForUI() and make it use UTF-8, since it seems that we now would
> > want all callers to pass "UTF-8" as the encoding anyway. Other than that,
> > could you, please, explain how this patch fixes the URL bar /
> > history-related aspects of the bug as reported (as opposed to merely fixing
> > the UI on link hover)?
> 
> URL bar does not use unEscapeURIForUI because it needs to re-escape the URL
> (unEscapeURIForUI is lossy).
> 
> History did not unescape the URL at all and we should not tamper the URL
> internally because we can't tell how the server treat the URL equivalence
> and unEscapeURIForUI does not care about reserved characters.
> UnEscapeURIForUI is for UI, as the name implies. So we should used it only
> when we show the URL to the user in our UI (status tooltip, in this case).

OK.

I filed bug 1322874.

I suppose it would be good to get bz's OK for getting rid of the UI prettiness for the legacy-encoded query strings.
Flags: needinfo?(bzbarsky)
That's a question for the UI folks, no?
Flags: needinfo?(bzbarsky)
mossop, what's the right process for deciding if we're OK with no longer trying to make URL query strings originating from non-UTF-8 contexts pretty for UI?
Flags: needinfo?(dtownsend)
Asking the UI folks is thing to do. Madhava, can you forward this to someone for a decision?
Flags: needinfo?(dtownsend) → needinfo?(madhava)
Assignee

Comment 25

3 years ago
By the way, some of front-end code are already hard-coding "UTF-8":
https://dxr.mozilla.org/mozilla-central/search?q=unEscapeURIForUI(%22UTF-8&redirect=false
Comment on attachment 8812739 [details]
Bug 817374 - Fix inconsistency of the setOverLink parameter.

Unsetting review pending reply from madhava.
Attachment #8812739 - Flags: review?(hsivonen)
Assignee

Comment 27

2 years ago
madhava has only one activity in Bugzilla last year:
https://bugzilla.mozilla.org/page.cgi?id=user_activity.html&action=run&from=-1y&who=madhava@mozilla.com
Do you know anyone else to ask?
Flags: needinfo?(madhava) → needinfo?(dtownsend)
Assignee

Comment 28

2 years ago
I didn't mean to clear ni? to madhava...
Flags: needinfo?(madhava)
Shorlander maybe
Flags: needinfo?(dtownsend) → needinfo?(shorlander)
(In reply to Dave Townsend [:mossop] from comment #29)
> Shorlander maybe

I can help. Could someone please clarify what is meant by "getting rid of the UI prettiness for the legacy-encoded query strings." What UI prettiness are we losing?
Flags: needinfo?(shorlander)
Flags: needinfo?(madhava)
Assignee

Comment 31

2 years ago
(In reply to Stephen Horlander [:shorlander] from comment #30)
> (In reply to Dave Townsend [:mossop] from comment #29)
> > Shorlander maybe
> 
> I can help. Could someone please clarify what is meant by "getting rid of
> the UI prettiness for the legacy-encoded query strings." What UI prettiness
> are we losing?

If the query part of the url has a non-ASCII non-UTF-8 characters, it will be displayed as %hh encoded sequences instead of decoded character in the status tip. For example, when a page is encoded in ISO-8859-1 and a user hovers over a link "http://example.com/?name=M\xFCller" ("\xFC" is "ü" in ISO-8859-1) on the page, the status bar will display "http://example.com/?name=M%FCller" instead of "http://example.com/?name=Müller/".

Note that we will always decode the url in UTF-8 regardless of the page encoding. So if the url is "http://example.com/?name=M\xC3\xBCller", we will display "http://example.com/?name=Müller/" in the status bar. Also, we already do not decode non-UTF-8 bytes on the location bar.

Does this clarify the situation?
Flags: needinfo?(shorlander)
Assignee

Updated

2 years ago
Flags: needinfo?(shorlander)
Assignee

Comment 32

2 years ago
Ping? Please tell me if comment #31 is insufficient to answer your question.
Flags: needinfo?(shorlander)
Assignee

Comment 33

2 years ago
Do you know yet another person?

Or can I just go ahead if nobody cares about the status tip UX? I don't think it makes sense to leave the status tip and the location bar inconsistent.
Flags: needinfo?(dtownsend)
Assignee

Comment 34

2 years ago
I'll request a review again if no one respond in two weeks.
(In reply to Masatoshi Kimura [:emk] from comment #31)
> (In reply to Stephen Horlander [:shorlander] from comment #30)
> > (In reply to Dave Townsend [:mossop] from comment #29)
> > > Shorlander maybe
> > 
> > I can help. Could someone please clarify what is meant by "getting rid of
> > the UI prettiness for the legacy-encoded query strings." What UI prettiness
> > are we losing?
> 
> If the query part of the url has a non-ASCII non-UTF-8 characters, it will
> be displayed as %hh encoded sequences instead of decoded character in the
> status tip. For example, when a page is encoded in ISO-8859-1 and a user
> hovers over a link "http://example.com/?name=M\xFCller" ("\xFC" is "ü" in
> ISO-8859-1) on the page, the status bar will display
> "http://example.com/?name=M%FCller" instead of
> "http://example.com/?name=Müller/".

Does this affect all areas of the UI (like the address bar) or just the status bar?
Flags: needinfo?(dtownsend) → needinfo?(VYV03354)
Assignee

Comment 36

2 years ago
(In reply to Dave Townsend [:mossop] from comment #35)
> > If the query part of the url has a non-ASCII non-UTF-8 characters, it will
> > be displayed as %hh encoded sequences instead of decoded character in the
> > status tip. For example, when a page is encoded in ISO-8859-1 and a user
> > hovers over a link "http://example.com/?name=M\xFCller" ("\xFC" is "ü" in
> > ISO-8859-1) on the page, the status bar will display
> > "http://example.com/?name=M%FCller" instead of
> > "http://example.com/?name=Müller/".
> 
> Does this affect all areas of the UI (like the address bar) or just the
> status bar?

Sorry, I overlooked the question. It only affects the status bar. The address bar is already using UTF-8.
Flags: needinfo?(VYV03354)

Comment 37

2 years ago
mozreview-review
Comment on attachment 8812739 [details]
Bug 817374 - Fix inconsistency of the setOverLink parameter.

https://reviewboard.mozilla.org/r/94378/#review143652
Attachment #8812739 - Flags: review+
Comment hidden (mozreview-request)

Comment 39

2 years ago
Pushed by VYV03354@nifty.ne.jp:
https://hg.mozilla.org/integration/autoland/rev/8e61a56fa2e3
Fix inconsistency of the setOverLink parameter. r=mossop
Assignee

Updated

2 years ago
Assignee: nobody → VYV03354
Status: NEW → ASSIGNED

Comment 40

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8e61a56fa2e3
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.