Closed Bug 493975 Opened 15 years ago Closed 3 years ago

Make nsTextToSubURI threadsafe

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: sdwilsh, Unassigned)

References

Details

Attachments

(5 obsolete files)

Over in bug 455555, I'm moving location bar searches off of the main thread.  We call nsITextToSubURI::unEscapeURIForUI on our data to search through it.  We could proxy our calls back to the main thread, but given how often we'd call that, we'd have a significant slowdown in obtaining results, which is not the goal of this change.

Therefore, we need to make nsTextToSubURI threadsafe.  We have two options here:
1) Add appropriate locking
2) Rewrite this in JS since JS is threadsafe (this was suggested by timeless).

I don't think (1) or (2) are that different in terms of workload, and (2) gives us the advantage of having less code in C++.  I'm interested in hearing what smontagu and bz have to say, since they will likely be the reviewers for this code.

This will also require that nsCharsetConverterManager be threadsafe, which I'll be filing a bug for shortly.
Depends on: 493981
I'm not generally enthusiastic about the trend to reimplement the world in JS, but this seems like a good candidate for it.

As I said on IRC, you probably want to use nsIScriptableUnicodeConverter, but that still requires making nsCharsetConvertManager threadsafe.
In general, we don't support JS components that will be called into from C++ while there's content JS on the stack right now.  So doing (2) means either fixing bug 304048 (aka bug 360207) first or moving all of these callsites into JS as well:

http://mxr.mozilla.org/mozilla-central/search?string=unEscapeURI&find=.cpp&tree=mozilla-central
http://mxr.mozilla.org/mozilla-central/search?string=unEscapenonascii&find=.cpp&tree=mozilla-central
http://mxr.mozilla.org/mozilla-central/search?string=UnEscapeAndConvert&find=.cpp&tree=mozilla-central
http://mxr.mozilla.org/mozilla-central/search?string=ConvertAndEscape&find=.cpp&tree=mozilla-central

Not doing either of those will mean little things like getting location.hash in content JS will throw.
Just to be clear, I'm all in favor of moving code into JS; we just need to fix this old bug about being unable to create really useful JS components first.
(In reply to comment #3)
> Just to be clear, I'm all in favor of moving code into JS; we just need to fix
> this old bug about being unable to create really useful JS components first.
How hard do we think this is going to be?  I've basically got the JS component done, baring one issue (I think just one) that I'm debugging right now.

With that said, I think I've come to the conclusion that I don't actually need to use this code in bug 455555 now that I fully understand what it was doing in the first place.  We are really only using this code to unescape, and then converting to UTF-16.
Attached patch v0.1 (obsolete) — Splinter Review
I'm doing this in two parts.  First, I'm going to rewrite this in JS, with functionality being on par.  Then I'm going to make it threadsafe, which should be a really small patch, in theory.

This doesn't quite work.  I'm hung up on some places tests, but as far as I can tell the code does the same thing right now.  I've pushed this to the try server to see if anything else is broken at the moment.  Feel free to look at this and comment.
Attached patch v0.1 (obsolete) — Splinter Review
It certainly helps to hg add and hg rm stuff...
Attachment #378738 - Attachment is obsolete: true
So, the issue that I'm currently hitting is that calling this._netUtil.unescapeString is dropping the high bits of my string that I'm passing in.  If I pass in the unicode characeter 0x30A8, the function returns 0xA8.

Trying something that dolske suggested...
Attached patch rewrite v0.9 (obsolete) — Splinter Review
This passes xpcshell-tests and make check locally.  I've pushed to the try server to get more data (and performance numbers).

As far as I can tell, this is identical behavior to what we have now.  Tomorrow I want to go through and make sure we have unit tests for these code paths, and add tests if we do not.  Additionally, I'm going to revamp the IDL comments so they are actually useful and a bit more meaningful.  Comments and feedback are very much welcomed at this point.
Attachment #378740 - Attachment is obsolete: true
Attachment #378767 - Flags: superreview?(bzbarsky)
Attachment #378767 - Flags: review?(smontagu)
Forgot to add - after talking this over with dolske, there may be a bug in the existing code (and the reason for the try-catch block in _unEscapeAndConvert).  It's possible in the current code, and the new code, to pass in a valid UTF-8 string that gets turned into an invalid one because NS_UnescapeURL breaks it.  If you are a consumer who uses UTF-16, and then try to convert this unescaped string back to UTF-16, it will not be valid.
I believe that bug came up sometime in a patch of smontagu's that I was reviewing, fwiw.  You might want to dig up that bug and see what was happening there.  I don't have the bug# offhand, sadly, but pretty sure it's assigned to smontagu.
Was the stripping just because of the stupid (imo) signature in the IDL?  ACString is really only safe for ISO-8859-1-range strings...
(In reply to comment #10)
> I believe that bug came up sometime in a patch of smontagu's that I was
> reviewing, fwiw.  You might want to dig up that bug and see what was happening
> there.  I don't have the bug# offhand, sadly, but pretty sure it's assigned to
> smontagu.
Sadly, he's the assignee for every bug in this component, so it makes it hard to search for it.

(In reply to comment #11)
> Was the stripping just because of the stupid (imo) signature in the IDL? 
> ACString is really only safe for ISO-8859-1-range strings...
Are you talking about the signature of unescapeString on nsINetUtil.idl?  I suspect that that may be the case, and I'd be happy to fix that if we want to.  Native could should just be #including nsNetUtil.h and using the native methods anyway, right?  I can spin that off into another bug if you'd like.
> Sadly, he's the assignee for every bug in this component

Bah.

Did some digging through my visited bugs, and it was bug 415491.


> Are you talking about the signature of unescapeString on nsINetUtil.idl? 

Yes.  When converting from JS string to ACString, we convert by just dropping the high byte of the 16-bit units (calling JS_GetStringBytes, without using UTF-8 JS strings).  So if you pass in a string which is not basically byte-inflated ISO-8859-1, then you'll lose by the time it gets into C++.

Spinning this off into another bug might be good.  For one thing, it's not obvious what the right behavior here really is.   Unescaping doesn't have to produce useful Unicode.... :(
I didn't see any unit test failures or performance regressions after pushing to the try server with this patch.
Depends on: 494217
(In reply to comment #13)
> Spinning this off into another bug might be good.  For one thing, it's not
> obvious what the right behavior here really is.   Unescaping doesn't have to
> produce useful Unicode.... :(
Done in bug 494217.  I can get rid of all the converting around the unescape call with the patch in that bug.
Attached patch v1.0 (obsolete) — Splinter Review
I think there might be some slight behavior changes with this code in the rewrite.  I also don't think the IDL docs were completely in sync with the actual code from before.  Not sure if we should change the iid on the interface because of this or not (assuming not).

There are a few changes here from 0.9:
1) Added usable IDL comments to all methods
2) Removed the need to convert to UTF-8 and then back around the call to nsINetUtil::unescapeString
3) Fixed unEscapeURIForUI to not fail and assume UTF-8 if it does fail the first time through per IDL comments.  I'm not sure if we actually want that still, but that's up to the reviewers.

Sadly, there is just about zero test coverage on all of this code, so I'm going to have a write a bunch of tests too.  I'll put that up in another patch though, as I strongly suspect this code is correct.  The places test do hammer on unEscapeURIForUI, and those were the most difficult to pass initially.
Attachment #378767 - Attachment is obsolete: true
Attachment #378925 - Flags: superreview?(bzbarsky)
Attachment #378925 - Flags: review?(smontagu)
Attachment #378767 - Flags: superreview?(bzbarsky)
Attachment #378767 - Flags: review?(smontagu)
Whiteboard: [needs review smontagu][needs sr bz]
(In reply to comment #16)
> 2) Removed the need to convert to UTF-8 and then back around the call to
> nsINetUtil::unescapeString
I may need to add this back in based on discussion in bug 494217
Comment on attachment 378925 [details] [diff] [review]
v1.0

Canceling reviews until we figure out what is going on with bug 494217.
Attachment #378925 - Flags: superreview?(bzbarsky)
Attachment #378925 - Flags: review?(smontagu)
Whiteboard: [needs review smontagu][needs sr bz] → [pending bug 494217]
Attached patch v1.1 (obsolete) — Splinter Review
Well, that was quick.  I'll get the tests up tomorrow or the next day hopefully.
Attachment #378925 - Attachment is obsolete: true
Attachment #379792 - Flags: superreview?(bzbarsky)
Attachment #379792 - Flags: review?(smontagu)
Whiteboard: [pending bug 494217] → [needs review smontagu][needs sr bz]
Comment on attachment 379792 [details] [diff] [review]
v1.1

>+  _unEscapeAndConvert: function TTSU_unEscapeAndConvert(aCharset,
>+                                                        aText,
>+                                                        aFlags)
>+  {
>+    // Un-escape our text, making sure to default to escape all if no flags were
>+    // specified.
>+    let text = this._netUtil.unescapeString(aText, (aFlags || 0));
>+
>+    // unescapeString returns an ACString, so we have to convert this back to
>+    // Unicode (xpconnect doesn't do this for us).  It's possible that we are
>+    // given back an invalid UTF-8 string, however.  If that's the case, we'll
>+    // just return the string that we got back (when conversion fails).
>+    try {
>+      let converter = this._newConverter("UTF-8");
>+      text = converter.ConvertToUnicode(text);
>+    }
>+    catch (e) {
>+    }
I realized last night that I think I want to convert from the specified character set to unicode before I call unescapeString.  XPConnect will then convert it to UTF-8 safely from Unicode for the call, and then we'll get an escaped string that may be UTF-8 back.

I'll need to add a test that passes in something that isn't Unicode here to be sure about this.
That function's signature says |string| in the idl, no?  Passing non-ASCII data through that argument is technically not allowed.  Looking at the consumers, nsIndexedToHTML.cpp always passes through ASCII data, I can't tell for nsDirIndexParser.cpp (but could be checked with a dir with a non-ASCII name and files likewise in it), and the docshell caller seems to violate this (and will probably break the moment we stop escaping UTF-8 in refs).  The docshell caller might need changes in any case, to be compliant with the HTML5 spec for handling refs.

That said, the actual code for converting from C++ to JS strings used in XPConnect here will simply byte-inflate the string.  That is, it doesn't enforce the ASCII-ness.

Perhaps the right thing to do here is to make the argument ACString in the idl?  That will guarantee byte-inflation no matter what, and then your approach of converting before unescaping should be fine...  Or we could simply make it clear that the input must be ASCII, of course.
Well, it's only |string| for UnEscapeAndConvert, but the two other un-escape methods use ACString.  We should probably unify those I guess.  I guess I'm also not doing a simple refactoring anymore - should this work be spun out into a different bug blocking this one?
I thought we were changing at least one of the other unescape methods to use AUTF8String, no?

But yeah, if we start changing the API that should go in a separate bug...
Depends on: 495080
Attachment #379792 - Attachment is obsolete: true
Attachment #379792 - Flags: superreview?(bzbarsky)
Attachment #379792 - Flags: review?(smontagu)
(In reply to comment #23)
> I thought we were changing at least one of the other unescape methods to use
> AUTF8String, no?
That was in nsINetUtil

> But yeah, if we start changing the API that should go in a separate bug...
Spun off into bug 495080.
No longer depends on: 494217
Whiteboard: [needs review smontagu][needs sr bz]
Depends on: 494217
No longer blocks: 455555
Assignee: sdwilsh → smontagu

Still valid?

Flags: needinfo?(hsivonen)

mak, do we still care about using this stuff off-the-main-thread for the URL bar? AFAICT, the internals look rather off-the-main-thread-friendly at this point.

Flags: needinfo?(hsivonen) → needinfo?(mak)
Assignee: smontagu → nobody
Status: ASSIGNED → NEW

No, the urlbar doesn't use this from cpp anymore, I think we can close this bug.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(mak)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: