Closed Bug 1492938 Opened Last year Closed 10 months ago

Proxy autoconfig scripts should be loaded as UTF-8 if they are valid UTF-8, otherwise as Latin-1 (a byte is a code point)

Categories

(Core :: Networking, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox64 --- affected

People

(Reporter: Waldo, Assigned: Waldo)

References

()

Details

(Whiteboard: [necko-triaged])

Attachments

(2 files, 1 obsolete file)

PAC scripts are currently compiled as Latin-1 text -- a byte equals that code point.  In the modern world of emojis that I am sure are now required to be used in every PAC script in existence and constitute critical functionality therein, this really ought to be UTF-8.

bz suspects we may not be able to Just Change The Encoding here, and we might need to try to first parse as UTF-8, then if that fails fall back to Latin-1.  (Or more precisely, we ought to attempt a UTF-8 to UTF-16 conversion, then if that fails do a Latin-1 to UTF-16 conversion, then call a pure UTF-16 version.  Or we could attempt UTF-8 validation and if it passes use the UTF-8 API, otherwise translate Latin-1 to UTF-8.  But this latter tack is going to be slower until we have proper UTF-8 compilation support -- coming soon.)

Someone should implement such conversion-attempt code in PAC code.  *touches nose*
Priority: -- → P3
Whiteboard: [necko-triaged]
...said readability not being necessary now, but being quite necessary after the next patch.
Attachment #9029467 - Flags: review?(daniel)
Assignee: nobody → jwalden+bmo
Status: NEW → ASSIGNED
Most of the change here is alpha-renaming.  This could be much more minimal, but I think the clearer-about-encodings names help a ton for understandability and confidence in encoding-safety, which seems especially important where we have dodgy encoding behavior.
Attachment #9029468 - Flags: review?(daniel)
Comment on attachment 9029468 [details] [diff] [review]
Attempt compiling PAC scripts as UTF-8 first if they're valid UTF-8, and only if that fails inflate to UTF-16 and compile that

Review of attachment 9029468 [details] [diff] [review]:
-----------------------------------------------------------------

The code change looks good. I don't like the renamed variables, but can live with them if you insist.

::: netwerk/base/ProxyAutoConfig.cpp
@@ +662,4 @@
>  }
>  
>  nsresult ProxyAutoConfig::Init(const nsCString &aPACURI,
> +                               const nsCString &aUtf8OrLatin1PACScript,

I think this rename of the argument and the member variable makes the code harder to read. Why does the name itself have to say that it holds latin-1 or UTF-8?
Attachment #9029468 - Flags: review?(daniel) → review+
Attachment #9029467 - Flags: review?(daniel) → review+
(In reply to Daniel Stenberg [:bagder] from comment #3)
> >  nsresult ProxyAutoConfig::Init(const nsCString &aPACURI,
> > +                               const nsCString &aUtf8OrLatin1PACScript,
> 
> I think this rename of the argument and the member variable makes the code
> harder to read. Why does the name itself have to say that it holds latin-1
> or UTF-8?

Mm.  I want some sort of clarity of encoding for this because it's quite unclear what the encoding is throughout the process of handling full PAC data.  The comment at the very start (immediately post-download) is nice...but getting from that to where the actual UTF-8 test happens is convoluted.  And going backward -- as users might tend to do, to understand why UTF-8 testing is desirable -- is even harder.  I agree the naming goo is fugly; ideally we would use the type system for this, but nsACString is woefully unclear whether it holds ASCII only, UTF-8 more broadly, or Latin-1 distinctly more broadly (or at least possibly something else).

Here's an idea.  What about if I change stuff so the downloaded data is named as "data" -- which to my mind better connotes bag'o'bytes of uncertain encoding -- and then move the "this is what this data is" discussion to ProxyAutoConfig::{Init,SetupJS} far closer to the actual consumption of the bytes with specific encodings?  Take a gander, let me know what you think.  I'm very not happy with the current naming/approach for clarity, but I also would rather not just bull my way past half-muted reviewer concerns in code I don't regularly touch.  :-)
Attachment #9030382 - Flags: review?(daniel)
Attachment #9029468 - Attachment is obsolete: true
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4933066fd349
Move PAC script compilation (currently as Latin-1 only) into a lambda for readability.  r=bagder
Landed the uncontroversial part to get it out of my tree locally (tho as a practical matter it's not like it was going to bitrot or anything), mostly to keep |hg qser| size and rebase times down.  Leaving open for the actual fix to land...
Keywords: leave-open
Comment on attachment 9030382 [details] [diff] [review]
Take two, with less/less-verbose renaming?

Review of attachment 9030382 [details] [diff] [review]:
-----------------------------------------------------------------

I think this turned out much better. Thanks!
Attachment #9030382 - Flags: review?(daniel) → review+
Pushed by jwalden@mit.edu:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a27d4d17da85
Attempt compiling PAC scripts as UTF-8 first if they're valid UTF-8, and only if that fails inflate to UTF-16 and compile that.  r=bagder
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Keywords: leave-open
Resolution: --- → FIXED
Summary: Proxy autoconfig script loading shouldn't load PAC scripts as Latin-1 → Proxy autoconfig scripts should be loaded as UTF-8 if they are valid UTF-8, otherwise as Latin-1 (a byte is a code point)
You need to log in before you can comment on or make changes to this bug.