Closed Bug 881958 Opened 11 years ago Closed 11 years ago

Parser ignores escaped characters

Categories

(L20n :: JS Library, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 918655

People

(Reporter: zbraniecki, Assigned: zbraniecki)

Details

Attachments

(1 file, 1 obsolete file)

Python parser does a fine job at unescaping escaped characters in string: https://github.com/l20n/python-l20n/blob/master/lib/format/lol/parser.py#L200-L204

but our JS parser doesn't do that in getString. It only does it in getComplexString which makes certain strings like:

<id "foo\\"foo"> return 'foo\"foo' instead of 'foo"foo'.
I also noticed that \\ \" ends up correctly as \ ", but \\\" throws.
Not sure if this is the same bug, but \\{{ ends up escaping the braces somehow.  In tinker,

    <welcome "Welcome to \\{{ l20n }}!">

is shown as 

    Welcome to {{ l20n }}!

Instead of

    Welcome to L20n!
Assignee: nobody → gandalf
Priority: -- → P2
Target Milestone: --- → 1.0
Attached patch patch (obsolete) — Splinter Review
this patch changes the behavior of our escaping.

With the patch we only escape:

 - \" -> " in case string is enclosed in "
 - \' -> ' in case string is enclosed in '
 - \{{ -> {{ and no compilation

The cost of the patch is estimated to be ~18% according to make perf, but if I reduce the number of iterations to 1 and launch it multiple times, the numbers are basically the same, so I assume that it slows down JITted code mostly.

We will have to update the grammar because according to it we should also escape:

 - \\ -> \
 - \" -> " in case of " enclosed string
 - \' -> ' in case of ' enclosed string

One more caveats of that is that simple string keeps \{{ as \{{, only parsing it into complex string does the unescaping.
I believe it to be the right choice - from simple string perspective \{{ is just a string. And it will be parsed into complex string because it contains {{.
Attachment #771009 - Flags: review?(stas)
Attached patch patchSplinter Review
Updated patch.

As per conv. with Stas we decided that we'll only unescape string quotes.
Attachment #771009 - Attachment is obsolete: true
Attachment #771009 - Flags: review?(stas)
Attachment #782013 - Flags: review?(stas)
Comment on attachment 782013 [details] [diff] [review]
patch

Review of attachment 782013 [details] [diff] [review]:
-----------------------------------------------------------------

I've been thinking about this and I had a change of heart.  Maybe we should try to be consistent and make the backslash an escape character.  Always.

Suppose we fix bug 898594 using the \uXXXX syntax.  I'm not sure how we'd then allow to write a literal "\u" which is not to be escaped.

I tried the folliwng, but they break the parser:

  <foo1 "{{ '\' }}u">
  <foo2 """ {{ "\" }}u"""> 

Maybe we should always assume that \ is an escape char, and if what follows isn't any special opchar (like "u" in the example above), we'd just print it?  So, \a gives a.  Just like in JavaScript.

As far as our double-pass parsing goes, I think we should first detect if the string is maybeComplex (let's change the name from isNotComplex).  If it's maybeComplex, don't unescape anything and pass the raw string on to parseString (second pass).  If it's not maybeComplex, than go back and unescape everything from left to right.

Gandalf, what do you think?

::: lib/l20n/parser.js
@@ +140,5 @@
>      }
>  
> +    var escopchars = {
> +      '"': new RegExp('\\\\"', 'g'),
> +      "'": new RegExp("\\\\'", 'g'),

Can you use regexp literals here?

According to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Creating_a_Regular_Expression:

    Regular expression literals provide compilation of the regular expression when the script is evaluated. When the regular expression will remain constant, use this for better performance.

    Using the constructor function provides runtime compilation of the regular expression. Use the constructor function when you know the regular expression pattern will be changing, or you don't know the pattern and are getting it from another source, such as user input.

Plus, I think you don't need to escape backslashes to many times in literals.

::: tests/lib/parser.js
@@ +63,2 @@
>  
>      //var ast = parser.parse("<id ''''string\\''''>");

Why is this commented out?  This seems to break badly.  Any idea why?  Should we file another bug?
Comment on attachment 782013 [details] [diff] [review]
patch

Review of attachment 782013 [details] [diff] [review]:
-----------------------------------------------------------------

r- based on comment 5; waiting for response from gandalf
Attachment #782013 - Flags: review?(stas) → review-
Ok, I'm taking another approach.

Trying to figure out how much would it cost to remove separation between complex and simple string overall.

So far got to the point where I have:

 - backslash unescaping
 - unicode unescaping
 - placeables

I don't have:
 - multiquote closings
 - max_placeables counter

And I got performance *gain* on a long settings.lol

You can see the current code here: https://github.com/zbraniecki/l20n.js/blob/parser3/lib/l20n/parser.js#L152
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: