Closed Bug 1575947 Opened 1 year ago Closed 1 year ago

UTF-8 and UTF-16 atomization functions don't produce the same atom (was: The page goes blank after loading on


(Core :: JavaScript Engine, defect, P1)

70 Branch



Webcompat Priority ?
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- fixed


(Reporter: ksenia, Assigned: Waldo)


(Regression, )


(Keywords: regression)


(2 files)

Steps to reproduce:
In Firefox Nightly navigate to and observe the behavior.

Site loads

Site loads, but then goes blank due to an error: TypeError: "_ is not a function" here:

       _ = k.µ,
      E = k.ƒ,
      S = n(4),
      w = S.isEmpty,
      L = S.kebabCase,
      N = S.isFunction,
      x = S.omit;
      var M = function (e) {
        var t = e.props,
        n = e.state;
        return _(f(), 'c-formgroup', t.cssUtility, t.cssModifier, E(n.isFocused) ('-focus'), E(!w(t.txtError)) ('-error'))

When I set dom.script_loader.external_scripts.utf8_parsing.enabled to be false, the error goes away

From mozregression:

 6:24.69 INFO: Narrowed inbound regression window from [5f48ef70, 613d59f3] (3 builds) to [a3e42056, 613d59f3] (2 builds) (~1 steps left)
 6:24.69 INFO: No more inbound revisions, bisection finished.
 6:24.69 INFO: Last good revision: a3e4205698b8711791e4bc97ceaceabaa137c553
 6:24.69 INFO: First bad revision: 613d59f386c7a757825b4ad7055db072075f5670
 6:24.69 INFO: Pushlog:
Flags: needinfo?(jwalden)

Ooh, this looks juicy! (And unlike the other ostensible UTF-8 bug, much more likely to be directly UTF-8-relevant.) I am on this for sure. Debugging nowish...

Flags: needinfo?(jwalden)

Still grunging through this, but right now I don't think this is a UTF-8 tokenizing bug.'s an atomization bug. 😲 YES I AM SERIOUS HERE.

And somehow I just happened to stumble upon it.


Attached file Reduced testcase
[jwalden@find-waldo-now src]$ cat /tmp/fail.js 
var obj1 = { "µ": "PASS" };
print(obj1.µ || "FAIL", "U+00B5 MICRO SIGN");

var obj2 = { "𧊶": "PASS" };
print(obj2.𧊶 || "FAIL", "U+272B6 CJK UNIFIED IDEOGRAPH-272B6");
[jwalden@find-waldo-now src]$ dbg/js/src/js --file /tmp/fail.js 
[jwalden@find-waldo-now src]$ dbg/js/src/js --utf8-file /tmp/fail.js 
(compiling '/tmp/fail.js' as UTF-8 without inflating)

Stepping through in a debugger, the obj1 property lookup uses a jsid containing a JSString* having one bit pattern...but the property in obj1 itself of identical name uses a jsid containing a JSString* having a different bit pattern. And if you check both strings, they both say they're atoms.

In other words, something is going very wrong about atomizing UTF-8-valued identifier names.

In the string-literal case, we construct the atom by appending full code points to a UTF-16 buffer, then we drain the buffer into a JSAtom*.

But in the identifier-name tokenizing case, we instead call this function (in this testcase, directly upon the source units themselves):

template <>
/* static */ MOZ_ALWAYS_INLINE JSAtom*
    JSContext* cx, mozilla::Span<const mozilla::Utf8Unit> units) {
  auto chars = ToCharSpan(units);
  return AtomizeUTF8Chars(cx,, chars.size());

And apparently the two methods produce disagreeing "atoms".

Assignee: nobody → jwalden
Component: DOM: Core & HTML → JavaScript Engine
Summary: The page goes blank after loading on → UTF-8 and UTF-16 atomization functions don't produce the same atom (was: The page goes blank after loading on

Minimal jsapi-test fodder:

BEGIN_TEST(testAtomizeUTF8AndUTF16Agree) {
  // U+00B5 MICRO SIGN ("µ")
  static const char16_t MicroSign16[] = u"\u00B5";
  static_assert(mozilla::ArrayLength(MicroSign16) - 1 == 1,
                "single-code-point UTF-16 has length 1");
  static const char MicroSign8[] = u8"\u00B5";
  static_assert(mozilla::ArrayLength(MicroSign8) - 1 == 2,
                "single-code-point UTF-8 has length 2");

  JS::Rooted<JSAtom*> atom16(cx, js::AtomizeChars(cx, MicroSign16, 1));
  CHECK(atom16->latin1OrTwoByteChar(0) == u'µ');
  CHECK(atom16->length() == 1);

  JS::Rooted<JSAtom*> atom8(cx, js::AtomizeUTF8Chars(cx, MicroSign8, 2));
  CHECK(atom8->latin1OrTwoByteChar(0) == u'µ');
  CHECK(atom8->length() == 1);

  CHECK(atom16 == atom8);
  return true;
Priority: -- → P1
Pushed by
AtomizeUTF8OrWTF8Chars should look up non-ASCII, Latin-1 single-code-point static strings itself.  (Static strings are not recorded as permanent atoms or in the normal atom table, so it's always necessary to specially look them up.)... r=jonco

Comment on attachment 9087672 [details]
Bug 1575947 - AtomizeUTF8OrWTF8Chars should look up non-ASCII, Latin-1 single-code-point static strings itself. (Static strings are not recorded as permanent atoms or in the normal atom table, so it's always necessary to specially look them up.)...

Beta/Release Uplift Approval Request

  • User impact if declined: Property accesses in source code using single-code-point property names in the range [U+0080, U+00FF] will evaluate to undefined, not the actual value of the property. And it'll be very difficult for web developers to figure out, because the string created for the property will function in every respect like it should -- except for pointer identity and therefore property-lookup equivalence. Basically impossible to debug without going in with a C++ debugger.

Possibly other bad symptoms could manifest, beyond just property accesses in source: in principle every caller of js::AtomizeUTF8Chars represents a risk. It appears, for example, global imports in asm.js may be susceptible to errors due to this bug. But only callers that care about JSAtom* pointer equality (e.g. they use the atom to access a property) will actually be buggy. I don't immediately know how many do that. At a quick skim I'm guessing the TokenStream caller that stumbled over this is the most likely caller to do this, but I can't say none of the others do it.

  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The static string store is simple, and other paths to create a string already have code similar to this. The other cases have it simpler -- they don't have to compute the code point to check for a static string of. But the decoding and probing we do here is for exactly one UTF-8 bit encoding and is extremely simple -- and tested by the included test -- so it's very hard to see any room for error.
  • String changes made/needed:
Attachment #9087672 - Flags: approval-mozilla-beta?

While the URL testcase started misbehaving with bug 1554362, the root problem was introduced by bug 1494942.

The comment // Since the static strings are all ascii, we can check them before trying anything else. introduced by that patch is probably most of what led people astray. Had that been true, the fact of disjoint storage of atoms in three distinct places (all of which must be probed when atomizing character sequences to be sure a duplicate atom isn't created) wouldn't have mattered, because "\u00B5" and so on would not have been in static strings and would have had to be created/record in the normal atoms set.

Regressed by: 1494942
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

This isn't a new issue and we're out of betas this cycle. Let's let it ride with Fx70.

Attachment #9087672 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Duplicate of this bug: 1563896
Regressions: 1620290
You need to log in before you can comment on or make changes to this bug.