Open Bug 1231541 Opened 9 years ago Updated 8 months ago

Add a fast API to check for ASCII strings

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: jandem, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Having a fast API to check whether a string contains only ASCII chars, would help the Servo people (because, unlike Latin1, ASCII maps directly to UTF8).

To do this we can either:

(1) Restrict Latin1 strings to ASCII characters. A bit unfortunate but if measurements indicate most Latin1 strings only contain ASCII chars, maybe it's not too bad. (I think V8 switched from ASCII to Latin1 strings at some point; we might want to check if that's true and what their reasons were.)

(2) Add a ASCII-only bit to JSString. If set, the string is known to contain only ASCII characters. Because strings are immutable, doing this is fairly easy: ropes and dependent strings can just propagate the bit and some places where we create strings statically know the result only contains ASCII chars (because it's a number or date, for instance).

Question is also whether this flag could be used in SpiderMonkey itself (or in Gecko). ToUpperCase for instance has to check for some special characters outside the ASCII range, but that's not a perf-sensitive code path.
Severity: normal → S3

Bug 998392 has the context for latin1 vs. ascii strings.

See Also: → latin1strings

ni? me to poke a bit at this when time allows:

  • If needed, we have a couple bits here apparently.
  • But probably the easiest thing for now is hacking Latin1 -> Ascii, implementing some of the optimizations it unlocks, and see how benchmarks look.
Flags: needinfo?(emilio)

sfink pointed out in #spidermonkey that we actually use at least one of those bits.

Never mind, we were misreading the code. Those bits live elsewhere.

Ok, again: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=a705827c1ae4f3d7902655d1ca68a7aac164e8eb&newProject=try&newRevision=7124ff3d6a385437b274af65883a4a42c6c24356&framework=13&page=1

It seems this doesn't meaningfully move the needle for Firefox. It's probably still worth it for Servo / what not since that affects all Rust strings (in Firefox the extra validation only affects UTF8String).

So maybe if someone working on Servo stuff wants to they still can push this forward?

Flags: needinfo?(emilio)

Looking into this.

Assignee: nobody → dazabani
Status: NEW → ASSIGNED

I started implementing approach (2) in D176893 (plus the attached patch for servo/mozjs), and I even found a way to store the flag without using the GC reserved bits, but so far it hasn’t improved Servo’s results in the benchmarks I’ve tried (RoboHornet (benchmarks 1, 2, 8–11, 13, 15, 16), Dromaeo (DOM Core)).

I also haven’t really figured out how to measure the impact rigorously, only running benchmarks manually with profiling and/or code path counters. Servo doesn’t have any benchmarking infrastructure yet, and many of the relevant CSS and DOM benchmarks currently crash due to Servo bugs.

The biggest challenge (imo) with the ASCII-only flag approach is that there are many types of JSString and many code paths for creating them, and not all code paths have an obvious opportunity to check if the contents are ASCII-only and set the flag. So while I’ve done several…

  • NewStringCopyN (the factory that Servo primarily uses) with length < 16
  • JSRope::flatten
  • JSRope ctor(left, right, length)
  • JSDependentString ctor(base, start, length)
  • AllocateNewPermanentAtomNonStaticValidLength (used by ParserAtom and well-known "Symbol.foo")
  • StaticStrings ("\x00" through "\xFF", [A-Za-z0-9$_]{2}, "100" through "255")

…there are many more I haven’t yet done:

  • CanStoreCharsAsLatin1 callers with length < 16 — these should be easy, now that NewStringCopyN is done
  • CanStoreCharsAsLatin1 callers with length ≥ 16 — these use a Rust impl with both SIMD and non-SIMD versions, and I think we would need to make those check for Latin1-only and ASCII-only conditions simultaneously to avoid a perf regression
  • WellKnownAtom — these should be easy
    • CommonPropertyNames (FOR_EACH_COMMON_PROPERTYNAME)
    • Symbol (JS_FOR_EACH_WELL_KNOWN_SYMBOL)
    • ProtoKey (JS_FOR_EACH_PROTOTYPE)
  • JIT string ops — anything that uses ConcatInlineString/CopyStringChars/CopyStringCharsMaybeInflate
    • CodeGenerator::visitSubstr
    • CreateDependentString::generate
    • JitRealm::generateStringConcatStub
  • callers of ctors that just take ownership of the given buffer — neither searchfox nor clangd can find these for some reason
    • JSLinearString ctor(chars, length)
    • JSThinInlineString ctor(length, chars)
    • JSFatInlineString ctor(length, chars)
    • JSExternalString ctor(chars, length, callbacks)
    • NormalAtom ctor(length, chars, hash), ctor(chars, length, hash)
    • FatInlineAtom ctor(length, chars, hash)

Any thoughts? Should we continue with this approach, adding ASCII-only checks to the remaining code paths? Or are we better off trying approach (1), if restricting Latin1 strings to ASCII doesn’t regress Firefox? Either way, how best can we measure the impact on our end?

Flags: needinfo?(jdemooij)

(In reply to Delan Azabani from comment #10)

Any thoughts? Should we continue with this approach, adding ASCII-only checks to the remaining code paths? Or are we better off trying approach (1), if restricting Latin1 strings to ASCII doesn’t regress Firefox? Either way, how best can we measure the impact on our end?

Sorry for the delay. Adding an is-ASCII bit and setting it in all the relevant places is a fair amount of work and overhead, especially considering that we're not using it upstream, and the team is pretty busy at the moment. I think before we do this work, we should evaluate our longer-term options, for example what if we added support for native UTF8/WTF8 strings?

Short-term it would be simplest and fastest to try limiting Latin1 strings to the ASCII range in Servo's fork of SpiderMonkey by changing MAX_LATIN1_CHAR. I don't think we want to make that change for Firefox because it'd have no benefits for our users at this point, and it would risk perf/memory regressions. We could consider adding a configure flag for this if the patch is straight-forward though.

Flags: needinfo?(jdemooij)

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: dazabani → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: