Closed Bug 998392 (latin1strings) Opened 10 years ago Closed 10 years ago

Store JS strings as Latin1 unless they actually need UTF-16 encoding

Categories

(Core :: JavaScript Engine, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla33

People

(Reporter: cpeterson, Assigned: jandem)

References

Details

(Whiteboard: [MemShrink:P2])

This optimization was suggested at the 2014 JS work week. It would reduce string memory usage for content that only uses Latin1 text.
Whiteboard: [MemShrink]
Whiteboard: [MemShrink] → [MemShrink:P2]
Just to clarify, this would not include script sources, correct?  It would be nice if we could keep sources in ascii when we have decompression disabled for really large scripts.
> Just to clarify, this would not include script sources, correct? 

That is correct.
I'm interested in doing this. It's one of the things we know will help memory usage and performance (regexps for instance) and the competition is doing it already.

I'll talk to a bunch of people to see what we actually want to do (UTF8 vs ASCII etc), and post the results in this bug.
Depends on: 1008590
Depends on: 1009466
The plan is to go with ASCII (Latin1) strings. UTF8 is more complicated and the benefits don't really justify that. Also, once we have ASCII strings it will be simpler to switch to full UTF8 if we want that in the future.

Last week(end) I spent some time prototyping this by using ASCII chars for atoms where possible and I got it to the point where a fair number of jit-tests pass. The good news is that with ASCII/Latin1 strings, most algorithms working on strings can be templatized and things Just Work without duplicating the logic. With full UTF8 this would be a lot harder in many cases.

I considered continuing with this huge patch, but I'm not really looking forward to making this patch even bigger and rebasing it for a few weeks. I'd rather add a shell function that turns on ASCII strings, and a shell function that returns whether a string is ASCII/Latin1, then we can convert things incrementally and add jit-tests to avoid breaking it. There's also a lot of ancient, crufty jschar code full of gotos and macros (looking at you, jsdate.cpp and jsnum.cpp) that we can hopefully clean up in the process.

Before I start submitting patches I want to have some agreement on at least the following:

(1) Naming. Options are ASCII/Latin1/8Bit/OneByte: in my prototype patch I went with Latin1 and TwoByte, because we are already using those names elsewhere (see js/public/CharacterEncoding.h for instance).

(2) As discussed in bug 1009466, we can add wrapper classes for jschar* and char*. For my prototype patch I went with TwoByteStringChars and Latin1StringChars, based on the stuff in CharacterEncoding.h This gives us bounds checks in debug builds and we can teach the static analysis to report an hazard when an instance of this class is live across a GC call (for bug 903519). The latter requirement means we can't use the stuff in CharacterEncoding.h directly, because we don't want to let the static analysis complain about those generic classes, as they are probably also used for chars not owned by strings.

Maybe this is overkill though. One of the alternatives is to add an AutoAssertNoGC argument to chars(), getChars etc like this:

AutoAssertNoGC nogc;
jschar *s = str->twoByteChars(nogc);
JS::TwoByteChars r = str->twoByteRange(nogc);

Or make them return an Auto class:

AutoTwoByteChars chars = str->twoByteChars();
jschar *s = s.get();

Although I like the wrapper classes and using the static analysis, this is simpler. It means we can keep the JSString methods like range() that return classes defined in CharacterEncoding.h, without having to add wrappers for those too...

Waldo, Luke: your thoughts on this would be appreciated.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(luke)
Flags: needinfo?(jwalden+bmo)
(In reply to Jan de Mooij [:jandem] from comment #4)
> (1) Naming. Options are ASCII/Latin1/8Bit/OneByte: in my prototype patch I
> went with Latin1 and TwoByte, because we are already using those names
> elsewhere (see js/public/CharacterEncoding.h for instance).

Does the patch use IEC-8859-1 codepoints in [128,256), which have an incompatible encoding overlap with UTF-8? If so, Latin1 seems fine. If not, I suggest using the name ASCII, which makes it explicit that it's just [0-128) and an encoding subset of UTF-8.
(In reply to Dan Gohman [:sunfish] from comment #5)
> Does the patch use IEC-8859-1 codepoints in [128,256), which have an
> incompatible encoding overlap with UTF-8? If so, Latin1 seems fine. If not,
> I suggest using the name ASCII, which makes it explicit that it's just
> [0-128) and an encoding subset of UTF-8.

I'd like to use the full [0,256) range. Some of the extra characters are seem fairly common and also to handle cases where people use strings to store binary data.
(In reply to Jan de Mooij [:jandem] from comment #4)
> I'd rather add a shell function that turns on ASCII strings, and a shell
> function that returns whether a string is ASCII/Latin1, then we can convert
> things incrementally and add jit-tests to avoid breaking it. There's also a
> lot of ancient, crufty jschar code full of gotos and macros (looking at you,
> jsdate.cpp and jsnum.cpp) that we can hopefully clean up in the process.

You are a gentleman and a scholar.

> Waldo, Luke: your thoughts on this would be appreciated.

Let's see, so IIUC there are two wrappers here: Ranges, which perform debug-only range and no-gc asserts, Handles, which handle moving GC and will have actual runtime overhead.  And the third option was AssertNoGC and using raw jschar*.  Are those the three options here?

Assuming 'yes', then I think we'd want all three options, and default to Ranges unless otherwise required to use Handles (b/c of moving GC) or raw jschar* (b/c of a jschar*-taking API or b/c a hot code path that isn't fully optimized by Ranges).

As for the syntax of ranges, the:
  AutoAssertNoGC nogc;
  JS::TwoByteChars r = str->twoByteRange(nogc);
syntax seems fine (except the JS:: can be removed inside js/src b/c of NamespaceImports.h).  Although I like terseness, I think it is beneficial that every time we play with chars we have to type "AutoAssertNoGC" which prompts us to think about what can happen in the scope of the guard.
Flags: needinfo?(luke)
(In reply to Luke Wagner [:luke] from comment #7)
> Although I like terseness, I think it is beneficial
> that every time we play with chars we have to type "AutoAssertNoGC" which
> prompts us to think about what can happen in the scope of the guard.

One thought, it might be nice to add a new method:
  jschar twoByteChar(unsigned index);
  char latin1Char(unsigned index);
which avoids the AutoAssertNoGC in various trivial cases.  (I don't know off the top of my head how often this comes up.)
(In reply to Luke Wagner [:luke] from comment #7)
> Let's see, so IIUC there are two wrappers here: Ranges, which perform
> debug-only range and no-gc asserts, Handles, which handle moving GC and will
> have actual runtime overhead.  And the third option was AssertNoGC and using
> raw jschar*.  Are those the three options here?

Sorry for not being clear, the question was more how to (safely) deal with the many cases where we want a char pointer but can't GC while we're using it. We can either (1) add brand new wrapper classes (TwoByteStringChars etc) that are checked by the static analysis to not be held live across a GC and use these everywhere or (2) use the AutoAssertNoGC argument + the good old jschar*/Range stuff that exist today.

It sounds like you're fine with option (2) (+ ranges as the preferred option instead of jschar). Initially I did (1) but now I think (2) may be the cleaner/simpler option. Terrence, what do you think?

(In reply to Luke Wagner [:luke] from comment #8)
> One thought, it might be nice to add a new method:
>   jschar twoByteChar(unsigned index);
>   char latin1Char(unsigned index);
> which avoids the AutoAssertNoGC in various trivial cases.  (I don't know off
> the top of my head how often this comes up.)

Good point! I noticed there's a number of places where we only check the first char and fallback to the full implementation for the rest (JSFlatString::isIndex for instance). This method would be very useful there.
Flags: needinfo?(terrence)
I had initially thought (1) because I was not sure if (2) would work universally or allow us to easily integrate bounds checking with Range. However, as it seems that both of those are invalid, I agree that we should do it in pure C++ and not complicate things with the static analysis.
Flags: needinfo?(terrence)
Flags: needinfo?(jwalden+bmo)
Depends on: 1013917
Depends on: 1013531
Depends on: 1014114
Depends on: 1015902
Alias: latin1strings
Summary: Store JS strings as ASCII unless they actually need UTF-16 encoding → Store JS strings as Latin1 unless they actually need UTF-16 encoding
Depends on: 1015917
Depends on: 1016379
Depends on: 1017107
Depends on: 1018243
Depends on: 1018311
jandem, one of your patches regressed 2 (maybe 3) tests of Sunspider: base-64, validate-input and format-xparb, and PDFJS from Octane. All this only on the Mac machines from AWFY (Win doesn't show regressions and FFOS is broken).
(In reply to Guilherme Lima from comment #11)
> jandem, one of your patches regressed 2 (maybe 3) tests of Sunspider:
> base-64, validate-input and format-xparb, and PDFJS from Octane. All this
> only on the Mac machines from AWFY (Win doesn't show regressions and FFOS is
> broken).

Yes I noticed; bug 1018568 should fix these.
Depends on: 1018893
Depends on: 1019543
Depends on: 1019512
Depends on: 1019585
Depends on: 1020420
Depends on: 1020869
Depends on: 1021209
Depends on: 1021714
Depends on: 1023778
Depends on: 1024038
Depends on: 1024518
With the patches that landed so far + my local patch queue I can run Sunspider in the shell. I'm seeing about a 5-7% win, most of this is on regexp-dna (78% faster).

This confirms Brian's analysis that 8bit strings should fix the regexp-dna regression from irregexp.

  regexp:              1.78x as fast      16.1ms +/- 3.0%      9.1ms +/- 4.7%      significant
    dna:               1.78x as fast      16.1ms +/- 3.0%      9.1ms +/- 4.7%      significant

  string:              1.030x as fast     49.7ms +/- 1.1%     48.2ms +/- 1.3%      significant
    base64:            ??                  4.8ms +/- 1.6%      5.1ms +/- 9.3%      not conclusive: might be *1.078x as slow*
    fasta:             ??                  5.4ms +/- 1.5%      5.6ms +/- 7.7%      not conclusive: might be *1.037x as slow*
    tagcloud:          1.069x as fast     14.6ms +/- 1.3%     13.6ms +/- 1.0%      significant
    unpack-code:       1.034x as fast     18.7ms +/- 1.4%     18.1ms +/- 0.3%      significant
    validate-input:    1.079x as fast      6.2ms +/- 1.1%      5.8ms +/- 1.4%      significant
Wow, impressive! It's great that this is a time win as well as a space win :)

Is the improvement just because the more compact data means we stress the data caches less? It'd be interesting to see some numbers from Cachegrind for regexp-dna...
(In reply to Nicholas Nethercote [:njn] from comment #14)
> Is the improvement just because the more compact data means we stress the
> data caches less? It'd be interesting to see some numbers from Cachegrind
> for regexp-dna...

That could be part of it (regexp-dna creates some huge strings), but irregexp also emits different code (it can load 4 characters at once instead of 2) and also has some ASCII/Latin1-only optimizations.
Depends on: 1025875
Depends on: 1026438
Depends on: 1026680
Depends on: 1027528
Depends on: 1028866
Depends on: 1028867
Depends on: 1029622
The JS engine work is mostly done; we now run jit-tests with --latin1-strings on tbpl.

Today I hacked the browser just enough that I could load some websites, to do some measurements.

I opened gmail (already logged in), waited 15 seconds after it finished loading and opened about:memory in a second tab. Numbers are as follows for the "zones / strings" subtree under Other measurements (repeated this 4 times for each to minimize noise):

Before (numbers are in MB):

strings, malloc-heap, gc-heap
10.03, 7.34, 2.69
10.01, 7.32, 2.69
10.22, 7.47, 2.75
10.39, 7.57, 2.82

After:

6.05, 3.42, 2.62
5.99, 3.36, 2.63
5.95, 3.34, 2.61
5.97, 3.36, 2.61

I also modified JSString::sizeOfExcludingThis to dump the number of Flat and Inline strings, and the number of Latin1 vs non-Latin1 strings before/after:

Before:

Total:      138,056
Inline:      83,458 (60.4%)
Flat:        52,362 (37.9%)

After:

Total:      139,042
Inline:     117,919 (84.8%)
Flat:        20,123 (14.4%)

Latin1:     137,821 (99.1%)
Non-Latin1:   1,221 ( 0.9%)

So almost all strings are Latin1 (I use western languages, so it's a best-case scenario, but still). Also, with Latin1 strings enabled, most strings that used to have malloc'ed chars, can now store their chars inline. This should also help reduce heap fragmentation, especially once we have moving/compacting GC for strings.

These numbers should be taken with a grain of salt of course. My browser changes are pretty hackish and the main challenge with the browser is to not regress performance; that'll take more time to get right.
(Oh these numbers are with a 32-bit opt build on OS X.)

Also, it's interesting that there are a lot more inline strings, but the gc-heap size decreases (slightly). Fat inline strings are bigger than (flat) strings, so it wouldn't be unexpected for gc-heap to increase, but it's possible many of the inline strings are small so we now use inline instead of fat-inline strings for a lot of them.
Nice!

It's interesting to see that, while non-latin1 is rare for western-language pages, they are not non-existent.  It'd be interesting to hack about:memory to determine whether these as are actually non-latin1 strings or whether these strings are produced by sites that don't deflate (e.g. external string).
(In reply to Luke Wagner [:luke] from comment #18)
> It's interesting to see that, while non-latin1 is rare for western-language
> pages, they are not non-existent.  It'd be interesting to hack about:memory
> to determine whether these as are actually non-latin1 strings or whether
> these strings are produced by sites that don't deflate (e.g. external
> string).

Yeah, I'll try to find out if they are really non-Latin1. I also just noticed there are some functions that still create non-Latin1 strings (js::Int32ToString, js::IndexToString), I'll see how many look like integer strings etc.
If you guys are interested, and can provide us a build and test instructions, I can help you get it run by a few l10n community folks in non-latin1 countries to see how it affects them.
About 40% of the only 1% of strings that use a TwoByte representation really need to be TwoByte (because they contain non-Latin1 chars). Some of the other strings (that ideally would be stored as Latin1) look like random strings (JSON, URLs). There are a number of places where we could attempt to deflate; will look into that as a follow-up.

This is with Gmail; results may be different for other websites.

(In reply to Zibi Braniecki [:gandalf] from comment #20)
> If you guys are interested, and can provide us a build and test
> instructions, I can help you get it run by a few l10n community folks in
> non-latin1 countries to see how it affects them.

Thanks for the offer! I'd like things to be more stable first, so that we can get more reliable numbers, but once we're there that would be appreciated :)
Great measurements! Your analysis sounds plausible to me.

On 64-bit the gc-heap results should be even better, because the threshold where we go from inline to fat-inline is a little larger.
Depends on: 1032238
Blocks: 962603
Depends on: 1032726
Depends on: 1034191
Depends on: 1034627
Depends on: 1034689
Depends on: 1037869
Depends on: 1037871
Depends on: 1037886
Depends on: 1038093
Depends on: 1038095
Depends on: 1038099
Depends on: 1039551
This has been enabled in two nightlies now and I'm pretty sure it will stick. Next week (after the merge) I'll remove the flag (and some dead code), fix some of the follow-up bugs I filed and close this bug.
\o/
Depends on: 1041469
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
No longer depends on: 1033946
See Also: → 1231541
You need to log in before you can comment on or make changes to this bug.