596229 - static global atoms

Reporter

Description

•

14 years ago

Currently a few global atoms are created on startup using js_Atomize calls. These calls create JSString from the corresponding C string chars and hash them before adding into the global atom table. This startup cost can be avoided if the strings will be constructed statically (similar to unit and other static strings) and their hash will be precalculate during compilation.

Igor Bukanov

Reporter

Comment 1

•

14 years ago

callgrind on my iCore7 laptop shows that js_InitCommonAtoms is responsible for 2.5% of execution time when running js -e 'quit()'.

Assignee: general → igor

Igor Bukanov

Reporter

Comment 2

•

14 years ago

Attached patch v1 (obsolete) — Details — Splinter Review

The patch is a work in progress. It turns all common atoms into static strings. To make initialization of the static atoms as fast as possible the patch precompute the string hash and adds to the hashtable addUniqueNoRemovals function that assumes that the keys are unique, no entries is removed from the table and it has enough capacity to accommodate the new entry. With this changes the callgrind shows that the execution time of running js -e 'quit()' drops by 2.4% while contribution of atom initialization decreases from 2.5% to 0.56%. In principle that can be decreased father via pre-recording the hashtable itself so the hashtable can be constructed using single malloc and memcpy from the static template, but the gains would relatively small compared with efforts. Another note is that the patch in jsstaticatimgen.cpp duplicates the code for hash calculations from jsstr.cpp and jshashtable.h. Unfortunately it is not possible to include the corresponding headers into the generator as it runs on the host while the headers depends on the configuration for the target. Still perhaps via some shared header the duplication can be avoided. An alternative to running jsstaticatimgen.cpp atis to use in jsstaticatom.tbl instead of ATOMDEF(Math) a set of macros like ATOMDEF<N> with a usage like ATOMDEF4(Math, 'M','a','t','h') Such macros would allow in principle to calculate the hash, but that may push the compielrs to the limit. For example, ATOMDEF4 during hash calculations can be defined as #define ATOMDEF4(name, char1, char2, char3, char4) \ R(R(R(R(uint32(0), char1), char2), char3), char4) where R implements the body of the loop from js_HashString: #define R(hash, new_char) ((((hash) << 4) | ((hash) >> 28)) ^ uint32(new_char)) But since the body of R include hash twice to get ATOMDEF(propertyIsEnumerable) the expansion would be on the order of 2**length(propertyIsEnumerable) or 2**20. Clearly smarter solutions is necessary. So for now I have opted for source code generation. Asking Luke for a feedback here.

Attachment #475876 - Flags: feedback?(lw)

Luke Wagner [:luke]

Comment 3

•

14 years ago

Before I try to understand the patch, I had a few high-level questions: In the browser, is global atom initialization done per global object or per runtime? I would guess it is done per runtime. In that case, would this be a speedup for browser startup, and about how much time do you think it would save?

Igor Bukanov

Reporter

Comment 4

•

14 years ago

(In reply to comment #3) > In the browser, is global atom initialization done per global object or per > runtime? I would guess it is done per runtime. That is right, once per runtime. > In that case, would this be a > speedup for browser startup, and about how much time do you think it would > save? There are 3 small wins. First the runtime initialization on my laptom would be 0.2 ms faster. Second few cycles cycles would be spared per each GC as it would not need to mark the common atoms. Third win is that common atoms would be loaded using a constant pointer and not using two or more indirect loads as in cx->runtime->atomState.nameAtom.

Luke Wagner [:luke]

Comment 5

•

14 years ago

The reason I ask is because the patch seems to make things more complicated, so it seems like it should be justified by some measurable improvement somewhere, as was the case with unit, int, and length-2 strings. Win 1, measured in comment 2, seems very small since browser startup is on the order of 1000s of ms. Win 2 seems like it could show up when marking an empty or close-to-empty heap... perhaps you can demonstrate its best case speedup with a micro-benchmark? As for Win 3, with the tracejit, method-jit and PICs, I don't see us pulling common atoms out of the runtime on any hot paths... but perhaps you can find some with a measurable improvement?

Igor Bukanov

Reporter

Comment 6

•

14 years ago

Attached patch v2 — Details — Splinter Review

Updated patch with fixes and more optimizations.

Attachment #475876 - Attachment is obsolete: true

Attachment #476528 - Flags: feedback?(lw)

Attachment #475876 - Flags: feedback?(lw)

Igor Bukanov

Reporter

Comment 7

•

14 years ago

(In reply to comment #5) > Win 1, measured in comment 2, seems very small since browser startup is on the > order of 1000s of ms. Callgrind shows that js_AtomizeString takes at least 3% of the browser startup time with 40% of that time spent on atomization of new atoms. I would like to decrease that as much as possible. This patch is a preliminary work that shows that pre-computing just 143 common atoms used by JS engine can have measurable benefits on the startup time. > Win 2 seems like it could show up when marking an empty > or close-to-empty heap... perhaps you can demonstrate its best case speedup > with a micro-benchmark? A js benchmark that shows clear benefits of the patch is the following: var list = []; var obj = {}; var x = 0; for (var i = 0; i != 1e6; ++i) { var c1 = 10 + Math.floor(i / 1e4); var c2 = 10 + Math.floor(i / 100) % 100; var c3 = 10 + i % 100; var str = String.fromCharCode(c1, c2, c3); list.push(str); if (obj[str]) ++x; } list = null; var t = Date.now(); gc(); t = Date.now() - t; print("gc time:" + t); It creates 1e6 atoms that turns into the garbage during the timed GC. With the patch time drops from 67 to 51 ms. The win comes from the marking phase which no longer needs to scan garbage-full atom hash just to mark pinned atoms. > As for Win 3, with the tracejit, method-jit and PICs, > I don't see us pulling common atoms out of the runtime on any hot paths... but > perhaps you can find some with a measurable improvement? The parser often refers to the predefined atoms. The benefits of elimination of cx->runtime->atomState loads is visible in the following test (it compiles a function containing return arguments[0]+...+arguments[0]) var s = "return "+Array(1 << 17).join("arguments[0]+")+"0;"; Function(s); callgrind shows a drop of the instruction count from 627 to 625 in millions with halve of the win coming from the parser and another halve from more efficient implementation of js_AtomizeString.

Luke Wagner [:luke]

Comment 8

•

14 years ago

(In reply to comment #7) > (In reply to comment #5) > > Win 1, measured in comment 2, seems very small since browser startup is on the > > order of 1000s of ms. > > Callgrind shows that js_AtomizeString takes at least 3% of the browser startup > time with 40% of that time spent on atomization of new atoms. I would like to > decrease that as much as possible. This patch is a preliminary work that shows > that pre-computing just 143 common atoms used by JS engine can have measurable > benefits on the startup time. Atomization may be a significant part of browser startup, but is the atomization of global atoms a significant part of atomization? I don't see any indication of that from the measurements you've quoted. > > Win 2 seems like it could show up when marking an empty > > or close-to-empty heap... perhaps you can demonstrate its best case speedup > > with a micro-benchmark? [snip] > It creates 1e6 atoms that turns into the garbage during the timed GC. With the > patch time drops from 67 to 51 ms. The win comes from the marking phase which > no longer needs to scan garbage-full atom hash just to mark pinned atoms. Now this looks enticing, but on first glance it seems independent of whether we do any static initialization of global atoms. Is there any simple alternative way of achieving the same benefits? > > As for Win 3, with the tracejit, method-jit and PICs, > > I don't see us pulling common atoms out of the runtime on any hot paths... but > > perhaps you can find some with a measurable improvement? [snip] > callgrind shows a drop of the instruction count from 627 to 625 in millions > with halve of the win coming from the parser and another halve from more > efficient implementation of js_AtomizeString. That is a .3% improvement on parsing; still not the killer argument for all the hassle.

Igor Bukanov

Reporter

Comment 9

•

14 years ago

(In reply to comment #8) > Atomization may be a significant part of browser startup, but is the > atomization of global atoms a significant part of atomization? I don't see any > indication of that from the measurements you've quoted. A possible way to speedup the browser startup is to add API that would allow to quickly setup the atom table with about 3000 long-term atoms. For that one can add to the hashtable a function that allows to add unique elements with a pre-calculated hash. This way a browser can provide a memory-mapped file with char data and hashes to the atoms and the implementation would just populate the hashtable with new entries avoiding touching the char data itself until the corresponding atom is referenced. The patch implements that adding of unique elements and uses that for global atoms from the engine itself. > Now this looks enticing, but on first glance it seems independent of whether we > do any static initialization of global atoms. Is there any simple alternative > way of achieving the same benefits? An alternative exists but with heap-allocated global atoms it would not allow to simplify the code removing that hack of using a hashset to store both strings and interned/pinned flags. > That is a .3% improvement on parsing; still not the killer argument for all the > hassle. I am not sure what do you mean by the hassle. Do you refer to the extra source generator? If so, then this is a price to simplify the current code to use just STATIC_ATOM(name) instead of referring atomState in the runtime. The latter is especially ugly in the parser with code like: tc->parser->context->runtime->atomState.argumentsAtom Also for the final patch I plan to merge the atom generator together with the keyword generator so the hassle will be less intrusive.

Luke Wagner [:luke]

Comment 10

•

14 years ago

(In reply to comment #9) > (In reply to comment #8) > > Atomization may be a significant part of browser startup, but is the > > atomization of global atoms a significant part of atomization? I don't see any > > indication of that from the measurements you've quoted. > > A possible way to speedup the browser startup is to add API that would allow to > quickly setup the atom table with about 3000 long-term atoms. For that one can > add to the hashtable a function that allows to add unique elements with a > pre-calculated hash. Cool, so would that be most/all of that startup atomization overhead you mentioned earlier? Btw, I think there was a mix-up with the v2 patch posted; its very short ;-)

Luke Wagner [:luke]

Updated

•

14 years ago

Attachment #476528 - Flags: feedback?(lw)

Igor Bukanov

Reporter

Updated

•

13 years ago

Assignee: igor → general

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Assignee: general → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

1 year ago

Blocks: 1803803

Severity: S3 → N/A

Priority: -- → P3

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

1 year ago

Whiteboard: [sp3]

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-480

Tooru Fujisawa [:arai]

Comment 11

•

1 year ago

Some update on the atomization during startup.

well-known atoms' hash is pre-calculated during compilation [1]
well-known atom uses dedicate path skips some check [2]
JSRuntime::initializeAtoms doesn't appear in startup profile. JS::InitSelfHostedCode takes 6ms, but mostly in JSRuntime::initSelfHostingFromStencil
other atomization take ~20ms during startup, example in the following. some atoms are statically known from webidl, and that part could be optimized as well:

[1] https://searchfox.org/mozilla-central/rev/e4d8451468be3a0f8a9faa3d37cadf07200821ec/js/src/vm/WellKnownAtom.cpp#21,24-25

js::WellKnownAtomInfo js::wellKnownAtomInfos[] = {
...
   mozilla::HashStringKnownLength(js_##IDPART##_str,              \
                                  sizeof(js_##IDPART##_str) - 1), \

[2] https://searchfox.org/mozilla-central/rev/e4d8451468be3a0f8a9faa3d37cadf07200821ec/js/src/vm/JSAtomUtils.cpp#199,250-252

bool JSRuntime::initializeAtoms(JSContext* cx) {
...
  for (size_t i = 0; i < uint32_t(WellKnownAtomId::Limit); i++) {
    const auto& info = wellKnownAtomInfos[i];
    JSAtom* atom = PermanentlyAtomizeCharsValidLength(

Tooru Fujisawa [:arai]

Updated

•

1 year ago

Depends on: 1847672

Tooru Fujisawa [:arai]

Updated

•

1 year ago

Depends on: 1847677

Tooru Fujisawa [:arai]

Updated

•

1 year ago

Depends on: 1847758

Tooru Fujisawa [:arai]

Updated

•

1 year ago

Depends on: 1847862

Tooru Fujisawa [:arai]

Updated

•

1 year ago

Depends on: 1848278

v1 14 years ago Igor Bukanov 314.42 KB, patch		Details \| Diff \| Splinter Review
v2 14 years ago Igor Bukanov 834 bytes, patch		Details \| Diff \| Splinter Review