Closed Bug 965130 Opened 6 years ago Closed 6 years ago

pdf.js: create less garbage while parsing

Categories

(Firefox :: PDF Viewer, defect, P4)

21 Branch
x86
Windows 7
defect

Tracking

()

RESOLVED FIXED
Firefox 29

People

(Reporter: njn, Assigned: njn)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [pdfjs-c-performance][MemShrink])

If I open http://stacks.math.columbia.edu/download/book.pdf, which is a 4000+ PDF, and wait for the file to finish being read, about:memory tells me this:

> │    ├──820.95 MB (72.97%) --
> workers(localhost)/worker(../src/worker_loader.js, 0x7f232bd41000)
> │    │  ├──799.56 MB (71.07%) -- zone(0x7f232414a800)
> │    │  │  ├──410.71 MB (36.51%) -- strings
> │    │  │  │  ├──285.73 MB (25.40%) -- notable
> │    │  │  │  │  ├──245.05 MB (21.78%) ++ (2593 tiny)
> │    │  │  │  │  └───40.68 MB (03.62%) ── string(length=15, copies=666537,
> "[object Object]")/gc-heap
> │    │  │  │  ├──118.15 MB (10.50%) ── short-gc-heap
> │    │  │  │  └────6.83 MB (00.61%) ++ normal

What a lot of strings. And looking in detail at them, there are *lots* of entries like this:

> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=10, copies=6174, "http://sta")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=11, copies=6174, "http://stac")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=12, copies=6174, "http://stack")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=13, copies=6174, "http://stacks")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=14, copies=6174, "http://stacks.")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=15, copies=6174, "http://stacks.m")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=16, copies=6174, "http://stacks.ma")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=17, copies=6174, "http://stacks.mat")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=18, copies=6174, "http://stacks.math")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=19, copies=6174, "http://stacks.math.")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=20, copies=6174, "http://stacks.math.c")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=21, copies=6174, "http://stacks.math.co")/gc-heap
> │    │  │  │  │  │  ├────0.38 MB (00.03%) ── string(length=22, copies=6174, "http://stacks.math.col")/gc-heap

Goodness! It looks like pdf.js is building up strings one char at a time, and indeed, src/core/parser.js does exactly that in the Lexer class.

But isn't the JS engine supposed to do well in such cases, thanks to the use of ropes? Well, it turns out that thanks to http://dxr.mozilla.org/mozilla-central/source/js/src/vm/String.cpp#398, ropes aren't used if the result of a concatenation can fit into a short string. And on 64-bit platforms, a short string can have up to 23 chars. And this test case causes *many* strings shorter than 23-chars to be built up. (And that's why all the entries above have "gc-heap" values, but don't have "malloc-heap" values, because the chars are inline in the GC thing.)

I have a patch that converts pdf.js to use Array.join instead of += to build up these strings. It reduces the size of the "strings/notable" sub-tree to just over 10 MiB, and reduces the peak RSS encountered while reading the file from ~1130 MiB to ~800. I will submit a GitHub pull request to get that landed in the pdf.js GitHub repo.

But I also I wonder if we should reconsider the JS engine's behaviour here. I tried commenting out the short-string code in js::ConcatStrings(), and sure enough the peak RSS dropped to ~800 MiB, very similar to what I got when using Array.join.
(In reply to Nicholas Nethercote [:njn] from comment #0)
> But I also I wonder if we should reconsider the JS engine's behaviour here.
> I tried commenting out the short-string code in js::ConcatStrings(), and
> sure enough the peak RSS dropped to ~800 MiB, very similar to what I got
> when using Array.join.

I tried this in bug 861251. It improved sunspider a bit but regressed Dromaeo and that's why I reverted that change, see bug 863018. Also note that Ion inlines the short-string path, so you should probably comment out the short-string code there as well, see JitCompartment::generateStringConcatStub.
> I tried this in bug 861251. It improved sunspider a bit but regressed
> Dromaeo and that's why I reverted that change

Fair enough; I can see that the best choice would depend greatly on workload.

One noteworthy thing is that our behaviour is quite different on 32-bit vs 64-bit here. On 32-bit, a JSInlineString can hold 3 jschars (excluding null-termination) and a JSShortString can hold 11 jschars. On 64-bit, the corresponding numbers are 7 and 23. So this particular problem would be much smaller on 32-bit.
Depends on: 965861
Fixed on trunk by bug 965861.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 29
You need to log in before you can comment on or make changes to this bug.