Closed Bug 727615 Opened 13 years ago Closed 6 months ago

Do small dependent strings effectively leak large strings?

Tracking

()

Status:

RESOLVED FIXED

Milestone:

141 Branch

Tracking Flags:

Tracking

Status

firefox141

---

fixed

People

(Reporter: cdleary, Assigned: sfink)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [MemShrink:P3][sp3])

Attachments

(2 files, 1 obsolete file)

Bug 727615 - Do not use dependent strings for substrings that use a small fraction of the base string length 8 months ago Steve Fink [:sfink] [:s:] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 727615 - Avoid keeping large base strings alive with small dependent strings r=jonco 6 months ago Steve Fink [:sfink] [:s:] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 727615 - report all byteSize-of-string.js failures, not just first one r=jonco 6 months ago Steve Fink [:sfink] [:s:] 48 bytes, text/x-phabricator-request		Details \| Review

Chris Leary [:cdleary] (not checking bugmail)

Reporter

Description

•

13 years ago

I haven't had time to finish investigating this, but there is a possible situation that goes like so: 1. There is a very large string temporary, say from a transient innerHTML or an AJAX request that returned the text of a book. 2. Take a tiny substring of it, making a dependent string that keeps the original alive. Perhaps these dependent strings could result from things like regexp matches as well. 3. Lose all other references to the original large-string temporary, so that it only continues to exist so that those small number of characters can be referred to in the original buffer. I was originally trying to dump the string GC subheap at mark time and run a postprocessing analysis to determine how often this situation actually occurred during actual web browsing (and things like TBPL). Specifically, we're looking for: A large string which is only held alive by its dependent-strings, which refer to a negligible amount of the large string. There are accounting tricks you can try to play at mark time to try and detect these sorts of situations -- the easiest to detect being, "I'm a large string that is only referred to by one dependent string, so just memcpy the small amount of referred to characters into that dependent string's buffer." One catch it that, even if you detect such a situation, you don't want to spend a lot of time performing memcpy, since that could hurt perceived responsiveness of the GC.

Nicholas Nethercote [inactive]

Comment 1

•

13 years ago

Chris, we can update the MemShrink priority for this when you get some data.

Whiteboard: [MemShrink] → [MemShrink:P3]

Markus Stange [:mstange]

Comment 2

•

13 years ago

I've hit such a case today in cleopatra, the UI for the new built-in profiler [0] [1]. In this web app we parse raw profiles, i.e. strings that can be arbitrarily large (for example 150MB), by splitting the string into lines and using RegExps on the individual lines. After parsing the profile, we only want to keep around a space-optimized representation of the profile and discard the original profile string. However, the space-optimized representation contains substrings from the original profile and thus kept the original string alive. I've worked around that by cloning the new representation using JSON.parse(JSON.stringify(obj)) [2]. For the example I tested with, this reduced the "string-chars" memory bucket in the parser worker from 300MB to 2MB. [0] https://developer.mozilla.org/en/Performance/Profiling_with_the_Built-in_Profiler [1] http://tests.themasta.com/cleopatra/?usesample [2] https://github.com/mstange/cleopatra/commit/0d6cf95e7354a1a1659c3693e6a3e346e671ca1f

Nobody; OK to take it and work on it

Updated

•

11 years ago

Assignee: general → nobody

aleth [:aleth]

Updated

•

11 years ago

Blocks: 1058584

Jan de Mooij [:jandem]

Updated

•

9 years ago

Blocks: 1238456

Jan de Mooij [:jandem]

Comment 4

•

8 years ago

This keeps coming up every once in a while, and was also mentioned in a blog post mraleph wrote this week about string representations in V8/Dart. Thinking about it some more, it seems we might be able to fix this by making string->base a weak pointer. Then, we sweep all strings and when IsAboutToBeFinalized(string->base) is true, we know the base string is only used by dependent strings and we undepend it. This requires sweeping all strings except atoms (separate arenas for dependent strings won't work because strings can be turned into dependent strings later) and I'm sure there are other problems to overcome, but it might be worth investigating.

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Steve Fink [:sfink] [:s:]

Assignee

Comment 5

•

2 years ago

•

Edited

I wonder if we could do a variant of comment 4, because sweeping all strings seems unfortunate. This is also an implementation of making string→base a weak pointer.

During regular marking, when a string→base edge is encountered where the base has not yet been marked, effectively push the base onto the bottom of the mark stack (possibly implemented with a separate string-only mark stack). When we finish the rest of marking and are about to process the first of these deferred strings, switch to a marking mode that is exactly the same as regular marking except set a BASE_ONLY bit on all marked strings. Then during regular marking, sever dependent strings from their base if the BASE_ONLY bit is set.

MarkDependent() {
  if isMarked(base):
    return
  elif base.flags & BASE_ONLY:
    convert dependent string to a regular linear string
    # No need to trace(base), even though it may be kept alive by something else.
  elif inDeferredMarking():
    base.flags |= BASE_ONLY;
    trace(base)
  else:
    deferredMark(base)
}

But also note that I may be making short dependent strings into inline strings, which would reduce the number of these dependent-preserved base strings.

A related alternative would be to not use the BASE_ONLY bit, and instead queue up dependent strings with unmarked bases. In the final phase, if their bases are still unmarked, sever (undepend) them. This is equivalent to Jan's comment 4, it's just recording all strings to sweep instead of scanning all strings. This might require pushing a lot of dependent strings onto the alternative stack, though, especially if the bases are marked late in the regular marking pass. Most strings are eagerly marked, and so currently do not require mark stack space. And in mstange's case at least, there were many dependent strings with a common base.

Another intermediate possibility would be to construct a list of string arenas containing at least one dependent string that needs sweeping.

Unless I'm missing something, the main drawbacks of the BASE_ONLY scheme are: (1) I don't think we have an easily free bit in string flags; (2) the complexity of pushing onto a separate stack or the bottom of the stack; (3) this would probably end up with at least one extra branch in string marking; (4) this trick can only be used once. If we want anything else to use a similar "everything is now marked except for those reachable through edges of type X" trick, it would collide (unless the reachable types are fully disjoint).

Fight me.

Steve Fink [:sfink] [:s:]

Assignee

Updated

•

2 years ago

Blocks: 1803803

Steve Fink [:sfink] [:s:]

Assignee

Comment 6

•

2 years ago

After some further thought and discussion with jonco:

First, there's no need to worry about this at all unless we have evidence it is a problem.

Second, it is possible to push onto the bottom of the stack (unshift) relatively easily: push the bottom entry onto the top, and overwrite the bottom entry with the unshifted value. Our stack doesn't have fixed-length entries, but there are only two sizes and the size can be determined from the first word, so this can all be fixed up by tacking on some padding if unshifting a small entry into a big entry's place (and strings will always be small entries, so the other way around isn't needed). Since I think the large entries are 2x the small entries, this could also just be done by duplicating the small entry since I think we ignore re-marking.

Third, jonco suggested that a far simpler heuristic would be to copy the chars if the base string is x% larger than the dependent string. No funky extra sweep phase with a bizarro double-ended stack required.

Anyway, I'll let this sit here until we find evidence that this is an issue.

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Whiteboard: [MemShrink:P3] → [MemShrink:P3][sp3]

Jira Integration Bot

Updated

•

2 years ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-606

André Bargull [:anba]

Comment 7

•

1 year ago

(In reply to Steve Fink [:sfink] [:s:] from comment #6)

Anyway, I'll let this sit here until we find evidence that this is an issue.

Bug 1931252 is a new report about this issue.

André Bargull [:anba]

Updated

•

1 year ago

Blocks: 1931252

Matthew Gaudet (he/him) [:mgaudet]

Comment 8

•

1 year ago

I do also wonder if this might be related to https://bugzilla.mozilla.org/show_bug.cgi?id=1931717 ...

Iain Ireland [:iain]

Comment 9

•

1 year ago

(In reply to Steve Fink [:sfink] [:s:] from comment #6)

...
Third, jonco suggested that a far simpler heuristic would be to copy the chars if the base string is x% larger than the dependent string. No funky extra sweep phase with a bizarro double-ended stack required.

This suggestion seems like a quick and easy improvement. I don't think we need a huge amount of additional evidence to justify implementing it.

Steve Fink [:sfink] [:s:]

Assignee

Comment 10

•

1 year ago

The x% heuristic prevents the problem of a short(ish) substring keeping the whole original string alive, at the cost of more memory use and allocation overhead if the original string is being sliced up into lots of substrings that collectively cover all or most of the original string (imagine splitting a whole file into lines, for example). Bug 1931252, among others, provides evidence that the heuristic is a good idea.

The worst case would be for shingles:

var s = ...1 MB string...;
var shingles = [];
for (let i = 0; i < s.length - 50; i++) {
  shingles.push(s.substr(i, 50));
}

With the current setup, this uses 1MB + 999950 * sizeof(JSString) of string-related memory, about 16MB. With the heuristic, it would release the original string and use 999950 * (sizeof(JSString) + 50 bytes), about 63MB (but actually the allocation overhead would round up each 50 byte allocation to 64 bytes, for a total of 76MB). But this is probably too uncommon to worry about, relative to the situation this bug is about.

Jan de Mooij [:jandem]

Updated

•

9 months ago

Updated

•

8 months ago

Assignee: nobody → sphink

Status: NEW → ASSIGNED

Steve Fink [:sfink] [:s:]

Assignee

Comment 11

•

8 months ago

Attached file Bug 727615 - Do not use dependent strings for substrings that use a small fraction of the base string length (obsolete) — Details

Steve Fink [:sfink] [:s:]

Assignee

Comment 12

•

8 months ago

Performance comparison, 7 runs each base/treatment: https://perf.compare/compare-results?baseRev=f4b3014e29ff4ba08d191b3fa1e7cf89a1980141&baseRepo=try&newRev=9cdab110d5fe428e3b35624cc112f9c8142f2e13&newRepo=try&framework=13&sort=delta%7Cdesc

I can't tell much of anything from this.

Steve Fink [:sfink] [:s:]

Assignee

Comment 13

•

6 months ago

Attached file Bug 727615 - Avoid keeping large base strings alive with small dependent strings r=jonco — Details

Phabricator Automation

Updated

•

6 months ago

Attachment #9470624 - Attachment is obsolete: true

Steve Fink [:sfink] [:s:]

Assignee

Comment 14

•

6 months ago

Attached file Bug 727615 - report all byteSize-of-string.js failures, not just first one r=jonco — Details

Pulsebot

Comment 15

•

6 months ago

Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cd0651330d71 Avoid keeping large base strings alive with small dependent strings r=jonco https://hg.mozilla.org/integration/autoland/rev/439c72e35361 report all byteSize-of-string.js failures, not just first one r=jonco https://hg.mozilla.org/integration/autoland/rev/df255fb45980 apply code formatting via Lando

Sandor Molnar[:smolnar]

Comment 16

•

6 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/cd0651330d71
https://hg.mozilla.org/mozilla-central/rev/439c72e35361
https://hg.mozilla.org/mozilla-central/rev/df255fb45980

Status: ASSIGNED → RESOLVED

Closed: 6 months ago

status-firefox141: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 141 Branch

Treeherder Bug Filer

Updated

•

5 months ago

Regressions: 1970864

Camelia Badau [:cbadau], Desktop Test Engineering

Updated

•

5 months ago

QA Whiteboard: [qa-triage-done-c142/b141]

You need to log in before you can comment on or make changes to this bug.