Closed Bug 1678651 Opened 4 years ago Closed 4 years ago

Sampling suddenly stops on the content process

Categories

(Core :: Gecko Profiler, defect, P2)

defect

Tracking

()

RESOLVED FIXED
86 Branch
Tracking Status
firefox86 --- fixed

People

(Reporter: denispal, Assigned: mozbugz)

References

(Blocks 1 open bug)

Details

Attachments

(9 files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Here are a couple profiles recently reported in bug 1664067. The sampling data on the content process seems to suddenly stop but there is clearly activity going on based on the screenshots and markers.

https://share.firefox.dev/3nghqaG
https://share.firefox.dev/32zSqTJ

Looking at the stack heights, it seems to have been broken after reaching a really deep stack containing 746 frames. (I got this number by going to the deepest node in the call tree, choosing "Copy stack" and pasting it into a text editor.)

I'm not sure what could break based on the stack depth, but maybe it's a hint.

I suspect something is going on with the sampling thread here as the markers are still being captured. This may be that the existing hangs issue from that bug are affecting profiling in some way.

ni: Gerald, to bring this to your attention.

Flags: needinfo?(gsquelart)

Looking at the first profile, there's a "DOM Worker" threads in the same process, which continues to be sampled. So the sampler loop still runs in the process.
Looking at the code, either the thread was removed from the list of "live" threads (very unlikely), or samples were in fact taken but somehow lost.
This last hypothesis, with Markus' clue that stack size have been growing, would point at the too-big stack sample code.

I managed to get too-big stacks on foxnews.com, they were more than the current maximum of almost 68 KB.

While investigating, I saw that this code (and supporting code around it) is in fact incorrect!

  • It checks for "cleared" blocks, which won't happen because we don't actually destroy old blocks. So the first test never catches the too-big stack issue, and then an empty stack is silently recorded.
  • More worryingly I suspect that the ProfileBufferChunkManagerSingle's single chunk is not actually re-used, which could prevent later samples from getting stored. (Though this is contradicted by the DOM Worker samples -- but I'd still like to review this code and make sure it does its job.)

As a quick solution here, we could bump up the buffer size during profiling. And I should at least make the size checks work.

A better solution would be to dynamically increase the size when needed. But this cannot be done while a thread is suspended, which makes things more complicated.
I'll do what I can here, and file follow-ups as necessary...

Assignee: nobody → gsquelart
Severity: -- → S3
Flags: needinfo?(gsquelart)
Priority: -- → P2

In DEBUG builds, instead of only testing for the expected chunk state(s), there are now separate assertions for each unexpected state, which makes it much easier to track the source of failures.

This bug has revealed some issues when the single chunk gets filled, and there are different paths depending on whether the chunk is filled right to the end, or past it.
Later patches will fix these issues and update these tests accordingly.

Depends on D99975

When a "Put" operation fails (most probably because no chunk was available to store the data), we remember the number of bytes that couldn't be stored.
This can be useful to give an indication of how much more memory would have been needed for successful puts.

Depends on D99976

The profiler sampler was using the "cleared block count" as indication that some data could not be stored.
But this was incorrect, because cleared blocks only account for blocks in chunks that have been destroyed or recycled, but this missed data that could not be stored due to lack of space.
In particular, ProfileBufferChunkManagerSingle only provides one chunk, so when it gets full, a new chunk cannot be allocated, and the data is lost; but the "cleared block count" is left unchanged (the one chunk is only released, but not destroyed nor recycled).

Now the sampler correctly looks at the "failed puts bytes", and can report precisely how many bytes could not be stored.

Depends on D99977

Instead of discarding released chunks, we keep them as "next" chunks, and we make sure there's a valid "current" chunk when possible.

Recycling released chunks means that when using the ProfileBufferChunkManagerSingle, the one chunk, in whichever state it may be, will be kept alive and reused.

Depends on D99978

Some sites do have stacks that require more than 64KB to store in the profiler buffer.
Note that this only affects one semi-permanent buffer per process during profiling, and short-lived buffers when capturing stacks in markers.

Depends on D99980

The "expected maximum stack size" (currently 64KiB) value was present in multiple places.
Now it's accessible from everywhere as ProfileBufferChunkManager::scExpectedMaximumStackSize, so it's easier to modify as needed.

Depends on D99980

Pushed by gsquelart@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e6890e92e588 Better debugging assertions in ProfileBufferChunk - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/f52a13d8eba8 Added ProfileChunkedBuffer state tests - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/b597a84c7777 Rewrote TestChunkedBufferSingle() to better test buffer-filling scenarios - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/9bf07ef7bfb5 ProfileChunkedBuffer now keeps track of failed puts - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/ba92771a968a Fixed sampler's detection of too-big stacks - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/07b13d6485bc Better ProfileChunkedBuffer::Clear(), recycles released chunks - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/9116bff9e29b ProfilerChunkedBuffer: Always mark current chunk full when entry doesn't fit - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/dce478a960db Centralize scExpectedMaximumStackSize in ProfileBufferChunkManager - r=canaltinova https://hg.mozilla.org/integration/autoland/rev/ece3bd1f7094 Increase expected max stack record size to 128KB - r=canaltinova
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: