Limit the number of chrome-hangs reported to Telemetry

RESOLVED FIXED in Firefox 45, Firefox OS v2.5

Status

()

Toolkit
Telemetry
P1
normal
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: vladan, Assigned: Dexter, Mentored)

Tracking

(Blocks: 2 bugs)

Trunk
mozilla45
Points:
1
Dependency tree / graph

Firefox Tracking Flags

(firefox45 fixed, b2g-v2.5 fixed)

Details

(Whiteboard: [unifiedTelemetry] [measurement:client] [lang=c++])

Attachments

(1 attachment, 5 obsolete attachments)

(Reporter)

Description

5 years ago
Some clients report over 1000 chrome-hang stacks from a single session! This causes unnecessary memory use in Firefox and causes Telemetry to spend a long time uploading its ping.

We should impose limits on the size of individual hang stacks (preserving only top X frames) as well as the number of chrome-hangs reported (keeping only the most recent stacks).
(Reporter)

Updated

5 years ago
Blocks: 896744

Comment 1

5 years ago
I am interested for getting involved in this bug. Please tell me how to start and related information.
(Reporter)

Comment 2

5 years ago
(In reply to subh.singh007 from comment #1)
> I am interested for getting involved in this bug. Please tell me how to
> start and related information.

Thanks for helping out!

Here's a list of things you have to do, I hope it's not too much info at once :)

1. Do you have access to a Windows machine? You'll need it to test out the change, because "chrome hangs" are currently a Windows-only feature

2. Check out and build Firefox on Windows: https://developer.mozilla.org/en-US/docs/Simple_Firefox_build

You can use the following sample .mozconfig file: http://pastebin.mozilla.org/2696396

3. Read a bit of background about the chrome-hang reporting feature (the implementation details are likely to be out of date though):
https://blog.mozilla.org/vdjeric/2012/03/21/introducing-chrome-hang-reporting-and-the-symbolication-server/

4. Modify the CombinedStacks::AddStack function in toolkit/components/telemetry/Telemetry.cpp to impose a max on the number of frames in a stack:

http://hg.mozilla.org/mozilla-central/file/5ecd26bc5274/toolkit/components/telemetry/Telemetry.cpp#l146

Make sure the "most recent" stack frames in a stack are preserved and the "older" frames (e.g. "int main()") are truncated

5. Further modify the CombinedStacks::AddStack function, to never hold more than some maximum number of stacks. So, for example, if the max is 50 and we already have 50 stacks when AddStack gets called again, then first drop the oldest stack and then add this latest one.

6. For testing the chrome hangs reporting mechanism, you can trigger a hang in your own Nightly:
- First, enable Telemetry from about:telemetry
- Write a webpage with a tight JavaScript loop that runs for more than 5 seconds
- Visit the webpage, wait for the "Unresponsive script" dialog to pop up
- Reload about:telemetry and verify there is new hang in the "Browser Hangs" section
- You should make sure that about:telemetry never has more than the maximum number of stacks and that none of the stacks have more than the maximum number of frames. You might want to lower the maximums to some small number during testing to make it easier.

You can reach me at my email address if you have any question.
Whiteboard: [mentor=vladan][lang=c++][good first bug] → [mentor=vladan][lang=c++]

Comment 3

5 years ago
> 1. Do you have access to a Windows machine? You'll need it to test out the
> change, because "chrome hangs" are currently a Windows-only feature
I don't have access to window machine. I am using Mac machine. I have build it on my mac machine. Is it not possible to continue on this bug using my mac machine?

Updated

5 years ago
OS: All → Windows 7
(Reporter)

Comment 4

5 years ago
(In reply to subh.singh007 from comment #3)
> I don't have access to window machine. I am using Mac machine. I have build
> it on my mac machine. Is it not possible to continue on this bug using my
> mac machine?

It's a Windows-only feature, so you won't be able to test it on a Mac. You could try hacking together some test inputs, but I wouldn't recommend it.
Do you have a Windows VM?

Comment 5

5 years ago
Since this looks like it's available and I'm currently building Firefox so I can start working on bugs, I'll say I'll take this one. I'm using Windows 8 as my primary OS, but I have a Windows 7 VHD that I can boot from to test this if necessary.

Comment 6

5 years ago
Taking a look at the code now... is ProcessedStack FIFO, FILO, or what? If it's FIFO, that makes dropping the oldest frames easy (pop pop pop), if it's FILO then I'll have to look further into it (which is no problem, I plan on going through and looking at everything once the build completes anyway). It's been a while since I did any programming, especially with vectors, so I welcome any suggestions and I forsee myself looking up many things to find the right function call.

Also, any suggestions for the max number of frames and stacks? I'm thinking 15 stacks could be enough, but I'm not sure either how often the reports get sent or how often people experience hangs in general. For frames I'm thinking 50 is probably good - it should more than cover what's necessary to determine the cause behind most hangs.

Comment 7

5 years ago
Created attachment 793781 [details] [diff] [review]
Patch without switching to deque

mStacks is still a vector as there are problems with changing it to a deque that I'm working on fixing, but this patch is fully functional and the size of the vector (50) means that it shouldn't really be a performance hit to drop the first stack off of mStacks. Once I can get it working with deque I'll submit another patch, this is mainly just because a deque may not be a viable option.
Attachment #793781 - Flags: review?(vdjeric)
(Reporter)

Updated

5 years ago
Assignee: nobody → demize95
(Reporter)

Comment 8

5 years ago
Comment on attachment 793781 [details] [diff] [review]
Patch without switching to deque

Thank you for working on this bug :) Just a couple of comments below.

As we discussed on IRC, I'd prefer to use a circular buffer or a deque (as you suggested) to store the stacks. Either one is fine, but erasing elements from the beginning of a vector is inefficient.

>+  //Limit number of stacks kept to 50
>+  if(mStacks.size() > 50) {
>+    mStacks.erase(mStacks.begin());
>+  }

We shouldn't hard-code "magic numbers" in the code, instead we should use a constant. It's easier to find & change the max if it's stored in one place in a variable, plus we only have to update the value in one place if we decide to change the maximum # of chrome hangs one day. See "kMaxFailedProfileLockFileSize" in the same file for an example.


>   // Copy the information to the return value.
>   ProcessedStack Ret;
>+  // Limit the number of frames to 50
>+  size_t endIndex = (rawStack.size() > 50) ? (50) 
>+	: (rawStack.size());
>   for (std::vector<StackFrame>::iterator i = rawStack.begin(),
>-         e = rawStack.end(); i != e; ++i) {
>+         e = rawStack.begin() + endIndex; i != e; ++i) {
>     const StackFrame &rawFrame = *i;
>     ProcessedStack::Frame frame = { rawFrame.mPC, rawFrame.mModIndex };
>     Ret.AddFrame(frame);
>   }

We can save a bit of time in this function by imposing the max frames limit in the first loop, i.e. when we iterate over aPCs to copy them into rawStack.
Attachment #793781 - Flags: review?(vdjeric)

Comment 9

5 years ago
Created attachment 794848 [details] [diff] [review]
Buggy patch after switching to deque
(Reporter)

Comment 10

5 years ago
Comment on attachment 794848 [details] [diff] [review]
Buggy patch after switching to deque

The new patch got lost in my bugmail, I'll take a look tomorrow
Attachment #794848 - Flags: feedback?(vdjeric)

Comment 11

5 years ago
Going to be without reliable internet for a while (up to a week, I'll say). I'll make any changes you find when I get reliable internet.
(Reporter)

Comment 12

5 years ago
Comment on attachment 794848 [details] [diff] [review]
Buggy patch after switching to deque

>+  
>+  //Limit number of stacks kept

Nit: space after comment

>+  if(mStacks.size() > kMaxStacksKept) {
>+    mStacks.pop_front();
>+  }
> }

You also need to erase the first element in mDurations, I think that was your bug
Also, you need a space after "if"

>+const size_t kMaxFramesKept = 50;
>+

We don't want this const in the Telemetry namespace, so put the declaration next to kMaxStacksKept (in the anonymous namespace). Also, add a comment explaining what these consts represent

>+  // Limit the number of frames
>+  size_t endIndex = (aPCs.size() > kMaxFramesKept) ? (kMaxFramesKept) 
>+	: (aPCs.size());

You can use std::min and figure out the last iterator here:

std::vector<uintptr_t>::const_iterator e = aPCs.begin() + std::min(aPCs.size, kMaxFramesKept);

>   for (std::vector<uintptr_t>::const_iterator i = aPCs.begin(),
>-         e = aPCs.end(); i != e; ++i) {
>+         e = aPCs.begin() + endIndex; i != e; ++i) {

Btw, this implementation means that we'll sometimes be holding onto module info that is no longer needed.. I don't think it's a big deal, so can you just add a comment about it?
Attachment #794848 - Flags: feedback?(vdjeric)
Mentor: vdjeric
Whiteboard: [mentor=vladan][lang=c++] → [lang=c++]
Blocks: 1186986
No longer blocks: 896744
OS: Windows 7 → All
Points: --- → 3
Priority: -- → P2
So, per bug 1187864 there are different things we can do here:
(1) limit the number of hang stat entries
(2) limit the stack depth to a sane number
(3) collapse repeated successive "(chrome script)" to one "(chrome script)" or "(chrome script <N>)"

For (3) i think we should just stay with collapsing into the simple "(chrome script)" (unless there are compelling reasons to make it more complicated).
This looks like it hasn't been worked on for a while, unassigning.
Assignee: demize95 → nobody
Points: 3 → 1
Whiteboard: [lang=c++] → [unifiedTelemetry] [measurement:client] [lang=c++]
Priority: P2 → P1
Assignee: nobody → alessio.placitelli
Created attachment 8678776 [details] [diff] [review]
bug896740.patch

This patch continued where the original contributor left off, by limiting the number of reported chrome hangs to 50 and the depth of the reported stack to 50.

Due to the way annotations are de-duplicated, discarding stacks is a bit more involving than it was before.

Please note that this patch stacks on bug 1213780.
Attachment #793781 - Attachment is obsolete: true
Attachment #794848 - Attachment is obsolete: true
Attachment #8678776 - Flags: review?(vladan.bugzilla)
(Reporter)

Comment 16

3 years ago
Comment on attachment 8678776 [details] [diff] [review]
bug896740.patch

Review of attachment 8678776 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +170,5 @@
>    return mModules[aIndex];
>  }
>  
>  void
> +CombinedStacks::PopFrontStack() {

Let's make this more descriptive: RemoveOldestStack

@@ +381,5 @@
> +        continue;
> +      }
> +
> +      // Shift the index down, to account for the removed stack in the deque.
> +      stackIndices[k] = stackIndices[k] - 1;

instead of rewriting indices, how about an alternate approach where the stacks are stored in a circular buffer? You would only need to change (remove) references to the evicted element.

@@ +402,5 @@
>    size_t n = 0;
>    n += mStacks.SizeOfExcludingThis();
>    // This is a crude approximation. See comment on
>    // CombinedStacks::SizeOfExcludingThis.
> +  n += mHangInfo.size() * sizeof(HangInfo);

why this change?

@@ +3999,5 @@
>  ProcessedStack
>  GetStackAndModules(const std::vector<uintptr_t>& aPCs)
>  {
>    std::vector<StackFrame> rawStack;
> +  std::vector<uintptr_t>::const_iterator e =

nit: e -> stackEnd

@@ +4001,5 @@
>  {
>    std::vector<StackFrame> rawStack;
> +  std::vector<uintptr_t>::const_iterator e =
> +    aPCs.begin() + std::min(aPCs.size(), kMaxChromeStackDepth);
> +  for (std::vector<uintptr_t>::const_iterator i = aPCs.begin(); i != e; ++i) {

you can simplify several lines in this patch by using auto
Attachment #8678776 - Flags: review?(vladan.bugzilla)
Status: NEW → ASSIGNED
Created attachment 8679556 [details] [diff] [review]
bug896740.patch

Thanks for the review. This patch uses a circular queue instead of a deque.
The vectors holding the stacks/hang info grow up to the maximum allowed size (50) and then wraps around.
Attachment #8678776 - Attachment is obsolete: true
Attachment #8679556 - Flags: review?(vladan.bugzilla)
(Reporter)

Comment 18

3 years ago
Comment on attachment 8679556 [details] [diff] [review]
bug896740.patch

Review of attachment 8679556 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +156,5 @@
>  private:
>    std::vector<Telemetry::ProcessedStack::Module> mModules;
>    std::vector<Stack> mStacks;
> +  size_t mLastWrittenIndex;
> +  size_t mNextIndex;

As far as I can tell mNextIndex is just mLastWrittenIndex + 1.

I think you can just keep one variable, mHangCounter, as a static variable in HangReports. It could be monotonically increasing and you could derive vector indeces from it

@@ +176,5 @@
>  void
>  CombinedStacks::AddStack(const Telemetry::ProcessedStack& aStack) {
> +  // Advance the indices of the circular queue holding the stacks.
> +  mLastWrittenIndex = mNextIndex % kMaxChromeStacksKept;
> +  mNextIndex = mLastWrittenIndex + 1;

if you used only one index, this would become:

size_t index = HangReports::GetHangCount() % kMaxChromeStacksKept;

@@ +182,5 @@
> +  if (mStacks.size() < kMaxChromeStacksKept) {
> +    mStacks.resize(mStacks.size() + 1);
> +  }
> +  // Get a reference to the location holding the new stack.
> +  CombinedStacks::Stack& adjustedStack = mStacks[mLastWrittenIndex];

this would become: adjustedStack = mStacks[index]

@@ +335,5 @@
>                       int32_t aFirefoxUptime,
>                       HangAnnotationsPtr aAnnotations) {
> +  // Append the new stack to the stack's circular queue.
> +  mStacks.AddStack(aStack);
> +  size_t hangIndex = mStacks.GetLastWrittenIndex();

this would become: hangIndex = mHangCounter % kMaxChromeStacksKept;

@@ +386,5 @@
> +    for (size_t k = 0; k < stackIndices.Length(); k++) {
> +      // Is this index referencing the removed stack?
> +      if (stackIndices[k] == aRemovedStackIndex) {
> +        toRemove = k;
> +        continue;

are all the indices in iter->mHangIndices distinct? if so, why not break after finding one?
Attachment #8679556 - Flags: review?(vladan.bugzilla)
Created attachment 8680054 [details] [diff] [review]
bug896740.patch

As discussed over IRC, this patch leaves the counters in CombinedStack (as this is used both for lateWrites and chromeHangs).
Attachment #8679556 - Attachment is obsolete: true
Attachment #8680054 - Flags: review?(vladan.bugzilla)
(Reporter)

Comment 20

3 years ago
Comment on attachment 8680054 [details] [diff] [review]
bug896740.patch

Review of attachment 8680054 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +122,5 @@
>    size_t GetStackCount() const;
>    size_t SizeOfExcludingThis() const;
>  private:
>    std::vector<Telemetry::ProcessedStack::Module> mModules;
>    std::vector<Stack> mStacks;

add a comment that this is a circular buffer and that mNextIndex is the next buffer element to write to

@@ +305,5 @@
> +    mHangInfo[hangIndex] = info;
> +  }
> +  // Remove any reference to the stack overwritten in the circular queue
> +  // from the annotations.
> +  PruneStackReferences(hangIndex);

you can put this in the else above, size >= kMaxChromeStacksKept will be rare
Attachment #8680054 - Flags: review?(vladan.bugzilla) → review+
Created attachment 8681124 [details] [diff] [review]
bug896740.patch
Attachment #8680054 - Attachment is obsolete: true
Attachment #8681124 - Flags: review+

Comment 24

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/c48cb2c5ccf0
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox45: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
You need to log in before you can comment on or make changes to this bug.