Open Bug 1009060 Opened 6 years ago Updated 10 months ago

Limit depth of lateWrites stacks in Telemetry payloads

Categories

(Toolkit :: Telemetry, defect, P4)

31 Branch
x86
Windows 7
defect

Tracking

()

REOPENED

People

(Reporter: mreid, Unassigned)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

Attachments

(2 files)

It appears that some of the chromehang reports contain a really huge stack trace (including millions of entries).

We should determine if this is due to a bug in the chromehang code, or if these are legitimate stacks.  If the latter, we should probably truncate them before reporting.

I've attached one example payload, let me know if you need more info.
Correction - it's the lateWrites part (not chromeHangs) that has a lot of large stacks.
Summary: Overly large chromehang stacks in Telemetry payloads → Overly large lateWrites stacks in Telemetry payloads
This might be a bug in a 3rd party security tool.. PGHook.dll is part of "Avecto Privilege Guard" http://www.avecto.com/products/overview/

These are the top & bottom of the hang stack. Unfortunately the symbols for this Nightly (from April 7th) are no longer available so the xul frames can't be symbolicated.

0xca384 (in xul.pdb)
0x797bc (in xul.pdb)
0x79770 (in xul.pdb)
FlushFileBuffers (in wkernelbase.pdb)
FlushFileBuffersImplementation (in wkernel32.pdb)
0x175d7 (in PGHook.pdb)
0x1ffff (in PGHook.pdb)
MoveFileWithProgressTransactedW (in wkernel32.pdb)
MoveFileWithProgressTransactedA (in wkernel32.pdb)
MoveFileWithProgressA (in wkernel32.pdb)
MoveFileA (in wkernel32.pdb)
0x8ed1 (in nss3.pdb)
0xc16dd6 (in xul.pdb)
0xca384 (in xul.pdb)
0x797bc (in xul.pdb)
0x79770 (in xul.pdb)
FlushFileBuffers (in wkernelbase.pdb)
FlushFileBuffersImplementation (in wkernel32.pdb)
0x175d7 (in PGHook.pdb)
0x1ffff (in PGHook.pdb)
MoveFileWithProgressTransactedW (in wkernel32.pdb)
MoveFileWithProgressTransactedA (in wkernel32.pdb)
MoveFileWithProgressA (in wkernel32.pdb)
MoveFileA (in wkernel32.pdb)
0x8ed1 (in nss3.pdb)
0xc16dd6 (in xul.pdb)
0xca384 (in xul.pdb)


.. snip ...


0xca384 (in xul.pdb)
0x797bc (in xul.pdb)
0x79770 (in xul.pdb)
FlushFileBuffers (in wkernelbase.pdb)
FlushFileBuffersImplementation (in wkernel32.pdb)
0x175d7 (in PGHook.pdb)
0x1ffff (in PGHook.pdb)
MoveFileWithProgressTransactedW (in wkernel32.pdb)
MoveFileWithProgressTransactedA (in wkernel32.pdb)
MoveFileWithProgressA (in wkernel32.pdb)
MoveFileA (in wkernel32.pdb)
0x8ed1 (in nss3.pdb)
0xc16dd6 (in xul.pdb)
0xca384 (in xul.pdb)
0x797bc (in xul.pdb)
0x79770 (in xul.pdb)
FlushFileBuffers (in wkernelbase.pdb)
FlushFileBuffersImplementation (in wkernel32.pdb)
0x175d7 (in PGHook.pdb)
0x1ffff (in PGHook.pdb)
MoveFileWithProgressTransactedW (in wkernel32.pdb)
MoveFileWithProgressTransactedA (in wkernel32.pdb)
MoveFileWithProgressA (in wkernel32.pdb)
MoveFileA (in wkernel32.pdb)
0x8ed1 (in nss3.pdb)
0xc16dd6 (in xul.pdb)
0xca384 (in xul.pdb)
0x797bc (in xul.pdb)
0x79770 (in xul.pdb)
FlushFileBuffers (in wkernelbase.pdb)
FlushFileBuffersImplementation (in wkernel32.pdb)
0x175d7 (in PGHook.pdb)
0x1ffff (in PGHook.pdb)
MoveFileWithProgressTransactedW (in wkernel32.pdb)
MoveFileWithProgressTransactedA (in wkernel32.pdb)
MoveFileWithProgressA (in wkernel32.pdb)
MoveFileA (in wkernel32.pdb)
0x8ed1 (in nss3.pdb)
0x913db1 (in xul.pdb)
0x2a058e (in xul.pdb)
0x1753 (in firefox.pdb)
0x1a1b (in firefox.pdb)
0x1b26 (in firefox.pdb)
0x242f (in firefox.pdb)
BaseThreadInitThunk (in wkernel32.pdb)
__RtlUserThreadStart (in wntdll.pdb)
_RtlUserThreadStart (in wntdll.pdb)
Aaron, does the stack above make any sense to you?
Flags: needinfo?(aklotz)
Given the limited information that we have there, it looks like IOInterposer caught the FlushFileBuffers call but something in the handler caused us to repeatedly reenter.

There is a thread-local variable in IOInterposer that is supposed to guard against this kind of reentry. For some reason that didn't catch in this case. Furthermore, I agree that this PGHook module is not helping. It would sure be nice to have symbols. :-(

Mark, do we have any of these that are new enough to symbolicate?
Flags: needinfo?(aklotz) → needinfo?(mreid)
I traced the MoveFileA call to PR_Rename and then to http://dxr.mozilla.org/mozilla-central/source/xpcom/build/LateWriteChecks.cpp#208 ; I think that gives me enough to look further into this.
(In reply to Aaron Klotz [:aklotz] from comment #4)
> There is a thread-local variable in IOInterposer that is supposed to guard
> against this kind of reentry. For some reason that didn't catch in this
> case. Furthermore, I agree that this PGHook module is not helping. It would
> sure be nice to have symbols. :-(

I checked to see when the reentry variable was added; I checked it in on April 8, which would make sense given the timing of the stack in this bug. Mark, I'm amending my needinfo request: Have you seen anything like this appear since April 8?
Flags: needinfo?(mreid)
(In reply to Aaron Klotz [:aklotz] from comment #6)
> Mark, I'm amending my needinfo request: Have you seen anything like this
> appear since April 8?
Flags: needinfo?(mreid)
I see one from build id 20140409030203, but nothing since then.
Flags: needinfo?(mreid)
(In reply to Mark Reid [:mreid] from comment #8)
> I see one from build id 20140409030203, but nothing since then.

I don't think that bug 913653 made it into that build. I doubt that we'll see anything like this again. If we do, of course we can always reopen.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Here's a newer one (from 20140427).
We still have this in the payload, although i don't know if anyone uses it actively.
For now we can just limit the stack depth here to be safe.

ni?me to set this up as a mentored bug.
Flags: needinfo?(gfritzsche)
Summary: Overly large lateWrites stacks in Telemetry payloads → Limit depth of lateWrites stacks in Telemetry payloads
See Also: → 1469146
I filed bug 1469146 to see investigate a bunch of performance data that's historically in Telemetry.
Maybe we can just remove this.
Depends on: 1469146
Flags: needinfo?(gfritzsche)
Priority: -- → P4
You need to log in before you can comment on or make changes to this bug.