Closed Bug 1293501 Opened 4 years ago Closed 2 years ago

Add debug annotation/code for the crash in ReleaseData()

Categories

(Core :: Networking: HTTP, defect, P3, major)

All
Windows
defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: ting, Assigned: ting)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 5 obsolete files)

+++ This bug was initially created as a clone of Bug #1145613 +++

I have checked the code which set nsTSubstring_CharT::mFlags with F_SHRED, but I don't see any of them could allow null mData.

Based on crash-stats, nsHttpHeaderArray has the highest chance to meet the crash. And from the minidumps, it's crashed while destructing the nsCString |mValue| of the first nsEntry in nsHttpHeaderArray::mHeaders, so I'd like to see memory content the array's header also the first two elements (which are not existed in the minidump).

This bug may also be used for adding some more debug code.
:mcmanus is taking PTO til 08-15, could help to review this?
Attachment #8779142 - Flags: review?(jduell.mcbugs)
Assignee: nobody → janus926
Whiteboard: [necko-active]
Comment on attachment 8779142 [details] [diff] [review]
part1 v1: add crash annotatios for the nsHttpHeaderArray

Review of attachment 8779142 [details] [diff] [review]:
-----------------------------------------------------------------

Seems ok to me!
Attachment #8779142 - Flags: review?(jduell.mcbugs) → review+
This is for debug purpose, probably more patches to come. But for now there's only one.
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/16abccc347ef
Add debug annotations to track down the crash of bug 1145613. r=jduell
Keywords: checkin-needed
Attachment #8780951 - Attachment is obsolete: true
This time I'd like to know when |mData| is cleared to null, so I put a check in ReentrantMonitorAutoEnter, which will check before/after manipulating nsHttpHeaderArray.
Attachment #8785146 - Flags: review?(mcmanus) → review?(dd.mozilla)
Comment on attachment 8785146 [details]
Bug 1293501 - Add some more crash annotations to dianose bug 1145613.

https://reviewboard.mozilla.org/r/74448/#review72576

Thanks for doing this.
Attachment #8785146 - Flags: review?(dd.mozilla) → review+
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5661cf2d2b85
Add some more crash annotations to dianose bug 1145613. r=mcmanus
Keywords: checkin-needed
I'll see if I can use debug registers to add a watchpoint on |mData| like https://hg.mozilla.org/try/rev/db9d9d4e2dab9cffdfec027316381a0fb775c9f7.
Based on bug 1247982 comment 41, the debug registers would need to apply on multiple threads.
(In reply to Ting-Yu Chou [:ting] from comment #14)
> Based on bug 1247982 comment 41, the debug registers would need to apply on
> multiple threads.

Seems not true in content process, I saw only main thread that is accessing nsHttpRequestHead.
Pushed by tchou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e6697a3d251
Remove debug code for not propagating to Aurora. r=me
It'd become https://crash-stats.mozilla.com/search/?reason=~EXCEPTION_ACCESS_VIOLATION_WRITE&product=Firefox if it's captured by this code.
Attachment #8785146 - Attachment is obsolete: true
Attachment #8798718 - Flags: review?(dd.mozilla)
Comment on attachment 8798718 [details] [diff] [review]
v4: use virtualprotect to track down possible corruption

This protects only the |mData| of the |value| of nsEntry, I should protect also the memory after |header|.
Attachment #8798718 - Attachment is obsolete: true
Attachment #8798718 - Flags: review?(dd.mozilla)
Add padding between header and value of nsEntry, set the page (starts from the array header to the first nsEntry value's |mData|) read only when ReentrantMonitorAutoEnter destructs.
Attachment #8798764 - Flags: review?(dd.mozilla)
Attachment #8798764 - Flags: review?(dd.mozilla) → review+
Pushed by tchou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2eebd44ff2e9
Use VirtualProtect to track down a possible memory corruption. r=dragana
Depends on: 1310113
I am using this search to monitor the possible culprit:
https://crash-stats.mozilla.com/search/?reason=~EXCEPTION_ACCESS_VIOLATION_WRITE&build_id=>%3D20161012030211&product=Firefox&version=52.0a1&platform=Windows
Depends on: 1315285
The crash still happens with the VirtualProtect, see bp-5e073847-8799-4ae1-b6b2-d42f22161105.
(In reply to Ting-Yu Chou [:ting] from comment #26)
> The crash still happens with the VirtualProtect, see
> bp-5e073847-8799-4ae1-b6b2-d42f22161105.

This one is odd, open the dump with vs2015 and it has following in ReleaseData():

  &iter->header = 0x0e03300c
  &iter->padding = 0x0e033010
  &iter->value = 0x0e034000

The array content does not start from 0x...8 (sizeof nsTArrayHeader is 8) but 0x0e03300c, which iter->value.mData is outside of the readonly page.
I was wrong for comment 27, the address is not ended with 8 (0x0e0300c) because it's the 2nd nsEntry in the array. Note the patch in comment 23 protects only the first page of the nsHttpRequestHead.mHeaders:

  |nsTArrayHeader|nsEntry.header|padding|nsEntry.value.mData|
Hi Ting, now that the debugging code is on Aurora too, are we seeing anything useful in newer crash reports?
Flags: needinfo?(janus926)
No, there're only 5 EXCEPTION_ACCESS_VIOLATION_WRITE crashes for 52.0a2 which have the address end with 'c' in the past 2 weeks, and none of them are '0c' or 'fc'. I'd like to wait for few more days.
Flags: needinfo?(janus926)
Between Nov 12, 2016 and Dec 12, 2016 [1], there're 6 EXCEPTION_ACCESS_VIOLATION_WRITE crashes have address ends with 0xfc (the padding shifted offset of the first nsEntry's value.mData in the array):

bp-574f8db7-fa10-4ab3-8d04-fb7532161212 F_767234858____________________________
bp-2ca9629f-c698-49b5-8646-3bdd92161206 ImplCycleCollectionTraverse<T>
bp-a5cbad60-913d-4396-87a1-ece372161130 js::frontend::MatchOrInsertSemicolonHelper
bp-797295af-ff22-4e94-8c77-7ad852161128 kernel32.dll@0x260be
bp-a7825c72-f51f-4c48-92fe-f02282161125 js::irregexp::ExecuteCode<T>
bp-9189e0d4-2334-4735-97e6-66c672161125 igdusc32.dll | igd10iumd32.dll | d3d11.dll@0x63ddb

There aren't many crashes so I am not sure if the root cuase is captured, but bp-2ca9629f-c698-49b5-8646-3bdd92161206 looks suspicious as it has nsTArray_base::ShiftData in the crash stack. I'll take a look at it.

I am going to backout the debug patch.

[1] https://crash-stats.mozilla.com/search/?build_id=%3E%3D20161012030211&reason=~EXCEPTION_ACCESS_VIOLATION_WRITE&address=%24fc&product=Firefox&version=53.0a1&version=52.0a2&version=52.0a1&platform=Windows&date=%3E%3D2016-11-12T17%3A17%3A14.000Z&date=%3C2016-12-12T17%3A17%3A14.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=address&_columns=reason#crash-reports
Pushed by tchou@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/62b7071249cf
Back out the debug patch (changeset 2eebd44ff2e9). r=me
Attached patch v4 backoutSplinter Review
Approval Request Comment
[Feature/Bug causing the regression]: none
[User impact if declined]: perf regression
[Is this code covered by automated tests?]: not sure
[Has the fix been verified in Nightly?]: no
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: just back out a debug patch
[String changes made/needed]: none
Attachment #8798764 - Attachment is obsolete: true
Attachment #8818143 - Flags: approval-mozilla-aurora?
(In reply to Ting-Yu Chou [:ting] from comment #31)
> There aren't many crashes so I am not sure if the root cuase is captured,
> but bp-2ca9629f-c698-49b5-8646-3bdd92161206 looks suspicious as it has
> nsTArray_base::ShiftData in the crash stack. I'll take a look at it.

It doesn't seem relate to the bug here because:

1) From vs2015, the crash address 0x11ba52fc is __vfptr of the argument aCallback for ImplCycleCollectionTraverse() which is not nsHttpHeaderArray.
2) The stack in crash-stats is different from the one when open the crash dump by vs2015, which does not have nsTArray_base::ShiftData():
   xul.dll!ImplCycleCollectionTraverse<nsIBrowserElementAPI>(nsCycleCollectionTraversalCallback & aCallback, nsCOMPtr<nsIBrowserElementAPI> & aField, const char *) Line 1102   C++
   xul.dll!nsPurpleBuffer::PurpleBlock::VisitEntries<RemoveSkippableVisitor>(nsPurpleBuffer & aBuffer, RemoveSkippableVisitor & aVisitor) Line 1035    C++
   xul.dll!SnowWhiteKiller::SnowWhiteKiller(nsCycleCollector * aCollector) Line 2649   C++
   xul.dll!nsTimerImpl::Fire() Line 477    C++
   xul.dll!nsTimerEvent::Run() Line 293    C++
   xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Line 1222 C++
   xul.dll!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate * aDelegate) Line 96 C++
   xul.dll!MessageLoop::RunHandler() Line 226  C++
   xul.dll!MessageLoop::Run() Line 206 C++
   xul.dll!nsBaseAppShell::Run() Line 158  C++
   xul.dll!nsAppShell::Run() Line 262  C++
   xul.dll!nsAppStartup::Run() Line 283    C++
   xul.dll!XREMain::XRE_mainRun() Line 4467    C++
   xul.dll!XREMain::XRE_main(int argc, char * * argv, const nsXREAppData * aAppData) Line 4600 C++
   xul.dll!XRE_main(int argc, char * * argv, const nsXREAppData * aAppData, unsigned int aFlags) Line 4692 C++
   firefox.exe!wmain(int argc, wchar_t * * argv) Line 115  C++
   [External Code] 
3) The assembly around ip (0x10ed62a2) does not write to any address:
   10ED62A0  mov         edi,dword ptr [edx]
   10ED62A2  test        byte ptr [esi+4],1
   10ED62A6  je          ImplCycleCollectionTraverse<nsIBrowserElementAPI>+19h (10ED62B5h)
VirtualProtect() doesn't catch noticeable invalid writes, could it be something in kernel mode? I don't know how to proceed without STR...
Comment on attachment 8818143 [details] [diff] [review]
v4 backout

backout a debugging patch from aurora52
Attachment #8818143 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
No longer depends on: 1315285
Bulk priority update: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Ting,  where are we with this bug?  Should it stay open?
Flags: needinfo?(janus926)
Priority: P1 → P3
Whiteboard: [necko-active]
The patches I tried here didn't catch any culprits, and I don't have any other ideas how to proceed without STR.
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(janus926)
Resolution: --- → INCOMPLETE
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.