"null signature" in about:crashes Reports - /data/socorro/stackwalk/bin/stackwalk.sh returned no header lines for reportid.

RESOLVED INCOMPLETE

Status

()

Toolkit
Crash Reporting
--
major
RESOLVED INCOMPLETE
6 years ago
4 years ago

People

(Reporter: Rob, Unassigned)

Tracking

Trunk
x86
Windows XP
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

6 years ago
User Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0a1) Gecko/20111023 Firefox/10.0a1
Build ID: 20111023031047

Steps to reproduce:

Nightly crashes and popped up the Error Reporter, I typed some info and submitted it but when I check about:crashes it says that there is a null Signature for the Report's Title. 

There is a Link in your about:crashes Reports that brings you here when you click it, normally the Title of this Report would include "[@ ????]" but without a Signature there is nothing to include, so THIS Report is NOT so much about the Crash itself but about the presence of "null Signatures" (presumably unreportable crashes).


The Report is: bp-b6d563f3-36f0-47a9-9267-124cf2111023 .


The Processor Info is:

Processor Notes 	
/data/socorro/stackwalk/bin/stackwalk.sh returned no header lines for reportid: 306765798; No thread was identified as the cause of the crash; No signature could be created because we do not know which thread crashed; 
/data/socorro/stackwalk/bin/stackwalk.sh returned no frame lines for reportid: 306765798; 
/data/socorro/stackwalk/bin/stackwalk.sh failed with return code 1 when processing dump b6d563f3-36f0-47a9-9267-124cf2111023


There must be someway for "stackwalk" to 'back out of itself' and know from where it was called. 

If the Address has no Title (in Debug Comments) could we at least have the Address and the "Built from" (EG: http://hg.mozilla.org/mozilla-central/rev/1fa31fa85082) from about:buildconfig (or whichever Info the Developers prefer) to identify the location. 

Even going back to the 'caller of the caller' (and prefixing "**" to the Signature) would show that 'something' is calling 'something else' and that something else does not identify itself.

This would allow us to add some granularity to the "null Signature" Reports and know which is which; instead of EVERY "null Signature" Report being categorized as the SAME Crash.

Thanks.
(Reporter)

Updated

6 years ago
Severity: normal → major
Component: General → Breakpad Integration
Product: Firefox → Toolkit
QA Contact: general → breakpad.integration
(In reply to Rob from comment #0)
> There must be someway for "stackwalk" to 'back out of itself' and know from
> where it was called. 

The problem here is that we more than likely wound up with an empty minidump file. The minidump file is supposed to contain information about the state of the process when you crashed, but if it's empty or malformed then we can't tell anything about that. The file can be empty or malformed in certain circumstances like out-of-memory crashes, or crashes where heap memory gets corrupted.

It's unfortunate, but there isn't much we can do in terms of getting an actual signature out of it. There is literally no data to work with in these cases.
Component: Breakpad Integration → Socorro
Product: Toolkit → Webtools
QA Contact: breakpad.integration → socorro
Version: 10 Branch → Trunk
I downloaded the minidump file, it is completely empty with a length of zero bytes.  There's nothing Socorro can do with empty minidumps.  I've got no stack info from which to build a signature.  

There is Bug 678865 that proposes changing empty signatures to some string that says something like "can't determine signature", but that really gives you no more information than the lack of a signature.

With no data, there's nothing I can do.  I'd reassign this under "Breakpad Integration", but that's where it came from.
Status: UNCONFIRMED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WONTFIX
(Reporter)

Comment 3

6 years ago
(In reply to Ted Mielczarek [:ted, :luser] from comment #1)
> (In reply to Rob from comment #0)
> > There must be someway for "stackwalk" to 'back out of itself' and know from
> > where it was called. 
> 
> The problem here is that we more than likely wound up with an empty minidump
> file. The minidump file is supposed to contain information about the state
> of the process when you crashed, but if it's empty or malformed then we
> can't tell anything about that. The file can be empty or malformed in
> certain circumstances like out-of-memory crashes, or crashes where heap
> memory gets corrupted.
> 
> It's unfortunate, but there isn't much we can do in terms of getting an
> actual signature out of it. There is literally no data to work with in these
> cases.

I guess the solution would be to ensure that this did not occur.

If Firefox can figure out it crashed (as opposed to simply closing without popping up the "Error Reporter") then "something", "somewhere" must have crashed, otherwise how would the "Error Reporter" 'know' - unless it is falsely being called.

So there IS a Bug in there somewhere.

But if you prefer "WONTFIX" that is unfortunate; but I've no interest in arguing with the Developers about what is best for them; until I am one myself (maybe next year).

Thanks for taking the time to look at this and provide your answers.

YT,
Rob
(Reporter)

Comment 4

6 years ago
Here is a "null Signature" 'startup crash' bp-b73b4b0d-23b6-4331-b809-5f6f02111024 .

Solution:
The mechanism to catch crashes is not being initialized early enough in main() .

Could we re-open for "move the call to that function earlier in the initialization" ?
(Reporter)

Comment 5

6 years ago
As I said in Comment 0: 
"THIS Report is NOT so much about the Crash itself but about the presence of "null Signatures" (presumably unreportable crashes)."

In Comment 4 I proposed:
"Solution:
The mechanism to catch crashes is not being initialized early enough in main() ."

In Comment 4 I included a "bp" for a "startup crash".

I just re-loaded EXACTLY the same Tabs (using TM+) and the same version of Nightly (without updating in-between) and _this_ time Nightly works fine; thus we are missing the ability to catch an intermittent crash that ought to be caught.

I am going to re-start under WinDbg and IF it is possible to catch the culprit I will provide the Log and reopen.

Comments?
(Reporter)

Comment 6

6 years ago
Created attachment 569270 [details]
JavaScript's malloc-ing is calling abort() and bypassing "Error Reporter" Code.

(In reply to Ted Mielczarek [:ted, :luser] from comment #1)
> (In reply to Rob from comment #0)
> ...
> can't tell anything about that. The file can be empty or malformed in
> certain circumstances like out-of-memory crashes, or crashes where heap
> memory gets corrupted.
> ...

As I suspected, JavaScript is allocating memory and malloc is calling abort() .


I managed to get Nightly to run under Windbg and then re-loaded all the Tabs; causing a crash.


From WinDbg.log

-----
STACK_TEXT:  
0258fba4 00b11960 00b120fc 011064f8 00010000 mozalloc!mozalloc_abort+0x2b [e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_abort.cpp @ 77]
0258fbac 011064f8 00010000 467cd470 467cd470 mozalloc!mozalloc_handle_oom+0xa [e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_oom.cpp @ 54]
0258fbd0 011062c3 020182c4 00000002 467cd470 xul!GCGraphBuilder::NoteScriptChild+0x1d8 [e:\builds\moz2_slave\m-cen-w32-ntly\build\xpcom\base\nscyclecollector.cpp @ 1797]
0258fbe4 00dd9b0b 0258fc88 467cd470 00000000 xul!NoteJSChild+0x53 [e:\builds\moz2_slave\m-cen-w32-ntly\build\js\xpconnect\src\nsxpconnect.cpp @ 693]
0258fc00 00dd9d54 0258fc88 467cd470 00000021 mozjs!js::gc::Mark<JSObject>+0x8b [e:\builds\moz2_slave\m-cen-w32-ntly\build\js\src\jsgcmark.cpp @ 144]

...

FOLLOWUP_IP: 
mozalloc!mozalloc_handle_oom+a [e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_oom.cpp @ 54]
00b11960 cc              int     3

FAULTING_SOURCE_CODE:  
No source found for 'e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_oom.cpp'


SYMBOL_STACK_INDEX:  1

SYMBOL_NAME:  mozalloc!mozalloc_handle_oom+a

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: mozalloc

IMAGE_NAME:  mozalloc.dll

DEBUG_FLR_IMAGE_TIMESTAMP:  4ea54d3a

STACK_COMMAND:  ~3s ; kb

BUCKET_ID:  HANG_mozalloc!mozalloc_handle_oom+a

FAILURE_BUCKET_ID:  APPLICATION_HANG_BusyHang_cfffffff_mozalloc.dll!mozalloc_handle_oom

WATSON_STAGEONE_URL:  http://watson.microsoft.com/StageOne/firefox_exe/10_0_0_4314/4ea560bd/mozalloc_dll/10_0_0_4314/4ea54d3a/cfffffff/0000192f.htm?Retriage=1

Followup: MachineOwner
---------

...

-----

See enclosed Log.
(Reporter)

Comment 7

6 years ago
Re-opening because we have something to work with now. See Comment 0 through 6.
Status: RESOLVED → UNCONFIRMED
Resolution: WONTFIX → ---

Comment 8

6 years ago
We definitely can't do anything on the crash-processing side, but ted may wish to comment on on the Breakpad side (report generation), so I'm moving it back to that component.
Component: Socorro → Breakpad Integration
Product: Webtools → Toolkit
QA Contact: socorro → breakpad.integration
(Reporter)

Comment 9

6 years ago
(In reply to Laura Thomson :laura from comment #8)
> We definitely can't do anything on the crash-processing side, but ted may
> wish to comment on on the Breakpad side (report generation), so I'm moving
> it back to that component.

Can we not move the "crash-processing side" earlier in main() so it can setup it's variables and be ready to catch _anything_ that comes after it ?

If the "crash-processing side" occurred earlier it could catch any crash except itself.

I can (at the moment) get the Browser to start and render a Window, complete with Menus, then TM+'s Window pops up asking me if I want to restore the Session. If there are "too many" Tabs (usually over 200, with Nightly, 270 with Aurora, and about 240 with FF "Release") then a couple of dozen will load and the Browser will crash.


The enclosed Windbg Log leads me to believe that JavaScript's Garbage Collection routine is not "instrumented for the Browser's Error Reporter" (the "crash-processing side" would use an atexit() routine or something similar) and when the Garbage Collection routine calls abort() the Program simply exits and execution terminates immediately (as opposed to setting up an 'error stack' to pass to the "Error Reporter" so it can say where 'whatever' has visited).


There was another Bug Report 539716 Comment 2 where "uxtheme.dll crash doesn't trigger Breakpad" where "Timeless" commented: "uxtheme is *literally* the first userspace module loaded, it happens before we get around to doing any work.".

If we could move the "crash-processing side" earlier in main() and get it working BEFORE we did anything else then it could catch any problem (except it's own crashing or Routines that are not "instrumented for the Browser's Error Reporter"; as appears to be happening shown at Line 2984 of the enclosed Windbg Log).


Lets get "Error Reporting" running early (and working), then after that lets open the Browser's Window and show the Menu; while having each Routine properly register itself in the "we are here" Stack (the regular Stack). Then we know where we are and where we came from. 

How else would a "ret" ever be expected to return if we are not maintaining the Stack correctly ? If we write Garbage to the end of the Stack (push when we ought to have pop-ed) we can still run from the (know) head of the Stack (IF the Error Reporter pops up (gets a chance to execute)) until we get to the Garbage which contains 'Addresses' that point to "nothing"; thus we know where we last were that called "somewhere" that broke the Stack - and it is from that point onward that we must look.
(Reporter)

Comment 10

6 years ago
I re-read my last Post.


Where I said this *"?"*:

'Can we not move the *"crash-processing side"* earlier in main() so it can setup it's variables and be ready to catch _anything_ that comes after it ?'

I should have said this *"?"*:

'Can we not move the *"Breakpad side"* earlier in main() so it can setup it's variables and be ready to catch _anything_ that comes after it ?'


I mixed up that the "crash-processing side" is on the Server and that the "Breakpad side" is in the Browser. Substituting the correct terminology my point stands.
Your analysis is incorrect. Exception handling is initialized very early in the process startup, way before we start executing any JavaScript. If you get the "Mozilla Crash Reporter" dialog, then we have initialized crash reporting. The problem here is as I suspected earlier, an out-of-memory crash. When that happens, Microsoft's minidump writing library does not function correctly and we get an empty minidump. We are catching the crash, we're just unable to write a useful dump from it.
(Reporter)

Comment 12

6 years ago
Created attachment 569473 [details]
Windbg Log of Nightly's startup and crash before starting.

I am not an expert on Windbg of Firefox but I think there are some troubling things within:

-----

.....
021fffa4 012c198a kernel32!WaitForSingleObject+0x12
021fffb4 7c80b729 xul!google_breakpad::ExceptionHandler::ExceptionHandlerThreadMain(void * lpParameter = 0x00000000)+0x13 [e:\builds\moz2_slave\m-cen-w32-ntly\build\toolkit\crashreporter\google-breakpad\src\client\windows\handler\exception_handler.cc @ 412]
021fffec 00000000 kernel32!BaseThreadStart+0x37
.....

Problem: Should not be 0x0 ? -- lpParameter = 0x00000000

-----

.....
0247fdf4 01283114 kernel32!GetQueuedCompletionStatus+0x29
0247fe20 012830ba xul!base::MessagePumpForIO::GetIOItem(unsigned long timeout = 0xffffffff, struct base::MessagePumpForIO::IOItem * item = 0x0247fe38)+0x37 [e:\builds\moz2_slave\m-cen-w32-ntly\build\ipc\chromium\src\base\message_pump_win.cc @ 536]
0247fe48 0129ac20 xul!base::MessagePumpForIO::WaitForIOCompletion(unsigned long timeout = 0xffffffff, class base::MessagePumpForIO::IOHandler * filter = 0x00000000)+0x24 [e:\builds\moz2_slave\m-cen-w32-ntly\build\ipc\chromium\src\base\message_pump_win.cc @ 507]
0247fe54 0126645d xul!base::MessagePumpForIO::WaitForWork(void)+0x1c [e:\builds\moz2_slave\m-cen-w32-ntly\build\ipc\chromium\src\base\message_pump_win.cc @ 501]
.....

Problem: Infinite Timeout length (in many places) -- timeout = 0xffffffff

-----

.....
0258fef4 010e166e xul!nsCycleCollectorRunner::Run(void)+0x39 [e:\builds\moz2_slave\m-cen-w32-ntly\build\xpcom\base\nscyclecollector.cpp @ 3492]
0258ff1c 012476f6 xul!nsThread::ProcessNextEvent(bool mayWait = <Memory access error>, bool * result = <Memory access error>)+0x15e [e:\builds\moz2_slave\m-cen-w32-ntly\build\xpcom\threads\nsthread.cpp @ 637]
0258ff44 10002470 xul!nsThreadStartupEvent::Run(void)+0x25 [e:\builds\moz2_slave\m-cen-w32-ntly\build\xpcom\threads\nsthread.cpp @ 201]
.....

Problem: (many) Variables set to unallocated Memory locations -- mayWait = <Memory access error>

-----

.....
0a63ff44 10002470 xul!nsHostResolver::ThreadFunc(void * arg = 0x020feaf0)+0x2e [e:\builds\moz2_slave\m-cen-w32-ntly\build\netwerk\dns\nshostresolver.cpp @ 906]
0a63ff6c 10001b4d nspr4!_PR_NativeRunThread(void * arg = 0x082b7d40)+0x120 [e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\threads\combined\pruthr.c @ 448]
0a63ff74 781329bb nspr4!pr_root(void * arg = 0x78132a47)+0xd [e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\md\windows\w95thred.c @ 122]
.....
0a05ff44 10002470 xul!nsThreadStartupEvent::Run(void)+0x25 [e:\builds\moz2_slave\m-cen-w32-ntly\build\xpcom\threads\nsthread.cpp @ 201]
0a05ff60 78132e24 nspr4!_PR_NativeRunThread(void * arg = 0x00000004)+0x120 [e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\threads\combined\pruthr.c @ 448]
0a05ff6c 10001b4d MSVCR80!_getptd_noexit(void)+0x72 [f:\dd\vctools\crt_bld\self_x86\crt\src\tidtable.c @ 633]
.....

Problem: Argument Values suspect:
nspr4!_PR_NativeRunThread(void * arg = 0x082b7d40)+0x120 .....
vs.
nspr4!_PR_NativeRunThread(void * arg = 0x00000004)+0x120 .....

-----

.....
ChildEBP RetAddr  
0013eae8 108d2d0d mozalloc!mozalloc_abort(char * msg = 0x108e39fc "???")+0x2b [e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_abort.cpp @ 77]
0013ef10 108e39fc xul!NS_DebugBreak_P(unsigned int aSeverity = 3, char * aStr = 0x10d00278 "terminating child process", char * aExpr = 0x00000000 "", char * aFile = 0x10d00228 "e:/builds/moz2_slave/m-cen-w32-ntly/build/dom/plugins/ipc/PluginModuleChild.cpp", int aLine = 0n536)+0x1da [e:\builds\moz2_slave\m-cen-w32-ntly\build\xpcom\base\nsdebugimpl.cpp @ 345]
0013ef28 107bc5cb xul!mozilla::plugins::PluginModuleChild::ShouldContinueFromReplyTimeout(void)+0x18 [e:\builds\moz2_slave\m-cen-w32-ntly\build\dom\plugins\ipc\pluginmodulechild.cpp @ 536]
0013ef34 108f73ff xul!mozilla::ipc::SyncChannel::ShouldContinueFromTimeout(void)+0x18 [e:\builds\moz2_slave\m-cen-w32-ntly\build\ipc\glue\syncchannel.cpp @ 265]
.....

Problem: Strings not being passed or a new message is not developed from the old one:
mozalloc_abort(char * msg = 0x108e39fc "???")+0x2b
NS_DebugBreak_P(unsigned int aSeverity = 3, char * aStr = 0x10d00278 "terminating child process" ...

NS_DebugBreak_P has a readable message to pass but mozalloc_abort only has "???"(+0x2b).

-----

..... (Line 3339)
0013cb04 10002a86 logexts!LogHook+0x17
0013cb20 010bc3df nspr4!PR_EnterMonitor(struct PRMonitor * mon = 0x020ee3f0)+0x16 [e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\threads\prmon.c @ 96]
0013cb3c 01141446 xul!nsXPCWrappedJS::Release(void)+0x1f [e:\builds\moz2_slave\m-cen-w32-ntly\build\js\xpconnect\src\xpcwrappedjs.cpp @ 202]
0013cc30 010a2ead xul!nsXPCWrappedJSClass::DelegatedQueryInterface(class nsXPCWrappedJS * self = 0x04c20500, struct nsID * aIID = 0x01c1d9e4, void ** aInstancePtr = 0x0013cc7c)+0x536 [e:\builds\moz2_slave\m-cen-w32-ntly\build\js\xpconnect\src\xpcwrappedjsclass.cpp @ 781]
0013cc4c 010ec798 xul!nsXPCWrappedJS::QueryInterface(struct nsID * aIID = <Memory access error>, void ** aInstancePtr = <Memory access error>)+0xcd [e:\builds\moz2_slave\m-cen-w32-ntly\build\js\xpconnect\src\xpcwrappedjs.cpp @ 157]
0013cc64 010faaf7 xul!nsQueryReferent::operator()(struct nsID * aIID = 0x0df2094a, void ** answer = <Memory access error>)+0x38 [e:\builds\moz2_slave\m-cen-w32-ntly\build\obj-firefox\xpcom\build\nsweakreference.cpp @ 88]
0013cc74 01090fb9 xul!nsCOMPtr_base::assign_from_helper(class nsCOMPtr_helper * helper = 0x0df2094a, struct nsID * iid = <Memory access error>)+0x17 [e:\builds\moz2_slave\m-cen-w32-ntly\build\obj-firefox\xpcom\build\nscomptr.cpp @ 150]
0013cca0 01090f82 xul!nsDocLoader::FireOnStatusChange(class nsIWebProgress * aWebProgress = <Memory access error>, class nsIRequest * aRequest = <Memory access error>, unsigned int aStatus = <Memory access error>, wchar_t * aMessage = <Memory access error>)+0xb9 [e:\builds\moz2_slave\m-cen-w32-ntly\build\uriloader\base\nsdocloader.cpp @ 1456]
0013ccc8 01090f82 xul!nsDocLoader::FireOnStatusChange(class nsIWebProgress * aWebProgress = 0x010b5324, class nsIRequest * aRequest = 0x0c3f8404, unsigned int aStatus = 0x1efcec30, wchar_t * aMessage = 0x00000000 "")+0x82 [e:\builds\moz2_slave\m-cen-w32-ntly\build\uriloader\base\nsdocloader.cpp @ 1471]
0013ccf0 010b7339 xul!nsDocLoader::FireOnStatusChange(class nsIWebProgress * aWebProgress = 0x00740069, class nsIRequest * aRequest = 0x006e0069, unsigned int aStatus = 0x200067, wchar_t * aMessage = 0x006f0066 "--- memory read error at address 0x006f0066 ---")+0x82 [e:\builds\moz2_slave\m-cen-w32-ntly\build\uriloader\base\nsdocloader.cpp @ 1471]
0013cd24 010cbfff xul!nsDocLoader::OnStopRequest(class nsIRequest * aRequest = 0x00200067, class nsISupports * aCtxt = 0x006f0066, unsigned int aStatus = 0x200072)+0x2a9 [e:\builds\moz2_slave\m-cen-w32-ntly\build\uriloader\base\nsdocloader.cpp @ 725]
0013cd60 010b5324 xul!nsDocLoader::QueryInterface(struct nsID * aIID = 0x0013cd74, void ** aInstancePtr = 0x01bb99e4)+0xbf [e:\builds\moz2_slave\m-cen-w32-ntly\build\uriloader\base\nsdocloader.cpp @ 288]
0013cd9c 008f3574 xul!nsLoadGroup::RemoveRequest(class nsIRequest * request = 0x01bb99e4, class nsISupports * ctxt = 0x0013cd90, unsigned int aStatus = 0x13cdc4)+0xf4 [e:\builds\moz2_slave\m-cen-w32-ntly\build\netwerk\base\src\nsloadgroup.cpp @ 731]
0013cdb8 011000d7 mozutils!je_free(void * ptr = <Memory access error>)+0xd4 [e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\jemalloc\jemalloc.c @ 6263]
0013cdc4 01057a30 xul!nsCOMPtr_base::assign_with_AddRef(class nsISupports * rawPtr = 0x0df2094a)+0x27 [e:\builds\moz2_slave\m-cen-w32-ntly\build\obj-firefox\xpcom\build\nscomptr.cpp @ 90]
..... a dozen lines .....
0013f124 01216d33 xul!nsAppStartup::Run(void)+0x1e [e:\builds\moz2_slave\m-cen-w32-ntly\build\toolkit\components\startup\nsappstartup.cpp @ 229]
0013f4a8 004017e1 xul!XRE_main(int argc = 0n1, char ** argv = 0x003e2f78, struct nsXREAppData * aAppData = 0x02017140)+0xdf5 [e:\builds\moz2_slave\m-cen-w32-ntly\build\toolkit\xre\nsapprunner.cpp @ 3577]
0013ff7c 00401b10 firefox!wmain(int argc = <Memory access error>, wchar_t ** argv = <Memory access error>)+0x7e1 [e:\builds\moz2_slave\m-cen-w32-ntly\build\toolkit\xre\nswindowswmain.cpp @ 107]
0013ffc0 7c817077 firefox!__tmainCRTStartup(void)+0x10f [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 594]
0013fff0 00000000 kernel32!BaseProcessStart+0x23

Problems:
1. wmain(int argc = <Memory access error>, wchar_t ** argv = <Memory access error>)+0x7e1
2. je_free(void * ptr = <Memory access error>)+0xd4 -- Freeing memory not allocated.
3. QueryInterface(struct nsID * aIID = <Memory access error>, void ** aInstancePtr = <Memory access error>)+0xcd -- Variables aIID and aInstancePtr = <Memory access error>, IE: unallocated.

-----

.....
030ffe88 10001013 nspr4!_PR_MD_PR_POLL(struct PRPollDesc * pds = 0x0209b020, int npds = 0n3, unsigned int timeout = 0x3e7fc18)+0x246 [e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\md\windows\w32poll.c @ 279]
030ffe94 011c38a6 nspr4!PR_Poll(struct PRPollDesc * pds = 0x00000000, int npds = 0n0, unsigned int timeout = 0)+0x13 [e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\io\prio.c @ 173]
030ffeb8 011c36d2 xul!nsSocketTransportService::Poll(bool wait = false, unsigned int * interval = 0x00000000)+0x56 [e:\builds\moz2_slave\m-cen-w32-ntly\build\netwerk\base\src\nssockettransportservice2.cpp @ 415]
030ffed4 011a80f0 xul!nsSocketTransportService::DoPollIteration(bool wait = <Memory access error>)+0xa2 [e:\builds\moz2_slave\m-cen-w32-ntly\build\netwerk\base\src\nssockettransportservice2.cpp @ 726]
030ffef4 010e166e xul!nsSocketTransportService::Run(void)+0xa0 [e:\builds\moz2_slave\m-cen-w32-ntly\build\netwerk\base\src\nssockettransportservice2.cpp @ 634]
.....

Problem: 
Poll(bool wait = false, unsigned int * interval = 0x00000000 ....
and 
_PR_MD_PR_POLL(struct PRPollDesc * pds = 0x0209b020, int npds = 0n3, unsigned int timeout = 0x3e7fc18 .....

Very short poll interval with a lengthy Timeout -- explains hanging of some Tabs while starting.

-----

(Line 4868)
.....
4:006> |* !analyze -v -f
*******************************************************************************
*                                                                             *
*                        Exception Analysis                                   *
*                                                                             *
*******************************************************************************


FAULTING_IP: 
mozalloc!mozalloc_abort+2e [e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_abort.cpp @ 87]
007c1932 8b08            mov     ecx,dword ptr [eax]

EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
.exr 0xffffffffffffffff
ExceptionAddress: 007c1932 (mozalloc!mozalloc_abort+0x0000002e)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00000000
Attempt to read from address 00000000

FAULTING_THREAD:  00001508

DEFAULT_BUCKET_ID:  NULL_POINTER_READ

PROCESS_NAME:  plugin-container.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
.....

-----

There is a lot more. Ultimately the Browser crashed without popping up the Error Reporter. 

With my old, slow Computer this took over two hours to start and crash (whereas it only takes two minutes without Windbg); many Threads seemed to be waiting on another to clear memory so they could get some, and some freed memory that was not allocated. Lots of Functions used Variables that had not come from allocated memory.

It is as though there are no mutexes, just let the OS, the plugin-container.exe, JavaScript, or the 'Program proper' request Memory, by any method (from the OS, malloc() or an unallocated Variable), and don't check if the call succeeded; then free it, even if you don't own it.


Fingers seem to point at e:\builds\moz2_slave\m-cen-w32-ntly\build\nsprpub\pr\src\threads\prmon.c and e:\builds\moz2_slave\m-cen-w32-ntly\build\memory\mozalloc\mozalloc_abort.cpp but I'll leave it to the experts (and someone who has the Code in front of them).

Thanks,
Rob
(Reporter)

Comment 13

6 years ago
(In reply to Ted Mielczarek [:ted, :luser] from comment #11)
> Your analysis is incorrect. Exception handling is initialized very early in
> the process startup, way before we start executing any JavaScript. If you
> get the "Mozilla Crash Reporter" dialog, then we have initialized crash
> reporting. The problem here is as I suspected earlier, an out-of-memory
> crash. When that happens, Microsoft's minidump writing library does not
> function correctly and we get an empty minidump. We are catching the crash,
> we're just unable to write a useful dump from it.

I do have 4G of real memory and 4G of swap.

I can open the exact same Session using Aurora (and FF "Release"); _sometimes_ (usually) I can simply try again (the same Session) and Nightly will work -- when _if_ _anything_ I would probably have even less Memory (due to the prior attempt).

I can run other Programs (like Internet Explorer or a Disk Defragmenter) and get Nightly to start, but occasionally it will crash, thus the Bug is intermittent -- so something is not as failsafe as it could be (it should run with the same result, go OOM or not, each time).


If the "Breakpad side" could catch this ALWAYS and pop up the "Mozilla Crash Reporter" dialog ALWAYS, and it would save a helpful Stack trace, then Stackwalk.sh could say where the problem was instead of piling it in the null pile (along with other (different) crashes).

I'm sure we all wish this worked better, I'm not blaming you Ted.
(Reporter)

Comment 14

6 years ago
I again had the problem where I started Nightly (without Windbg) and it crashed; I simply restarted it and it worked.

Here is one "Null Signature": bp-f5311f9a-f421-4fa8-af3e-b660e2111025

and here are two 

"[@ mozalloc_abort(char const* const) | NS_DebugBreak_P | mozilla::plugins::PluginModuleChild::ShouldContinueFromReplyTimeout() ]": bp-af631221-5284-4d27-8ec1-74e322111025 and 
bp-3f3901d8-9cf5-43bf-9520-b31762111025



How about a different ('fail-resistant') Error Reporting Mechanism.


In other Bug Reports (EG: Bug 407981) some Users complaint that if they try to restart the Browser immediately (and some used to complain about starting a minute later, but that was fixed) that the Browser says an instance of itself is still running and to close that first.


How about:
1. When we start the Browser it detects if there are any running instances and if there are none then that instance BECOMES the "Error Reporter".
2. The "Error Reporter" spawns the 'Browser Proper' with a "Start Code" (EG: 1234 or anything not = 0 or 0xffff......, etc.).
3. The Browser runs (correctly, hopefully).
4. The Browser exits correctly and returns Error Code 163840 (A0 K) if it exits without error.
5. IF the Browser exits with an error it returns the Address of the Stack.
6. The Error Reporter saves the Stack the same way it does now and sends it off to Stackwalk.sh _OR_ sends Stackwalk a pre-arrainged "Error Code" indicating that it NEVER got control back from the Browser.
7. The Error Reporter exits (and hopefully uses exit(0) to indicate that the Error Reporter did not have an error.

That way the Error Reporter works for sure (because it runs first) and if it gets an error code back it's Code is not corrupted so it can work. 

If the Error Reporter doesn't get a "deadman switch" message (watchdog call) once every 5 minutes it knows the Browser is off in never-neverland and can report a lost child.
(Reporter)

Comment 15

6 years ago
This Bug _might_ be caused by Bug 680130 (it is deliberate to abort if Nightly gets too much load, but that mechanism is removed from Aurora and "Release").


PS: Ted, this _was_ closed (INVALID) on my last Post and I did NOT reopen it again - though I do not object to it being reopened. 

There _must_ be some way to fix this - if we were _truly_ we "lost in space" we would BSOD and not return to the OS, thus we must have come from somewhere. If we could only preserve (at least) part of the Stack we would at least know a portion of our journey.


Call For: A MIRO Build -- (Mudflap Improved with Referent Objects).
If it helps, here is another crash on Firefox 7.0.1 Win7 with NULL Signature;

https://crash-stats.mozilla.com/report/index/bp-5c0b1fb5-0839-497e-80ed-740c42111106
Also occurs on Native Fennec where we get null signatures:
https://crash-stats.mozilla.com/report/index/25fb4408-7ff6-4c8c-b255-feed32111101
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Reporter)

Comment 18

6 years ago
In Firefox 11.0a1 there _might_ have been a small improvement to this Bug.


In bp-966f09c1-c9b8-4d59-ae4f-9bc162111109 it says:
"
Signature	EMPTY: no crashing thread identified; corrupt dump
"
.

The "Crash Signature:" would therefore be:
[@ EMPTY: no crashing thread identified; corrupt dump ] .

This is slightly better as it differentiates itself from the Cases that do not EVEN go that far (though this _might_ have been done on Socorro's end and not in Firefox 11.0a1 itself).

Avoiding a "Null Signature" altogether (either by adding a handshake (and resend) or trying the Mechanism described in Comment 14) would completely fix this Bug (AFAIC).
The signature change was a Socorro change: bug 678865.
Another crash just after updating firefox 7.0.1 to firefox 8 under Win7:

https://crash-stats.mozilla.com/report/index/bp-c2d495a0-36d2-49b5-8b15-9b5502111111
(Reporter)

Comment 21

6 years ago
Here is a perfect example of where Socorro claims "empty" data but in fact there is some (it did not know where to look) !


See bp-d9ae1f93-06be-44ba-b150-f0bad2111111, here is the "data" it did not find:


Signature 	EMPTY: no crashing thread identified; corrupt dump
UUID 		d9ae1f93-06be-44ba-b150-f0bad2111111
Date Processed 	2011-11-11 17:43:43.36487
.....
Crash Reason	
Crash Address	
User Comments 	Crashed on startup with over 250 Tabs open. Simply using the [Restart Firefox] Button ...
App Notes  	AdapterVendorID: 10de, AdapterDeviceID: 0615, AdapterSubsysID: 210319da, AdapterDriverVersion: 6.14.12.8026
 		Has dual GPUs. GPU #2: AdapterVendorID2: 10de, AdapterDeviceID2: 0241, AdapterSubsysID2: 2a3a103c, AdapterDriverVersion2: 8.2.0.5D3D10 Layers? D3D10 Layers-
 		D3D9 Layers? D3D9 Layers+
 		WebGL? EGL? EGL+
 		GL Context? GL Context+
 		WebGL+
 		Failed to create temporary texture in system memory. Error code: 2147942414
Processor Notes 	/data/socorro/stackwalk/bin/stackwalk.sh returned no header lines for reportid: 313405081; No thread was identified as the cause of the crash; No signature could be created because we do not know which thread crashed; No frame data available; /data/socorro/stackwalk/bin/stackwalk.sh failed with return code 1 when processing dump d9ae1f93-06be-44ba-b150-f0bad2111111
EMCheckCompatibility	False
.....



Note the line "Failed to create temporary texture in system memory. Error code: 2147942414", that is our "Data" and it explains the "Signature" being a Null (which is an error).


We need to fully peruse all Data in the Report and not lump everything to /dev/null if we claim to not understand it.


This _could_ have been given a Signature something like:
"
[@ EXTERNAL: no crashing thread identified in Firefox; Found: "Failed to create temporary texture in system memory. Error code: 2147942414" in "App Notes"] 
"
that would avoid lumping THIS particular identifiable crash will other unidentifiable crashes (or ones that _could_ have been identified but the Socorro Server was offline).


Thanks,
Rob
The fix from bug 837835 / bug 943051 reduced the frequency of empty dumps. Otherwise this is not very actionable.
Status: NEW → RESOLVED
Last Resolved: 6 years ago4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.