Closed Bug 588873 Opened 9 years ago Closed 6 years ago

Firefox 4.0b Crash [@ NS_AccumulateFastLoadChecksum(unsigned int*, unsigned char const*, unsigned int, int) ]

Categories

(Core :: XPCOM, defect, critical)

x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
blocking2.0 --- -

People

(Reporter: chofmann, Unassigned)

References

Details

(Keywords: crash, regression, Whiteboard: not-ready)

Crash Data

Attachments

(1 file, 3 obsolete files)

we crested 400k user on b4 and starting to see some new signatures turn up.

this one in fast load code.

Correlation to startup or time of session
43 total crashes for NS_AccumulateFastLoadChecksum on 20100818-crashdata.csv
40 startup crashes inside 30 sec.
27 repeated crashes inside 3 min. of last crash

Stacks look like
http://crash-stats.mozilla.com/report/index/aa934437-6d71-4da5-8d60-8e6e82100819

Frame  	Module  	Signature [Expand]  	Source
0 	xul.dll 	NS_AccumulateFastLoadChecksum 	xpcom/io/nsFastLoadFile.cpp:172
1 	xul.dll 	nsFastLoadFileReader::ComputeChecksum 	xpcom/io/nsFastLoadFile.cpp:628
2 	xul.dll 	nsFastLoadFileReader::Open 	xpcom/io/nsFastLoadFile.cpp:908
3 	xul.dll 	NS_NewFastLoadFileReader 	xpcom/io/nsFastLoadFile.cpp:1167
4 	xul.dll 	nsFastLoadService::NewInputStream 	xpcom/io/nsFastLoadService.cpp:171
5 	xul.dll 	mozJSComponentLoader::StartFastLoad 	
6 	xul.dll 	mozJSComponentLoader::GlobalForLocation 	js/src/xpconnect/loader/mozJSComponentLoader.cpp:1274
7 	xul.dll 	mozJSComponentLoader::LoadModuleImpl 	js/src/xpconnect/loader/mozJSComponentLoader.cpp:793
8 	xul.dll 	mozJSComponentLoader::LoadModule 	js/src/xpconnect/loader/mozJSComponentLoader.cpp:728
9 	xul.dll 	nsComponentManagerImpl::KnownModule::Load 	xpcom/components/nsComponentManager.cpp:953
10 	xul.dll 	nsFactoryEntry::GetFactory 	xpcom/components/nsComponentManager.cpp:1942
11 	xul.dll 	nsComponentManagerImpl::CreateInstanceByContractID 	xpcom/components/nsComponentManager.cpp:1304
12 	xul.dll 	nsComponentManagerImpl::GetServiceByContractID 	xpcom/components/nsComponentManager.cpp:1670
13 	xul.dll 	CallGetService 	obj-firefox/xpcom/build/nsComponentManagerUtils.cpp:94
14 	xul.dll 	nsCOMPtr_base::assign_from_gs_contractid_with_error 	obj-firefox/xpcom/build/nsCOMPtr.cpp:141
15 	xul.dll 	nsAppStartupNotifier::Observe 	embedding/components/appstartup/src/nsAppStartupNotifier.cpp:100
16 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3465
17 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:120
18 	firefox.exe 	__tmainCRTStartup 	obj-firefox/memory/jemalloc/crtsrc/crtexe.c:591
19 	kernel32.dll 	BaseProcessStart

more at 

http://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=NS_AccumulateFastLoadChecksum%28unsigned&date=08%2F19%2F2010%2011%3A21%3A00&range_value=1&range_unit=weeks&hang_type=any&process_type=any&plugin_field=&plugin_query_type=&plugin_query=&do_query=1&admin=&signature=NS_AccumulateFastLoadChecksum%28unsigned%20int*%2C%20unsigned%20char%20const*%2C%20unsigned%20int%2C%20int%29

all     289490  43      0.000148537
4.0b3   15642   31      0.00198184
4.0b2   2871    11      0.00383142
4.0b1   1802    1       0.000554939

os breakdown
NS_AccumulateFastLoadChecksumTotal 43
Win5.1  0.91
Win6.0  0.02
Win6.1  0.07
blocking2.0: --- → ?
Keywords: crash, regression
OS: Mac OS X → Windows XP
Severity: normal → critical
I stared at this for a while, can't see what's causing it. The error means that we are somehow reading outside of the mmapped area, but given the existing bounds checks in there, I don't see how that can happen.
Depends on: 598416
Attached patch fix (obsolete) — Splinter Review
This should make this crash go away
Attachment #477697 - Flags: review?(bhsieh)
Comment on attachment 477697 [details] [diff] [review]
fix

Per IRC discussion, we're worried about allowing ComputeChecksum() to throw an exception. It'd be better to catch the exception inside the method, especially since it's exposed through the .idl
Attachment #477697 - Flags: review?(bhsieh) → review-
Assignee: nobody → tglek
blocking2.0: ? → betaN+
This is still happening in beta7 - lower on the list though ie: #200. I haven't found the signature in any of the latest crash reports on the trunk.
Attached patch use descriptor for checksum (obsolete) — Splinter Review
This adds exception_in_page guards to mmap-io and switches to descriptor io in ComputeChecksum which avoids this exception and provides faster startup performance.
Attachment #477697 - Attachment is obsolete: true
Attachment #494491 - Flags: review?
Attachment #494491 - Flags: review? → review?(ben.hsieh)
>+    if (rem) {
>+        NS_AccumulateFastLoadChecksum(&checksum,
>+                                      reinterpret_cast<PRUint8*>(buf),
>+                                      rem,
>+                                      PR_TRUE);
Is this last block necessary after the loop? Doesn't seems like other callsites worry about remainder left from NS_AFLC().

>+#ifdef XP_WIN
>+    nsAutoString name;
>+    rv = mFile->GetPath(name);
>+    NS_ENSURE_SUCCESS(rv, rv);
>+    HANDLE winFD = ::CreateFileW(name.get(), GENERIC_READ, FILE_SHARE_READ,
>+                                 NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);
>+    if (winFD == INVALID_HANDLE_VALUE)
>+        return NS_ERROR_FAILURE;
>+    fd = PR_ImportFile((PROsfd) winFD);
>+#else
How come OpenNSPRFileDesc doesn't work here? Maybe put a comment here.

r+ with above
Attachment #494491 - Flags: review?(ben.hsieh) → review+
(In reply to comment #7)
> >+    if (rem) {
> >+        NS_AccumulateFastLoadChecksum(&checksum,
> >+                                      reinterpret_cast<PRUint8*>(buf),
> >+                                      rem,
> >+                                      PR_TRUE);
> Is this last block necessary after the loop? Doesn't seems like other callsites
> worry about remainder left from NS_AFLC().

I brought back old reading logic here. Rest of the callsites don't have to worry about Read() underfill as they read in big enough chunks.

> 
> >+#ifdef XP_WIN
> >+    nsAutoString name;
> >+    rv = mFile->GetPath(name);
> >+    NS_ENSURE_SUCCESS(rv, rv);
> >+    HANDLE winFD = ::CreateFileW(name.get(), GENERIC_READ, FILE_SHARE_READ,
> >+                                 NULL, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, NULL);
> >+    if (winFD == INVALID_HANDLE_VALUE)
> >+        return NS_ERROR_FAILURE;
> >+    fd = PR_ImportFile((PROsfd) winFD);
> >+#else
> How come OpenNSPRFileDesc doesn't work here? Maybe put a comment here.
> 
> r+ with above

Ok, FILE_FLAG_SEQUENTIAL_SCAN is special sauce that turns on aggressive readahead in windows
http://hg.mozilla.org/mozilla-central/rev/9ddbf8ab23a5
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Depends on: 616704
I backed this out to investigate whether it caused bug 617048:
http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=c7b784648b02
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Based on the backouts, either this or bug 605672 caused bug 617048.  (My money is on this one, but not all *that* strongly.)

If you think it's not this one, feel free to try relanding *after* beta8 branches or is tagged, but make sure not to reland on the same day as bug 605672, and watch the crash stats the next day (e.g., on http://dbaron.org/mozilla/crashes-by-build ).
also looking at crash stats its not clear that the fix had impact in solving the original crash bug

sort this query by build id to see b8pre crashes on dec 7 with build id 20101207030325 which should have been inside the range with the patch from comment 9 was landed and wend comment 10 backed out the patch.

http://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=NS_AccumulateFastLoadChecksum%28unsigned&hang_type=any&process_type=any&plugin_field=&plugin_query_type=&plugin_query=&do_query=1&admin=&signature=NS_AccumulateFastLoadChecksum%28unsigned%20int*%2C%20unsigned%20char%20const*%2C%20unsigned%20int%2C%20int%29
(In reply to comment #12)
> also looking at crash stats its not clear that the fix had impact in solving
> the original crash bug
> 
> sort this query by build id to see b8pre crashes on dec 7 with build id
> 20101207030325 which should have been inside the range with the patch from
> comment 9 was landed and wend comment 10 backed out the patch.
> 
> http://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=NS_AccumulateFastLoadChecksum%28unsigned&hang_type=any&process_type=any&plugin_field=&plugin_query_type=&plugin_query=&do_query=1&admin=&signature=NS_AccumulateFastLoadChecksum%28unsigned%20int*%2C%20unsigned%20char%20const*%2C%20unsigned%20int%2C%20int%29

It did, but looks it did uncover another issue in the code. If look at the new crash, it's a different (and more rare) exception...the same one as present in firefox 3.6(ie before fasl was rewritten). I guess restoring the old algorithm also restored some old bug.

I doubt that this caused bug 617048 because it changed a low level detail in how fasl is read, the resulting structures should be identical, with and without this patch.
The only thing that I can tell is wrong with the old code is using memcpy instead of memmove(memory shouldn't overlap in memcpy). Wonder if that could cause issues.
The code was also storing return value of PR_Read() in a PRUint32
Attachment #494491 - Attachment is obsolete: true
Attachment #497382 - Flags: review+
blocking2.0: betaN+ → -
Attachment #497382 - Flags: approval2.0?
Please don't carry over review flags, it makes bugs confusing.
Attachment #497382 - Flags: approval2.0? → approval2.0+
http://hg.mozilla.org/mozilla-central/rev/30e1cfee3545

hopefully my revised patch takes care of the crash, closing for now.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
http://hg.mozilla.org/mozilla-central/rev/5634bfc97402 no dice
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
had a += typo instead of = in the previous patch :(
Attachment #497382 - Attachment is obsolete: true
Attachment #505204 - Flags: approval2.0?
Comment on attachment 505204 [details] [diff] [review]
use descriptor for checksum(fixed)

This bounced once and I don't think it's critical for the release.
Attachment #505204 - Flags: approval2.0? → approval2.0-
Comment on attachment 497382 [details] [diff] [review]
use descriptor for checksum(fixed)

(bookkeeping)
Attachment #497382 - Flags: approval2.0+
I don't feel comfortable taking this on the cedar branch.
Whiteboard: not-ready
Crash Signature: [@ NS_AccumulateFastLoadChecksum(unsigned int*, unsigned char const*, unsigned int, int) ]
Assignee: tglek → nobody
This seems to not happen in versions past 6.0.x - have we fixed this in some other way?
Crash Signature: [@ NS_AccumulateFastLoadChecksum(unsigned int*, unsigned char const*, unsigned int, int) ] → [@ NS_AccumulateFastLoadChecksum(unsigned int*, unsigned char const*, unsigned int, int) ] [@ NS_AccumulateFastLoadChecksum ]
this code was removed.
Fixed by code removal per comment 24.
Status: REOPENED → RESOLVED
Closed: 9 years ago6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.