Closed Bug 572011 Opened 10 years ago Closed 7 years ago

Crash @ nsDiskCacheStreamIO::FlushBufferToFile

Categories

(Core :: Networking: Cache, defect, critical)

x86
Windows XP
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla21
Tracking Status
firefox17 --- affected
firefox18 --- affected
firefox19 - ---
firefox20 + verified
firefox21 --- verified

People

(Reporter: Ricmacas, Assigned: michal)

References

Details

(Keywords: crash, topcrash, Whiteboard: [tbird crash])

Crash Data

Attachments

(1 file, 1 obsolete file)

Hello, one user intially reported this crash on SUMO Twitter

3.6.3 http://crash-stats.mozilla.com/report/index/219cdbde-e406-4f0f-8c60-00ccc2100612

But we asked the user to upgrade to 3.6.4 beta 7, and the crashes have now this signature (presented in summary):

3.6.4 http://crash-stats.mozilla.com/report/index/c21105ba-e160-4a63-92a1-b19732100613

I'll have more information when possible.
All crash reports are in clean profiles.
0  	ntdll.dll  	strstr  	
1 	xul.dll 	nsDiskCacheStreamIO::FlushBufferToFile 	netwerk/cache/src/nsDiskCacheStreams.cpp:731
2 	xul.dll 	nsDiskCacheStreamIO::Write 	netwerk/cache/src/nsDiskCacheStreams.cpp:587
3 	xul.dll 	nsDiskCacheOutputStream::Write 	netwerk/cache/src/nsDiskCacheStreams.cpp:257
4 	xul.dll 	nsCacheEntryDescriptor::nsOutputStreamWrapper::Write 	netwerk/cache/src/nsCacheEntryDescriptor.cpp:647
5 	xul.dll 	nsInputStreamTee::TeeSegment 	xpcom/io/nsInputStreamTee.cpp:80
6 	xul.dll 	nsInputStreamTee::WriteSegmentFun 	xpcom/io/nsInputStreamTee.cpp:109
7 	xul.dll 	nsPipeInputStream::ReadSegments 	xpcom/io/nsPipe3.cpp:799
8 	xul.dll 	nsInputStreamTee::ReadSegments 	xpcom/io/nsInputStreamTee.cpp:156
9 	xul.dll 	nsStreamLoader::OnDataAvailable 	netwerk/base/src/nsStreamLoader.cpp:180
10 	xul.dll 	nsStreamListenerTee::OnDataAvailable 	netwerk/base/src/nsStreamListenerTee.cpp:108
11 	xul.dll 	nsHttpChannel::OnDataAvailable 	netwerk/protocol/http/src/nsHttpChannel.cpp:5388
12 	xul.dll 	nsInputStreamPump::OnStateTransfer 	netwerk/base/src/nsInputStreamPump.cpp:508
13 	xul.dll 	nsInputStreamPump::OnInputStreamReady 	netwerk/base/src/nsInputStreamPump.cpp:398
14 	xul.dll 	nsInputStreamReadyEvent::Run 	xpcom/io/nsStreamUtils.cpp:191
15 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:527
16 	xul.dll 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:142
Keywords: crash
Summary: Crash [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile() ] → Crash [@ strstr][@ nsDiskCacheStreamIO::FlushBufferToFile() ]
Not a result of locked or corrupted files, because clean install didn't fix this issue.
(In reply to comment #2)
> Not a result of locked or corrupted files, because clean install didn't fix
> this issue.

What about a new Firefox profile and manually copying important data over?
Tobbi, 
> All crash reports are in clean profiles.
Besides, if you checked the crash reports, you could clearly see they are clean.
This user has clean profile, in a clean install.
Component: General → Networking: Cache
Product: Firefox → Core
QA Contact: general → networking.cache
Version: 3.6 Branch → 1.9.2 Branch
the 3.6.4 and 3.6.4 crash don't look like the share much common code on the stacks.  they could be two entirely different problems, but tickled by some underlying problem(s).   

The  nsDiskCacheStreamIO::FlushBufferToFile might have 6 different associated signatures

nsDiskCacheStreamIO::FlushBufferToFile signature breakdown
signature distribution
       6
signature list
 475 strstr | nsDiskCacheStreamIO::FlushBufferToFile()
  16 strstr | @0x1318d523 | nsDiskCacheStreamIO::FlushBufferToFile()
   9 nsDiskCacheStreamIO::FlushBufferToFile()
   1 strstr | @0x1310cfd3 | nsDiskCacheStreamIO::FlushBufferToFile()
   1 strlen | @0x7d50d53 | nsDiskCacheStreamIO::FlushBufferToFile()
   1 strlen | @0x4a40d53 | nsDiskCacheStreamIO::FlushBufferToFile()

Correlation to startup or time of session
503 total crashes for nsDiskCacheStreamIO::FlushBufferToFile on 20100614-crashdata.csv
138 startup crashes inside 30 sec.
228 startup crashes inside 3 min.
141 repeated crashes inside 3 min. of last crash

but they all seem to be happening on winXP

os breakdown
nsDiskCacheStreamIO::FlushBufferToFileTotal 9
Win5.1  1.00
Win6.0  0.00
Win6.1  0.00

and the problem stretches back across many releases.

checking --- nsDiskCacheStreamIO::FlushBufferToFile 20100614-crashdata.csv
found in: 3.6.3 3.5.9 3.6.4 3.0.19 3.6 3.6.2 3.0.4 3.5.1 3.0 3.5.7 3.5.5 3.5.3 3.5.2 3.5 3.0.7 3.0.11 3.0.10 3.0.1
release total-crashes
              nsDiskCacheStreamIO::FlushBufferToFile crashes
                         pct.
all     408812  503     0.00123039
3.6.3   270109  369     0.00136612
3.5.9   33612   52      0.00154707
3.6.4   50079   29      0.000579085
3.0.19  12129   16      0.00131915
3.6     11172   10      0.000895095
3.6.2   3238    5       0.00154416
3.0.4   541     5       0.00924214
3.5.1   439     3       0.00683371
3.0     1120    3       0.00267857
...


If the profiles are clean, one thing I'd suggest checking is for disk corruption on the systems where people are hitting this, and we should try to isolate to possible file system problems.
Well, both crash [@ strstr], the rest is unrelated...
User reports crashes in plugincontainer.exe in new profile and clean install, more info in a sec.
User claims he did a disk check and everything's fine.
He has scanned his system using Avira, Malware Bytes and Super Anti-Spyware
seen also in thunderbird - example bp-f1e0c3ae-f635-42db-9372-483252100625. some thunderbird also has stack variations:
strlen | @0x1910ae8 | nsDiskCacheStreamIO::FlushBufferToFile()
strlen | @0x2210ae8 | nsDiskCacheStreamIO::FlushBufferToFile()
strlen | @0x2ae0ae8 | nsDiskCacheStreamIO::FlushBufferToFile()

Odd, no thunderbird crashes with strlstr... sig prior to calendar date   Jun 07, 2010 13:18 for any version, even though there is a 3.0 crash bp-b0d17e78-09c7-480d-9880-db8c12100607 (and it's the only crash for that user - no other versions appear for that OS)

mod summary - socorro prefers  strstr | nsDiskCacheStreamIO::FlushBufferToFile()
Summary: Crash [@ strstr][@ nsDiskCacheStreamIO::FlushBufferToFile() ] → Crash [@ strstr][@ nsDiskCacheStreamIO::FlushBufferToFile() ] [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile()]
Crash Signature: [@ strstr] [@ nsDiskCacheStreamIO::FlushBufferToFile() ] [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile()]
Crash Signature: [@ strstr] [@ nsDiskCacheStreamIO::FlushBufferToFile() ] [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile()] → [@ strstr] [@ nsDiskCacheStreamIO::FlushBufferToFile() ] [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile()]
It's #26 top browser crasher in 10.0b6 while only #137 in 9.0.1.
It's #61 top browser crasher in 10.0.1, #28 in 11.0b2, and #39 in 12.0a2.
It might be related to bug 727097.

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsDiskCacheStreamIO%3A%3AFlushBufferToFile%28%29
Keywords: topcrash
Summary: Crash [@ strstr][@ nsDiskCacheStreamIO::FlushBufferToFile() ] [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile()] → Crash @ nsDiskCacheStreamIO::FlushBufferToFile
nsDiskCacheStreamIO::FlushBufferToFile() is rare for thunderbird - roughly a dozen in a week
Whiteboard: [tbird crash]
It's now #4 top browser crasher in 10.0.2 and #6 in 11.0.

It is correlated to Yandex Toolbar and Спутник @Mail.Ru:
     38% (345/904) vs.   7% (4472/65399) {37964A3C-4EE8-47b1-8321-34DE2C39BA4D} (Спутник @Mail.Ru)
     38% (348/904) vs.   8% (5036/65399) yasearch@yandex.ru (Yandex.Bar, https://addons.mozilla.org/addon/3495)

It is also correlated to Windows Based Wireless Networks:
     34% (309/904) vs.   6% (3695/65399) wlanapi.dll
This stays constantly at high ranks. #7 in yesterday's data for 12.*

Any chance someone can find out what's going on there?
Now it's at #5 on 12.
And this is #3 on 13 Beta in yesterday's data. Any chance we can do something about it?
Version: 1.9.2 Branch → Trunk
Michal - Can you take a look at this?
Assignee: nobody → michal.novotny
Ted suggested that it might be a problem with LSP's based on the stack trace.
It's possible that these are not all the same crash, but I grabbed the minidump from https://crash-stats.mozilla.com/report/index/90b4f9bf-b59b-4671-af96-d61122120613 . WinDBG gives me the exact same stack as Breakpad. Running minidump_stackwalk locally shows me:
Thread 25 (crashed)
 0  0x16c26f    eip = 0x0016c26f   esp = 0x084ffde0   ebp = 0x084ffe10   ebx = 0xffe92500
    esi = 0x00176b01   edi = 0x00176b00   eax = 0x00000000   ecx = 0x07a6e000
    edx = 0xffffffff   efl = 0x00010246
    Found by: given as instruction pointer in context
1  0x15d30c
    eip = 0x0015d30d   esp = 0x084ffe18   ebp = 0x084ffe30
    Found by: previous frame's frame pointer

 2  xul.dll!nsDiskCacheStreamIO::FlushBufferToFile() [nsDiskCacheStreams.cpp:2b6
43ea8edf9 : 791 + 0xe]
    eip = 0x012a1be6   esp = 0x084ffe38   ebp = 0x084ffe5c
    Found by: previous frame's frame pointer

So it's not as though Breakpad is resorting to stack scanning here, it's using frame pointers. I also ran another tool I have that scans the stack looking for anything that looks like a return address (http://hg.mozilla.org/users/tmielczarek_mozilla.com/dump-lookup/), and it doesn't show anything more useful on the stack above FlushBufferToFile.

I also looked at the list of modules that we ever loaded in the process, and there didn't seem to be anything at the address where we're crashing.
This is #13 top crash on the list for 13.0.1.
This seems to be spiking again. I'm pretty sure it's connected to some third-party software as it's rising on multiple versions at the same time, but I wonder if there's something we can do here.
I have possibly some options to reduce or remove this crasher...
(In reply to Alfred Kayser from comment #25)
> I have possibly some options to reduce or remove this crasher...

Cool !!!
1. Remove 'offset' from nsDiskCacheStream::GetOutputStream (it is always zero). This simplifies the code, particular around the buffer handling.
2. Use buffer if and only aslong as the data fits in cacheblocks, otherwise directly write to the file (skipping the buffer) (the write to the file is buffered itself). This reduces the need for the buffer and chances of crashes.
3. Merge nsDiskCacheOutputStream with nsDiskCacheStreamIO: less malloc's and less code overall: reduces risk of crashing.
Depends on: 813935
Depends on: 814010
No longer depends on: 813935
Blocks: 405407
Here are some correlations from Firefox 17 that implicate Yandex in the crash on the Firefox side:

nsDiskCacheStreamIO::FlushBufferToFile()|EXCEPTION_ACCESS_VIOLATION_READ (228 crashes)
     43% (98/228) vs.   4% (1843/43394) yasearch@yandex.ru (Yandex.Bar, https://addons.mozilla.org/addon/3495)
     42% (96/228) vs.   4% (1728/43394) vb@yandex.ru
     37% (84/228) vs.   2% (947/43394) {37964A3C-4EE8-47b1-8321-34DE2C39BA4D}
     15% (35/228) vs.   0% (106/43394) {6236BA26-C117-4007-928C-DE0716C7FA78} (Chameleon Bob: layouts for Facebook, https://addons.mozilla.org/addon/11584)
     15% (35/228) vs.   0% (108/43394) {6236BA26-C117-4007-928C-DE0716C7FA48}
     15% (35/228) vs.   0% (109/43394) {6236BA26-C117-4007-928C-DE0716C7FA58}
     15% (35/228) vs.   0% (112/43394) {6236BA26-C117-4007-928C-DE0716C7FA68} (Express Tab, https://addons.mozilla.org/addon/14584)
     15% (35/228) vs.   0% (120/43394) {6236BA26-C117-4007-928C-DE0716C7FA38}
     15% (35/228) vs.   0% (153/43394) {4933189D-C7F7-4C6E-834B-A29F087BFD23}
     12% (28/228) vs.   2% (889/43394) jqs@sun.com (Java Quick Starter, http://java.sun.com/javase/downloads/)
     10% (23/228) vs.   0% (30/43394) plusmo@plusmo.ru
      9% (21/228) vs.   0% (25/43394) player@zaycev.fm
      6% (14/228) vs.   0% (18/43394) 501891365d7e4@501891365d81d.info
      6% (14/228) vs.   1% (399/43394) {dfefbe51-ca52-484b-adf0-6b158b05262d}
     98% (223/228) vs.  93% (40195/43394) {972ce4c6-7e08-4474-a285-3208198ce6fd} (Default, https://addons.mozilla.org/addon/8150)
      6% (14/228) vs.   1% (437/43394) {5911488E-9D1E-40ec-8CBB-06B231CC153F}
      6% (14/228) vs.   1% (476/43394) gtffxtbr@GamingWonderland.com
The nsDiskCacheStreamIO::FlushBufferToFile() signature is #21 on 18.0b3 and up to #11 on 17.0.1 release.

As the Yandex Bar add-on is completely JS-based (no binary code), any crash it triggers is very probably a problem in our code and we should investigate that, esp. with how much-used this add-on is in Russia.

sfink, you have done work with debugging crashes with this add-on in bug 770238, can you help here as well?
The fix in bug 814010 will likely fix this bug. It seems risky to uplift it in branches.
Longstanding issue, we'll look into uplifting bug 814010 for FF19.
Still causing crashes (although less than before).

The crash is at PR_Write with access to 0x0. As mFD cannot be zero at that point (see code before), the other suspect is mBuffer:
   if (PR_Write(mFD, mBuffer, mStreamEnd) != (int32_t)mStreamEnd) {

mBuffer could be zero when there is no buffer allocated (either failed to or first chunk is directly greater than 16k). In both cases, mStreamEnd will be zero, as there is nothing in the buffer.

So to prevent PR_Write crashing, a check around it with mBuffer != null and mStreamEnd > 0 will help preventing to crash here.
Attachment #700937 - Flags: review?(michal.novotny)
Comment on attachment 700937 [details] [diff] [review]
Don't crash on write of empty buffer

First, I think that this crash isn't caused by mBuffer being null. See comment #6 in bug #752266. Also see discussion in bug #801394. BTW, this bug is also Windows only problem.

Anyway, I agree that we should make sure that we don't pass null to PR_Write(). But I don't think that we should try to survive malloc() failure, because if we are out of memory the application will crash sooner or later. So I propose to change malloc()/realloc() with infallible moz_xmalloc()/moz_xrealloc() and to place a check before PR_Write() that aborts Firefox (even in release build) in case mBuffer is null, something like this:

  if (!mBuffer) {
    NS_RUNTIMEABORT("Fix me!");
  }
Attachment #700937 - Flags: review?(michal.novotny) → review-
There can be multiple issues that can cause PR_Write to crash:
1. Failed to allocate mBuffer, because of lack of memory...
2. No mBuffer allocated because first part is > 16K, so the buffer allocate stage is skipped.
3. No mBuffr allocated because count = 0.

Number 1 is crisis-mode, but should just return OUT_OF_MEMORY from the write call, so that stream writer can handle it.

Number 2 and 3 should be protected within the code as in patch https://bugzilla.mozilla.org/attachment.cgi?id=700937.
Number 2 will happen when chunks larger than 16K are written in the first call to Write.

So, while number 1 is memory or virus or such related, number 2 and 3 are to be fixed by this patch.
Whiteboard: [tbird crash] → [tbird crash][startupcrash]
Depending on the crash signatures, 43% or 32% of crashes occur within one minute. It doesn't qualify this bug for the startupcrash whiteboard that requires more than 50% of crashes.
Whiteboard: [tbird crash][startupcrash] → [tbird crash]
Hmm, for Firefox Aurora, I did see the startup "rocket" symbol on it, that's what I told Alex in the channel meeting and he marked it that way.
I now see that it barely qualifies there: https://crash-stats.mozilla.com/report/list?range_value=7&range_unit=days&date=2013-01-25&signature=nsDiskCacheStreamIO%3A%3AFlushBufferToFile%28%29&version=Firefox%3A20.0a2
(In reply to Alfred Kayser from comment #35)
> Number 1 is crisis-mode, but should just return OUT_OF_MEMORY from the write
> call, so that stream writer can handle it.

As I wrote, use infallible memory allocators and don't try to handle OOM.

The rest of the patch is OK.
Attachment #700937 - Attachment is obsolete: true
Attachment #709359 - Flags: review?
Attachment #709359 - Flags: review? → review?(michal.novotny)
Attachment #709359 - Flags: review?(michal.novotny) → review+
https://hg.mozilla.org/mozilla-central/rev/d9c36d39f7bc
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
This needs to be applied to Aurora/Mozilla20 as it fixes a potential crash issue caused by the patch of bug 572011 which is applied to Mozilla20.
Target Milestone: mozilla21 → mozilla20
If tracking gets denied we may have to ask for approval of the patch. But the target milestone is still 21.
Target Milestone: mozilla20 → mozilla21
(In reply to Alfred Kayser from comment #43)
> This needs to be applied to Aurora/Mozilla20 as it fixes a potential crash
> issue caused by the patch of bug 572011 which is applied to Mozilla20.
Here it's already bug 572011, a long standing issue. Are you sure you posted in the right bug?
This is the right bug with the patch that needs to go into Mozilla20.
I meant to refer to bug 814010. See comment 19 in there.
BTW, "Target Milestone" means when it is supposed to be or was fixed on mozilla-central. If you want to indicate what other versions are affected, that's what the status-* flags are about (setting that for 20), tracking-* means that we should track this for a given train, if it's just about getting a patch uplifted, all that's needed is the relevant approval-* flag(s) on the patch (and switching the fitting status flag to fixed once the uplift lands).
Tracking - go ahead with nominating the patch for aurora uplift.
We'll be merging Aurora to the Beta channel next Tuesday so an uplift nomination and landing would be great to have before then so this crash doesn't hit our larger audience.
Flags: needinfo?(michal.novotny)
Comment on attachment 709359 [details] [diff] [review]
V2: Use moz_xmalloc

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 814010
User impact if declined: crash
Testing completed (on m-c, etc.): on m-c for a week now.
Risk to taking this patch (and alternatives if risky):  very low risk IMO.  Just a couple if (null) checks and switch to use moz_xmalloc from malloc.  Trivial.
String or UUID changes made by this patch: none.
Attachment #709359 - Flags: approval-mozilla-aurora?
Flags: needinfo?(michal.novotny)
Comment on attachment 709359 [details] [diff] [review]
V2: Use moz_xmalloc

Thank you - please go ahead with this low risk uplift.
Attachment #709359 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
No post-fix crashes for [@ nsDiskCacheStreamIO::FlushBufferToFile()] and [@ strstr]:
https://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=nsDiskCacheStreamIO%3A%3AFlushBufferToFile%28%29&reason_type=contains&date=03%2F14%2F2013%2009%3A30%3A48&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=nsDiskCacheStreamIO%3A%3AFlushBufferToFile%28%29

https://crash-stats.mozilla.com/report/list?product=Firefox&query_search=signature&query_type=contains&query=strstr&reason_type=contains&date=03%2F14%2F2013%2009%3A30%3A49&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=strstr

6 post-fix crashes for [@ strstr | nsDiskCacheStreamIO::FlushBufferToFile()]:
https://crash-stats.mozilla.com/report/list?product=Firefox&version=Firefox%3A22.0a1&version=Firefox%3A21.0a2&version=Firefox%3A20.0b4&version=Firefox%3A20.0b5&version=Firefox%3A20.0b3&version=Firefox%3A20.0b2&version=Firefox%3A20.0b1&query_search=signature&query_type=contains&query=strstr%20%7C%20nsDiskCacheStreamIO%3A%3AFlushBufferToFile%28%29&reason_type=contains&date=03%2F14%2F2013%2009%3A25%3A28&range_value=4&range_unit=weeks&hang_type=any&process_type=any&do_query=1&signature=strstr%20%7C%20nsDiskCacheStreamIO%3A%3AFlushBufferToFile%28%29

Definitely not a top crash anymore.
Status: RESOLVED → VERIFIED
I took a quick look at the six post-fix crashes, and I'm guessing that the problem is that mFD is corrupted somehow - at http://hg.mozilla.org/releases/mozilla-beta/annotate/64a66423dbd3/netwerk/cache/nsDiskCacheStreams.cpp#l706, FlushBufferToFile() calls PR_Write, which calls through a function pointer mFD->methods->write. If that's pointing somewhere odd, that would explain why strstr (or other seemingly unrelated symbols) shows up at the top of the stack.

The most likely candidate for corrupt mFD would be use-after-free (most likely in the form of use-after-close), but based on a quick review I'm not seeing an obvious place where that would happen.
You need to log in before you can comment on or make changes to this bug.