Closed Bug 976171 Opened 10 years ago Closed 5 years ago

crash in mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)

Categories

(Core :: Networking: Cache, defect, P3)

All
Windows NT
defect

Tracking

()

RESOLVED WORKSFORME
mozilla32
Tracking Status
firefox47 --- affected
firefox48 --- affected
firefox-esr45 --- affected

People

(Reporter: mayhemer, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, Whiteboard: [necko-backlog])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-626494ac-f0c2-4a18-b28c-b1c862140223.
=============================================================

According the IO thread simplicity, this also seem more likely as a heap break from outside or something being actually wrong with the event (not the thread).
One valid report [1] in 4 weeks.  Maybe just a null check will do here.

[1] https://crash-stats.mozilla.com/report/index/74ada869-620f-4cc1-84dc-b99562140327
Attached patch v1 (obsolete) — Splinter Review
The bug may already be fixed, but we should ensure there are no null runnables added and later attempted to be executed.
Assignee: nobody → honzab.moz
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #8413837 - Flags: review?(michal.novotny)
Comment on attachment 8413837 [details] [diff] [review]
v1

Review of attachment 8413837 [details] [diff] [review]:
-----------------------------------------------------------------

It is a bad usage of dispatching methods if somebody passes nullptr. Instead of returning an error in DispatchInternal() add an assertion to methods that call it, i.e. to CacheIOThread::DispatchAfterPendingOpens() and CacheIOThread::Dispatch().
Attachment #8413837 - Flags: review?(michal.novotny) → review-
Attached patch v2Splinter Review
- MOZ_ASSERTS added to the top level methods to catch this when we are in debug
- non-null check left to actually fix/prevent unnecessary crashes in production that won't tell us anything anyway
Attachment #8413837 - Attachment is obsolete: true
Attachment #8415835 - Flags: review?(michal.novotny)
Attachment #8415835 - Flags: review?(michal.novotny) → review+
https://hg.mozilla.org/mozilla-central/rev/c69333201bc7
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Looking at [1] I see that there are still some crashes:
- Firefox 32 Beta - 8 crashes ranging from 20140722030201 to 20140811180644
- Firefox 34 Nightly - 2 crashes: 20140722030201 and 20140810030204

Honza, is this acceptable, or does it need more work?
Flags: needinfo?(honzab.moz)
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #7)
> Looking at [1] I see that there are still some crashes:
> - Firefox 32 Beta - 8 crashes ranging from 20140722030201 to 20140811180644
> - Firefox 34 Nightly - 2 crashes: 20140722030201 and 20140810030204
> 
> Honza, is this acceptable, or does it need more work?

[1] https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3Anet%3A%3ACacheIOThread%3A%3ALoopOneLevel%28unsigned+int%29#tab-reports
I suspect more work is needed here.  One thing that comes to my mind is that some event has its reference counter broken.  But it also could be a result of a heap break from a completely different code, but this is hard to track.
Flags: needinfo?(honzab.moz)
Thanks Honza! I'm reopening this so it gets the needed attention.
Status: RESOLVED → REOPENED
Keywords: verifyme
Resolution: FIXED → ---
Status: REOPENED → NEW
Crash Signature: [@ mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)] → [@ mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)] [@ mozilla::net::CacheIOThread::LoopOneLevel]
Whiteboard: [necko-backlog]
Status:

only one crash on aurora (47.0a2), 64 on release (45.0.1).  This is a low rate, but still I would like to figure out if this is CacheIOThread issues or issues with the runnable.

It's not duplicate of bug 1257611, 
https://crash-stats.mozilla.com/report/index/354f7ec7-806f-4db3-a584-10ed52160405, having the fix (https://hg.mozilla.org/releases/mozilla-aurora/log/a4481ccef67e/netwerk/cache2/CacheFileIOManager.cpp)
Depends on: 1277275
Probes from bug 1277275 show that mainly the write queue can be pretty long.  In two weeks on Nightly (50) there are overall 1630k samples to HTTP_CACHE_IO_QUEUE_WRITE probe where 334k (20%) hits more than 30 events backlog and 400k (25%) backlog of more than 300!  

Just before WRITE we process MANAGEMENT that has some 850k samples with 89k >300.  But on MANAGEMENT we don't do any IO and according the numbers this is just accumulation of operations happening around openings and readings (the sum of all OPEN*/READ* ops is almost equal to number of MANAGEMENT operations). 

There is no reference of how many sessions never go over a backlog of 30 events.

Anyway, this all shows we can keep a lot of memory allocated (suspected cause of THIS bug) to hold these queues alive.  Solutions may be more threads, cache/net race, smaller write op granularity, priorities for the write queue as well with a time limit for a write operation to rather be bypassed.
Crash volume for signature 'mozilla::net::CacheIOThread::LoopOneLevel':
 - nightly (version 50): 0 crash from 2016-06-06.
 - aurora  (version 49): 0 crash from 2016-06-07.
 - beta    (version 48): 56 crashes from 2016-06-06.
 - release (version 47): 282 crashes from 2016-05-31.
 - esr     (version 45): 37 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          0          0
 - aurora           0          0          0          0          0          0          0
 - beta            13         10          7          7         11          5          2
 - release         37         43         32         46         54         56          5
 - esr              3          1          1          4          3          5          8

Affected platforms: Windows, Mac OS X
Depends on: 1340141
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
See Also: → 1408075
Low rate, not working on this.
Assignee: honzab.moz → nobody
Keywords: stalled

I checked few of the crash reports and they are only from very old branches. WFM!

Status: NEW → RESOLVED
Closed: 10 years ago5 years ago
Resolution: --- → WORKSFORME

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.