Closed Bug 976171 Opened 9 years ago Closed 3 years ago

crash in mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)


(Core :: Networking: Cache, defect, P3)

Windows NT



Tracking Status
firefox47 --- affected
firefox48 --- affected
firefox-esr45 --- affected


(Reporter: mayhemer, Unassigned)


(Depends on 1 open bug)


(Keywords: crash, Whiteboard: [necko-backlog])

Crash Data


(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-626494ac-f0c2-4a18-b28c-b1c862140223.

According the IO thread simplicity, this also seem more likely as a heap break from outside or something being actually wrong with the event (not the thread).
One valid report [1] in 4 weeks.  Maybe just a null check will do here.

Attached patch v1 (obsolete) — Splinter Review
The bug may already be fixed, but we should ensure there are no null runnables added and later attempted to be executed.
Assignee: nobody → honzab.moz
Ever confirmed: true
Attachment #8413837 - Flags: review?(michal.novotny)
Comment on attachment 8413837 [details] [diff] [review]

Review of attachment 8413837 [details] [diff] [review]:

It is a bad usage of dispatching methods if somebody passes nullptr. Instead of returning an error in DispatchInternal() add an assertion to methods that call it, i.e. to CacheIOThread::DispatchAfterPendingOpens() and CacheIOThread::Dispatch().
Attachment #8413837 - Flags: review?(michal.novotny) → review-
Attached patch v2Splinter Review
- MOZ_ASSERTS added to the top level methods to catch this when we are in debug
- non-null check left to actually fix/prevent unnecessary crashes in production that won't tell us anything anyway
Attachment #8413837 - Attachment is obsolete: true
Attachment #8415835 - Flags: review?(michal.novotny)
Attachment #8415835 - Flags: review?(michal.novotny) → review+
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Looking at [1] I see that there are still some crashes:
- Firefox 32 Beta - 8 crashes ranging from 20140722030201 to 20140811180644
- Firefox 34 Nightly - 2 crashes: 20140722030201 and 20140810030204

Honza, is this acceptable, or does it need more work?
Flags: needinfo?(honzab.moz)
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #7)
> Looking at [1] I see that there are still some crashes:
> - Firefox 32 Beta - 8 crashes ranging from 20140722030201 to 20140811180644
> - Firefox 34 Nightly - 2 crashes: 20140722030201 and 20140810030204
> Honza, is this acceptable, or does it need more work?

I suspect more work is needed here.  One thing that comes to my mind is that some event has its reference counter broken.  But it also could be a result of a heap break from a completely different code, but this is hard to track.
Flags: needinfo?(honzab.moz)
Thanks Honza! I'm reopening this so it gets the needed attention.
Keywords: verifyme
Resolution: FIXED → ---
Crash Signature: [@ mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)] → [@ mozilla::net::CacheIOThread::LoopOneLevel(unsigned int)] [@ mozilla::net::CacheIOThread::LoopOneLevel]
Whiteboard: [necko-backlog]

only one crash on aurora (47.0a2), 64 on release (45.0.1).  This is a low rate, but still I would like to figure out if this is CacheIOThread issues or issues with the runnable.

It's not duplicate of bug 1257611,, having the fix (
Depends on: 1277275
Probes from bug 1277275 show that mainly the write queue can be pretty long.  In two weeks on Nightly (50) there are overall 1630k samples to HTTP_CACHE_IO_QUEUE_WRITE probe where 334k (20%) hits more than 30 events backlog and 400k (25%) backlog of more than 300!  

Just before WRITE we process MANAGEMENT that has some 850k samples with 89k >300.  But on MANAGEMENT we don't do any IO and according the numbers this is just accumulation of operations happening around openings and readings (the sum of all OPEN*/READ* ops is almost equal to number of MANAGEMENT operations). 

There is no reference of how many sessions never go over a backlog of 30 events.

Anyway, this all shows we can keep a lot of memory allocated (suspected cause of THIS bug) to hold these queues alive.  Solutions may be more threads, cache/net race, smaller write op granularity, priorities for the write queue as well with a time limit for a write operation to rather be bypassed.
Crash volume for signature 'mozilla::net::CacheIOThread::LoopOneLevel':
 - nightly (version 50): 0 crash from 2016-06-06.
 - aurora  (version 49): 0 crash from 2016-06-07.
 - beta    (version 48): 56 crashes from 2016-06-06.
 - release (version 47): 282 crashes from 2016-05-31.
 - esr     (version 45): 37 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          0          0          0
 - aurora           0          0          0          0          0          0          0
 - beta            13         10          7          7         11          5          2
 - release         37         43         32         46         54         56          5
 - esr              3          1          1          4          3          5          8

Affected platforms: Windows, Mac OS X
Depends on: 1340141
Bulk change to priority:
Priority: -- → P1
Bulk change to priority:
Priority: P1 → P3
See Also: → 1408075
Low rate, not working on this.
Assignee: honzab.moz → nobody
Keywords: stalled

I checked few of the crash reports and they are only from very old branches. WFM!

Closed: 9 years ago3 years ago
Resolution: --- → WORKSFORME

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.