Closed
Bug 1352916
Opened 5 years ago
Closed 4 years ago
Crash in mozilla::net::CacheFileMetadata::GetElement
Categories
(Core :: Networking: Cache, defect, P5)
Tracking
()
RESOLVED
WONTFIX
Tracking | Status | |
---|---|---|
firefox52 | --- | unaffected |
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | unaffected |
firefox55 | + | fixed |
firefox56 | --- | fixed |
People
(Reporter: marcia, Assigned: michal)
References
Details
(Keywords: crash, regression, Whiteboard: [necko-triaged])
Crash Data
Attachments
(2 files)
1.83 KB,
patch
|
valentin
:
review+
|
Details | Diff | Splinter Review |
3.82 KB,
text/plain
|
Details |
This bug was filed from the Socorro interface and is report bp-362bd8c8-9d6e-4eba-a36f-812272170402. ============================================================= Seen while looking at nightly crash stats: http://bit.ly/2orsg2T Crashes started using 20170401030204 build. Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8df9fabf2587b7020889755acb9e75b664fe13cf&tochange=00a166a8640dffa2e0f48650f966d75ca3c1836e Bug 1325088 is in the range, ni on junior
Flags: needinfo?(juhsu)
Updated•5 years ago
|
Flags: needinfo?(juhsu)
Comment 1•5 years ago
|
||
please expect some delay for me since national holiday. keep the ni
Flags: needinfo?(juhsu)
Assignee | ||
Comment 2•5 years ago
|
||
I have no theory what could be wrong. The metadata is a local instance and is not accessed on any other thread. It is read and parsed at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheIndex.cpp#l3094 Elements are checked at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheFileMetadata.cpp#l977. The check ensures that the buffer holding the elements is not corrupted, i.e. it contains pairs "key \0 value \0" and CacheFileMetadata::GetElement() shouldn't IMO crash on it.
Assignee | ||
Comment 3•5 years ago
|
||
It's also worth to note that just few lines above GetElement() didn't crash when trying to read alt-data at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheIndex.cpp#l2707. Alt-data isn't used yet, so this had to walk through the whole buffer.
Assignee | ||
Comment 4•5 years ago
|
||
Note that this patch won't fix the crash. It changes CacheFileMetadata::GetElement so that it won't access memory outside the buffer even if the data is corrupted. If the problem is that mBuf is rewritten and points to a random memory, then it will still crash on strnlen(). If the data in the buffer is overwritten it will crash on one of the release asserts.
Attachment #8854906 -
Flags: review?(valentin.gosu)
Updated•5 years ago
|
Flags: needinfo?(juhsu)
Attachment #8854906 -
Flags: review?(valentin.gosu) → review+
Assignee | ||
Updated•5 years ago
|
Assignee: nobody → michal.novotny
Keywords: checkin-needed,
leave-open
Whiteboard: [necko-active]
Updated•5 years ago
|
status-firefox52:
--- → unaffected
status-firefox53:
--- → unaffected
status-firefox54:
--- → unaffected
status-firefox-esr52:
--- → unaffected
Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c3de4bb69aba Add release asserts to try to narrow down the source of the reported crashes. r=valentin
Keywords: checkin-needed
Comment 6•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/c3de4bb69aba
Reporter | ||
Comment 7•5 years ago
|
||
I see a bunch of crashes in [strnlen] http://bit.ly/2qohBXO - is this the assert that was added in this bug?
This isn't an assert, it's an actual crash, but the cause is the same. Based on the crash address I think it sometimes happens because mBuf is null. The other crashes have addresses such as 0xf88e1a6000 ... so it seems we are exceeding the bounds of the memory page. So it seems the metadata was indeed corrupted.
Flags: needinfo?(michal.novotny)
Reporter | ||
Updated•5 years ago
|
Crash Signature: [@ mozilla::net::CacheFileMetadata::GetElement] → [@ mozilla::net::CacheFileMetadata::GetElement]
[@ strnlen]
Assignee | ||
Comment 9•5 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #8) > This isn't an assert, it's an actual crash, but the cause is the same. Based > on the crash address I think it sometimes happens because mBuf is null. > The other crashes have addresses such as 0xf88e1a6000 ... so it seems we are > exceeding the bounds of the memory page. So it seems the metadata was indeed > corrupted. See my comment #3. mBuf was not null and the data was not corrupted just few lines above this crash. I have no idea what to do here...
Flags: needinfo?(michal.novotny)
Comment 10•5 years ago
|
||
[Tracking Requested - why for this release]: this signature is regressing in numbers with the start of the 55.0b cycle. the [@ strnlen] signature is the #3 browser crash in 55.0b2 for the population on the beta update channel (2.85% of browser crashes).
Comment 11•5 years ago
|
||
Tracking for 55. Michal, I'm curious why you think mBuf wasn't null, what am I missing?
Flags: needinfo?(michal.novotny)
Assignee | ||
Comment 12•5 years ago
|
||
If it was null, it would crash on the earlier GetElement call.
Flags: needinfo?(michal.novotny)
Comment 13•5 years ago
|
||
Any other diagnostics we can do here? The crash volume is concerning.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(honzab.moz)
Comment 14•5 years ago
|
||
FYI, the first crash I find with CacheFile in the proto_signature (i.e. not jit/etc bugs) for strnlen is 55.0a1 buildid 20170429030208
![]() |
||
Comment 15•5 years ago
|
||
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #13) > Any other diagnostics we can do here? The crash volume is concerning. Not a question for me, sorry. Not my code.
Flags: needinfo?(honzab.moz)
I've discussed this with Michal, and it's very difficult to figure out what's going on. Several crash sites are calling GetElement twice, and only crash on the second call. I've spent a bit of time looking over GetElement and SetElement, and so far I haven't found anything that could cause this. This is probably memory corruption caused by some other code. Finding out what causes it isn't going to be easy. Maybe we could add some guard pages around the metadata?
Flags: needinfo?(valentin.gosu)
Comment 17•5 years ago
|
||
hm, not sure if it may be related, but a third of those crashes on beta seem to have an external module related to the indian av vendor "quickheal" hooking into the firefox process: (20.18% in signature vs 00.27% overall) Module "ScDetour.Dll" = true (17.50% in signature vs 00.41% overall) Module "SCDETOUR.DLL" = true
Comment 18•5 years ago
|
||
Crash volume has been pretty low for the in the past week ~5-10 crashes per day.
Comment 19•5 years ago
|
||
the [@ strnlen ] signature is more in the area of 100-150 daily crashes on beta though.
Comment 20•5 years ago
|
||
Michal, I know it's not the perfect solution, but I'm thinking of something we could try. (1) remove lambda in http://searchfox.org/mozilla-central/rev/cef8389c687203085dc6b52de2fbd0260d7495bf/netwerk/cache2/CacheIndex.cpp#2741 lambda sometimes comes with unknown behavior like this http://searchfox.org/mozilla-central/diff/59878e7979d63697174d12a40f9b3061d1c976ec/netwerk/dns/mdns/libmdns/MDNSResponderOperator.cpp#624 We have 5-10 crashes per day. We can get the result pretty soon from nightly. (2) avoid CacheFileMetadata::GetElement for net-on-start-time we could save the time like this http://searchfox.org/mozilla-central/source/netwerk/cache2/CacheFileMetadata.h#51 on the other hand, it could save disk space since we save integer instead of string.
Flags: needinfo?(michal.novotny)
Assignee | ||
Comment 21•5 years ago
|
||
(In reply to Junior[:junior] from comment #20) > Michal, > I know it's not the perfect solution, but I'm thinking of something we could > try. > (1) remove lambda in > http://searchfox.org/mozilla-central/rev/ > cef8389c687203085dc6b52de2fbd0260d7495bf/netwerk/cache2/CacheIndex.cpp#2741 > > lambda sometimes comes with unknown behavior like this > http://searchfox.org/mozilla-central/diff/ > 59878e7979d63697174d12a40f9b3061d1c976ec/netwerk/dns/mdns/libmdns/ > MDNSResponderOperator.cpp#624 > > We have 5-10 crashes per day. We can get the result pretty soon from nightly. > > (2) avoid CacheFileMetadata::GetElement for net-on-start-time > we could save the time like this > http://searchfox.org/mozilla-central/source/netwerk/cache2/CacheFileMetadata. > h#51 > > on the other hand, it could save disk space since we save integer instead of > string. I would prefer to keep the times in metadata and not to move them into the header. Let's first try to remove the lambda...
Flags: needinfo?(michal.novotny)
Comment 22•5 years ago
|
||
No crashes happen after Bug 1390909. Will ask for an uplift on Wed-Thu if all good.
Comment 23•5 years ago
|
||
(In reply to Junior[:junior] from comment #22) > No crashes happen after Bug 1390909. > Will ask for an uplift on Wed-Thu if all good. Sorry, I meant bug 1380909
Comment 24•5 years ago
|
||
Uplift to beta there.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1380909
Comment 25•5 years ago
|
||
Thanks Junior! Mirroring the firefox55/56 status of bug 1380909.
Comment 26•5 years ago
|
||
We have a SUMO report [1] indicating continuous crashes on startup. (even remove the profile, weird!) Also in the crash reporter [2]. Hits the assertion with zero-length value. MOZ_RELEASE_ASSERT(keyLen + 1 != maxLen, "Metadata elements corrupted. There is no value for the key!"); The crash becomes wide spread, i.e., not only in the Network Times metadata [2]. Like previous syndrome, could assert in the second GetElement [3]. Hope we could get the cache log from [1]. And SC founds that we have some GetElement without locking (e.g., [4]). It's good to fix. (Note that this won't fix all the crashes since [3] is in the critical section) What do you think, michal? [1] https://support.mozilla.org/en-US/questions/1180306 [2] https://crash-stats.mozilla.com/report/index/087ac762-bf70-4a1b-a5fd-581cb0171015 [3] https://crash-stats.mozilla.com/report/index/a6a31f99-2b7f-4695-9784-956990170928 [4] http://searchfox.org/mozilla-central/rev/dca019c94bf3a840ed7ff50261483410cfece24f/netwerk/cache2/CacheFile.cpp#649
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: DUPLICATE → ---
Assignee | ||
Comment 27•5 years ago
|
||
(In reply to Junior[:junior] from comment #26) > And SC founds that we have some GetElement without locking (e.g., [4]). > It's good to fix. > (Note that this won't fix all the crashes since [3] is in the critical > section) CacheFile isn't used before metadata is read from the disk, so there is no call to metadata until CacheFile::OnMetadataRead finishes. So the lock shouldn't be needed here.
Flags: needinfo?(michal.novotny)
Assignee | ||
Comment 28•5 years ago
|
||
I think that this report (mentioned at https://support.mozilla.org/en-US/questions/1180306) is important: https://crash-stats.mozilla.com/report/index/324d62ef-4c15-4410-bbfd-e62801171017 It's metadata which is created in CacheIndex::BuildIndex() and is used only in the scope of this function, so obviously it's not a threading issue. Also GetElement() at https://hg.mozilla.org/releases/mozilla-release/annotate/8fbf05f4b921/netwerk/cache2/CacheIndex.cpp#l2734 didn't crash and few lines below it crashes on another call at https://hg.mozilla.org/releases/mozilla-release/annotate/8fbf05f4b921/netwerk/cache2/CacheIndex.cpp#l2752. On medatata in CacheIndex::BuildIndex() we call only GetFrecency(), GetExpirationTime() and GetElement(). IMO either some other code rewrites our memory or the memory is faulty (bad modules, overclocked PC, etc...).
Assignee | ||
Comment 29•5 years ago
|
||
This is a small sample from the log which we got from user who reported the problem at https://support.mozilla.org/en-US/questions/1180306. The log is garbled, so there is something very wrong with user's PC. Maybe this is HW problem or some malware, I don't know. But it's not a problem in the cache, we crash in the cache code only because our sanity checks use MOZ_RELEASE_ASSERT.
![]() |
||
Comment 30•5 years ago
|
||
P5 as this looks like a machine issue, also, no answer from the report for a relatively long time. Candidate to WONTFIX.
Priority: -- → P5
Whiteboard: [necko-active] → [necko-triaged]
Comment 31•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months. :michal.novotny, maybe it's time to close this bug?
Flags: needinfo?(michal.novotny)
I'm just thinking out loud here: instead of release_assert, could we maybe return null & blow away the cache metadata?
Assignee | ||
Comment 33•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #32) > I'm just thinking out loud here: instead of release_assert, could we maybe > return null & blow away the cache metadata? It looks like it's either HW problem or the memory is corrupted by some other code. It's IMO not a good idea to try to handle such unexpected state. The entry is not usable so just removing metadata isn't enough. We would need to doom the entry, close all streams etc. and if the memory is really corrupted, we would crash later somewhere else. Marking as WONTFIX.
Status: REOPENED → RESOLVED
Closed: 5 years ago → 4 years ago
Flags: needinfo?(michal.novotny)
Resolution: --- → WONTFIX
Updated•4 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•