Crash in mozilla::net::CacheFileMetadata::GetElement

RESOLVED WONTFIX

Status

()

defect
P5
critical
RESOLVED WONTFIX
2 years ago
9 months ago

People

(Reporter: marcia, Assigned: michal)

Tracking

({crash, regression})

55 Branch
Unspecified
Windows 10
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox52 unaffected, firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55+ fixed, firefox56 fixed)

Details

(Whiteboard: [necko-triaged], crash signature)

Attachments

(2 attachments)

This bug was filed from the Socorro interface and is 
report bp-362bd8c8-9d6e-4eba-a36f-812272170402.
=============================================================

Seen while looking at nightly crash stats: http://bit.ly/2orsg2T

Crashes started using 20170401030204 build.

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8df9fabf2587b7020889755acb9e75b664fe13cf&tochange=00a166a8640dffa2e0f48650f966d75ca3c1836e

Bug 1325088 is in the range, ni on junior
Flags: needinfo?(juhsu)
Flags: needinfo?(juhsu)
please expect some delay for me since national holiday. keep the ni
Flags: needinfo?(juhsu)
I have no theory what could be wrong. The metadata is a local instance and is not accessed on any other thread. It is read and parsed at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheIndex.cpp#l3094

Elements are checked at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheFileMetadata.cpp#l977. The check ensures that the buffer holding the elements is not corrupted, i.e. it contains pairs "key \0 value \0" and CacheFileMetadata::GetElement() shouldn't IMO crash on it.
It's also worth to note that just few lines above GetElement() didn't crash when trying to read alt-data at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheIndex.cpp#l2707. Alt-data isn't used yet, so this had to walk through the whole buffer.
Posted patch patchSplinter Review
Note that this patch won't fix the crash. It changes CacheFileMetadata::GetElement so that it won't access memory outside the buffer even if the data is corrupted. If the problem is that mBuf is rewritten and points to a random memory, then it will still crash on strnlen(). If the data in the buffer is overwritten it will crash on one of the release asserts.
Attachment #8854906 - Flags: review?(valentin.gosu)
Flags: needinfo?(juhsu)
Attachment #8854906 - Flags: review?(valentin.gosu) → review+
Assignee: nobody → michal.novotny
Whiteboard: [necko-active]
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3de4bb69aba
Add release asserts to try to narrow down the source of the reported crashes. r=valentin
Keywords: checkin-needed
I see a bunch of crashes in [strnlen] http://bit.ly/2qohBXO - is this the assert that was added in this bug?
This isn't an assert, it's an actual crash, but the cause is the same. Based on the crash address I think it sometimes happens because mBuf is null.
The other crashes have addresses such as 0xf88e1a6000 ... so it seems we are exceeding the bounds of the memory page. So it seems the metadata was indeed corrupted.
Flags: needinfo?(michal.novotny)
Crash Signature: [@ mozilla::net::CacheFileMetadata::GetElement] → [@ mozilla::net::CacheFileMetadata::GetElement] [@ strnlen]
(In reply to Valentin Gosu [:valentin] from comment #8)
> This isn't an assert, it's an actual crash, but the cause is the same. Based
> on the crash address I think it sometimes happens because mBuf is null.
> The other crashes have addresses such as 0xf88e1a6000 ... so it seems we are
> exceeding the bounds of the memory page. So it seems the metadata was indeed
> corrupted.

See my comment #3. mBuf was not null and the data was not corrupted just few lines above this crash. I have no idea what to do here...
Flags: needinfo?(michal.novotny)
[Tracking Requested - why for this release]:
this signature is regressing in numbers with the start of the 55.0b cycle. the [@ strnlen] signature is the #3 browser crash in 55.0b2 for the population on the beta update channel (2.85% of browser crashes).
Tracking for 55.

Michal, I'm curious why you think mBuf wasn't null, what am I missing?
Flags: needinfo?(michal.novotny)
If it was null, it would crash on the earlier GetElement call.
Flags: needinfo?(michal.novotny)
Any other diagnostics we can do here? The crash volume is concerning.
Flags: needinfo?(valentin.gosu)
Flags: needinfo?(honzab.moz)
FYI, the first crash I find with CacheFile in the proto_signature (i.e. not jit/etc bugs) for strnlen is 
55.0a1 buildid 20170429030208
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #13)
> Any other diagnostics we can do here? The crash volume is concerning.

Not a question for me, sorry.  Not my code.
Flags: needinfo?(honzab.moz)
I've discussed this with Michal, and it's very difficult to figure out what's going on.
Several crash sites are calling GetElement twice, and only crash on the second call.
I've spent a bit of time looking over GetElement and SetElement, and so far I haven't found anything that could cause this.
This is probably memory corruption caused by some other code. Finding out what causes it isn't going to be easy.
Maybe we could add some guard pages around the metadata?
Flags: needinfo?(valentin.gosu)
hm, not sure if it may be related, but a third of those crashes on beta seem to have an external module related to the indian av vendor "quickheal" hooking into the firefox process:
(20.18% in signature vs 00.27% overall) Module "ScDetour.Dll" = true
(17.50% in signature vs 00.41% overall) Module "SCDETOUR.DLL" = true
Crash volume has been pretty low for the in the past week ~5-10 crashes per day.
the [@ strnlen ] signature is more in the area of 100-150 daily crashes on beta though.
Michal,
I know it's not the perfect solution, but I'm thinking of something we could try.
(1) remove lambda in http://searchfox.org/mozilla-central/rev/cef8389c687203085dc6b52de2fbd0260d7495bf/netwerk/cache2/CacheIndex.cpp#2741

lambda sometimes comes with unknown behavior like this
http://searchfox.org/mozilla-central/diff/59878e7979d63697174d12a40f9b3061d1c976ec/netwerk/dns/mdns/libmdns/MDNSResponderOperator.cpp#624

We have 5-10 crashes per day. We can get the result pretty soon from nightly.

(2) avoid CacheFileMetadata::GetElement for net-on-start-time
we could save the time like this
http://searchfox.org/mozilla-central/source/netwerk/cache2/CacheFileMetadata.h#51

on the other hand, it could save disk space since we save integer instead of string.
Flags: needinfo?(michal.novotny)
(In reply to Junior[:junior] from comment #20)
> Michal,
> I know it's not the perfect solution, but I'm thinking of something we could
> try.
> (1) remove lambda in
> http://searchfox.org/mozilla-central/rev/
> cef8389c687203085dc6b52de2fbd0260d7495bf/netwerk/cache2/CacheIndex.cpp#2741
> 
> lambda sometimes comes with unknown behavior like this
> http://searchfox.org/mozilla-central/diff/
> 59878e7979d63697174d12a40f9b3061d1c976ec/netwerk/dns/mdns/libmdns/
> MDNSResponderOperator.cpp#624
> 
> We have 5-10 crashes per day. We can get the result pretty soon from nightly.
> 
> (2) avoid CacheFileMetadata::GetElement for net-on-start-time
> we could save the time like this
> http://searchfox.org/mozilla-central/source/netwerk/cache2/CacheFileMetadata.
> h#51
> 
> on the other hand, it could save disk space since we save integer instead of
> string.

I would prefer to keep the times in metadata and not to move them into the header. Let's first try to remove the lambda...
Flags: needinfo?(michal.novotny)
Depends on: 1380909
No crashes happen after Bug 1390909.
Will ask for an uplift on Wed-Thu if all good.
(In reply to Junior[:junior] from comment #22)
> No crashes happen after Bug 1390909.
> Will ask for an uplift on Wed-Thu if all good.

Sorry, I meant bug 1380909
Uplift to beta there.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1380909
Thanks Junior!

Mirroring the firefox55/56 status of bug 1380909.
We have a SUMO report [1] indicating continuous crashes on startup. (even remove the profile, weird!)
Also in the crash reporter [2].

Hits the assertion with zero-length value.
MOZ_RELEASE_ASSERT(keyLen + 1 != maxLen, "Metadata elements corrupted. There is no value for the key!");

The crash becomes wide spread, i.e., not only in the Network Times metadata [2].
Like previous syndrome, could assert in the second GetElement [3].

Hope we could get the cache log from [1].
And SC founds that we have some GetElement without locking (e.g., [4]).
It's good to fix.
(Note that this won't fix all the crashes since [3] is in the critical section)

What do you think, michal?

[1] https://support.mozilla.org/en-US/questions/1180306
[2] https://crash-stats.mozilla.com/report/index/087ac762-bf70-4a1b-a5fd-581cb0171015
[3] https://crash-stats.mozilla.com/report/index/a6a31f99-2b7f-4695-9784-956990170928
[4] http://searchfox.org/mozilla-central/rev/dca019c94bf3a840ed7ff50261483410cfece24f/netwerk/cache2/CacheFile.cpp#649
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: DUPLICATE → ---
(In reply to Junior[:junior] from comment #26)
> And SC founds that we have some GetElement without locking (e.g., [4]).
> It's good to fix.
> (Note that this won't fix all the crashes since [3] is in the critical
> section)

CacheFile isn't used before metadata is read from the disk, so there is no call to metadata until CacheFile::OnMetadataRead finishes. So the lock shouldn't be needed here.
Flags: needinfo?(michal.novotny)
I think that this report (mentioned at https://support.mozilla.org/en-US/questions/1180306) is important:
https://crash-stats.mozilla.com/report/index/324d62ef-4c15-4410-bbfd-e62801171017

It's metadata which is created in CacheIndex::BuildIndex() and is used only in the scope of this function, so obviously it's not a threading issue. Also GetElement() at https://hg.mozilla.org/releases/mozilla-release/annotate/8fbf05f4b921/netwerk/cache2/CacheIndex.cpp#l2734 didn't crash and few lines below it crashes on another call at https://hg.mozilla.org/releases/mozilla-release/annotate/8fbf05f4b921/netwerk/cache2/CacheIndex.cpp#l2752. On medatata in CacheIndex::BuildIndex() we call only GetFrecency(), GetExpirationTime() and GetElement(). IMO either some other code rewrites our memory or the memory is faulty (bad modules, overclocked PC, etc...).
Posted file log
This is a small sample from the log which we got from user who reported the problem at https://support.mozilla.org/en-US/questions/1180306. The log is garbled, so there is something very wrong with user's PC. Maybe this is HW problem or some malware, I don't know. But it's not a problem in the cache, we crash in the cache code only because our sanity checks use MOZ_RELEASE_ASSERT.
P5 as this looks like a machine issue, also, no answer from the report for a relatively long time.  Candidate to WONTFIX.
Priority: -- → P5
Whiteboard: [necko-active] → [necko-triaged]
The leave-open keyword is there and there is no activity for 6 months.
:michal.novotny, maybe it's time to close this bug?
Flags: needinfo?(michal.novotny)
I'm just thinking out loud here: instead of release_assert, could we maybe return null & blow away the cache metadata?
(In reply to Valentin Gosu [:valentin] from comment #32)
> I'm just thinking out loud here: instead of release_assert, could we maybe
> return null & blow away the cache metadata?

It looks like it's either HW problem or the memory is corrupted by some other code. It's IMO not a good idea to try to handle such unexpected state. The entry is not usable so just removing metadata isn't enough. We would need to doom the entry, close all streams etc. and if the memory is really corrupted, we would crash later somewhere else. Marking as WONTFIX.
Status: REOPENED → RESOLVED
Closed: 2 years ago9 months ago
Flags: needinfo?(michal.novotny)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.