1352916 - Crash in mozilla::net::CacheFileMetadata::GetElement

I have no theory what could be wrong. The metadata is a local instance and is not accessed on any other thread. It is read and parsed at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheIndex.cpp#l3094 Elements are checked at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheFileMetadata.cpp#l977. The check ensures that the buffer holding the elements is not corrupted, i.e. it contains pairs "key \0 value \0" and CacheFileMetadata::GetElement() shouldn't IMO crash on it.

Michal Novotny [:michal]

Assignee

Comment 3

•

9 years ago

It's also worth to note that just few lines above GetElement() didn't crash when trying to read alt-data at https://hg.mozilla.org/mozilla-central/annotate/00a166a8640d/netwerk/cache2/CacheIndex.cpp#l2707. Alt-data isn't used yet, so this had to walk through the whole buffer.

Michal Novotny [:michal]

Assignee

Comment 4

•

9 years ago

Attached patch patch — Details — Splinter Review

Note that this patch won't fix the crash. It changes CacheFileMetadata::GetElement so that it won't access memory outside the buffer even if the data is corrupted. If the problem is that mBuf is rewritten and points to a random memory, then it will still crash on strnlen(). If the data in the buffer is overwritten it will crash on one of the release asserts.

Attachment #8854906 - Flags: review?(valentin.gosu)

Junior [inactive]

Updated

•

9 years ago

Flags: needinfo?(juhsu)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Updated

•

9 years ago

Attachment #8854906 - Flags: review?(valentin.gosu) → review+

Michal Novotny [:michal]

Assignee

Updated

•

9 years ago

Assignee: nobody → michal.novotny

Keywords: checkin-needed, leave-open

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Updated

•

9 years ago

Whiteboard: [necko-active]

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

status-firefox52: --- → unaffected

status-firefox53: --- → unaffected

status-firefox54: --- → unaffected

status-firefox-esr52: --- → unaffected

Pulsebot

Comment 5

•

9 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c3de4bb69aba Add release asserts to try to narrow down the source of the reported crashes. r=valentin

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 6

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c3de4bb69aba

Marcia Knous [:marcia]

Reporter

Comment 7

•

9 years ago

I see a bunch of crashes in [strnlen] http://bit.ly/2qohBXO - is this the assert that was added in this bug?

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Comment 8

•

9 years ago

This isn't an assert, it's an actual crash, but the cause is the same. Based on the crash address I think it sometimes happens because mBuf is null. The other crashes have addresses such as 0xf88e1a6000 ... so it seems we are exceeding the bounds of the memory page. So it seems the metadata was indeed corrupted.

Flags: needinfo?(michal.novotny)

Marcia Knous [:marcia]

Reporter

Updated

•

9 years ago

Crash Signature: [@ mozilla::net::CacheFileMetadata::GetElement] → [@ mozilla::net::CacheFileMetadata::GetElement] [@ strnlen]

Michal Novotny [:michal]

Assignee

Comment 9

•

9 years ago

(In reply to Valentin Gosu [:valentin] from comment #8) > This isn't an assert, it's an actual crash, but the cause is the same. Based > on the crash address I think it sometimes happens because mBuf is null. > The other crashes have addresses such as 0xf88e1a6000 ... so it seems we are > exceeding the bounds of the memory page. So it seems the metadata was indeed > corrupted. See my comment #3. mBuf was not null and the data was not corrupted just few lines above this crash. I have no idea what to do here...

Flags: needinfo?(michal.novotny)

[:philipp]

Comment 10

•

9 years ago

[Tracking Requested - why for this release]: this signature is regressing in numbers with the start of the 55.0b cycle. the [@ strnlen] signature is the #3 browser crash in 55.0b2 for the population on the beta update channel (2.85% of browser crashes).

status-firefox56: --- → affected

tracking-firefox55: --- → ?

Keywords: regression

Julien Cristau [:jcristau]

Comment 11

•

9 years ago

Tracking for 55. Michal, I'm curious why you think mBuf wasn't null, what am I missing?

tracking-firefox55: ? → +

Flags: needinfo?(michal.novotny)

Michal Novotny [:michal]

Assignee

Comment 12

•

9 years ago

If it was null, it would crash on the earlier GetElement call.

Flags: needinfo?(michal.novotny)

Mike Taylor [:miketaylr]

Comment 13

•

9 years ago

Any other diagnostics we can do here? The crash volume is concerning.

Flags: needinfo?(valentin.gosu)

Flags: needinfo?(honzab.moz)

Randell Jesup [:jesup] (needinfo me)

Comment 14

•

9 years ago

FYI, the first crash I find with CacheFile in the proto_signature (i.e. not jit/etc bugs) for strnlen is 55.0a1 buildid 20170429030208

Honza Bambas (:mayhemer)

Comment 15

•

9 years ago

(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #13) > Any other diagnostics we can do here? The crash volume is concerning. Not a question for me, sorry. Not my code.

Flags: needinfo?(honzab.moz)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Comment 16

•

9 years ago

I've discussed this with Michal, and it's very difficult to figure out what's going on. Several crash sites are calling GetElement twice, and only crash on the second call. I've spent a bit of time looking over GetElement and SetElement, and so far I haven't found anything that could cause this. This is probably memory corruption caused by some other code. Finding out what causes it isn't going to be easy. Maybe we could add some guard pages around the metadata?

Flags: needinfo?(valentin.gosu)

[:philipp]

Comment 17

•

9 years ago

hm, not sure if it may be related, but a third of those crashes on beta seem to have an external module related to the indian av vendor "quickheal" hooking into the firefox process: (20.18% in signature vs 00.27% overall) Module "ScDetour.Dll" = true (17.50% in signature vs 00.41% overall) Module "SCDETOUR.DLL" = true

Mike Taylor [:miketaylr]

Comment 18

•

8 years ago

Crash volume has been pretty low for the in the past week ~5-10 crashes per day.

[:philipp]

Comment 19

•

8 years ago

the [@ strnlen ] signature is more in the area of 100-150 daily crashes on beta though.

Junior [inactive]

Comment 20

•

8 years ago

Michal, I know it's not the perfect solution, but I'm thinking of something we could try. (1) remove lambda in http://searchfox.org/mozilla-central/rev/cef8389c687203085dc6b52de2fbd0260d7495bf/netwerk/cache2/CacheIndex.cpp#2741 lambda sometimes comes with unknown behavior like this http://searchfox.org/mozilla-central/diff/59878e7979d63697174d12a40f9b3061d1c976ec/netwerk/dns/mdns/libmdns/MDNSResponderOperator.cpp#624 We have 5-10 crashes per day. We can get the result pretty soon from nightly. (2) avoid CacheFileMetadata::GetElement for net-on-start-time we could save the time like this http://searchfox.org/mozilla-central/source/netwerk/cache2/CacheFileMetadata.h#51 on the other hand, it could save disk space since we save integer instead of string.

Flags: needinfo?(michal.novotny)

Michal Novotny [:michal]

Assignee

Comment 21

•

8 years ago

(In reply to Junior[:junior] from comment #20) > Michal, > I know it's not the perfect solution, but I'm thinking of something we could > try. > (1) remove lambda in > http://searchfox.org/mozilla-central/rev/ > cef8389c687203085dc6b52de2fbd0260d7495bf/netwerk/cache2/CacheIndex.cpp#2741 > > lambda sometimes comes with unknown behavior like this > http://searchfox.org/mozilla-central/diff/ > 59878e7979d63697174d12a40f9b3061d1c976ec/netwerk/dns/mdns/libmdns/ > MDNSResponderOperator.cpp#624 > > We have 5-10 crashes per day. We can get the result pretty soon from nightly. > > (2) avoid CacheFileMetadata::GetElement for net-on-start-time > we could save the time like this > http://searchfox.org/mozilla-central/source/netwerk/cache2/CacheFileMetadata. > h#51 > > on the other hand, it could save disk space since we save integer instead of > string. I would prefer to keep the times in metadata and not to move them into the header. Let's first try to remove the lambda...

Flags: needinfo?(michal.novotny)

Junior [inactive]

Updated

•

8 years ago

Depends on: 1380909

Junior [inactive]

Comment 22

•

8 years ago

No crashes happen after Bug 1390909. Will ask for an uplift on Wed-Thu if all good.

Junior [inactive]

Comment 23

•

8 years ago

(In reply to Junior[:junior] from comment #22) > No crashes happen after Bug 1390909. > Will ask for an uplift on Wed-Thu if all good. Sorry, I meant bug 1380909

Junior [inactive]

Comment 24

•

8 years ago

Uplift to beta there.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → DUPLICATE

Hsin-Yi Tsai (she/her)[:hsinyi]

Comment 25

•

8 years ago

Thanks Junior! Mirroring the firefox55/56 status of bug 1380909.

status-firefox55: affected → fixed

status-firefox56: affected → fixed

Junior [inactive]

Comment 26

•

8 years ago

We have a SUMO report [1] indicating continuous crashes on startup. (even remove the profile, weird!) Also in the crash reporter [2]. Hits the assertion with zero-length value. MOZ_RELEASE_ASSERT(keyLen + 1 != maxLen, "Metadata elements corrupted. There is no value for the key!"); The crash becomes wide spread, i.e., not only in the Network Times metadata [2]. Like previous syndrome, could assert in the second GetElement [3]. Hope we could get the cache log from [1]. And SC founds that we have some GetElement without locking (e.g., [4]). It's good to fix. (Note that this won't fix all the crashes since [3] is in the critical section) What do you think, michal? [1] https://support.mozilla.org/en-US/questions/1180306 [2] https://crash-stats.mozilla.com/report/index/087ac762-bf70-4a1b-a5fd-581cb0171015 [3] https://crash-stats.mozilla.com/report/index/a6a31f99-2b7f-4695-9784-956990170928 [4] http://searchfox.org/mozilla-central/rev/dca019c94bf3a840ed7ff50261483410cfece24f/netwerk/cache2/CacheFile.cpp#649

Status: RESOLVED → REOPENED

Flags: needinfo?(michal.novotny)

Resolution: DUPLICATE → ---

Michal Novotny [:michal]

Assignee

Comment 27

•

8 years ago

(In reply to Junior[:junior] from comment #26) > And SC founds that we have some GetElement without locking (e.g., [4]). > It's good to fix. > (Note that this won't fix all the crashes since [3] is in the critical > section) CacheFile isn't used before metadata is read from the disk, so there is no call to metadata until CacheFile::OnMetadataRead finishes. So the lock shouldn't be needed here.

Flags: needinfo?(michal.novotny)

Michal Novotny [:michal]

Assignee

Comment 28

•

8 years ago

I think that this report (mentioned at https://support.mozilla.org/en-US/questions/1180306) is important: https://crash-stats.mozilla.com/report/index/324d62ef-4c15-4410-bbfd-e62801171017 It's metadata which is created in CacheIndex::BuildIndex() and is used only in the scope of this function, so obviously it's not a threading issue. Also GetElement() at https://hg.mozilla.org/releases/mozilla-release/annotate/8fbf05f4b921/netwerk/cache2/CacheIndex.cpp#l2734 didn't crash and few lines below it crashes on another call at https://hg.mozilla.org/releases/mozilla-release/annotate/8fbf05f4b921/netwerk/cache2/CacheIndex.cpp#l2752. On medatata in CacheIndex::BuildIndex() we call only GetFrecency(), GetExpirationTime() and GetElement(). IMO either some other code rewrites our memory or the memory is faulty (bad modules, overclocked PC, etc...).

Michal Novotny [:michal]

Assignee

Comment 29

•

8 years ago

Attached file log — Details

This is a small sample from the log which we got from user who reported the problem at https://support.mozilla.org/en-US/questions/1180306. The log is garbled, so there is something very wrong with user's PC. Maybe this is HW problem or some malware, I don't know. But it's not a problem in the cache, we crash in the cache code only because our sanity checks use MOZ_RELEASE_ASSERT.

Honza Bambas (:mayhemer)

Comment 30

•

8 years ago

P5 as this looks like a machine issue, also, no answer from the report for a relatively long time. Candidate to WONTFIX.

Priority: -- → P5

Whiteboard: [necko-active] → [necko-triaged]

BugBot [:suhaib / :marco/ :calixte]

Comment 31

•

7 years ago

The leave-open keyword is there and there is no activity for 6 months. :michal.novotny, maybe it's time to close this bug?

Flags: needinfo?(michal.novotny)

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Comment 32

•

7 years ago

I'm just thinking out loud here: instead of release_assert, could we maybe return null & blow away the cache metadata?

Michal Novotny [:michal]

Assignee

Comment 33

•

7 years ago

(In reply to Valentin Gosu [:valentin] from comment #32) > I'm just thinking out loud here: instead of release_assert, could we maybe > return null & blow away the cache metadata? It looks like it's either HW problem or the memory is corrupted by some other code. It's IMO not a good idea to try to handle such unexpected state. The entry is not usable so just removing metadata isn't enough. We would need to doom the entry, close all streams etc. and if the memory is really corrupted, we would crash later somewhere else. Marking as WONTFIX.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Flags: needinfo?(michal.novotny)

Resolution: --- → WONTFIX

BugBot [:suhaib / :marco/ :calixte]

Updated

•

7 years ago

Keywords: leave-open

patch 9 years ago Michal Novotny [:michal] 1.83 KB, patch	valentin : review+	Details \| Diff \| Splinter Review
log 8 years ago Michal Novotny [:michal] 3.82 KB, text/plain		Details