Bugzilla

Lawrence Mandel [:lmandel] (use needinfo)

Reporter

Comment 2

•

9 years ago

¡Hola Anthony!

Still a problem on 38.0b99

Report ID 	Date Submitted
bp-1f570d4d-a3f3-47aa-967b-b5c162150506
	06/05/2015	05:15 p.m.

Sadly I still don't have STR the salient symptoms just prior to the crash is heavy usage of the "Physical Memory" at 98% or more and major black artifacting on the pages that are loaded or loading on Firefox Beta.

Please let me know if there's anything worth collecting from the affected system.

Shall tracking be re-requested for 38?

¡Gracias!

status-firefox37: --- → affected

status-firefox38: --- → affected

status-firefox39: --- → affected

status-firefox40: --- → affected

Flags: needinfo?(anthony.s.hughes)

u279076

Comment 3

•

9 years ago

Unfortunately there isn't much we can do without STR. Perhaps the next time you encounter these symptoms you could capture a profile using the Gecko Profiler addon. Additionally, an about:memory log would be useful.

Flags: needinfo?(anthony.s.hughes)

Comment 4

•

9 years ago

Marking Firefox 37 and no for tracking and wontfix for status as this release will be EOL when Firefox 38 is released on Tuesday.

status-firefox37: affected → wontfix

tracking-firefox37: ? → -

Robert Kaiser

Comment 5

•

9 years ago

[Tracking Requested - why for this release]:
This crash signature is now 1.1% of 39.0b2 data so far.

tracking-firefox39: --- → ?

Liz Henry (:lizzard) (relman/hg->git project)

Comment 6

•

9 years ago

This is OOM. Not sure what we can do to this.

Comment 7

•

9 years ago

It sounds like a fix is unlikely until we have better steps to reproduce the bug. 

Andrew since this seems to be a crash signature on the rise for 39, can you find someone to look into it a little more? I can also track this and we can keep an eye on the crashes.

status-firefox38: affected → wontfix

tracking-firefox39: ? → +

tracking-firefox40: --- → +

Flags: needinfo?(overholt)

Comment 8

•

9 years ago

This is rather hard to debug based on the crash-stats. Some JS code somewhere calls
sendAsyncMessage(msg, someLargeData); and we don't have enough memory to serialize that large data.

Comment 9

•

9 years ago

But, we're doing some structured clone data copying. Do we really need to? Could we perhaps
just share the data. This is all within on process.
Perhaps baku has some ideas.

Flags: needinfo?(amarchesini)

Robert Kaiser

Comment 10

•

9 years ago

(In reply to Olli Pettay [:smaug] (Reviewing overload, no new review requests before June 7, please) from comment #8)
> This is rather hard to debug based on the crash-stats. Some JS code
> somewhere calls
> sendAsyncMessage(msg, someLargeData); and we don't have enough memory to
> serialize that large data.

Well, for one thing, if it's OOM, I think we should actually fail the right way and annotate allocation size on the crash. For the other, should we really let JS crash us on a large allocation? Shouldn't we make this fallible instead and return an error, catching it on the JS side?

Comment 11

•

9 years ago

JS passes something large to C++ side, and then C++ crashes when it copies the serialized form of some data it got from JS.
The issue is that this is in message manager messaging, so supposedly it is some browser functionality which is causing the issue. Is that functionality important enough that we rather crash than break it in case of OOM? hard to say without knowing what this is about.

But I believe we should be able to use less memory here.

Andrea Marchesini [:baku]

Comment 12

•

9 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10)
> Well, for one thing, if it's OOM, I think we should actually fail the right
> way and annotate allocation size on the crash.

Funny enough I just added that annotation in bug 1167189 comment 8. I'll request uplift so we can get a handle on the size of these allocations.

Comment 13

•

9 years ago

> Perhaps baku has some ideas.

There is a bug about using shared memory for big strings in postMessage. But nothing else than this.
I'll talk with bent to know if we have any plan to support something similar for IPDL and maybe we can have something shared.

Flags: needinfo?(overholt)

Flags: needinfo?(amarchesini)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 14

•

9 years ago

well, this has nothing to do with IPDL.

Robert Kaiser

Updated

•

9 years ago

Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*)] → [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*)] [@ OOM | large | N…

Summary: crash in mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) → Large OOM in nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase

Comment 15

•

9 years ago

Probably too late to take anything in 39; we could still take a patch for 40.

status-firefox39: affected → wontfix

Comment 16

•

9 years ago

Do we have any idea when this started to happen?

Flags: needinfo?(lhenry)

Comment 17

•

9 years ago

Basically forever. We have crash reports going back at least as far as Fx17. (The signatures were different before bug 982646)

Flags: needinfo?(lhenry)

Comment 18

•

9 years ago

But we started seeing it in higher volume during 34, and it continued to climb in 35, 36, 37.

I suspect that's just because we've eliminated many large OOMs that may have hid this earlier. I.e. memory-constrained machines would have previously crashed somewhere else; now that we've fixed those callsites, they're starting to crash here now.

Comment 19

•

9 years ago

interesting, and useful data. I was worried that we have added some new message manager usage where
some large data packages are sent via in-process message manager.

Comment 20

•

9 years ago

Now that we have allocation size numbers... by eyeball most of these are 1-2MB but occasionally going as high as 10MB.

Looks like this is going to be our top OOM on beta. Any chance for a fix in 40?

Flags: needinfo?(bugs)

Comment 21

•

9 years ago

Do we have any correlation for addons? Since if FF itself was using messagemanager for that large
messages, that clearly indicates a bug.

(but thinking if there is anything safe for beta)

Comment 22

•

9 years ago

The correlation files don't show any addons at all for that signature. (I assume that means no addons met the threshold for being suspicious)

Comment 23

•

9 years ago

felipe, can you think of anything in FF which might pass _huge_ messages via message manager.

:Felipe Gomes (needinfo for replies!)

Updated

•

9 years ago

Flags: needinfo?(felipc)

Comment 24

•

9 years ago

I don't know of anything that would get close to a OOM situation unless things are being leaked. I know that the select boxes can send a big amount of data if there's a <select> with many <option>s (e.g the "Component" dropdown in this bugzilla page: https://bugzilla.mozilla.org/request.cgi?action=queue&requestee=nobody%40mozilla.org&group=type ), but I doubt that would get past 1MB or so.

I don't know if thumbnails make use message manager to do anything, but if they do that would be somewhere to look.

I guess it wouldn't be hard to add some logging when a message larger than 1MB is received, right? And see if we spot it in during mochitests or in the wild by logging to the browser console.

Flags: needinfo?(felipc)

:Felipe Gomes (needinfo for replies!)

Comment 25

•

9 years ago

Jim: do you know if there's any usage from the Thumbnails service that could be passing huge messages through the message manager?

Flags: needinfo?(jmathies)

Comment 26

•

9 years ago

(In reply to :Felipe Gomes [away Jul 8 - 21, will reply to needinfos when back] from comment #25)
> Jim: do you know if there's any usage from the Thumbnails service that could
> be passing huge messages through the message manager?

That was my first thought reading this bug. 

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/PageThumbs.jsm#291

We really need to identify the caller here, knowing size alone isn't helpful.

Flags: needinfo?(jmathies)

Comment 27

•

9 years ago

> We really need to identify the caller here, knowing size alone isn't helpful.

Can you elaborate why you're interested in the caller? Do you suspect that these huge messages shouldn't be happening at all?

If there's reasonable chance that the message size is legit, maybe we can just use a fallible allocator and call it a day?

Comment 28

•

9 years ago

(In reply to David Major [:dmajor] from comment #27)
> > We really need to identify the caller here, knowing size alone isn't helpful.
> 
> Can you elaborate why you're interested in the caller? Do you suspect that
> these huge messages shouldn't be happening at all?

I have no idea, that's why i'm interested. Sounds like we could optimize whatever this api is doing, serializing large checks of data can't be good for performance.

> If there's reasonable chance that the message size is legit, maybe we can
> just use a fallible allocator and call it a day?

I don't know how this fixes this problem:

Total Virtual Memory 	
2,147,352,576

Available Virtual Memory 	
266,911,744

System Memory Use Percentage 	
79%

Available Page File 	
1,321,902,080

Available Physical Memory 	
702,009,344

Why would this system OOM trying to serialize a large message? Does a fallible allocator fix the problem?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 29

•

9 years ago

> Available Virtual Memory 	
> 266,911,744

For large allocations the limiting factor is contiguous address space. We may have 266 megs available in total, but it's fragmented into a bunch of small regions. With our current allocator, once we get below roughly 300 megs, the available regions are usually not larger than 1 megabyte in size. So in general, we should not expect an allocation over 1MB to succeed.

Comment 30

•

9 years ago

Jim, Olli, we are hitting this crash on Beta channel at about 1.4%. Given the high volume, could you please help find an owner? Thanks!

Flags: needinfo?(jmathies)

Comment 31

•

9 years ago

(In reply to David Major [:dmajor] from comment #29)
> > Available Virtual Memory 	
> > 266,911,744
> 
> For large allocations the limiting factor is contiguous address space. We
> may have 266 megs available in total, but it's fragmented into a bunch of
> small regions. With our current allocator, once we get below roughly 300
> megs, the available regions are usually not larger than 1 megabyte in size.
> So in general, we should not expect an allocation over 1MB to succeed.

So if we switch to fallible allocation here per your suggestion in comment 27, does that mean we get better crash signatures with caller info? This wouldn't help us on beta but we need to start somewhere.

Also, if these machines are memory starved and our allocator barfs in these situations, what's the proper fix here, try to reduce the allocation size of the caller, whatever that is? This feels like a bug that doesn't have a real chance of getting fixed baring ripping our allocator out and using something better.

Flags: needinfo?(jmathies)

Updated

•

9 years ago

Flags: needinfo?(dmajor)

Comment 32

•

9 years ago

There are several possible approaches for dealing with Large OOMs. I think in this case we'd be looking at the last one.

- Make the allocation smaller. Data structure rearrangements, lazy construction, general MemShrink-type work.
- For long arrays, use a SegmentedVector to break the allocations into easier-to-swallow pieces.
- Use a fallible allocator. This papers over the problem but in practice it can be good enough. Some users will be hopelessly low on memory and will just go on to crash somewhere else instead. We can't do much about that. But for other users, a fallible allocation might let them get by just enough to finish what they were doing, or close a tab, or until a GC kicks in, etc. Average uptimes get longer. This becomes more and more true with larger allocation sizes (larger proportion of users who can't survive this allocation but could survive subsequent smaller ones).

Flags: needinfo?(dmajor)

Comment 33

•

9 years ago

So based on that, to fix this we need information on the callers. Looks like we have a message string here we could add to annotations. I think adding debugging on mc to get a feel for what calls commonly fail here is the next step.

http://hg.mozilla.org/releases/mozilla-beta/annotate/664b98d33a1b/dom/base/nsFrameMessageManager.cpp#l2099


This is an ipc related abort on machines with limited memory. Moving this into our e10s triage.

tracking-e10s: --- → ?

Comment 34

•

9 years ago

This bug is not about e10s or ipc. This is about _same_process_ messaging.

Flags: needinfo?(bugs)

Comment 35

•

9 years ago

(In reply to Olli Pettay [:smaug] (away-ish until July 26, expect slower than usual reviews) from comment #34)
> This bug is not about e10s or ipc. This is about _same_process_ messaging.

Hmm, ok then it's not suitable for e10s triage as the e10s team won't pick it up. I'm at a loss for an owner here, anyone who knows crash reporter and ipc a bit should be able to add some additional annotations to help diagnose.

tracking-e10s: ? → -

Comment 36

•

9 years ago

Attached patch Log the message name — Details — Splinter Review

Something like this?

Attachment #8637304 - Flags: review?(jmathies)

Comment 37

•

9 years ago

Comment on attachment 8637304 [details] [diff] [review]
Log the message name

Yep that looks good to me.

Attachment #8637304 - Flags: review?(jmathies) → review+

Pulsebot

Comment 38

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/caefe89148b3

Comment 39

•

9 years ago

Comment on attachment 8637304 [details] [diff] [review]
Log the message name

Note that this is a diagnostic patch only.

Approval Request Comment
[Feature/regressing bug #]: unknown
[User impact if declined]: The volume of this crash is so low on Nightly that we won't get enough data.
[Describe test coverage new/current, TreeHerder]: nope
[Risks and why]: very low risk as it only affects codepaths where we've already decided to crash
[String/UUID change made/needed]: none

Attachment #8637304 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

9 years ago

Keywords: leave-open

Comment 40

•

9 years ago

Comment on attachment 8637304 [details] [diff] [review]
Log the message name

Diagnostic messaging is good. Let's uplift to Aurora.

Attachment #8637304 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Carsten Book [:Tomcat]

Comment 41

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/caefe89148b3

Ryan VanderMeulen [:RyanVM]

Comment 42

•

9 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/22d307fa2327

status-firefox41: --- → affected

status-firefox42: --- → affected

tracking-firefox41: --- → ?

tracking-firefox42: --- → ?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 43

•

9 years ago

Tracked.

tracking-firefox41: ? → +

tracking-firefox42: ? → +

Andrew McCreight [:mccr8]

Comment 44

•

9 years ago

Just two so far, but interesting:

2385920, SessionStore:update
2927296, SessionStore:update

Does anyone know who currently works on session store?

Comment 45

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #44)
> Does anyone know who currently works on session store?

Bill would know.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 46

•

9 years ago

Tim, we're OOMing trying to send the SessionStore:update message. I guess we could try to break up the data. The other option would be to have sendAsyncMessage throw in this case.

Flags: needinfo?(wmccloskey) → needinfo?(ttaubert)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 47

•

9 years ago

Breaking up the data and supporting incremental updates shouldn't be too hard. Can we figure out which data the message includes that makes us OOM? I fear it might be DOMSessionStorage where it might be caused by a single huge entry that we couldn't split. What would we do when sendAsyncMessage() throws? Just retain the data and keep trying?

Flags: needinfo?(ttaubert)

Comment 48

•

9 years ago

(In reply to Tim Taubert [:ttaubert] from comment #47)
> Breaking up the data and supporting incremental updates shouldn't be too
> hard. Can we figure out which data the message includes that makes us OOM? I
> fear it might be DOMSessionStorage where it might be caused by a single huge
> entry that we couldn't split.

We should probably assume that DOMSessionStorage is possible then. It might be hard to find out too much, though, since we probably don't have enough memory to serialize the message into the minidump.

> What would we do when sendAsyncMessage()
> throws? Just retain the data and keep trying?

Yeah, I'm not sure what else to do. Hopefully we've fired a memory pressure event and things will get better.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 49

•

9 years ago

smaug stated that this this code only executes for in-process messaging (comment 34), why is session store sending huge blocks of data over a "local" connection?

Comment 50

•

9 years ago

It uses the message manager regardless of whether we're in e10s or not. Generally that makes for a nicer design.

Avoiding the structured clone is hard in general. If we just pass the object from the sender to the receiver, then the receiver can modify it and the modifications will be visible to the sender. Also, it has very different semantics for things like XPCOM objects, which don't work under structured clone.

One horrible hack we could do here is to check if we're in the chrome process. If we are, then we can send the data using the CPOWs argument rather than the data argument. We'd have to make sure that we don't modify the data in the session store code, but we could probably make it work.

We'll still have the same problem in e10s, but maybe we can put off fixing it until we get multiple content processes. At that point, address space limitations should be much less of an issue. Or maybe people will actually start using 64-bit windows builds.

Lawrence Mandel [:lmandel] (use needinfo)

Comment 51

•

9 years ago

We aren't seeing a lot of these on nightly and aurora. Of those that have shown up, 99% are session updates. I found one sdk related message ('sdk/remote/process/message') and that's it.

Comment 52

•

9 years ago

This is now wontfix for 40.

status-firefox40: affected → wontfix

Comment 53

•

9 years ago

out of 100 reports on beta now:

{u'vdh': 3, u'sdk/remote/process/message': 4, u'ContextMenu:SetAsDesktopBackground:Result': 2, u'SessionStore:update': 91}

Component: DOM → Session Restore

Product: Core → Firefox

Comment 54

•

9 years ago

Bill, it looks like this is indeed SessionStore. Can we do anything in time for 41?

Flags: needinfo?(wmccloskey)

Comment 55

•

9 years ago

njn, this might interest you. Perhaps we could use Segmented Vector somewhere here

Flags: needinfo?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 56

•

9 years ago

mData is JSAutoStructuredCloneBuffer.


Other option is to try to reuse the data and not do extra cloning.

Comment 57

•

9 years ago

(In reply to Olli Pettay [:smaug] (high review load) from comment #55)
> njn, this might interest you. Perhaps we could use Segmented Vector
> somewhere here

Thanks for the tip. I'll take a look.

Flags: needinfo?(n.nethercote)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 58

•

9 years ago

Brad, I am trying to follow up on tracked 41 bugs. Given that this is now a top crasher for Nightly, could you please help find an owner? That might be one step in trying to see if we have a speculative fix for FF41/42.

Flags: needinfo?(blassey.bugs)

Comment 59

•

9 years ago

(In reply to Ritu Kothari (:ritu) from comment #58)
> Brad, I am trying to follow up on tracked 41 bugs. Given that this is now a
> top crasher for Nightly, could you please help find an owner? That might be
> one step in trying to see if we have a speculative fix for FF41/42.

This was very rare a couple weeks ago, do we know what bug regressed this?

Comment 60

•

9 years ago

This isn't a top crasher on nightly, it's a bad crash in release (I can't get stats off crashstats for 40.0.2??). 43 isn't listed in the meta report.

https://crash-stats.mozilla.com/report/list?signature=OOM%20|%20large%20|%20NS_ABORT_OOM%28unsigned%20int%29%20|%20nsSameProcessAsyncMessageBase%3A%3AnsSameProcessAsyncMessageBase%28JSContext*%2C%20nsAString_internal%20const%26%2C%20mozilla%3A%3Adom%3A%3AStructuredCloneData%20const%26%2C%20JS%3A%3AHandle%3CT%3E%2C%20nsIPrincipal*%29#tab-sigsummary

Poking at a few of the reports, it look likely that this is caused by the same session store code. I see really big allocations on memory starved systems.

This has nothing to do with e10s too.

Flags: needinfo?(wmccloskey)

Updated

•

9 years ago

Flags: needinfo?(ttaubert)

Brad Lassey [:blassey] (use needinfo?)

Comment 61

•

9 years ago

Not sure what more information I can provide... It sounds like making sendAsyncMessage() throw and simply keep trying to send that data might be the best we can do.

Flags: needinfo?(ttaubert)

Comment 62

•

9 years ago

Ritu, this isn't a top crasher in Nightly and isn't e10s related. Did you NI me on the wrong bug?

Flags: needinfo?(blassey.bugs) → needinfo?(rkothari)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 63

•

9 years ago

Brad, my bad. I meant to n-i billm.

status-firefox41: affected → wontfix

Flags: needinfo?(rkothari)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 64

•

9 years ago

It is getting too late to fix this in FF41 given that we don't have a patch ready. Let's try to get this fixed in 42.

Comment 65

•

9 years ago

(In reply to Nicholas Nethercote [:njn] from comment #57)
> (In reply to Olli Pettay [:smaug] (high review load) from comment #55)
> > njn, this might interest you. Perhaps we could use Segmented Vector
> > somewhere here
> 
> Thanks for the tip. I'll take a look.

(In reply to Tim Taubert [:ttaubert] from comment #61)
> Not sure what more information I can provide... It sounds like making
> sendAsyncMessage() throw and simply keep trying to send that data might be
> the best we can do.

Tim, Nick - it isn't clear which of those paths is going to be taken here. Could you please work out a plan and an owner?

Flags: needinfo?(ttaubert)

Flags: needinfo?(n.nethercote)

Comment 66

•

9 years ago

Would either of these options risk breaking session store for these systems? If the system is starved for memory and every attempt to update session data fails.

Also, sounds like we need a bug filed and prioritized by a team that aims to break this data up so we can really fix the problem.

Comment 67

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #66)
> Would either of these options risk breaking session store for these systems?
> If the system is starved for memory and every attempt to update session data
> fails.

Yeah that would mean we wouldn't update session data for one or multiple tabs. That's somewhat related to bug 581510 and bug 903621 where the same is happening in different places. We wouldn't crash (so that's better) but we could also lose data if we can't find memory before Firefox quits.

> Also, sounds like we need a bug filed and prioritized by a team that aims to
> break this data up so we can really fix the problem.

We don't have a "team" for sessionstore unfortunately, the latest changes we made where mostly to make it work with multiple processes, so some of the e10s folks and I worked on that.

Breaking this data down is not an easy thing to do if it's really caused by one "type" of data. It's a non-negligible amount of work to implement incremental updates from child processes. But if even those incremental updates are huge (what we guess happens here with DOMSessionStorage) then we can't split that data. We could try to compress it, but that operation needs memory too.

The biggest problem though (I think) is that even if we somehow rewrite code and manage to get this data over in chunks without going OOM then we very likely can't save that to disk anyway. We can't create that huge JSON string we want to send to the session worker that handles our I/O.

Flags: needinfo?(ttaubert)

Nicholas Nethercote [inactive]

Comment 68

•

9 years ago

(In reply to Tim Taubert [:ttaubert] from comment #67)
> (In reply to Jim Mathies [:jimm] from comment #66)
> > Would either of these options risk breaking session store for these systems?
> > If the system is starved for memory and every attempt to update session data
> > fails.
> 
> Yeah that would mean we wouldn't update session data for one or multiple
> tabs. That's somewhat related to bug 581510 and bug 903621 where the same is
> happening in different places. We wouldn't crash (so that's better) but we
> could also lose data if we can't find memory before Firefox quits.

My obviously not great suggestion would be to make sendAsyncMessage() throw so that we can catch that in the content script and simply keep trying. That would be in line with the somewhat shitty UX from bug 581510 & bug 903621 but at least we don't crash.

Caspy7

Comment 69

•

9 years ago

I experienced two crashes in the last two days:
https://crash-stats.mozilla.com/report/index/c98c0575-04f0-49bc-933f-9aedf2150909
https://crash-stats.mozilla.com/report/index/1eec3c5b-2d99-41e2-bda5-1ee8d2150911

Before each occurred I saw black painting on content. Shortly before the second crash I took a memory report.

I haven't read through this entire bug, so if you guys have this issue fully grasped, great, but if you need info from me like the memory report, my graphics info or more, please let me know.

Twice in two days is quite the escalation and I'd love to see it fixed.

Updated

•

9 years ago

Flags: needinfo?(n.nethercote)

Caspy7

Comment 70

•

9 years ago

Attached file Taken shortly before crash 1eec3c5b-2d99-41e2-bda5-1ee8d2150911 — Details

Caspy7

Comment 71

•

9 years ago

Attached file Memory Report taken shortly before crash 00cb32d2-6533-4a1c-869c-6a4bf2150917 — Details

As I mentioned previously, I've been experiencing this crash somewhat frequently and so when I saw it was about to crash (because of the black drawing) I was able to generate an anonymized memory report shortly before 2 separate crashes.

I have attached the two memory reports and their associated crashes ( 1eec3c5b-2d99-41e2-bda5-1ee8d2150911 and https://crash-stats.mozilla.com/report/index/00cb32d2-6533-4a1c-869c-6a4bf2150917 ).

Comment 72

•

9 years ago

Caspy, thanks for the reports, I think the problem is understood at this point.

Your crash reports show a recurring high memory usage so you are likely to have problems even after this bug is fixed. (Fixing this bug will mainly benefit those users who only occasionally spike into high memory territory.) Perhaps you may want to open a separate bug for "I frequently run out of memory" where you can get advice from the MemShrink team.

Comment 73

•

9 years ago

I am going to untrack it as we have been tracking it for a while...

tracking-firefox42: + → -

Comment 74

•

9 years ago

dmajor thinks that it is actually actionable. Tracking it again.
David, I think I remember you told me that your worked on the session mechanism, would you be able to help here?

status-firefox43: --- → affected

tracking-firefox42: - → +

tracking-firefox43: --- → +

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 75

•

9 years ago

For the record I understand that there's concern about the complexity of implementation here. Perhaps this may require a more involved fix that isn't uplift material. If so, that's fine; it's still better than accepting this high-volume crash signature forever. That's why I'd still like to keep it on the tracking radar so that we don't forget about this bug.

Comment 76

•

9 years ago

I see the following ways forward.

1/ Make sendAsyncMessage throw, as suggested by billm. As mentioned by Tim, this would, however, leave us with users who cannot write their Session Restore data to disk, so this is at best a temporary measure, which should perhaps be accompanied by a UX "Firefox is running low on memory, you should restart now".

2/ Bite the bullet and make Session Restore updates piecewise. This is possible, I had an early prototype a long time ago, but this is lots of delicate work and I am not planning to work on this again. Also, as Tim mentions this still leaves us with the problem that we may not be able to write such large JSON to disk in the first place.

3/ Bite the other bullet and move Session Restore to sqlite. There is a chance that this might have a negative impact on startup time, but this should improve jank and get us rid of these insane issues into which we run with JSON.

Any thought, Tim?

David Teller [:Yoric] - still alive but not very active

Updated

•

9 years ago

Flags: needinfo?(dteller) → needinfo?(ttaubert)

David Teller [:Yoric] - still alive but not very active

Comment 77

•

9 years ago

(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #76)
> 1/ Make sendAsyncMessage throw, as suggested by billm. As mentioned by Tim,
> this would, however, leave us with users who cannot write their Session
> Restore data to disk, so this is at best a temporary measure, which should
> perhaps be accompanied by a UX "Firefox is running low on memory, you should
> restart now".

That might be the easiest fix for now. It doesn't make OOM situations any easier but it doesn't crash Firefox, which is nice.

> 2/ Bite the bullet and make Session Restore updates piecewise. This is
> possible, I had an early prototype a long time ago, but this is lots of
> delicate work and I am not planning to work on this again. Also, as Tim
> mentions this still leaves us with the problem that we may not be able to
> write such large JSON to disk in the first place.

Also, just one piece (a huge DOMSessionStorage entry) might be enough to go OOM. I will certainly not have the time to implement or even mentor large changes like this.

> 3/ Bite the other bullet and move Session Restore to sqlite. There is a
> chance that this might have a negative impact on startup time, but this
> should improve jank and get us rid of these insane issues into which we run
> with JSON.

Same resource/time problem as with (2). Also I'm not convinced that this helps here? We go OOM just collecting data, not even trying to write it to disk. Sure we might go OOM partially because we hold that big JSON string in memory but we're talking about a few MBs (usually not more than 10). Anything could be pushing us over that boundary easily at that point. I think there might be other data formats to consider, but we should be confident they solve the problem at hand before considering any of those. SQLite will probably bring a whole class of different problems too.

Flags: needinfo?(ttaubert)

Comment 78

•

9 years ago

Well, 3/ will definitely take resources and open a different can of unpredictable worms.

What makes you say that we can crash collecting data, though? What did I miss? Is this an OOM that crashes when we send data from the content not-process-because-we-re-not-in-e10s to the parent?

Flags: needinfo?(ttaubert)

David Teller [:Yoric] - still alive but not very active

Comment 79

•

9 years ago

Yeah, sendAsyncMessage() crashes the child process, or everything if we don't actually have multiple processes. The nsSameProcessAsyncMessageBase() constructor aborts on OOM.

http://hg.mozilla.org/releases/mozilla-beta/annotate/664b98d33a1b/dom/base/nsFrameMessageManager.cpp#l2099

Flags: needinfo?(ttaubert)

Comment 80

•

9 years ago

Ah, in that case, yeah, we need both 1/ and a brand-new 4/ break up data from `SessionStore:update`, as you mentioned above. I wonder which part of the data is so large that it cannot be sent. Session Storage, perhaps?

David Teller [:Yoric] - still alive but not very active

Comment 81

•

9 years ago

session storage should use less memory, but I think we can also optimize memory usage in message sending. StructuredCloneData could use nsStringBuffer to store the data so that the data could be shared, not copied.

Comment 82

•

9 years ago

I am a bit nervous that if we the message is large enough to cause an OOM – any kind of OOM – then we have a structural weakness. As mentioned by ttaubert, if the message is large enough to cause an OOM in the child frame/process, it is also certainly large enough to cause an OOM in the parent a bit later when we attempt to write it to disk.

So, if we want to improve the efficiency of message-passing (either by improving StructuredCloneData or by cutting the message in chunks), we also need to rewrite how we write data to disk, otherwise, we're most likely just shifting the OOM crash around. This pretty much means sqlite, I believe.

For the time being, we indeed need to throw from sendAsyncMessage, and handle this semi-correctly, but this will turn this crash into a dataloss, which is not really satisfying.

David Teller [:Yoric] - still alive but not very active

Reporter

Comment 83

•

9 years ago

¡Hola!

Just another data point in case it is useful...

¡Gracias!

bp-334794a8-76ba-4397-957a-d7ad92150930
	30/09/2015	06:06 p.m.

Crashing Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
3 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
4 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessage(nsAString_internal const&, JS::Value const&, JS::Value const&, nsIPrincipal*, JSContext*, unsigned char) 	dom/base/nsFrameMessageManager.cpp
5 	xul.dll 	js::SetObjectElement(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, bool, JS::Handle<JSScript*>, unsigned char*) 	js/src/vm/Interpreter.cpp
6 	xul.dll 	PRMJ_Now() 	js/src/vm/Time.cpp
7 	xul.dll 	js::jit::CanEnter(JSContext*, js::RunState&) 	js/src/jit/Ion.cpp
8 	xul.dll 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
9 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
10 	xul.dll 	js::fun_apply(JSContext*, unsigned int, JS::Value*) 	js/src/jsfun.cpp
11 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
12 	xul.dll 	js::DirectProxyHandler::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/proxy/DirectProxyHandler.cpp
13 	xul.dll 	js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/proxy/CrossCompartmentWrapper.cpp
14 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
15 	xul.dll 	js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
16 	xul.dll 	js::jit::DoCallFallback 	js/src/jit/BaselineIC.cpp
17 		@0x14b94300 	
18 		@0x30a74337 	
19 		@0x14b90967 	
20 	xul.dll 	EnterBaseline 	js/src/jit/BaselineJIT.cpp
21 	xul.dll 	js::jit::EnterBaselineMethod(JSContext*, js::RunState&) 	js/src/jit/BaselineJIT.cpp
22 	xul.dll 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
23 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
24 	xul.dll 	JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) 	js/src/jsapi.cpp
25 	xul.dll 	nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJSClass.cpp
26 	xul.dll 	nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJS.cpp
27 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/md/win32/xptcstubs.cpp
28 	xul.dll 	SharedStub 	xpcom/reflect/xptcall/md/win32/xptcstubs.cpp
29 	nss3.dll 	PR_Assert 	nsprpub/pr/src/io/prlog.c
30 	nss3.dll 	md_UnlockAndPostNotifies 	nsprpub/pr/src/md/windows/w95cv.c
31 	user32.dll 	GetInputBits

Comment 84

•

9 years ago

As a stopgap measure, I'll try and make sending throw (and add Telemetry) and we'll see how this goes.

David Teller [:Yoric] - still alive but not very active

Updated

•

9 years ago

Depends on: 1212244

Updated

•

9 years ago

Depends on: 1213437

BMO Automation

Updated

•

9 years ago

Reporter

Comment 85

•

9 years ago

¡Hola!

Just another data point in case it is useful...

¡Gracias!

bp-9a3bff39-5edd-47ca-9af8-e0e892151012
	12/10/2015	03:09 p.m.

Crashing Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
3 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
4 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessage(nsAString_internal const&, JS::Value const&, JS::Value const&, nsIPrincipal*, JSContext*, unsigned char) 	dom/base/nsFrameMessageManager.cpp
5 	xul.dll 	mozilla::jsipc::IdToObjectMap::trace(JSTracer*) 	js/ipc/JavaScriptShared.cpp
6 	xul.dll 	js::InitPropertyOperation(JSContext*, JSOp, JS::Handle<JSObject*>, JS::Handle<jsid>, JS::Handle<JS::Value>) 	js/src/vm/Interpreter-inl.h

Comment 86

•

9 years ago

Looks like this bug won't be fixed during the 42 beta cycle. However, we still have time for 43 and we would be happy to take a patch during this cycle.

status-firefox42: affected → wontfix

status-firefox44: --- → affected

tracking-firefox44: --- → +

David Teller [:Yoric] - still alive but not very active

Comment 87

•

9 years ago

For 43 we could try either bug 1212244 or bug 1213437.
Though, they both probably just move OOM to some other place, or make browser work inconsistently (when sending large messages) in case we're close to OOM.
Bug 1213437 seems to rely on the change only in 44, so bug 1212244 might be easier and safer to port to 43.

Comment 88

•

9 years ago

So, bug 1212244 should be safe wrt OOM but will cause Session Restore dataloss instead.
On the other hand, bug 1213437 (which is generally a Good Thing) will probably only move the OOM in this case, until we have bug 1214346... which will also turn the OOM into a Session Restore dataloss.

I will try and work on bug 1214346 asap, but in the meantime, I'm not sure we should uplift anything. Executive decision needed.

David Teller [:Yoric] - still alive but not very active

Updated

•

9 years ago

Depends on: 1216250

David Teller [:Yoric] - still alive but not very active

Comment 89

•

9 years ago

Actually, the real solution is probably bug 1216250.

Wayne Mery (:wsmwk)

Updated

•

9 years ago

Comment 90

•

9 years ago

Work is ongoing in bug 1216250 for now; wontfixing this for 43.

status-firefox43: affected → wontfix

status-firefox45: --- → affected

tracking-firefox45: --- → +

Reporter

Comment 91

•

9 years ago

¡Hola!

Is this useful?

bp-95ce8741-a64e-4057-ab99-bf9ee2151204
	04/12/2015	12:35 p.m.

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
3 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
4 	xul.dll 	xul.dll@0x1b83107

¡Gracias!

u279076

Comment 92

•

9 years ago

(In reply to alex_mayorga from comment #91)
> ¡Hola!
> 
> Is this useful?
> 
> bp-95ce8741-a64e-4057-ab99-bf9ee2151204
> 	04/12/2015	12:35 p.m.

This is with Firefox Beta 43 which will not be getting any further fixes. It'd be useful to see if you can replicate this in Firefox 44 since that's where bug 1216250 landed.