Closed Bug 1106264 Opened 5 years ago Closed 4 years ago

Large OOM in nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase

Categories

(Firefox :: Session Restore, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
e10s - ---
firefox37 - wontfix
firefox38 --- wontfix
firefox39 + wontfix
firefox40 + wontfix
firefox41 + wontfix
firefox42 + wontfix
firefox43 + wontfix
firefox44 + verified
firefox45 + verified

People

(Reporter: alex_mayorga, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-ece146fc-dc12-41fa-a8a4-b8b032141129.
=============================================================
This is by no means a new crash (oldest report points to Firefox 30) but it's reaching a significant volume. There have been 4988 reports of this crash in the last week and accounts for nearly 12% of all DOM crashes.

Firefox 40 - #275 with 3 crashes
Firefox 39 - #56 with 42 crashes
Firefox 38 - #26 with 989 crashes
Firefox 37 - #36 with 2923 crashes

[Tracking Requested - why for this release]: By rank this does not yet count as a topcrash but it is our #1 crash for DOM right now.
Component: General → DOM
Product: Firefox → Core
Version: 34 Branch → Trunk
¡Hola Anthony!

Still a problem on 38.0b99

Report ID 	Date Submitted
bp-1f570d4d-a3f3-47aa-967b-b5c162150506
	06/05/2015	05:15 p.m.

Sadly I still don't have STR the salient symptoms just prior to the crash is heavy usage of the "Physical Memory" at 98% or more and major black artifacting on the pages that are loaded or loading on Firefox Beta.

Please let me know if there's anything worth collecting from the affected system.

Shall tracking be re-requested for 38?

¡Gracias!
Flags: needinfo?(anthony.s.hughes)
Unfortunately there isn't much we can do without STR. Perhaps the next time you encounter these symptoms you could capture a profile using the Gecko Profiler addon. Additionally, an about:memory log would be useful.
Flags: needinfo?(anthony.s.hughes)
Marking Firefox 37 and no for tracking and wontfix for status as this release will be EOL when Firefox 38 is released on Tuesday.
[Tracking Requested - why for this release]:
This crash signature is now 1.1% of 39.0b2 data so far.
This is OOM. Not sure what we can do to this.
It sounds like a fix is unlikely until we have better steps to reproduce the bug. 

Andrew since this seems to be a crash signature on the rise for 39, can you find someone to look into it a little more? I can also track this and we can keep an eye on the crashes.
Flags: needinfo?(overholt)
This is rather hard to debug based on the crash-stats. Some JS code somewhere calls
sendAsyncMessage(msg, someLargeData); and we don't have enough memory to serialize that large data.
But, we're doing some structured clone data copying. Do we really need to? Could we perhaps
just share the data. This is all within on process.
Perhaps baku has some ideas.
Flags: needinfo?(amarchesini)
(In reply to Olli Pettay [:smaug] (Reviewing overload, no new review requests before June 7, please) from comment #8)
> This is rather hard to debug based on the crash-stats. Some JS code
> somewhere calls
> sendAsyncMessage(msg, someLargeData); and we don't have enough memory to
> serialize that large data.

Well, for one thing, if it's OOM, I think we should actually fail the right way and annotate allocation size on the crash. For the other, should we really let JS crash us on a large allocation? Shouldn't we make this fallible instead and return an error, catching it on the JS side?
JS passes something large to C++ side, and then C++ crashes when it copies the serialized form of some data it got from JS.
The issue is that this is in message manager messaging, so supposedly it is some browser functionality which is causing the issue. Is that functionality important enough that we rather crash than break it in case of OOM? hard to say without knowing what this is about.

But I believe we should be able to use less memory here.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #10)
> Well, for one thing, if it's OOM, I think we should actually fail the right
> way and annotate allocation size on the crash.

Funny enough I just added that annotation in bug 1167189 comment 8. I'll request uplift so we can get a handle on the size of these allocations.
> Perhaps baku has some ideas.

There is a bug about using shared memory for big strings in postMessage. But nothing else than this.
I'll talk with bent to know if we have any plan to support something similar for IPDL and maybe we can have something shared.
Flags: needinfo?(overholt)
Flags: needinfo?(amarchesini)
well, this has nothing to do with IPDL.
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*)] → [@ mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*)] [@ OOM | large | N…
Summary: crash in mozalloc_abort(char const* const) | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) → Large OOM in nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase
Probably too late to take anything in 39; we could still take a patch for 40.
Do we have any idea when this started to happen?
Flags: needinfo?(lhenry)
Basically forever. We have crash reports going back at least as far as Fx17. (The signatures were different before bug 982646)
Flags: needinfo?(lhenry)
But we started seeing it in higher volume during 34, and it continued to climb in 35, 36, 37.

I suspect that's just because we've eliminated many large OOMs that may have hid this earlier. I.e. memory-constrained machines would have previously crashed somewhere else; now that we've fixed those callsites, they're starting to crash here now.
interesting, and useful data. I was worried that we have added some new message manager usage where
some large data packages are sent via in-process message manager.
Now that we have allocation size numbers... by eyeball most of these are 1-2MB but occasionally going as high as 10MB.

Looks like this is going to be our top OOM on beta. Any chance for a fix in 40?
Flags: needinfo?(bugs)
Do we have any correlation for addons? Since if FF itself was using messagemanager for that large
messages, that clearly indicates a bug.

(but thinking if there is anything safe for beta)
The correlation files don't show any addons at all for that signature. (I assume that means no addons met the threshold for being suspicious)
felipe, can you think of anything in FF which might pass _huge_ messages via message manager.
Flags: needinfo?(felipc)
I don't know of anything that would get close to a OOM situation unless things are being leaked. I know that the select boxes can send a big amount of data if there's a <select> with many <option>s (e.g the "Component" dropdown in this bugzilla page: https://bugzilla.mozilla.org/request.cgi?action=queue&requestee=nobody%40mozilla.org&group=type ), but I doubt that would get past 1MB or so.

I don't know if thumbnails make use message manager to do anything, but if they do that would be somewhere to look.

I guess it wouldn't be hard to add some logging when a message larger than 1MB is received, right? And see if we spot it in during mochitests or in the wild by logging to the browser console.
Flags: needinfo?(felipc)
Jim: do you know if there's any usage from the Thumbnails service that could be passing huge messages through the message manager?
Flags: needinfo?(jmathies)
(In reply to :Felipe Gomes [away Jul 8 - 21, will reply to needinfos when back] from comment #25)
> Jim: do you know if there's any usage from the Thumbnails service that could
> be passing huge messages through the message manager?

That was my first thought reading this bug. 

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/PageThumbs.jsm#291

We really need to identify the caller here, knowing size alone isn't helpful.
Flags: needinfo?(jmathies)
> We really need to identify the caller here, knowing size alone isn't helpful.

Can you elaborate why you're interested in the caller? Do you suspect that these huge messages shouldn't be happening at all?

If there's reasonable chance that the message size is legit, maybe we can just use a fallible allocator and call it a day?
(In reply to David Major [:dmajor] from comment #27)
> > We really need to identify the caller here, knowing size alone isn't helpful.
> 
> Can you elaborate why you're interested in the caller? Do you suspect that
> these huge messages shouldn't be happening at all?

I have no idea, that's why i'm interested. Sounds like we could optimize whatever this api is doing, serializing large checks of data can't be good for performance.

> If there's reasonable chance that the message size is legit, maybe we can
> just use a fallible allocator and call it a day?

I don't know how this fixes this problem:

Total Virtual Memory 	
2,147,352,576

Available Virtual Memory 	
266,911,744

System Memory Use Percentage 	
79%

Available Page File 	
1,321,902,080

Available Physical Memory 	
702,009,344

Why would this system OOM trying to serialize a large message? Does a fallible allocator fix the problem?
> Available Virtual Memory 	
> 266,911,744

For large allocations the limiting factor is contiguous address space. We may have 266 megs available in total, but it's fragmented into a bunch of small regions. With our current allocator, once we get below roughly 300 megs, the available regions are usually not larger than 1 megabyte in size. So in general, we should not expect an allocation over 1MB to succeed.
Jim, Olli, we are hitting this crash on Beta channel at about 1.4%. Given the high volume, could you please help find an owner? Thanks!
Flags: needinfo?(jmathies)
(In reply to David Major [:dmajor] from comment #29)
> > Available Virtual Memory 	
> > 266,911,744
> 
> For large allocations the limiting factor is contiguous address space. We
> may have 266 megs available in total, but it's fragmented into a bunch of
> small regions. With our current allocator, once we get below roughly 300
> megs, the available regions are usually not larger than 1 megabyte in size.
> So in general, we should not expect an allocation over 1MB to succeed.

So if we switch to fallible allocation here per your suggestion in comment 27, does that mean we get better crash signatures with caller info? This wouldn't help us on beta but we need to start somewhere.

Also, if these machines are memory starved and our allocator barfs in these situations, what's the proper fix here, try to reduce the allocation size of the caller, whatever that is? This feels like a bug that doesn't have a real chance of getting fixed baring ripping our allocator out and using something better.
Flags: needinfo?(jmathies)
Flags: needinfo?(dmajor)
There are several possible approaches for dealing with Large OOMs. I think in this case we'd be looking at the last one.

- Make the allocation smaller. Data structure rearrangements, lazy construction, general MemShrink-type work.
- For long arrays, use a SegmentedVector to break the allocations into easier-to-swallow pieces.
- Use a fallible allocator. This papers over the problem but in practice it can be good enough. Some users will be hopelessly low on memory and will just go on to crash somewhere else instead. We can't do much about that. But for other users, a fallible allocation might let them get by just enough to finish what they were doing, or close a tab, or until a GC kicks in, etc. Average uptimes get longer. This becomes more and more true with larger allocation sizes (larger proportion of users who can't survive this allocation but could survive subsequent smaller ones).
Flags: needinfo?(dmajor)
So based on that, to fix this we need information on the callers. Looks like we have a message string here we could add to annotations. I think adding debugging on mc to get a feel for what calls commonly fail here is the next step.

http://hg.mozilla.org/releases/mozilla-beta/annotate/664b98d33a1b/dom/base/nsFrameMessageManager.cpp#l2099


This is an ipc related abort on machines with limited memory. Moving this into our e10s triage.
tracking-e10s: --- → ?
This bug is not about e10s or ipc. This is about _same_process_ messaging.
Flags: needinfo?(bugs)
(In reply to Olli Pettay [:smaug] (away-ish until July 26, expect slower than usual reviews) from comment #34)
> This bug is not about e10s or ipc. This is about _same_process_ messaging.

Hmm, ok then it's not suitable for e10s triage as the e10s team won't pick it up. I'm at a loss for an owner here, anyone who knows crash reporter and ipc a bit should be able to add some additional annotations to help diagnose.
Something like this?
Attachment #8637304 - Flags: review?(jmathies)
Comment on attachment 8637304 [details] [diff] [review]
Log the message name

Yep that looks good to me.
Attachment #8637304 - Flags: review?(jmathies) → review+
Comment on attachment 8637304 [details] [diff] [review]
Log the message name

Note that this is a diagnostic patch only.

Approval Request Comment
[Feature/regressing bug #]: unknown
[User impact if declined]: The volume of this crash is so low on Nightly that we won't get enough data.
[Describe test coverage new/current, TreeHerder]: nope
[Risks and why]: very low risk as it only affects codepaths where we've already decided to crash
[String/UUID change made/needed]: none
Attachment #8637304 - Flags: approval-mozilla-aurora?
Keywords: leave-open
Comment on attachment 8637304 [details] [diff] [review]
Log the message name

Diagnostic messaging is good. Let's uplift to Aurora.
Attachment #8637304 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Just two so far, but interesting:

2385920, SessionStore:update
2927296, SessionStore:update

Does anyone know who currently works on session store?
(In reply to Jim Mathies [:jimm] from comment #44)
> Does anyone know who currently works on session store?

Bill would know.
Flags: needinfo?(wmccloskey)
Tim, we're OOMing trying to send the SessionStore:update message. I guess we could try to break up the data. The other option would be to have sendAsyncMessage throw in this case.
Flags: needinfo?(wmccloskey) → needinfo?(ttaubert)
Breaking up the data and supporting incremental updates shouldn't be too hard. Can we figure out which data the message includes that makes us OOM? I fear it might be DOMSessionStorage where it might be caused by a single huge entry that we couldn't split. What would we do when sendAsyncMessage() throws? Just retain the data and keep trying?
Flags: needinfo?(ttaubert)
(In reply to Tim Taubert [:ttaubert] from comment #47)
> Breaking up the data and supporting incremental updates shouldn't be too
> hard. Can we figure out which data the message includes that makes us OOM? I
> fear it might be DOMSessionStorage where it might be caused by a single huge
> entry that we couldn't split.

We should probably assume that DOMSessionStorage is possible then. It might be hard to find out too much, though, since we probably don't have enough memory to serialize the message into the minidump.

> What would we do when sendAsyncMessage()
> throws? Just retain the data and keep trying?

Yeah, I'm not sure what else to do. Hopefully we've fired a memory pressure event and things will get better.
smaug stated that this this code only executes for in-process messaging (comment 34), why is session store sending huge blocks of data over a "local" connection?
It uses the message manager regardless of whether we're in e10s or not. Generally that makes for a nicer design.

Avoiding the structured clone is hard in general. If we just pass the object from the sender to the receiver, then the receiver can modify it and the modifications will be visible to the sender. Also, it has very different semantics for things like XPCOM objects, which don't work under structured clone.

One horrible hack we could do here is to check if we're in the chrome process. If we are, then we can send the data using the CPOWs argument rather than the data argument. We'd have to make sure that we don't modify the data in the session store code, but we could probably make it work.

We'll still have the same problem in e10s, but maybe we can put off fixing it until we get multiple content processes. At that point, address space limitations should be much less of an issue. Or maybe people will actually start using 64-bit windows builds.
We aren't seeing a lot of these on nightly and aurora. Of those that have shown up, 99% are session updates. I found one sdk related message ('sdk/remote/process/message') and that's it.
out of 100 reports on beta now:

{u'vdh': 3, u'sdk/remote/process/message': 4, u'ContextMenu:SetAsDesktopBackground:Result': 2, u'SessionStore:update': 91}
Component: DOM → Session Restore
Product: Core → Firefox
Bill, it looks like this is indeed SessionStore. Can we do anything in time for 41?
Flags: needinfo?(wmccloskey)
njn, this might interest you. Perhaps we could use Segmented Vector somewhere here
Flags: needinfo?(n.nethercote)
mData is JSAutoStructuredCloneBuffer.


Other option is to try to reuse the data and not do extra cloning.
(In reply to Olli Pettay [:smaug] (high review load) from comment #55)
> njn, this might interest you. Perhaps we could use Segmented Vector
> somewhere here

Thanks for the tip. I'll take a look.
Flags: needinfo?(n.nethercote)
Brad, I am trying to follow up on tracked 41 bugs. Given that this is now a top crasher for Nightly, could you please help find an owner? That might be one step in trying to see if we have a speculative fix for FF41/42.
Flags: needinfo?(blassey.bugs)
(In reply to Ritu Kothari (:ritu) from comment #58)
> Brad, I am trying to follow up on tracked 41 bugs. Given that this is now a
> top crasher for Nightly, could you please help find an owner? That might be
> one step in trying to see if we have a speculative fix for FF41/42.

This was very rare a couple weeks ago, do we know what bug regressed this?
This isn't a top crasher on nightly, it's a bad crash in release (I can't get stats off crashstats for 40.0.2??). 43 isn't listed in the meta report.

https://crash-stats.mozilla.com/report/list?signature=OOM%20|%20large%20|%20NS_ABORT_OOM%28unsigned%20int%29%20|%20nsSameProcessAsyncMessageBase%3A%3AnsSameProcessAsyncMessageBase%28JSContext*%2C%20nsAString_internal%20const%26%2C%20mozilla%3A%3Adom%3A%3AStructuredCloneData%20const%26%2C%20JS%3A%3AHandle%3CT%3E%2C%20nsIPrincipal*%29#tab-sigsummary

Poking at a few of the reports, it look likely that this is caused by the same session store code. I see really big allocations on memory starved systems.

This has nothing to do with e10s too.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(ttaubert)
Not sure what more information I can provide... It sounds like making sendAsyncMessage() throw and simply keep trying to send that data might be the best we can do.
Flags: needinfo?(ttaubert)
Ritu, this isn't a top crasher in Nightly and isn't e10s related. Did you NI me on the wrong bug?
Flags: needinfo?(blassey.bugs) → needinfo?(rkothari)
Brad, my bad. I meant to n-i billm.
Flags: needinfo?(rkothari)
It is getting too late to fix this in FF41 given that we don't have a patch ready. Let's try to get this fixed in 42.
(In reply to Nicholas Nethercote [:njn] from comment #57)
> (In reply to Olli Pettay [:smaug] (high review load) from comment #55)
> > njn, this might interest you. Perhaps we could use Segmented Vector
> > somewhere here
> 
> Thanks for the tip. I'll take a look.

(In reply to Tim Taubert [:ttaubert] from comment #61)
> Not sure what more information I can provide... It sounds like making
> sendAsyncMessage() throw and simply keep trying to send that data might be
> the best we can do.

Tim, Nick - it isn't clear which of those paths is going to be taken here. Could you please work out a plan and an owner?
Flags: needinfo?(ttaubert)
Flags: needinfo?(n.nethercote)
Would either of these options risk breaking session store for these systems? If the system is starved for memory and every attempt to update session data fails.

Also, sounds like we need a bug filed and prioritized by a team that aims to break this data up so we can really fix the problem.
(In reply to Jim Mathies [:jimm] from comment #66)
> Would either of these options risk breaking session store for these systems?
> If the system is starved for memory and every attempt to update session data
> fails.

Yeah that would mean we wouldn't update session data for one or multiple tabs. That's somewhat related to bug 581510 and bug 903621 where the same is happening in different places. We wouldn't crash (so that's better) but we could also lose data if we can't find memory before Firefox quits.

> Also, sounds like we need a bug filed and prioritized by a team that aims to
> break this data up so we can really fix the problem.

We don't have a "team" for sessionstore unfortunately, the latest changes we made where mostly to make it work with multiple processes, so some of the e10s folks and I worked on that.

Breaking this data down is not an easy thing to do if it's really caused by one "type" of data. It's a non-negligible amount of work to implement incremental updates from child processes. But if even those incremental updates are huge (what we guess happens here with DOMSessionStorage) then we can't split that data. We could try to compress it, but that operation needs memory too.

The biggest problem though (I think) is that even if we somehow rewrite code and manage to get this data over in chunks without going OOM then we very likely can't save that to disk anyway. We can't create that huge JSON string we want to send to the session worker that handles our I/O.
Flags: needinfo?(ttaubert)
(In reply to Tim Taubert [:ttaubert] from comment #67)
> (In reply to Jim Mathies [:jimm] from comment #66)
> > Would either of these options risk breaking session store for these systems?
> > If the system is starved for memory and every attempt to update session data
> > fails.
> 
> Yeah that would mean we wouldn't update session data for one or multiple
> tabs. That's somewhat related to bug 581510 and bug 903621 where the same is
> happening in different places. We wouldn't crash (so that's better) but we
> could also lose data if we can't find memory before Firefox quits.

My obviously not great suggestion would be to make sendAsyncMessage() throw so that we can catch that in the content script and simply keep trying. That would be in line with the somewhat shitty UX from bug 581510 & bug 903621 but at least we don't crash.
I experienced two crashes in the last two days:
https://crash-stats.mozilla.com/report/index/c98c0575-04f0-49bc-933f-9aedf2150909
https://crash-stats.mozilla.com/report/index/1eec3c5b-2d99-41e2-bda5-1ee8d2150911

Before each occurred I saw black painting on content. Shortly before the second crash I took a memory report.

I haven't read through this entire bug, so if you guys have this issue fully grasped, great, but if you need info from me like the memory report, my graphics info or more, please let me know.

Twice in two days is quite the escalation and I'd love to see it fixed.
Flags: needinfo?(n.nethercote)
As I mentioned previously, I've been experiencing this crash somewhat frequently and so when I saw it was about to crash (because of the black drawing) I was able to generate an anonymized memory report shortly before 2 separate crashes.

I have attached the two memory reports and their associated crashes ( 1eec3c5b-2d99-41e2-bda5-1ee8d2150911 and https://crash-stats.mozilla.com/report/index/00cb32d2-6533-4a1c-869c-6a4bf2150917 ).
Caspy, thanks for the reports, I think the problem is understood at this point.

Your crash reports show a recurring high memory usage so you are likely to have problems even after this bug is fixed. (Fixing this bug will mainly benefit those users who only occasionally spike into high memory territory.) Perhaps you may want to open a separate bug for "I frequently run out of memory" where you can get advice from the MemShrink team.
I am going to untrack it as we have been tracking it for a while...
dmajor thinks that it is actually actionable. Tracking it again.
David, I think I remember you told me that your worked on the session mechanism, would you be able to help here?
Flags: needinfo?(dteller)
For the record I understand that there's concern about the complexity of implementation here. Perhaps this may require a more involved fix that isn't uplift material. If so, that's fine; it's still better than accepting this high-volume crash signature forever. That's why I'd still like to keep it on the tracking radar so that we don't forget about this bug.
I see the following ways forward.

1/ Make sendAsyncMessage throw, as suggested by billm. As mentioned by Tim, this would, however, leave us with users who cannot write their Session Restore data to disk, so this is at best a temporary measure, which should perhaps be accompanied by a UX "Firefox is running low on memory, you should restart now".

2/ Bite the bullet and make Session Restore updates piecewise. This is possible, I had an early prototype a long time ago, but this is lots of delicate work and I am not planning to work on this again. Also, as Tim mentions this still leaves us with the problem that we may not be able to write such large JSON to disk in the first place.

3/ Bite the other bullet and move Session Restore to sqlite. There is a chance that this might have a negative impact on startup time, but this should improve jank and get us rid of these insane issues into which we run with JSON.

Any thought, Tim?
Flags: needinfo?(dteller) → needinfo?(ttaubert)
(In reply to David Rajchenbach-Teller [:Yoric] (use "needinfo") from comment #76)
> 1/ Make sendAsyncMessage throw, as suggested by billm. As mentioned by Tim,
> this would, however, leave us with users who cannot write their Session
> Restore data to disk, so this is at best a temporary measure, which should
> perhaps be accompanied by a UX "Firefox is running low on memory, you should
> restart now".

That might be the easiest fix for now. It doesn't make OOM situations any easier but it doesn't crash Firefox, which is nice.

> 2/ Bite the bullet and make Session Restore updates piecewise. This is
> possible, I had an early prototype a long time ago, but this is lots of
> delicate work and I am not planning to work on this again. Also, as Tim
> mentions this still leaves us with the problem that we may not be able to
> write such large JSON to disk in the first place.

Also, just one piece (a huge DOMSessionStorage entry) might be enough to go OOM. I will certainly not have the time to implement or even mentor large changes like this.

> 3/ Bite the other bullet and move Session Restore to sqlite. There is a
> chance that this might have a negative impact on startup time, but this
> should improve jank and get us rid of these insane issues into which we run
> with JSON.

Same resource/time problem as with (2). Also I'm not convinced that this helps here? We go OOM just collecting data, not even trying to write it to disk. Sure we might go OOM partially because we hold that big JSON string in memory but we're talking about a few MBs (usually not more than 10). Anything could be pushing us over that boundary easily at that point. I think there might be other data formats to consider, but we should be confident they solve the problem at hand before considering any of those. SQLite will probably bring a whole class of different problems too.
Flags: needinfo?(ttaubert)
Well, 3/ will definitely take resources and open a different can of unpredictable worms.

What makes you say that we can crash collecting data, though? What did I miss? Is this an OOM that crashes when we send data from the content not-process-because-we-re-not-in-e10s to the parent?
Flags: needinfo?(ttaubert)
Yeah, sendAsyncMessage() crashes the child process, or everything if we don't actually have multiple processes. The nsSameProcessAsyncMessageBase() constructor aborts on OOM.

http://hg.mozilla.org/releases/mozilla-beta/annotate/664b98d33a1b/dom/base/nsFrameMessageManager.cpp#l2099
Flags: needinfo?(ttaubert)
Ah, in that case, yeah, we need both 1/ and a brand-new 4/ break up data from `SessionStore:update`, as you mentioned above. I wonder which part of the data is so large that it cannot be sent. Session Storage, perhaps?
session storage should use less memory, but I think we can also optimize memory usage in message sending. StructuredCloneData could use nsStringBuffer to store the data so that the data could be shared, not copied.
I am a bit nervous that if we the message is large enough to cause an OOM – any kind of OOM – then we have a structural weakness. As mentioned by ttaubert, if the message is large enough to cause an OOM in the child frame/process, it is also certainly large enough to cause an OOM in the parent a bit later when we attempt to write it to disk.

So, if we want to improve the efficiency of message-passing (either by improving StructuredCloneData or by cutting the message in chunks), we also need to rewrite how we write data to disk, otherwise, we're most likely just shifting the OOM crash around. This pretty much means sqlite, I believe.

For the time being, we indeed need to throw from sendAsyncMessage, and handle this semi-correctly, but this will turn this crash into a dataloss, which is not really satisfying.
¡Hola!

Just another data point in case it is useful...

¡Gracias!

bp-334794a8-76ba-4397-957a-d7ad92150930
	30/09/2015	06:06 p.m.

Crashing Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
3 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
4 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessage(nsAString_internal const&, JS::Value const&, JS::Value const&, nsIPrincipal*, JSContext*, unsigned char) 	dom/base/nsFrameMessageManager.cpp
5 	xul.dll 	js::SetObjectElement(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, bool, JS::Handle<JSScript*>, unsigned char*) 	js/src/vm/Interpreter.cpp
6 	xul.dll 	PRMJ_Now() 	js/src/vm/Time.cpp
7 	xul.dll 	js::jit::CanEnter(JSContext*, js::RunState&) 	js/src/jit/Ion.cpp
8 	xul.dll 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
9 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
10 	xul.dll 	js::fun_apply(JSContext*, unsigned int, JS::Value*) 	js/src/jsfun.cpp
11 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
12 	xul.dll 	js::DirectProxyHandler::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/proxy/DirectProxyHandler.cpp
13 	xul.dll 	js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) 	js/src/proxy/CrossCompartmentWrapper.cpp
14 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
15 	xul.dll 	js::Invoke(JSContext*, JS::Value const&, JS::Value const&, unsigned int, JS::Value const*, JS::MutableHandle<JS::Value>) 	js/src/vm/Interpreter.cpp
16 	xul.dll 	js::jit::DoCallFallback 	js/src/jit/BaselineIC.cpp
17 		@0x14b94300 	
18 		@0x30a74337 	
19 		@0x14b90967 	
20 	xul.dll 	EnterBaseline 	js/src/jit/BaselineJIT.cpp
21 	xul.dll 	js::jit::EnterBaselineMethod(JSContext*, js::RunState&) 	js/src/jit/BaselineJIT.cpp
22 	xul.dll 	js::RunScript(JSContext*, js::RunState&) 	js/src/vm/Interpreter.cpp
23 	xul.dll 	js::Invoke(JSContext*, JS::CallArgs, js::MaybeConstruct) 	js/src/vm/Interpreter.cpp
24 	xul.dll 	JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) 	js/src/jsapi.cpp
25 	xul.dll 	nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJSClass.cpp
26 	xul.dll 	nsXPCWrappedJS::CallMethod(unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) 	js/xpconnect/src/XPCWrappedJS.cpp
27 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/md/win32/xptcstubs.cpp
28 	xul.dll 	SharedStub 	xpcom/reflect/xptcall/md/win32/xptcstubs.cpp
29 	nss3.dll 	PR_Assert 	nsprpub/pr/src/io/prlog.c
30 	nss3.dll 	md_UnlockAndPostNotifies 	nsprpub/pr/src/md/windows/w95cv.c
31 	user32.dll 	GetInputBits
As a stopgap measure, I'll try and make sending throw (and add Telemetry) and we'll see how this goes.
Depends on: 1213437
Crash Signature: , nsIPrincipal*) ] [@ OOM | large | NS_ABORT_OOM | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase ] → , nsIPrincipal*) ] [@ OOM | large | NS_ABORT_OOM | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase ] [@ mozalloc_abort | NS_DebugBreak | nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase]
¡Hola!

Just another data point in case it is useful...

¡Gracias!

bp-9a3bff39-5edd-47ca-9af8-e0e892151012
	12/10/2015	03:09 p.m.

Crashing Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
3 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::StructuredCloneData const&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
4 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessage(nsAString_internal const&, JS::Value const&, JS::Value const&, nsIPrincipal*, JSContext*, unsigned char) 	dom/base/nsFrameMessageManager.cpp
5 	xul.dll 	mozilla::jsipc::IdToObjectMap::trace(JSTracer*) 	js/ipc/JavaScriptShared.cpp
6 	xul.dll 	js::InitPropertyOperation(JSContext*, JSOp, JS::Handle<JSObject*>, JS::Handle<jsid>, JS::Handle<JS::Value>) 	js/src/vm/Interpreter-inl.h
Looks like this bug won't be fixed during the 42 beta cycle. However, we still have time for 43 and we would be happy to take a patch during this cycle.
For 43 we could try either bug 1212244 or bug 1213437.
Though, they both probably just move OOM to some other place, or make browser work inconsistently (when sending large messages) in case we're close to OOM.
Bug 1213437 seems to rely on the change only in 44, so bug 1212244 might be easier and safer to port to 43.
So, bug 1212244 should be safe wrt OOM but will cause Session Restore dataloss instead.
On the other hand, bug 1213437 (which is generally a Good Thing) will probably only move the OOM in this case, until we have bug 1214346... which will also turn the OOM into a Session Restore dataloss.

I will try and work on bug 1214346 asap, but in the meantime, I'm not sure we should uplift anything. Executive decision needed.
Actually, the real solution is probably bug 1216250.
See Also: → 511135
Work is ongoing in bug 1216250 for now; wontfixing this for 43.
¡Hola!

Is this useful?

bp-95ce8741-a64e-4057-ab99-bf9ee2151204
	04/12/2015	12:35 p.m.

Crashing Thread (0)
Frame 	Module 	Signature 	Source
0 	xul.dll 	NS_ABORT_OOM(unsigned int) 	xpcom/base/nsDebugImpl.cpp
1 	xul.dll 	nsSameProcessAsyncMessageBase::nsSameProcessAsyncMessageBase(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
2 	mozglue.dll 	moz_xmalloc 	memory/mozalloc/mozalloc.cpp
3 	xul.dll 	nsFrameMessageManager::DispatchAsyncMessageInternal(JSContext*, nsAString_internal const&, mozilla::dom::ipc::StructuredCloneData&, JS::Handle<JSObject*>, nsIPrincipal*) 	dom/base/nsFrameMessageManager.cpp
4 	xul.dll 	xul.dll@0x1b83107

¡Gracias!
(In reply to alex_mayorga from comment #91)
> ¡Hola!
> 
> Is this useful?
> 
> bp-95ce8741-a64e-4057-ab99-bf9ee2151204
> 	04/12/2015	12:35 p.m.

This is with Firefox Beta 43 which will not be getting any further fixes. It'd be useful to see if you can replicate this in Firefox 44 since that's where bug 1216250 landed.
Looks like it is gone in 44 thanks to bug 1216250!
Congrat David! Merci beaucoup!
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.