Closed Bug 756549 Opened 12 years ago Closed 12 years ago

Sync updates cause explosive memory usage leading to OOM

Categories

(Firefox :: Sync, defect)

x86_64
All
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox15 + fixed

People

(Reporter: gcp, Assigned: till)

References

(Blocks 1 open bug)

Details

(Keywords: regression, Whiteboard: [MemShrink][js:p0])

Situation: I came back after a workweek in London where I used Android Nightlies with Sync extensively, and had a desktop Firefox Beta on the laptop I used. 

On my desktop machine at home, which has Desktop Nightlies, Firefox started crashes every 2 minutes. When it didn't crash, I got a Sync error, even though status.mozilla.com claims Sync is working fine.

Looking at the Error Console, I see the following errors:
Timestamp: 18/05/2012 20:58:53
Error: out of memory
Source File: resource://services-sync/record.js
Line: 638
Timestamp: 18/05/2012 20:58:53
Error: [Exception... "NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS"  nsresult: "0x80570021 (NS_ERROR_XPC_JAVASCRIPT_ERROR_WITH_DETAILS)"  location: "JS frame :: resource:///modules/services-sync/resource.js :: Channel_onStopRequest :: line 528"  data: no]
Source File: resource:///modules/services-sync/resource.js
Line: 528

I did a test, looking at memory usage in Windows Task Manager, which was stable at 400M. When I click the "Retry Sync" (Or whatever the button is called when you get the sync failed message), memory usage immediately increases very rapidly after which Firefox crashes due to OOM.
Asa also observed this:

12:37:22 < Asa> rnewman: I'm OOMing repeatedly and that was after a failed sync connection. I think it's sync killing Firefox. What can I do? (nightly, windows 7)

gps is attempting to repro.
Possible dupe of Bug 755641. This was reported on Reddit around the 16th.
I can repro on OS X using official Nightly builds.

From GC logs:

GC(T+626.7) Total Time: 21.4ms, Compartments Collected: 191, Total Compartments: 243, MMU (20ms): 0%, MMU (50ms): 57%, Reason: SET_NEW_DOCUMENT, Nonincremental Reason: GC mode, Allocated: 27MB, +Chunks: 0, -Chunks: 0
    Totals: Purge: 0.1ms, Mark: 11.3ms, Mark Roots: 1.9ms, Mark Other: 0.5ms, Sweep: 9.6ms, Sweep Compartments: 5.8ms, Sweep Object: 0.9ms, Sweep String: 0.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.2ms, Discard Code: 0.3ms, Discard Analysis: 3.3ms, Sweep Types: 0.2ms, Clear Script Analysis: 0.2ms, Finalize End Callback: 0.7ms, End Callback: 0.3ms
GC(T+631.4) Total Time: 97.4ms, Compartments Collected: 252, Total Compartments: 305, MMU (20ms): 0%, MMU (50ms): 0%, Reason: SET_NEW_DOCUMENT, Nonincremental Reason: GC mode, Allocated: 32MB, +Chunks: 0, -Chunks: 0
    Totals: Purge: 0.1ms, Mark: 13.0ms, Mark Roots: 1.8ms, Mark Other: 0.5ms, Finalize Start Callback: 0.2ms, Sweep: 83.5ms, Sweep Compartments: 6.5ms, Sweep Object: 1.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.4ms, Discard Code: 0.4ms, Discard Analysis: 3.9ms, Sweep Types: 0.3ms, Clear Script Analysis: 0.3ms, Finalize End Callback: 0.8ms, End Callback: 0.6ms
GC(T+647.9) Total Time: 27.0ms, Compartments Collected: 269, Total Compartments: 323, MMU (20ms): 0%, MMU (50ms): 45%, Reason: CC_WAITING, Nonincremental Reason: GC mode, Allocated: 34MB, +Chunks: 2, -Chunks: 0
    Totals: Mark: 14.9ms, Mark Roots: 1.6ms, Mark Other: 0.8ms, Finalize Start Callback: 0.1ms, Sweep: 11.0ms, Sweep Compartments: 6.3ms, Sweep Object: 1.3ms, Sweep String: 0.1ms, Sweep Script: 0.2ms, Sweep Shape: 1.5ms, Discard Code: 0.4ms, Discard Analysis: 3.6ms, Sweep Types: 0.3ms, Clear Script Analysis: 0.3ms, Finalize End Callback: 0.9ms, End Callback: 0.9ms

The number of collected compartments is rather alarming, I think. Perhaps we regressed in compartments per global?
OS: Windows 7 → All
On Nightly from May 3, Compartments collected during Sync is never above 62 and total memory reported by memchaser is in the ~250MB range. My MBP's fan kicks in, so the computer is working pretty hard.

On Nighty from May 6, there are more compartments, but memory profile remains about the same.

Using Nightly from May 12, memory goes crazy.

I've also isolated the memory spike to the history engine. Did anything change in places recently?
That number of compartments, is totally "normal" with CPG and a tab or three open.
This is at a low point during sync. string-chars jumped to over 1GB at one point. This is with the May 12 Nightly.

1,047,332,184 B (100.0%) -- explicit
├────871,282,408 B (83.19%) -- js
│    ├──590,434,392 B (56.38%) -- compartment([System Principal])
│    │  ├──476,512,856 B (45.50%) ── string-chars [130]
│    │  ├───80,068,608 B (07.65%) -- gc-heap
│    │  │   ├──42,122,064 B (04.02%) -- objects
│    │  │   │  ├──35,929,072 B (03.43%) ── non-function [181]
│    │  │   │  └───6,192,992 B (00.59%) ── function [177]
│    │  │   ├──14,060,512 B (01.34%) ── strings [130]
│    │  │   ├──12,035,880 B (01.15%) -- arena
│    │  │   │  ├──10,829,072 B (01.03%) ── unused [181]
│    │  │   │  ├─────625,536 B (00.06%) ── headers [181]
│    │  │   │  └─────581,272 B (00.06%) ── padding [181]
│    │  │   ├───8,708,200 B (00.83%) -- shapes
│    │  │   │   ├──4,458,560 B (00.43%) ── tree [181]
│    │  │   │   ├──2,361,880 B (00.23%) ── dict [177]
│    │  │   │   └──1,887,760 B (00.18%) ── base [181]
│    │  │   ├───2,409,760 B (00.23%) ── scripts [177]
│    │  │   ├─────731,232 B (00.07%) ── type-objects [181]
│    │  │   └─────────960 B (00.00%) ── xml [6]
│    │  ├────9,409,344 B (00.90%) -- shapes-extra
│    │  │    ├──6,874,112 B (00.66%) ── compartment-tables [182]
│    │  │    ├──1,019,520 B (00.10%) ── tree-tables [117]
│    │  │    ├────968,448 B (00.09%) ── dict-tables [177]
│    │  │    └────547,264 B (00.05%) ── tree-shape-kids [177]
│    │  ├────9,289,568 B (00.89%) ── analysis-temporary [56]
│    │  ├────9,025,872 B (00.86%) -- objects
│    │  │    ├──5,462,912 B (00.52%) ── elements [134]
│    │  │    ├──3,535,552 B (00.34%) ── slots [181]
│    │  │    └─────27,408 B (00.00%) ── misc [181]
│    │  ├────5,340,096 B (00.51%) ── script-data [177]
│    │  ├──────598,800 B (00.06%) -- type-inference
│    │  │      ├──523,408 B (00.05%) ── script-main [125]
│    │  │      ├───49,888 B (00.00%) ── tables [13]
│    │  │      └───25,504 B (00.00%) ── object-main [12]
│    │  └──────189,248 B (00.02%) -- mjit
│    │         └──189,248 B (00.02%) ── data [5]
│    ├──218,189,824 B (20.83%) ── gc-heap-chunk-dirty-unused
│    ├───26,214,400 B (02.50%) ── gc-heap-chunk-clean-unused
│    ├────7,385,016 B (00.71%) ── xpconnect
│    ├────7,018,592 B (00.67%) -- runtime
│    │    ├──4,194,304 B (00.40%) ── stack-committed
│    │    ├──1,048,576 B (00.10%) ── atoms-table
│    │    ├────850,016 B (00.08%) ── mjit-code
│    │    ├────323,584 B (00.03%) ── temporary
│    │    ├────262,144 B (00.03%) ── gc-marker
│    │    ├────217,088 B (00.02%) ── runtime-object
│    │    ├─────54,952 B (00.01%) ── unused-code-memory
│    │    ├─────51,296 B (00.00%) ── contexts
│    │    ├─────12,536 B (00.00%) ── regexp-code
│    │    └──────4,096 B (00.00%) ── normal
I did not experience this problem until today. I had a failed sync warning bar and then I started crashing regularly. If this was in earlier builds it wasn't affecting me until today. Since about noon I've had over 20 crashes from it.
Bug 754771 would probably help a lot with debugging this, because then you could see which system compartment we're talking about here.
Whiteboard: [memshrink]
With patches from bug 754771 applied, I've isolated the modules causing the explosion:

1,329,651,304 B (100.0%) -- explicit
├──1,230,743,744 B (92.56%) -- js
│  ├────555,525,832 B (41.78%) -- compartment([System Principal], resource://services-sync/record.js)
│  │    ├──553,818,320 B (41.65%) ── string-chars
│  ├────554,112,096 B (41.67%) -- compartment([System Principal], resource://services-sync/resource.js)
│  │    ├──553,733,776 B (41.65%) ── string-chars

record.js contains the logic for dealing with Sync records. It takes raw data objects, does crypto, creates real record objects, then serializes them to JSON. It then passes them over a wall to resource.js for HTTP request foo. Incoming records work the opposite way.

In this process, there is lots of overhead. The crypto uses js-ctypes (WeaveCrypto) to talk to NSS. There are lots of temporary JS strings and objects being created. We plan to replace this with a single C++ function in the next major Sync release. For now, we have to live with it.

I'm surprised that resource.js has as many string-chars bytes as record.js. I would expect most of the memory allocation to be in record.js and resource.js would "inherit" the subset/final product. Of course, I'm naive as to how the JS internals work.

Is there any way for me to debug what exactly is allocating these string-chars?
Interestingly, outgoing only changes (i.e. first sync upload) aren't too bad. The problem appears to only be when processing incoming data.

And, history sync seems to be the bit that triggers. It is unknown whether it is something specific to history sync or whether it is proportional to the number of processed records (history typically has the most).
> Is there any way for me to debug what exactly is allocating these string-chars?

Maybe Andrew has a suggestion?

My only suggestion is to look at how data is passed between record.js and resource.js.  The amount allocated is almost identical between the two compartments, which suggests to me that somehow the data is being passed from one compartment to the other.  Perhaps that narrows things down?
Whiteboard: [memshrink] → [MemShrink]
Sorry, I have no idea how JS engine string allocation works.  Can you use some sort of trace-malloc thing to get stack traces from where the allocations are happening?  Of course, if it is all just JS code it won't be very useful.
Regression is somewhere between 93302:835d5fab89dc and 93609:6fe7dd2f8f57, inclusive. That's May 7 to May 10, roughly.

That range includes a services-central merge, which was 95% refactoring. I would be pretty surprised if anything in the merge caused this. Then again, the JS engine is at least half magic, so who knows.

I should have a full bisect within 24 hours.
Severity: normal → major
That date range includes CPG, I believe.
Narrowed it down to between 93548:dd29535bac5f and 93609:6fe7dd2f8f57.

I'm going to take a break for a little. The house is still accepting bets on the commit.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #17)
> /me puts his chips on http://hg.mozilla.org/mozilla-central/rev/e6529138e338

That would have been my guess as well. However, the range has been reduced to:

93543:b8f2a0ff6945 and 93563:075804813c58.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #19)
> http://hg.mozilla.org/mozilla-central/rev/2f63829658ff is next on my betting
> list then.

We have a winner!

Before bug 723350, Sync allocates a lot of memory but nothing gets out of hand. After bug 723350, total compartment size jumps 2-3x to ~1GB range during the peak of history sync (on my test profile, anyway).

My casual observation is that GC seems to be running less frequently or less "completely" after bug 723350.

So, next steps. I'm not a release manager, but I'm pretty sure that one wouldn't like things in the current state. Is backing out bug 723350 a reasonable request?

On a somewhat related topic. With CPG, if I have a 100MB JS string passed between modules, does that string get copied into the new compartment? Or, does the JS engine "copy by reference." How would this be reflected in the per-compartment memory reporting from bug 754771?
Depends on: 723350
Before Bug 723350 we had a global and per-compartment malloc accounting. Now we removed the global one and let the compartments do their own accounting.
From the API side, we only changed the updateMallocCounter logic.

I can think of 2 scenarios why this happens:
1) updateMallocCounter gets called without an context.
updateMallocCounter has 2 arguments: bytes and optional JSContext.
In the old version we updated the runtime counter and cx->compartment if available.
Now we only update the counter if cx && cx->compartment.

2) CPG increased the number of compartments and the trigger logic is off. 

Comment 7 shows that most memory is allocated in a single compartment right? So back to 1)
Another thing I noticed is that we delete the mallocCounter for all compartments even if they are not GCed. This was not changed in bug 723350 but it might be part of the problem.
> Comment 7 shows that most memory is allocated in a single compartment right?

No, it just looks that way because with the about:memory used in comment 7, we merge all allocations from system compartments together.

Comment 10 has the system compartments split out, and we can see that the memory lives in two compartments.  Indeed, the value in the two compartments is almost identical, which suggests there may be some copying going on here.
I'll look at the scenarios gwagner outlined in comment 21 and will update as soon as I know more.
Assignee: nobody → tschneidereit+bmo
Seems like scenario 1 is a winner: starting the current Nightly with a fresh profile to a single about:blank tab causes over 2200 calls to updateMallocCounter without a context.

I'll next see if there's anything we can do about that. If that's not the case, backing out http://hg.mozilla.org/mozilla-central/rev/2f63829658ff might be the best option, I'm afraid.
Is there any chance for getting a solid step of STRs or maybe even a profile that exhibits the explosive memory usage described here?

I'm trying to get to the problem's bottom but can't really be sure what I'm doing will help without STRs, I'm afraid.
(In reply to Till Schneidereit [:till] from comment #25)
> Is there any chance for getting a solid step of STRs or maybe even a profile
> that exhibits the explosive memory usage described here?
> 
> I'm trying to get to the problem's bottom but can't really be sure what I'm
> doing will help without STRs, I'm afraid.

gps should have STRs, but I speculate that you can hit the problem by syncing a large profile, perhaps choosing "Reset Sync > Replace the contents of this device" to ensure a large download.

QA should be able to provide a fat Sync account if you need one: tracy, tchung?
I've been using a version of our largest account this morning and have only it this a couple times.  Not reliably failing here.  The account only has a 1000 or so bookmarks.
Just to make sure.. Could you go to about:memory and click "Minimize memory usage" and see if memory goes back to normal?
(In reply to Gregor Wagner [:gwagner] from comment #29)
> Just to make sure.. Could you go to about:memory and click "Minimize memory
> usage" and see if memory goes back to normal?

It wasn't leaking the memory for me when I was testing this the other day.
(In reply to Richard Newman [:rnewman] from comment #26)
> (In reply to Till Schneidereit [:till] from comment #25)
> > Is there any chance for getting a solid step of STRs or maybe even a profile
> > that exhibits the explosive memory usage described here?
> > 
> > I'm trying to get to the problem's bottom but can't really be sure what I'm
> > doing will help without STRs, I'm afraid.

The problem is most pronounced with history sync. I am able to see it when uploading a large amount of history records (like setting up Sync from your main profile where you do all your surfing) or when pulling down synced history records to a fresh profile (new sync). I did all the bisecting through download because I think it is slightly more pronounced. But, I could be wrong.

If you don't want to mess with your main profile, create a fresh profile, sync it with your existing Sync account, disconnect the device from Sync (preserving all data), then set up a new Sync account with a separate email (e.g. bug756549@mozilla.com">gps+bug756549@mozilla.com).

Let me see if I can code up a "stress test" that generates a bunch of "random" places data. I'm sure I will be reinventing the wheel here. If someone knows of existing code to save me the headache, please make a comment!
Adding brains for Comment 31.
Blocks: 723350
No longer depends on: 723350
Whiteboard: [MemShrink] → [MemShrink][js:p0]
Can someone re-test this now that bug 723350 has been backed out?
Oops, didn't mean to unset the tracking flag.
I can confirm that backing out bug 723350 fixed the regression: Setting up a new profile to sync with my main profile causes RSS to go up to ~3.3GB without the backout whereas it never gets beyond ~650MB with it applied.

Unfortunately, my attempts at alleviating the regression haven't yet yielded much progress. Nevertheless: with bug 723350 backed out, this should be fixed and closeable.
I can also confirm that backing out bug 723350 fixed things. So, resolving WORKSFORME.

From comment 35, it appears that others are able to reproduce as thus can test changes. If you would like Sync code to generate data for testing, please ping me on bug 723350.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Thanks Gregory - both for confirming the fix and for the Sync code offer. You're right, though: I can reproduce memory usage exploding when syncing. The only reason I don't experience OOMs seems to be that I have enough memory on a 64bit system.
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.