Closed Bug 799727 Opened 12 years ago Closed 6 years ago

High memory usage on syncstorage gunicorn processes

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P4)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gene, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

In production, we see high levels of memory utilization by gunicorn processes on the syncstorage systems. We currently spin up processes with the assumption that each one will use 1GB of memory. What is causing this and is it possible to reduce the memory requirements of the app?

I'll add more data as I gather it.

Here is some raw info about current memory utilization to give a flavor of what's going on :

Gunicorn RSS  as of 20:50 on 20121008,,,,,,,,,,,,
Host,Mem,,Start Time,CPU Time DD-HH:MM:SS,pid,proc,RSS,RSS in MB,% of total,# cores,# procs,% cores used
sync1.web.scl2.svc.mozilla.com:  19:05:38 up 35 days,,,,,,,,,,,,
sync1.web.scl2.svc.mozilla.com: Mem: 16078,16078,,,,,,,,,8,,
sync1.web.scl2.svc.mozilla.com,,,Sep17,00:02:41,7750,gunicorn,11684,11.41,0.07%,,5,62.50%
sync1.web.scl2.svc.mozilla.com,,,Sep17,00:13:29,7753,gunicorn,346736,338.61,2.11%,,,
sync1.web.scl2.svc.mozilla.com,,,Sep17,00:17:13,7754,gunicorn,346992,338.86,2.11%,,,
sync1.web.scl2.svc.mozilla.com,,,Sep17,00:22:31,7752,gunicorn,346808,338.68,2.11%,,,
sync1.web.scl2.svc.mozilla.com,,,Sep17,00:26:31,7751,gunicorn,346744,338.62,2.11%,,,
Total,,,,,,,,1366.18,8.50%,,,
,,,,,,,,,,,,
sync2.web.scl2.svc.mozilla.com:  19:05:38 up 35 days,,,,,,,,,,,,
sync2.web.scl2.svc.mozilla.com: Mem: 16079,16079,,,,,,,,,8,,
sync2.web.scl2.svc.mozilla.com,,,Oct02,2-17:10:23,12615,gunicorn,2067872,2019.41,12.56%,,5,62.50%
sync2.web.scl2.svc.mozilla.com,,,Oct05,1-12:06:29,18586,gunicorn,1319716,1288.79,8.02%,,,
sync2.web.scl2.svc.mozilla.com,,,Sep17,00:01:05,31544,gunicorn,11024,10.77,0.07%,,,
sync2.web.scl2.svc.mozilla.com,,,Sep30,3-10:30:44,18949,gunicorn,2475680,2417.66,15.04%,,,
sync2.web.scl2.svc.mozilla.com,,,Sep30,3-13:35:16,12940,gunicorn,2442920,2385.66,14.84%,,,
Total,,,,,,,,8122.28,50.51%,,,
,,,,,,,,,,,,
sync3.web.scl2.svc.mozilla.com:  19:05:38 up 134 days,,,,,,,,,,,,
sync3.web.scl2.svc.mozilla.com: Mem: 16079,16079,,,,,,,,,8,,
sync3.web.scl2.svc.mozilla.com,,,07:04,00:19:00,7629,gunicorn,328608,320.91,2.00%,,5,62.50%
sync3.web.scl2.svc.mozilla.com,,,Oct01,2-21:25:12,11639,gunicorn,2042892,1995.01,12.41%,,,
sync3.web.scl2.svc.mozilla.com,,,Oct04,1-17:39:05,27165,gunicorn,1171692,1144.23,7.12%,,,
sync3.web.scl2.svc.mozilla.com,,,Sep13,00:00:49,3251,gunicorn,10368,10.13,0.06%,,,
sync3.web.scl2.svc.mozilla.com,,,Sep30,3-06:04:44,29149,gunicorn,2115116,2065.54,12.85%,,,
Total,,,,,,,,5535.82,34.43%,,,
,,,,,,,,,,,,
sync4.web.scl2.svc.mozilla.com:  19:05:38 up 133 days,,,,,,,,,,,,
sync4.web.scl2.svc.mozilla.com: Mem: 16079,16079,,,,,,,,,8,,
sync4.web.scl2.svc.mozilla.com,,,Oct01,3-04:59:44,26151,gunicorn,2161816,2111.15,13.13%,,5,62.50%
sync4.web.scl2.svc.mozilla.com,,,Oct04,1-16:55:12,7458,gunicorn,1400264,1367.45,8.50%,,,
sync4.web.scl2.svc.mozilla.com,,,Oct06,22:54:27,1948,gunicorn,1338804,1307.43,8.13%,,,
sync4.web.scl2.svc.mozilla.com,,,Sep13,00:01:12,9471,gunicorn,10348,10.11,0.06%,,,
sync4.web.scl2.svc.mozilla.com,,,Sep30,3-08:02:37,22188,gunicorn,2427460,2370.57,14.74%,,,
Total,,,,,,,,7166.69,44.57%,,,
,,,,,,,,,,,,
sync5.web.scl2.svc.mozilla.com:  19:05:38 up 7 days,,,,,,,,,,,,
sync5.web.scl2.svc.mozilla.com: Mem: 32238,,,,,,,,,,,,
,,,,,,,,,,,,
sync6.web.scl2.svc.mozilla.com:  19:05:38 up 574 days,,,,,,,,,,,,
sync6.web.scl2.svc.mozilla.com: Mem: 24159,24159,,,,,,,,,8,,
sync6.web.scl2.svc.mozilla.com,,,Oct01,3-04:17:00,1301,gunicorn,1409080,1376.05,5.70%,,5,62.50%
sync6.web.scl2.svc.mozilla.com,,,Oct01,3-05:09:07,4302,gunicorn,1820152,1777.49,7.36%,,,
sync6.web.scl2.svc.mozilla.com,,,Oct08,07:29:54,8208,gunicorn,591708,577.84,2.39%,,,
sync6.web.scl2.svc.mozilla.com,,,Sep13,00:01:13,26340,gunicorn,11460,11.19,0.05%,,,
sync6.web.scl2.svc.mozilla.com,,,Sep30,3-10:50:41,22355,gunicorn,2463820,2406.07,9.96%,,,
Total,,,,,,,,6148.65,25.45%,,,
,,,,,,,,,,,,
sync7.web.scl2.svc.mozilla.com:  19:05:38 up 574 days,,,,,,,,,,,,
sync7.web.scl2.svc.mozilla.com: Mem: 32239,32239,,,,,,,,,8,,
sync7.web.scl2.svc.mozilla.com,,,Oct01,3-04:53:46,10909,gunicorn,2411760,2355.23,7.31%,,5,62.50%
sync7.web.scl2.svc.mozilla.com,,,Oct04,1-22:36:36,30214,gunicorn,2207924,2156.18,6.69%,,,
sync7.web.scl2.svc.mozilla.com,,,Oct04,1-23:32:36,30474,gunicorn,1434128,1400.52,4.34%,,,
sync7.web.scl2.svc.mozilla.com,,,Sep13,00:01:16,13037,gunicorn,11452,11.18,0.03%,,,
sync7.web.scl2.svc.mozilla.com,,,Sep30,3-09:50:30,19459,gunicorn,2416676,2360.04,7.32%,,,
Total,,,,,,,,8283.14,25.69%,,,
,,,,,,,,,,,,
sync8.web.scl2.svc.mozilla.com:  19:05:38 up 574 days,,,,,,,,,,,,
sync8.web.scl2.svc.mozilla.com: Mem: 32239,32239,,,,,,,,,8,,
sync8.web.scl2.svc.mozilla.com,,,Oct04,1-18:46:15,32316,gunicorn,1360568,1328.68,4.12%,,5,62.50%
sync8.web.scl2.svc.mozilla.com,,,Oct04,1-21:38:06,29375,gunicorn,1396188,1363.46,4.23%,,,
sync8.web.scl2.svc.mozilla.com,,,Oct06,1-03:34:27,8662,gunicorn,1303572,1273.02,3.95%,,,
sync8.web.scl2.svc.mozilla.com,,,Oct07,18:37:01,6402,gunicorn,737060,719.79,2.23%,,,
sync8.web.scl2.svc.mozilla.com,,,Sep13,00:01:15,8298,gunicorn,11476,11.21,0.03%,,,
Total,,,,,,,,4696.16,14.57%,,,
,,,,,,,,,,,,
sync1.web.phx1.svc.mozilla.com:  19:00:34 up 116 days,,,,,,,,,,,,
sync1.web.phx1.svc.mozilla.com:  Mem: 24022,24022,,,,,,,,,8,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:05:06,21118,gunicorn,11728,11.45,0.05%,,9,112.50%
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:10:05,21134,gunicorn,91364,89.22,0.37%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:10:29,21128,gunicorn,91916,89.76,0.37%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:11:31,21133,gunicorn,92460,90.29,0.38%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:12:53,21129,gunicorn,91792,89.64,0.37%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:14:31,21131,gunicorn,92784,90.61,0.38%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:16:16,21132,gunicorn,92108,89.95,0.37%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:18:59,21127,gunicorn,93076,90.89,0.38%,,,
sync1.web.phx1.svc.mozilla.com,,,Sep13,00:21:53,21130,gunicorn,92432,90.27,0.38%,,,
Total,,,,,,,,732.09,3.05%,,,
,,,,,,,,,,,,
sync2.web.phx1.svc.mozilla.com:  19:00:34 up 116 days,,,,,,,,,,,,
sync2.web.phx1.svc.mozilla.com:  Mem: 24022,24022,,,,,,,,,8,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:05:01,16356,gunicorn,11732,11.46,0.05%,,9,112.50%
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:09:50,16359,gunicorn,92008,89.85,0.37%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:10:17,16362,gunicorn,92244,90.08,0.37%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:11:15,16358,gunicorn,92072,89.91,0.37%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:12:32,16363,gunicorn,92316,90.15,0.38%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:14:04,16357,gunicorn,92204,90.04,0.37%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:15:40,16360,gunicorn,92168,90.01,0.37%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:18:04,16364,gunicorn,92644,90.47,0.38%,,,
sync2.web.phx1.svc.mozilla.com,,,Sep13,00:20:58,16361,gunicorn,92420,90.25,0.38%,,,
Total,,,,,,,,732.23,3.05%,,,
,,,,,,,,,,,,
sync3.web.phx1.svc.mozilla.com:  19:00:34 up 116 days,,,,,,,,,,,,
sync3.web.phx1.svc.mozilla.com:  Mem: 24022,24022,,,,,,,,,8,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:03:31,13242,gunicorn,11736,11.46,0.05%,,9,112.50%
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:08:53,13246,gunicorn,92020,89.86,0.37%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:09:20,13244,gunicorn,92240,90.08,0.37%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:10:18,13247,gunicorn,92096,89.94,0.37%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:11:37,13248,gunicorn,92060,89.90,0.37%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:13:22,13249,gunicorn,92144,89.98,0.37%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:15:22,13245,gunicorn,92516,90.35,0.38%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:17:08,13243,gunicorn,92212,90.05,0.37%,,,
sync3.web.phx1.svc.mozilla.com,,,Sep13,00:20:14,13250,gunicorn,92332,90.17,0.38%,,,
Total,,,,,,,,731.79,3.05%,,,
,,,,,,,,,,,,
sync4.web.phx1.svc.mozilla.com:  19:00:34 up 116 days,,,,,,,,,,,,
sync4.web.phx1.svc.mozilla.com:  Mem: 24022,24022,,,,,,,,,8,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:04:01,30955,gunicorn,11724,11.45,0.05%,,9,112.50%
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:09:16,30961,gunicorn,92268,90.11,0.38%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:09:38,30956,gunicorn,92136,89.98,0.37%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:10:32,30960,gunicorn,92300,90.14,0.38%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:11:52,30963,gunicorn,92056,89.90,0.37%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:13:47,30962,gunicorn,92772,90.60,0.38%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:15:35,30957,gunicorn,92196,90.04,0.37%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:17:28,30958,gunicorn,92308,90.14,0.38%,,,
sync4.web.phx1.svc.mozilla.com,,,Sep13,00:20:31,30959,gunicorn,92224,90.06,0.37%,,,
Total,,,,,,,,732.41,3.05%,,,
,,,,,,,,,,,,
sync5.web.phx1.svc.mozilla.com:  19:00:34 up 161 days,,,,,,,,,,,,
sync5.web.phx1.svc.mozilla.com:  Mem: 48267,48267,,,,,,,,,24,,
sync5.web.phx1.svc.mozilla.com,,,02:35,03:35:32,12180,gunicorn,268668,262.37,0.54%,,9,37.50%
sync5.web.phx1.svc.mozilla.com,,,Oct05,2-07:58:45,24168,gunicorn,1101596,1075.78,2.23%,,,
sync5.web.phx1.svc.mozilla.com,,,Oct05,2-09:59:54,30835,gunicorn,1438048,1404.34,2.91%,,,
sync5.web.phx1.svc.mozilla.com,,,Oct05,2-10:27:49,26549,gunicorn,1036668,1012.37,2.10%,,,
sync5.web.phx1.svc.mozilla.com,,,Oct05,2-11:23:49,13509,gunicorn,1740004,1699.22,3.52%,,,
sync5.web.phx1.svc.mozilla.com,,,Oct06,1-16:06:58,17125,gunicorn,820376,801.15,1.66%,,,
sync5.web.phx1.svc.mozilla.com,,,Oct08,16:50:24,18735,gunicorn,1071680,1046.56,2.17%,,,
sync5.web.phx1.svc.mozilla.com,,,Oct08,16:56:48,22295,gunicorn,794312,775.70,1.61%,,,
sync5.web.phx1.svc.mozilla.com,,,Sep13,00:07:37,2532,gunicorn,11732,11.46,0.02%,,,
Total,,,,,,,,8088.95,16.76%,,,
,,,,,,,,,,,,
sync6.web.phx1.svc.mozilla.com:  19:00:34 up 161 days,,,,,,,,,,,,
sync6.web.phx1.svc.mozilla.com:  Mem: 48267,48267,,,,,,,,,24,,
sync6.web.phx1.svc.mozilla.com,,,Oct03,3-16:11:37,13788,gunicorn,1439028,1405.30,2.91%,,9,37.50%
sync6.web.phx1.svc.mozilla.com,,,Oct04,3-04:55:06,21673,gunicorn,1647932,1609.31,3.33%,,,
sync6.web.phx1.svc.mozilla.com,,,Oct06,2-01:48:28,489,gunicorn,1102988,1077.14,2.23%,,,
sync6.web.phx1.svc.mozilla.com,,,Oct08,13:05:41,31848,gunicorn,758660,740.88,1.53%,,,
sync6.web.phx1.svc.mozilla.com,,,Oct08,15:57:05,21336,gunicorn,1078356,1053.08,2.18%,,,
sync6.web.phx1.svc.mozilla.com,,,Oct08,20:14:04,8956,gunicorn,955124,932.74,1.93%,,,
sync6.web.phx1.svc.mozilla.com,,,Sep13,00:07:58,22643,gunicorn,11720,11.45,0.02%,,,
sync6.web.phx1.svc.mozilla.com,,,Sep21,10-10:16:56,31149,gunicorn,1548868,1512.57,3.13%,,,
sync6.web.phx1.svc.mozilla.com,,,Sep25,8-16:09:41,2004,gunicorn,1812724,1770.24,3.67%,,,
Total,,,,,,,,10112.70,20.95%,,,
,,,,,,,,,,,,
sync7.web.phx1.svc.mozilla.com:  19:00:34 up 161 days,,,,,,,,,,,,
sync7.web.phx1.svc.mozilla.com:  Mem: 48267,48267,,,,,,,,,24,,
sync7.web.phx1.svc.mozilla.com,,,06:21,00:57:12,2892,gunicorn,237860,232.29,0.48%,,9,37.50%
sync7.web.phx1.svc.mozilla.com,,,Oct01,5-02:47:00,28049,gunicorn,1481744,1447.02,3.00%,,,
sync7.web.phx1.svc.mozilla.com,,,Oct05,2-10:51:13,5682,gunicorn,1124140,1097.79,2.27%,,,
sync7.web.phx1.svc.mozilla.com,,,Oct05,2-11:06:13,31141,gunicorn,1072000,1046.88,2.17%,,,
sync7.web.phx1.svc.mozilla.com,,,Oct05,2-12:05:19,3680,gunicorn,1449248,1415.28,2.93%,,,
sync7.web.phx1.svc.mozilla.com,,,Oct05,2-12:14:41,12691,gunicorn,1499232,1464.09,3.03%,,,
sync7.web.phx1.svc.mozilla.com,,,Oct07,22:53:37,29524,gunicorn,1093056,1067.44,2.21%,,,
sync7.web.phx1.svc.mozilla.com,,,Sep13,00:07:33,18047,gunicorn,11720,11.45,0.02%,,,
sync7.web.phx1.svc.mozilla.com,,,Sep20,11-08:09:04,11072,gunicorn,1487940,1453.07,3.01%,,,
Total,,,,,,,,9235.29,19.13%,,,
Assignee: nobody → rfkelly
Component: Firefox Sync: Backend → Server: Sync
I'm surprised to see some of these machines demonstrating high memory usage and some not.  In particular, these machines seem fine, with low memory usage and processes all having stayed up since they were last kicked:

  sync1.web.scl2.svc.mozilla.com

  sync1.web.phx1.svc.mozilla.com
  sync2.web.phx1.svc.mozilla.com
  sync3.web.phx1.svc.mozilla.com
  sync4.web.phx1.svc.mozilla.com

Do these machines differ enough from the others to provide any clues?  I vaguely recall :atoll mentioning memory problems on one RHEL platform but not another.
(In reply to Ryan Kelly [:rfkelly] from comment #1)
>   sync1.web.scl2.svc.mozilla.com

Oh, pencil suggests that this machine is getting 0 qps, which would explain why it's not showing the same memory use pattern as the others :-)

The others I don't know about.
rfkelly : correct, sync1-4 in PHX1 are configured to only come into play if the cluster is dying, otherwise they get no traffic.

sync1 in SCL2 has been dead for some time which explains it's data (I believe)
rfkelly : here's a summary of what we're seeing. Looks like it's highly variable how long it takes to reach the high memory utilization (some as short as 16 minutes to get to 1GB)

(03:37:26 PM) atoll: so when sync1..4.phx1 were having swap trouble, many weeks ago but this year since couchbase in may, i found that long-running processes had the 1GB plus ram usage
(03:37:46 PM) atoll: i deferred poking at it further until we deployed the new Sync code that bobm pushed a couple weeks ago
(03:38:03 PM) atoll: since analyzing memory issues in ancient stale code is not a very good use of time vs. analyzing it on new code
(03:38:34 PM) atoll: since ckolos reports we're still seeing issues, i *suspect* it's still "growth then plateau around 1.1GB", since it sounds like the new code appears not to have changed that profile

(03:40:19 PM) ckolos: so sync3.web.scl2
(03:40:23 PM) ckolos: pid 29149
(03:40:50 PM) ckolos: Virt is 2223m, rss is 2.0g, shared is 3116, stack (data) is 2.0gb

(03:41:50 PM) ckolos: other than sync1/5 all scl2 sync web heads have at least 1 gunicorn process taking more than 2gb of memory
(03:42:06 PM) ckolos: oop, damn you syn8
(03:42:28 PM) ckolos: okay sync8 doesn't have one over 2gb, but does have 3 over 1.2gb
(03:44:19 PM) ckolos: so... go fish.

(03:45:31 PM) atoll: any correlation between process age?
(03:46:00 PM) ckolos: likely some, but not definitively
(03:46:40 PM) ckolos: there are procs with 1 day of CPU time taking 1.2gb, while others with 3+ days, taking "only" 2.3
(03:47:01 PM) ckolos: so if so, it's not direct linear growth

(03:47:32 PM) ckolos: comparing phx and scl2 is even more frustratin
(03:47:51 PM) ckolos: where a proc with 2+ days of cpu time is only using 1.075 gb
(03:48:05 PM) ckolos: and another with 16 mins is using 1.046
(03:47:43 PM) atoll: yeah, i don't know why they're so variant yet :(
(03:47:54 PM) atoll: comparing sync5..7.phx1 to sync1..8.scl2 may help
(03:48:04 PM) atoll: and just ignore 1..4.phx1 since they're not in use most times
(03:48:14 PM) ckolos: this is on sync5.phx

(03:48:31 PM) atoll: maybe the initial memory burden for a worker is stable at 1GB after startup and a request or two

(03:48:55 PM) ckolos: possibly, but then that means that sync1-4 aren't used at *all*
(03:48:58 PM) atoll: correct
(03:49:03 PM) ckolos: b/c they're all around 90mb per proc
(03:49:26 PM) atoll: sync1..4.phx1 are set as "last resort" servers in the zeus pool, since if they're in active use they cause couchbase to swap out
(03:49:45 PM) atoll: once we have a couchbase hardware solution for scl2, it must also go to phx1
(03:49:54 PM) ckolos: really though, there's not enough running to come up with anything other than slightly-better-than-guesses

(03:50:12 PM) atoll: do the sync load tests show the same worker memory usage?
(03:50:22 PM) ckolos: unknown.

(03:50:26 PM) ckolos: where would that data be?
(03:50:36 PM) atoll: sync*.web.scl2.stage graphs and collection, if any
(03:50:55 PM) atoll: rfkelly is online and may be of further use here, in case he's ever observed memory usage previously

(03:56:54 PM) ckolos: none of the stage syncweb servers have gunicorn processes running that hot.
(03:57:13 PM) ckolos: highest use in stage is 94mb
(03:57:25 PM) ckolos: so I'm guessing no loadtests have been done in a while.
(03:57:37 PM) ckolos: most procs are dated aug 29
My first suspect here is the per-node-name connection pool and related data structures.  How many individual [host:nodename] sections do we have configured in the prod settings file?
I tried about an hour of light load against stage this afternoon, and monitored the memory usage of two gunicorn processes - one which was freshly restarted, and one that had been alive since 29 August.  RSS snapshots at 15-minute intervals:

        New Proc    Old Proc
  
  t=0     41232        75856
  t=15m   53580        75924
  t=30m   54244        75912
  t=45m   54836        75920
  t=60m   55604        75908


So the memory usage does seem to slowly climb to a peak value as requests come in, then stay relatively steady at that level.
Depends on: 799874
rfkelly, I think this is what you're asking for :
At PHX1 and SCL2 the production.ini file for syncstorage contains these :

[server:main]
use = egg:Paste#http
host = 0.0.0.0
port = 5000
use_threadpool = True
threadpool_workers = 60

[app:main]
use = egg:SyncStorage
configuration = file:/etc/sync/sync.conf
Sorry, I see what you mean now. In the sync.conf file we have :

In PHX1 we have 610 lines like this :
[host:phx-sync609.services.mozilla.com]

In SCL2 we have 1320 lines like this :
[host:scl2-sync1320.services.mozilla.com]
As a first step, I'd like to make a new release and push it to stage with the following changes:

  * memory-usage-dumping support from Bug 799874
  * update all our dependencies to latest version

In particular I want to update SQLAlchemy, which is a whole minor version behind the current release (0.6.6 vs 0.7.9) which has some known memory-usage improvements.

We can then throw some load at it and take periodic memory-usage dumps from one of the gunicorn worker processes.  I can then analyse these dumps offline to get an idea of where the memory is being spent.

Will we have the Ops bandwith for a push to stage sometime in the next few days?

If not then I can run my own tests, but I think memory-usage data from stage under full load will be significantly more useful than what I can simulate locally.
Submit Stage deploy ticket for Sync as per usual, and if Gene is blocked I can push it out.
Filed Bug 800254 for deploying gunicorn changes into stage.
Depends on: 800254
Gene, in the config file you grepped in Comment 8 there should be a [storage] section.  Can you please post (or email me if sensitive) the contents of that section, minus any passwords etc?  I want to check for anything that might explain why memory usage on stage seems to be much better controlled than in production.

Stage has 160 [host:blah] sections vs 1320 in production, but the difference in memory usage between the two doesn't seem to scale with that number.  Perhaps they have slightly different configurations in e.g. number of connections per pool.
Sure, here's that section. I've compared and it's the same at scl2 and phx1

[storage]
backend = syncstorage.storage.memcachedsql.MemcachedSQLStorage
sqluri = pymysql://USERNAMEGOESHERE:PASSWORDGOESHERE@sync1.db.scl2.svc.mozilla.com/weave0
standard_collections = true
use_quota = false
quota_size = 25600
pool_size = 2
pool_recycle = 1200
reset_on_return = true
batch_size = 100
cache_servers = localhost:11222
create_tables = false
display_config = false
hosts =
    scl2-sync1.services.mozilla.com
    scl2-sync2.services.mozilla.com
    scl2-sync3.services.mozilla.com
.
.
.
    scl2-sync1318.services.mozilla.com
    scl2-sync1319.services.mozilla.com
    scl2-sync1320.services.mozilla.com

shard = true
Bug 802486 identifies a cache-clearing issue that likely contributes to the high memory usage.

This issue would result in an empty dict being kept in memory for each unique userid ever encountered by the server.  That's only of the order of ~300 bytes of memory used per user, but we do serve a lot of users...

Probably not the whole story, but it's a solid start.
Depends on: 802486
I'm prepping a deployment to get the above-mentioned fix out into production - Bug 803389.  It will be interesting to see how much of a difference the tweaks so far have made.
Depends on: 803389
Bob, can you confirm whether this is still an issue for current sync?  If so then we should put it on our radar for sync+fxa deployment planning.
Flags: needinfo?(bobm)
Blocks: 907479
Whiteboard: [qa+]
Most gunicorn workers are in the 1GB and under range, however there are a couple of outliers.  

sync1.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
21851,654,951968,946532,1113304
21851,3228,1086948,1081628,1248400
21851,11828,634456,629020,795792
28912,21851,11756,7964,107360
21851,21852,1915644,1910268,2077040
21851,21853,940500,935168,1101940
21851,21857,1173020,1167580,1334352
21851,21858,1166896,1161464,1328236
21851,21859,945688,940260,1107032

sync2.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
8948,8863,1167276,1163088,1329732
2188,8948,11536,8076,109440
8948,8949,1252668,1248504,1415148
8948,8951,982400,978372,1145016
8948,8953,1080108,1075988,1242632
8948,8954,1125696,1121600,1288244
8948,8955,1421692,1417824,1584468
8948,19640,744772,740564,907208
8948,20869,919524,915292,1081936

sync3.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
2216,6760,11620,8076,109440
6760,6789,963724,964544,1131188
6760,6790,1057516,1052368,1219012
6760,6794,965408,960260,1126904
6760,6796,1085048,1079872,1246516
6760,6988,1113764,1108416,1275060
6760,17266,1104832,1099612,1266256
6760,21081,784004,778660,945304
6760,32528,1141740,1136392,1303036

sync4.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
5047,709,1037620,1033476,1200120
5047,3629,916064,912004,1078648
5047,4464,781464,779504,946148
5047,4879,1036016,1033880,1200524
2214,5047,11364,8080,109444
5047,5068,1582712,1611836,1782652
5047,11283,1591280,1587504,1754148
5047,22663,953084,949224,1115868
5047,30280,1410736,1406876,1573520

sync5.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
26597,8994,917196,911980,1078616
26597,9896,704416,699440,866076
26597,12619,924384,919592,1086228
26597,17183,920084,915012,1081648
26597,26389,958344,955308,1121944
14334,26597,11588,8080,109436
26597,26600,949552,946316,1112952
26597,26602,961044,955832,1122468
26597,32443,716948,711668,878304

sync6.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
22769,9956,918788,913600,1080236
22769,10205,946784,943588,1110224
22769,12584,949612,944340,1110976
22769,14819,647132,641972,808608
22769,15258,713376,712212,878848
22769,22177,701860,696588,863224
13874,22769,11568,8076,109432
22769,22778,948196,943052,1109688
22769,26315,702680,701496,868132

sync7.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
14816,11169,632620,627328,793964
14816,11370,901852,896564,1063200
14816,14094,704260,699096,865732
14816,14452,629164,623872,790508
13847,14816,11600,8080,109436
14816,16834,1057040,1051872,1218508
14816,22077,934376,929088,1095724
14816,31648,731460,726352,892988
14816,32555,612476,607184,773820

sync8.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
7219,473,13788,12008,132132
473,3442,631280,626672,804308
473,16617,817324,814148,991784
473,19755,644720,656248,833884
473,29223,356380,637560,815196

sync9.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
1632,1649,10152,7656,107052
1649,1974,518120,513300,680032
1649,2557,380420,375724,542456
1649,10650,494068,489600,656332
1649,30300,449664,445356,612088

sync10.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
9616,1379,506736,501860,668592
9423,9616,10844,7660,107056
9616,15202,504008,498784,665516
9616,17566,527428,522808,689540
9616,17675,553944,548804,715536

sync11.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
32495,2941,10796,7652,107048
2941,13538,531512,527492,694224
2941,13553,506328,503116,669848
2941,24705,536032,533092,699824
2941,28657,437740,433640,600372

sync12.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
1656,496,388496,383324,550056
1638,1656,10188,7656,107052
1656,2099,383816,379728,546460
1656,21088,494964,491292,658024
1656,24204,543384,548544,715276

sync13.web.phx1.svc.mozilla.com
ppid,pid,rss,size,vsize
17235,9771,524904,532088,698820
32656,17235,10904,7664,107060
17235,22606,484984,479836,646568
17235,26641,483060,477952,644684
17235,29777,480468,475364,642096
Flags: needinfo?(bobm)
From outlier on sync1.web:

Address           Kbytes     RSS   Dirty Mode   Mapping
0000000000400000       4       4       0 r-x--  python
0000000000600000       8       8       4 rw---  python
0000000001b62000    4276    4264    4264 rw---    [ anon ]
0000000001f8f000 1902352 1902272 1902272 rw---    [ anon ]
...
All dependent bugs have been Resolved.
What is our status here?
Priority: -- → P1
We currently watching out for this issue on the sync1.5 storage nodes, but I'm hopeful it won't be a problem in the one-box-per-node setup we're currently using.  So let's keep it open but not a blocker.
No longer blocks: 907479
Priority: P1 → P4
> 4 years ago
>
> We currently watching out for this issue on the sync1.5 storage nodes, but I'm hopeful it won't be a problem in
> the one-box-per-node setup we're currently using.  So let's keep it open but not a blocker.

4 years later, I haven't heard any complaints about this, so I'm going to go ahead and close it out.  :bobm please feel free to open a new bug if there are similar concerns on the sync1.5 server boxes.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.