Closed Bug 541873 Opened 15 years ago Closed 15 years ago

build capacity to process more crash reports.

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chofmann, Assigned: aravind)

Details

(Whiteboard: [crashkill-metrics] [waiting on cpus])

Attachments

(9 files)

the default setting setting for submitting crash reports has been shifted to 100% opt-in for firefox 3.6. this is up from a default of 10% in previous relases so we should start to see increased volume in crash submissions. we still have server side throttling to deal with load balancing, but over all we need to ramp up capacity. we tried to ramp up 3.6 processing from 15% to 50% of all submissions in the early days of 3.6 release and the system start to lag during peak periods. looks like we have capacity to process about 13k-14k crash reports per hour, and beyond that the system starts to lag. We should try to figure out how we could double that capacity.
Whiteboard: [crashkill-metrics]
you can see where aravind throttled back to 25% on 3.6 yesterday afternoon. the system was probably on its way to to recovery. but at 17k reports per hour was laboring. aravind might have details on the size of the backlog and the delay in processing reports at that point.
The number of processors was at 5 Friday and is at 5 now. http://crash-stats.mozilla.com/status Did we launch more processors?
I didn't take a snapshot of how the system looked when I throttled it back, I vaguely remember the backlog was trending up and something like 20k reports were in the queue. We can add capacity to the existing system, but doing that would mean hardware (processing power) and storage purchases on our side. We weren't planning for that since all this processing was supposed to be taken over by the metrics team. I am copying Daniel on this thread, he may have more input on when that will be ready etc.
@ozten: nope.. we did not add any new processors, I just throttled the incoming crashes to 25%. The system just caught up on its own. Our sweet spot right now is probably somewhere between that 25 to 50 %.
(In reply to comment #6) > @ozten: nope.. we did not add any new processors, I just throttled the incoming > crashes to 25%. The system just caught up on its own. > > Our sweet spot right now is probably somewhere between that 25 to 50 %. but the 25-50% server throttling will change quickly as 3.5.x users (client throttled to opt-out 90% of users) to 3.6x (no client throttling). I suspect that we may need to drop below 15% server throttling pretty soon on 3.6 users if we don't ramp up capacity soon.
I still doubt that the client-side throttling has any effect.
playing with the 3.5.x crash per 100 user numbers where client throttling is turned on I can make them look roughtly like an unthrottled beta by applying the idea that: - one person out of 10 accepts the default opt-in, - an additional 1.5 people override the default opt-in - 7.5 people keep the default opt-out. See https://wiki.mozilla.org/CrashKill/Crashr#Release_3.5.6 that means for days that we have 51 million adu's on 3.5.6 we effectively have 1.9 million of those users in the pool of users that might be submitting crashes. from there we apply 15% server side throttling to get roughly the number of crashes we see in the Socorro reporting.
I still had a window open with server status from yesterday morning. This is what things looked like when we were processing about 15k reports per hour. Mood Deathly Server Time 2010-01-24 09:39:21 Stats Created At 2010-01-24 09:35:01.829668 Waiting Jobs 9815 Processors Running 5 Average Seconds to Process 3.24642 Average Wait in Seconds 1668.05 Recently Completed 2010-01-24 09:35:01.449064 Oldest Job In Queue 2010-01-24 08:33:21.29002 The "seconds to process" is still around 3 seconds, so maybe we could add more processors as suggested in comment 5, and not affect cpu load too much. I'm guessing that when we see that number rise we are starting to grind the manchine into the ground. maybe another experiment to test this idea is in order.
we added about a million 3.6 adu's yesterday, and peak crash reporting increased about 300 crashes per hour to peak load. 14853 -> 15221 That might mean we are about a week or two away from being back in the peak load danger zone if we continue to add the expected million users a day to the 3.6 user base.
https://wiki.mozilla.org/CrashKill/Crashr#3.6_RC1.2C_RC2.2C_Final shows the current ramp rate of adus and crash reports on 3.6
complete with indication of the system maintenance outage. Also adds the relative volume of process crash reports for 3.5.7. 3.6 should surpase 3.5.7 volume in a week or two even without any kind of major update push.
I when to look for some of the early firefox 3.6.3 reports as it was going on the wire tonight and saw a backlog that is quite a bit larger than comment 10 from several weeks ago. Mood Deathly Server Time 2010-04-01 19:15:01 Stats Created At 2010-04-01 19:10:01.530213 Waiting Jobs 115616 Processors Running 2 Average Seconds to Process 2.54114 Average Wait in Seconds 26837.5 Recently Completed 2010-04-01 19:10:01.182602 Oldest Job In Queue 2010-03-31 07:11:11.473104 A 7 hour backlog means that it will be many hours until we start processing any significant number of 3.6.3 reports. It also looks like number of processors is down to 2. Would bumping that back up to 5 help? I think we had some discussion on this a few weeks ago but can't remember the outcome.
so where we were processing an average hourly rate of 10k per hour back in the last snapshot on 1/28 we are now processing about 14k per hour which is very close to the backlog creating limit. That means we are spending more hours of the day creating backlogs and fewer hours of the data reducing those backlogs. Having 3 processors down might be causing this current 7 hour backlog, so we need to get that repaired and see what things look like. The planned system downtime to upgrade the server will also add to the backlog which will compound the backlog problem a bit. Getting these backlogs generally lower and/or planning the outages during no backlog periods might be something that we might need to consider in the future.
in that chart in comment 15 looks like we had 160 hours where we were over 14k reports per hour and 128 hours where we were under 14k reports per hour
Looking at it now I see 5 processors. I'm guessing the processors being down to 2 was caused by bug 556679. The wait time is starting to rise again so we'll have to keep an eye on it today.
one thing we could do to see these trends a bit easier would be to expand http://crash-stats.mozilla.com/status to show a 3 and/or 7 day window. we would need 10 minute updates on the data (maybe 1/2 hour snapshots would be fine) but then we could see when things are out of bounds when compared to the previous couple of days. I also think we should be plotting #submissions, and #processed_reports somewhere to get a better understanding the dynamics of the backlog a bit better.
here is another way to look at this. we have a theoretical peak processing capability of 15,000 reports per hour**. ** assume 2.5-3seconds per report and twenty of such cycles per minute and 60 minutes per hour. When ever we go above 15,000 receiving reports per hour we are creating and/or adding to the backlog. The charts in comment 15 and below shows we were adding to the backlog for about 10 hours each day, and have the other 14 hours to recover. We might be using up about 70-80% of that recovery time under recent typical daily loads. here is what things looked like on 3/31/08. I backlog extra capacity for beyond 15k backlog reduction 2010033100 10709 2010033101 11338 2010033102 11876 2010033103 12411 2010033104 13917 2010033105 16291 1291 2010033106 17217 2217 2010033107 18175 3175 2010033108 19293 4293 2010033109 19190 4190 2010033110 19057 4057 2010033111 18787 3787 2010033112 18189 3189 2010033113 17083 2083 2010033114 15375 375 2010033115 13311 -1689 2010033116 11421 -3579 2010033117 11236 -3764 2010033118 11932 -3068 2010033119 13020 -1980 2010033120 10909 -4091 2010033121 10761 -4239 2010033122 10526 -4474 2010033123 9456 341480 28657 -26884 It took us beyond past 10pm to wipe out the backlog created between 0500am and 2pm. As the load from the removal of client side throttling in firefox 3.6.x takes effect as more users start using 3.6.x we continue to add to the backlog, and we also eat away at the hour available for recovery and capacity withing each recovery hour. There is also the case where downtime or bugs add to the backlog. on the event on 4/1 added about 100,000 reports to the backlog when we could only get though processing 230,000 reports. It looks like it might take 3-4 days over the lower volume weekend/holiday days to get rid of that backlog using spare evening cycles and fewer hours in overload. We are at about 65 million 3.6.x users and we could probably creat a model and project when we might hit some wrap around point where we aren't able to keep up with the backlog and would continue to add to it day over day unless we add capacity or reduce the 15% processing rate we have as the current throttle setting.
so if you take those hourly volume numbers from 3/31 and apply a uniform 6% increase we get to over the magically 360,000 reports per day and we produce the wrap around effect where we will continue to add to the backlog and ever catch up. hr #report backlog backlog_reduction 00 11352 -3648 00 12018 -2982 02 12589 -2411 03 13156 -1844 04 14752 -248 05 17268 2268 06 18250 3250 07 19266 4266 08 20451 5451 09 20341 5341 10 20200 5200 11 19914 4914 12 19280 4280 13 18108 3108 14 16298 1298 15 14110 -890 16 12106 -2894 17 11910 -3090 18 12648 -2352 19 13801 -1199 20 11564 -3436 21 11407 -3593 22 11158 -3842 23 10023 -4977 tl 361969 39128 -37160 for the last 7 days of march here is what daily volumes looked like 332072 20100325-crashdata.csv 323476 20100326-crashdata.csv 308682 20100327-crashdata.csv 329739 20100328-crashdata.csv 347344 20100329-crashdata.csv 338689 20100330-crashdata.csv 341481 20100331-crashdata.csv 331640 average daily volume this compares to and average daily volume in the last week of feb of 233,203 before the 3.6.x users began to climb as the result of the major update offer.
aravind was able to look at the monitor to confirm some of these numbers this morning we watch the flow of incoming flow of reports as the backlog was begining to build. between 7:05a and 8:05a the backlog increased by 1572 and the incoming flow or reports registered 276-289/min or an hourly rate of 16,560-17,340. that helps to confirm that 15000 reports per hour is the magical capacity number where we start adding to the backlog and when incoming reports over that number, and reducing the backlog is possible when we are under that number.
Socorro caught up on its backlog at 2010-04-03 21:50:00 pdt
yeah, due to lower active daily users over good friday and the easter weekend we operating at lower incoming report rate and had more backlog reduction ability. On Saturday we had the ability to reduce a backlog of 62k and only created a backlog of 11k. Looks like we were operating at 86% capacity rather than 95% of capacity that we saw on 3/31. backlog backlog reduction 2010040300 10868 -4132 2010040301 11314 -3686 2010040302 11819 -3181 2010040303 12100 -2900 2010040304 13220 -1780 2010040305 14337 -663 2010040306 15662 662 2010040307 16875 1875 2010040308 17183 2183 2010040309 17424 2424 2010040310 17347 2347 2010040311 16658 1658 2010040312 15703 703 2010040313 15101 101 2010040314 13809 -1191 2010040315 12654 -2346 2010040316 11231 -3769 2010040317 10255 -4745 2010040318 9964 -5036 2010040319 9676 -5324 2010040320 8722 -6278 2010040321 9225 -5775 2010040322 9048 -5952 2010040323 9466 -5534 total 309661 11953 -62292 “pct. Of 360k 0.86
I've been gathering incoming volume from the .csv files, and taking snapshots of the backlog for the past few days. this chart might give the clearest picture showing the backlog increasing when we hit that threshold of around 14-15k incoming reports, then the backlog doesn't start to decline until we get back under that level. it also show the volume increase as more people get on 3.6.x and coming out of the easter/passover weekend. it also show the narrowing of time we are a zero backlog. we didn;t clear the baclog until after 1am last night then started creating a backlog again around 5:30a
Aravind, can we get some extra boxes online as discussed last week ASAP? I think deinspanjer was offering to help too.
Assignee: nobody → aravind
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: Trunk → other
I've turned off backlog monitoring, but updated the chart with daily incoming volume for the last few days. on monday we went over 130m adu's and that put us over the 20k incoming report level for the last few days and estimated backlogs of 40k and 38k
Severity: normal → major
We ordered extra CPUs for the existing servers, with this change, we should be able to get close to double our current processing capacity. I don't have an exact date yet, but it should be soon.
Whiteboard: [crashkill-metrics] → [crashkill-metrics] [waiting on cpus]
We almost doubled the processing capacity of our current cluster of processors. To be exact, earlier we had 12 threads, we now have 20. One set of processors (8 threads) are now running on faster CPUs as well. This should hopefully be enough for us to keep up with the incoming load. Thanks to phong and jabba for getting the CPUs ordered and swapped so quickly.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
things are looking good. we had a backlog of 41,661 at 6:00p and by 9:00p that backlog was gone. under the previous capacity limits it would have taken well past midnight to knock out that backlog. tomorrow we might process over 35,000 or 40,000 reports within an hour and we could celebrate with cheap champagne! ;-)
but the sever status report has some conflicting info http://crash-stats.mozilla.com/status Mood Deathly Server Time 2010-04-21 21:54:44 Stats Created At 2010-04-21 21:50:01.944841 Waiting Jobs 882 Processors Running 5 Average Seconds to Process 2.91204 Average Wait in Seconds 4330.2 Recently Completed 2010-04-21 21:49:59.709397 Oldest Job In Queue 2010-04-21 13:54:47.601878 is there just something that needs to be tweaked in the report, or is there a problem somewhere? I wonder why there is an old job in the queue from 13:54
(In reply to comment #30) It's been discussed that we should calculate the mood value differently, so you can generally ignore this subjective value and focus on the other data points and trends. I'm not seeing the oldest job as 13:54. It is 22:54 currently.
Attached image daily volume trend
jumps correlate to increase number of 3.6 users that have client side throttling turned off
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: