Closed
Bug 541873
Opened 15 years ago
Closed 15 years ago
build capacity to process more crash reports.
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: chofmann, Assigned: aravind)
Details
(Whiteboard: [crashkill-metrics] [waiting on cpus])
Attachments
(9 files)
the default setting setting for submitting crash reports has been shifted to 100% opt-in for firefox 3.6. this is up from a default of 10% in previous relases so we should start to see increased volume in crash submissions.
we still have server side throttling to deal with load balancing, but over all we need to ramp up capacity.
we tried to ramp up 3.6 processing from 15% to 50% of all submissions in the early days of 3.6 release and the system start to lag during peak periods.
looks like we have capacity to process about 13k-14k crash reports per hour, and beyond that the system starts to lag.
We should try to figure out how we could double that capacity.
Reporter | ||
Comment 1•15 years ago
|
||
Reporter | ||
Updated•15 years ago
|
Whiteboard: [crashkill-metrics]
Reporter | ||
Comment 2•15 years ago
|
||
Reporter | ||
Comment 3•15 years ago
|
||
you can see where aravind throttled back to 25% on 3.6 yesterday afternoon. the system was probably on its way to to recovery. but at 17k reports per hour was laboring. aravind might have details on the size of the backlog and the delay in processing reports at that point.
Comment 4•15 years ago
|
||
The number of processors was at 5 Friday and is at 5 now.
http://crash-stats.mozilla.com/status
Did we launch more processors?
Assignee | ||
Comment 5•15 years ago
|
||
I didn't take a snapshot of how the system looked when I throttled it back, I vaguely remember the backlog was trending up and something like 20k reports were in the queue.
We can add capacity to the existing system, but doing that would mean hardware (processing power) and storage purchases on our side. We weren't planning for that since all this processing was supposed to be taken over by the metrics team.
I am copying Daniel on this thread, he may have more input on when that will be ready etc.
Assignee | ||
Comment 6•15 years ago
|
||
@ozten: nope.. we did not add any new processors, I just throttled the incoming crashes to 25%. The system just caught up on its own.
Our sweet spot right now is probably somewhere between that 25 to 50 %.
Reporter | ||
Comment 7•15 years ago
|
||
(In reply to comment #6)
> @ozten: nope.. we did not add any new processors, I just throttled the incoming
> crashes to 25%. The system just caught up on its own.
>
> Our sweet spot right now is probably somewhere between that 25 to 50 %.
but the 25-50% server throttling will change quickly as 3.5.x users (client throttled to opt-out 90% of users) to 3.6x (no client throttling).
I suspect that we may need to drop below 15% server throttling pretty soon on 3.6 users if we don't ramp up capacity soon.
Comment 8•15 years ago
|
||
I still doubt that the client-side throttling has any effect.
Reporter | ||
Comment 9•15 years ago
|
||
playing with the 3.5.x crash per 100 user numbers where client throttling is turned on I can make them look roughtly like an unthrottled beta by applying the idea that:
- one person out of 10 accepts the default opt-in,
- an additional 1.5 people override the default opt-in
- 7.5 people keep the default opt-out.
See https://wiki.mozilla.org/CrashKill/Crashr#Release_3.5.6
that means for days that we have 51 million adu's on 3.5.6 we effectively have 1.9 million of those users in the pool of users that might be submitting crashes. from there we apply 15% server side throttling to get roughly the number of crashes we see in the Socorro reporting.
Reporter | ||
Comment 10•15 years ago
|
||
I still had a window open with server status from yesterday morning. This is what things looked like when we were processing about 15k reports per hour.
Mood Deathly
Server Time 2010-01-24 09:39:21
Stats Created At 2010-01-24 09:35:01.829668
Waiting Jobs 9815
Processors Running 5
Average Seconds to Process 3.24642
Average Wait in Seconds 1668.05
Recently Completed 2010-01-24 09:35:01.449064
Oldest Job In Queue 2010-01-24 08:33:21.29002
The "seconds to process" is still around 3 seconds, so maybe we could add more processors as suggested in comment 5, and not affect cpu load too much. I'm guessing that when we see that number rise we are starting to grind the manchine into the ground.
maybe another experiment to test this idea is in order.
Reporter | ||
Comment 11•15 years ago
|
||
we added about a million 3.6 adu's yesterday, and peak crash reporting increased about 300 crashes per hour to peak load.
14853 -> 15221
That might mean we are about a week or two away from being back in the peak load danger zone if we continue to add the expected million users a day to the 3.6 user base.
Reporter | ||
Comment 12•15 years ago
|
||
https://wiki.mozilla.org/CrashKill/Crashr#3.6_RC1.2C_RC2.2C_Final shows the current ramp rate of adus and crash reports on 3.6
Reporter | ||
Comment 13•15 years ago
|
||
complete with indication of the system maintenance outage.
Also adds the relative volume of process crash reports for 3.5.7.
3.6 should surpase 3.5.7 volume in a week or two even without any kind of major update push.
Reporter | ||
Comment 14•15 years ago
|
||
I when to look for some of the early firefox 3.6.3 reports as it was going on the wire tonight and saw a backlog that is quite a bit larger than comment 10 from several weeks ago.
Mood Deathly
Server Time 2010-04-01 19:15:01
Stats Created At 2010-04-01 19:10:01.530213
Waiting Jobs 115616
Processors Running 2
Average Seconds to Process 2.54114
Average Wait in Seconds 26837.5
Recently Completed 2010-04-01 19:10:01.182602
Oldest Job In Queue 2010-03-31 07:11:11.473104
A 7 hour backlog means that it will be many hours until we start processing any significant number of 3.6.3 reports. It also looks like number of processors is down to 2. Would bumping that back up to 5 help? I think we had some discussion on this a few weeks ago but can't remember the outcome.
Reporter | ||
Comment 15•15 years ago
|
||
so where we were processing an average hourly rate of 10k per hour back in the last snapshot on 1/28 we are now processing about 14k per hour which is very close to the backlog creating limit. That means we are spending more hours of the day creating backlogs and fewer hours of the data reducing those backlogs.
Having 3 processors down might be causing this current 7 hour backlog, so we need to get that repaired and see what things look like.
The planned system downtime to upgrade the server will also add to the backlog which will compound the backlog problem a bit. Getting these backlogs generally lower and/or planning the outages during no backlog periods might be something that we might need to consider in the future.
Reporter | ||
Comment 16•15 years ago
|
||
in that chart in comment 15 looks like we had
160 hours where we were over 14k reports per hour and
128 hours where we were under 14k reports per hour
Comment 17•15 years ago
|
||
Looking at it now I see 5 processors. I'm guessing the processors being down to 2 was caused by bug 556679.
The wait time is starting to rise again so we'll have to keep an eye on it today.
Reporter | ||
Comment 18•15 years ago
|
||
one thing we could do to see these trends a bit easier would be to expand http://crash-stats.mozilla.com/status to show a 3 and/or 7 day window. we would need 10 minute updates on the data (maybe 1/2 hour snapshots would be fine) but then we could see when things are out of bounds when compared to the previous couple of days.
I also think we should be plotting #submissions, and #processed_reports somewhere to get a better understanding the dynamics of the backlog a bit better.
Reporter | ||
Comment 19•15 years ago
|
||
here is another way to look at this. we have a theoretical peak processing capability of 15,000 reports per hour**.
** assume 2.5-3seconds per report and twenty of such cycles per minute and 60 minutes per hour.
When ever we go above 15,000 receiving reports per hour we are creating and/or adding to the backlog. The charts in comment 15 and below shows we were adding to the backlog for about 10 hours each day, and have the other 14 hours to recover. We might be using up about 70-80% of that recovery time under recent typical daily loads. here is what things looked like on 3/31/08. I
backlog extra capacity for
beyond 15k backlog reduction
2010033100 10709
2010033101 11338
2010033102 11876
2010033103 12411
2010033104 13917
2010033105 16291 1291
2010033106 17217 2217
2010033107 18175 3175
2010033108 19293 4293
2010033109 19190 4190
2010033110 19057 4057
2010033111 18787 3787
2010033112 18189 3189
2010033113 17083 2083
2010033114 15375 375
2010033115 13311 -1689
2010033116 11421 -3579
2010033117 11236 -3764
2010033118 11932 -3068
2010033119 13020 -1980
2010033120 10909 -4091
2010033121 10761 -4239
2010033122 10526 -4474
2010033123 9456
341480 28657 -26884
It took us beyond past 10pm to wipe out the backlog created between 0500am and 2pm.
As the load from the removal of client side throttling in firefox 3.6.x takes effect as more users start using 3.6.x we continue to add to the backlog, and we also eat away at the hour available for recovery and capacity withing each recovery hour.
There is also the case where downtime or bugs add to the backlog. on the event on 4/1 added about 100,000 reports to the backlog when we could only get though processing 230,000 reports. It looks like it might take 3-4 days over the lower volume weekend/holiday days to get rid of that backlog using spare evening cycles and fewer hours in overload.
We are at about 65 million 3.6.x users and we could probably creat a model and project when we might hit some wrap around point where we aren't able to keep up with the backlog and would continue to add to it day over day unless we add capacity or reduce the 15% processing rate we have as the current throttle setting.
Reporter | ||
Comment 20•15 years ago
|
||
so if you take those hourly volume numbers from 3/31 and apply a uniform 6% increase we get to over the magically 360,000 reports per day and we produce the wrap around effect where we will continue to add to the backlog and ever catch up.
hr #report backlog backlog_reduction
00 11352 -3648
00 12018 -2982
02 12589 -2411
03 13156 -1844
04 14752 -248
05 17268 2268
06 18250 3250
07 19266 4266
08 20451 5451
09 20341 5341
10 20200 5200
11 19914 4914
12 19280 4280
13 18108 3108
14 16298 1298
15 14110 -890
16 12106 -2894
17 11910 -3090
18 12648 -2352
19 13801 -1199
20 11564 -3436
21 11407 -3593
22 11158 -3842
23 10023 -4977
tl 361969 39128 -37160
for the last 7 days of march here is what daily volumes looked like
332072 20100325-crashdata.csv
323476 20100326-crashdata.csv
308682 20100327-crashdata.csv
329739 20100328-crashdata.csv
347344 20100329-crashdata.csv
338689 20100330-crashdata.csv
341481 20100331-crashdata.csv
331640 average daily volume
this compares to and average daily volume in the last week of feb of 233,203 before the 3.6.x users began to climb as the result of the major update offer.
Reporter | ||
Comment 21•15 years ago
|
||
aravind was able to look at the monitor to confirm some of these numbers
this morning we watch the flow of incoming flow of reports as the backlog was begining to build.
between 7:05a and 8:05a the backlog increased by 1572
and the incoming flow or reports registered 276-289/min or an hourly rate of 16,560-17,340. that helps to confirm that 15000 reports per hour is the magical capacity number where we start adding to the backlog and when incoming reports over that number, and reducing the backlog is possible when we are under that number.
Comment 22•15 years ago
|
||
Socorro caught up on its backlog at 2010-04-03 21:50:00 pdt
Reporter | ||
Comment 23•15 years ago
|
||
yeah, due to lower active daily users over good friday and the easter weekend we operating at lower incoming report rate and had more backlog reduction ability. On Saturday we had the ability to reduce a backlog of 62k and only created a backlog of 11k. Looks like we were operating at 86% capacity rather than 95% of capacity that we saw on 3/31.
backlog backlog reduction
2010040300 10868 -4132
2010040301 11314 -3686
2010040302 11819 -3181
2010040303 12100 -2900
2010040304 13220 -1780
2010040305 14337 -663
2010040306 15662 662
2010040307 16875 1875
2010040308 17183 2183
2010040309 17424 2424
2010040310 17347 2347
2010040311 16658 1658
2010040312 15703 703
2010040313 15101 101
2010040314 13809 -1191
2010040315 12654 -2346
2010040316 11231 -3769
2010040317 10255 -4745
2010040318 9964 -5036
2010040319 9676 -5324
2010040320 8722 -6278
2010040321 9225 -5775
2010040322 9048 -5952
2010040323 9466 -5534
total 309661 11953 -62292
“pct. Of 360k 0.86
Reporter | ||
Comment 24•15 years ago
|
||
I've been gathering incoming volume from the .csv files, and taking snapshots of the backlog for the past few days. this chart might give the clearest picture showing the backlog increasing when we hit that threshold of around 14-15k incoming reports, then the backlog doesn't start to decline until we get back under that level. it also show the volume increase as more people get on 3.6.x and coming out of the easter/passover weekend. it also show the narrowing of time we are a zero backlog. we didn;t clear the baclog until after 1am last night then started creating a backlog again around 5:30a
Comment 25•15 years ago
|
||
Aravind, can we get some extra boxes online as discussed last week ASAP? I think deinspanjer was offering to help too.
Assignee: nobody → aravind
Component: Socorro → Server Operations
Product: Webtools → mozilla.org
QA Contact: socorro → mrz
Version: Trunk → other
Reporter | ||
Comment 26•15 years ago
|
||
I've turned off backlog monitoring, but updated the chart with daily incoming volume for the last few days.
on monday we went over 130m adu's and that put us over the 20k incoming report level for the last few days and estimated backlogs of 40k and 38k
Updated•15 years ago
|
Severity: normal → major
Assignee | ||
Comment 27•15 years ago
|
||
We ordered extra CPUs for the existing servers, with this change, we should be able to get close to double our current processing capacity. I don't have an exact date yet, but it should be soon.
Updated•15 years ago
|
Whiteboard: [crashkill-metrics] → [crashkill-metrics] [waiting on cpus]
Assignee | ||
Comment 28•15 years ago
|
||
We almost doubled the processing capacity of our current cluster of processors. To be exact, earlier we had 12 threads, we now have 20. One set of processors (8 threads) are now running on faster CPUs as well. This should hopefully be enough for us to keep up with the incoming load.
Thanks to phong and jabba for getting the CPUs ordered and swapped so quickly.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 29•15 years ago
|
||
things are looking good. we had a backlog of 41,661 at 6:00p and by 9:00p that backlog was gone. under the previous capacity limits it would have taken well past midnight to knock out that backlog. tomorrow we might process over 35,000 or 40,000 reports within an hour and we could celebrate with cheap champagne! ;-)
Reporter | ||
Comment 30•15 years ago
|
||
but the sever status report has some conflicting info http://crash-stats.mozilla.com/status
Mood Deathly
Server Time 2010-04-21 21:54:44
Stats Created At 2010-04-21 21:50:01.944841
Waiting Jobs 882
Processors Running 5
Average Seconds to Process 2.91204
Average Wait in Seconds 4330.2
Recently Completed 2010-04-21 21:49:59.709397
Oldest Job In Queue 2010-04-21 13:54:47.601878
is there just something that needs to be tweaked in the report, or is there a problem somewhere?
I wonder why there is an old job in the queue from 13:54
Comment 31•15 years ago
|
||
(In reply to comment #30)
It's been discussed that we should calculate the mood value differently, so you can generally ignore this subjective value and focus on the other data points and trends.
I'm not seeing the oldest job as 13:54. It is 22:54 currently.
Reporter | ||
Comment 32•15 years ago
|
||
Reporter | ||
Comment 33•15 years ago
|
||
jumps correlate to increase number of 3.6 users that have client side throttling turned off
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•