Closed Bug 1057148 Opened 11 years ago Closed 11 years ago

Estimate web cluster size needed to support try repository load

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: hwine)

References

Details

One possible way to mitigate the impact of issues serving try content is to assign dedicated hardware to handle only the try repository. This bug is to determine the number of nodes needed to support such an operation.
Recommendation: - start with an allocation of 20% of current web heads to try. (2) - re-run this analysis after 13 days of problem free operations. New logging was added to hg.mozilla.org web heads as part of the diagnosis of the hg.mozilla.org outages in July 2014 (bug 1042210). Using those logs from Aug 7 through Aug 20, the following numbers were gathered. During this time, 10 web heads were in service. Of note: - over 50% of web sessions result in no repository data being served - light load up through the 95 percentile - the top 1% is well over 100x the 99th percentile on both data transfer & time Totals for sessions related to any repo: session count: 40278673 octets served: 47173817122523 (~47TB) wall time sec: 29779628 cpu time sec: 18718815 Totals for sessions related to the try repo: session count: 561649 octets served: 755233412396 (~0.7GB) wall time sec: 4594855 cpu time sec: 3923607 = try repository = percentile 50 75 95 99 100 octets 0 2 134651 2299670 1085078641 wall 0 0 2 97 15351 cpu 0 0 1 96 11659 = all repository = percentile 50 75 95 99 100 octets 0 2 493717 19781745 1564835211 wall 0 0 1 8 29953 cpu 0 0 1 5 11659 = try as % of all = session count : 1.4% octets : 1.6% wall time : 15.4% cpu time : 21.0% Summary: try repository traffic accounts for ~1.5% of the connections and data transfer, but ~20% of processing time. Fine Print: - only web sessions that began and completed during the log interval were counted. - no cross check of these numbers has yet been done (or may be possible)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #1) > - no cross check of these numbers has yet been done (or may be possible) Poor wording choices -- I mean we haven't cross checked the numbers in the underlying logs. Such a cross check would be challenging due to different values being measured by existing logs. Some "order of magnitude" consistency is the best we could hope for. All of the analysis done on the log data is repeatable. Happy to show that work.
Did you look at relative CPU usage after the Try reset? I suspect it is significantly lower than before. Also, switching to generaldelta and/or lz4 revlogs will make certain operations much more CPU efficient. However, other operations will slow down drastically. I'd have to analyze which requests are accounting for CPU to tell you for sure.
(In reply to Gregory Szorc [:gps] from comment #3) > Did you look at relative CPU usage after the Try reset? I suspect it is > significantly lower than before. No real change in CPU, just we don't hit the overload points (yet). > Also, switching to generaldelta and/or lz4 revlogs will make certain > operations much more CPU efficient. However, other operations will slow down > drastically. I'd have to analyze which requests are accounting for CPU to > tell you for sure. What logs do you want? I can make them available to you.
(In reply to Hal Wine [:hwine] (use needinfo) from comment #4) > > Also, switching to generaldelta and/or lz4 revlogs will make certain > > operations much more CPU efficient. However, other operations will slow down > > drastically. I'd have to analyze which requests are accounting for CPU to > > tell you for sure. > > What logs do you want? I can make them available to you. I'd need to know how large bundles being fetched by clients are. The problem with generaldelta is that the server and/or client will re-encode the data for the wire transfer. This is very computationally expensive. If all we're doing is e.g. N<10 changesets during pulls, we should be fine. But pulling hundreds or thousands of changesets via generaldelta would burn a lot of cycles for a repo the size of mozilla-central. This problem should go away with Mercurial 3.2 or 3.3 (I hope).
This bug only focuses on web heads. Splitting the push head does not appear to be needed: == during issues == == post issues == 2014-08-10 to 2014-08-15 2014-08-24 to 2014-09-02 Try Non-try Try Non-try push times push times push times push times count 14082.0 47231.0 count 36218.0 101945.0 mean 20.2 2.3 mean 5.2 3.3 std 93.9 33.8 std 40.4 51.5 min 1.0 1.0 min 1.0 1.0 50% 7.0 1.0 50% 3.0 1.0 75% 23.0 2.0 75% 6.0 2.0 90% 37.0 3.0 90% 7.0 3.0 98% 98.0 15.0 98% 21.0 17.1 99% 125.0 16.0 99% 23.0 19.0 max 7944.0 7221.0 max 7226.0 9010.0 Try Non-try Try Non-try queue lengths queue lengths queue lengths queue lengths count 17010.0 94858.0 count 25651.0 225482.0 mean 5.5 4.2 mean 5.2 4.5 std 3.0 2.5 std 2.9 2.6 min 1.0 1.0 min 1.0 1.0 50% 6.0 4.0 50% 5.0 4.0 75% 8.0 6.0 75% 8.0 7.0 90% 10.0 8.0 90% 9.0 8.0 98% 11.0 10.0 98% 10.0 10.0 99% 12.0 10.0 99% 11.0 10.0 max 16.0 16.0 max 13.0 18.0
Assignee: nobody → hwine
Nothing more to do here
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Release Engineering → Developer Services
You need to log in before you can comment on or make changes to this bug.