1057148 - Estimate web cluster size needed to support try repository load

Assignee

Description

•

11 years ago

One possible way to mitigate the impact of issues serving try content is to assign dedicated hardware to handle only the try repository. This bug is to determine the number of nodes needed to support such an operation.

hwine

Assignee

Comment 1

•

11 years ago

Recommendation: - start with an allocation of 20% of current web heads to try. (2) - re-run this analysis after 13 days of problem free operations. New logging was added to hg.mozilla.org web heads as part of the diagnosis of the hg.mozilla.org outages in July 2014 (bug 1042210). Using those logs from Aug 7 through Aug 20, the following numbers were gathered. During this time, 10 web heads were in service. Of note: - over 50% of web sessions result in no repository data being served - light load up through the 95 percentile - the top 1% is well over 100x the 99th percentile on both data transfer & time Totals for sessions related to any repo: session count: 40278673 octets served: 47173817122523 (~47TB) wall time sec: 29779628 cpu time sec: 18718815 Totals for sessions related to the try repo: session count: 561649 octets served: 755233412396 (~0.7GB) wall time sec: 4594855 cpu time sec: 3923607 = try repository = percentile 50 75 95 99 100 octets 0 2 134651 2299670 1085078641 wall 0 0 2 97 15351 cpu 0 0 1 96 11659 = all repository = percentile 50 75 95 99 100 octets 0 2 493717 19781745 1564835211 wall 0 0 1 8 29953 cpu 0 0 1 5 11659 = try as % of all = session count : 1.4% octets : 1.6% wall time : 15.4% cpu time : 21.0% Summary: try repository traffic accounts for ~1.5% of the connections and data transfer, but ~20% of processing time. Fine Print: - only web sessions that began and completed during the log interval were counted. - no cross check of these numbers has yet been done (or may be possible)

hwine

Assignee

Comment 2

•

11 years ago

(In reply to Hal Wine [:hwine] (use needinfo) from comment #1) > - no cross check of these numbers has yet been done (or may be possible) Poor wording choices -- I mean we haven't cross checked the numbers in the underlying logs. Such a cross check would be challenging due to different values being measured by existing logs. Some "order of magnitude" consistency is the best we could hope for. All of the analysis done on the log data is repeatable. Happy to show that work.

Gregory Szorc [:gps]

Comment 3

•

11 years ago

Did you look at relative CPU usage after the Try reset? I suspect it is significantly lower than before. Also, switching to generaldelta and/or lz4 revlogs will make certain operations much more CPU efficient. However, other operations will slow down drastically. I'd have to analyze which requests are accounting for CPU to tell you for sure.

hwine

Assignee

Comment 4

•

11 years ago

(In reply to Gregory Szorc [:gps] from comment #3) > Did you look at relative CPU usage after the Try reset? I suspect it is > significantly lower than before. No real change in CPU, just we don't hit the overload points (yet). > Also, switching to generaldelta and/or lz4 revlogs will make certain > operations much more CPU efficient. However, other operations will slow down > drastically. I'd have to analyze which requests are accounting for CPU to > tell you for sure. What logs do you want? I can make them available to you.

Gregory Szorc [:gps]

Comment 5

•

11 years ago

(In reply to Hal Wine [:hwine] (use needinfo) from comment #4) > > Also, switching to generaldelta and/or lz4 revlogs will make certain > > operations much more CPU efficient. However, other operations will slow down > > drastically. I'd have to analyze which requests are accounting for CPU to > > tell you for sure. > > What logs do you want? I can make them available to you. I'd need to know how large bundles being fetched by clients are. The problem with generaldelta is that the server and/or client will re-encode the data for the wire transfer. This is very computationally expensive. If all we're doing is e.g. N<10 changesets during pulls, we should be fine. But pulling hundreds or thousands of changesets via generaldelta would burn a lot of cycles for a repo the size of mozilla-central. This problem should go away with Mercurial 3.2 or 3.3 (I hope).

hwine

Assignee

Comment 6

•

11 years ago

This bug only focuses on web heads. Splitting the push head does not appear to be needed: == during issues == == post issues == 2014-08-10 to 2014-08-15 2014-08-24 to 2014-09-02 Try Non-try Try Non-try push times push times push times push times count 14082.0 47231.0 count 36218.0 101945.0 mean 20.2 2.3 mean 5.2 3.3 std 93.9 33.8 std 40.4 51.5 min 1.0 1.0 min 1.0 1.0 50% 7.0 1.0 50% 3.0 1.0 75% 23.0 2.0 75% 6.0 2.0 90% 37.0 3.0 90% 7.0 3.0 98% 98.0 15.0 98% 21.0 17.1 99% 125.0 16.0 99% 23.0 19.0 max 7944.0 7221.0 max 7226.0 9010.0 Try Non-try Try Non-try queue lengths queue lengths queue lengths queue lengths count 17010.0 94858.0 count 25651.0 225482.0 mean 5.5 4.2 mean 5.2 4.5 std 3.0 2.5 std 2.9 2.6 min 1.0 1.0 min 1.0 1.0 50% 6.0 4.0 50% 5.0 4.0 75% 8.0 6.0 75% 8.0 7.0 90% 10.0 8.0 90% 9.0 8.0 98% 11.0 10.0 98% 10.0 10.0 99% 12.0 10.0 99% 11.0 10.0 max 16.0 16.0 max 13.0 18.0

Assignee: nobody → hwine

hwine

Assignee

Comment 7

•

11 years ago

Nothing more to do here

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: Release Engineering → Developer Services

Bugzilla

Estimate web cluster size needed to support try repository load

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

Tracking

(Not tracked)

People

(Reporter: hwine, Assigned: hwine)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated