Estimate web cluster size needed to support try repository load

RESOLVED FIXED

Status

Developer Services
Mercurial: hg.mozilla.org
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: hwine, Assigned: hwine)

Tracking

Details

(Assignee)

Description

3 years ago
One possible way to mitigate the impact of issues serving try content is to assign dedicated hardware to handle only the try repository.

This bug is to determine the number of nodes needed to support such an operation.
(Assignee)

Comment 1

3 years ago
Recommendation:
    - start with an allocation of 20% of current web heads to try. (2)
    - re-run this analysis after 13 days of problem free operations.

New logging was added to hg.mozilla.org web heads as part of the diagnosis of the hg.mozilla.org outages in July 2014 (bug 1042210). Using those logs from Aug 7 through Aug 20, the following numbers were gathered. During this time, 10 web heads were in service.

Of note:
 - over 50% of web sessions result in no repository data being served
 - light load up through the 95 percentile
 - the top 1% is well over 100x the 99th percentile on both data transfer & time

Totals for sessions related to any repo:
    session count: 40278673
    octets served: 47173817122523 (~47TB)
    wall time sec: 29779628
     cpu time sec: 18718815

Totals for sessions related to the try repo:
    session count: 561649
    octets served: 755233412396 (~0.7GB)
    wall time sec: 4594855
     cpu time sec: 3923607

= try repository =

    percentile       50           75           95           99          100
    octets            0            2       134651      2299670   1085078641
      wall            0            0            2           97        15351
       cpu            0            0            1           96        11659

= all repository =

    percentile       50           75           95           99          100
    octets            0            2       493717     19781745   1564835211
      wall            0            0            1            8        29953
       cpu            0            0            1            5        11659

= try as % of all =

    session count :  1.4%
           octets :  1.6%
        wall time : 15.4%
         cpu time : 21.0%

Summary: try repository traffic accounts for ~1.5% of the connections and data transfer, but ~20% of processing time.

Fine Print:
 - only web sessions that began and completed during the log interval were counted.
 - no cross check of these numbers has yet been done (or may be possible)
(Assignee)

Comment 2

3 years ago
(In reply to Hal Wine [:hwine] (use needinfo) from comment #1)
>  - no cross check of these numbers has yet been done (or may be possible)

Poor wording choices -- I mean we haven't cross checked the numbers in the underlying logs. Such a cross check would be challenging due to different values being measured by existing logs. Some "order of magnitude" consistency is the best we could hope for.

All of the analysis done on the log data is repeatable. Happy to show that work.

Comment 3

3 years ago
Did you look at relative CPU usage after the Try reset? I suspect it is significantly lower than before.

Also, switching to generaldelta and/or lz4 revlogs will make certain operations much more CPU efficient. However, other operations will slow down drastically. I'd have to analyze which requests are accounting for CPU to tell you for sure.
(Assignee)

Comment 4

3 years ago
(In reply to Gregory Szorc [:gps] from comment #3)
> Did you look at relative CPU usage after the Try reset? I suspect it is
> significantly lower than before.

No real change in CPU, just we don't hit the overload points (yet).

> Also, switching to generaldelta and/or lz4 revlogs will make certain
> operations much more CPU efficient. However, other operations will slow down
> drastically. I'd have to analyze which requests are accounting for CPU to
> tell you for sure.

What logs do you want? I can make them available to you.

Comment 5

3 years ago
(In reply to Hal Wine [:hwine] (use needinfo) from comment #4)
> > Also, switching to generaldelta and/or lz4 revlogs will make certain
> > operations much more CPU efficient. However, other operations will slow down
> > drastically. I'd have to analyze which requests are accounting for CPU to
> > tell you for sure.
> 
> What logs do you want? I can make them available to you.

I'd need to know how large bundles being fetched by clients are. The problem with generaldelta is that the server and/or client will re-encode the data for the wire transfer. This is very computationally expensive. If all we're doing is e.g. N<10 changesets during pulls, we should be fine. But pulling hundreds or thousands of changesets via generaldelta would burn a lot of cycles for a repo the size of mozilla-central.

This problem should go away with Mercurial 3.2 or 3.3 (I hope).
(Assignee)

Comment 6

3 years ago
This bug only focuses on web heads. Splitting the push head does not appear to be needed:

    == during issues ==                     == post issues ==

2014-08-10 to 2014-08-15                  2014-08-24 to 2014-09-02
                                        
     Try              Non-try                 Try           Non-try
  push times        push times            push times       push times
count    14082.0     47231.0            count    36218.0    101945.0
mean        20.2         2.3            mean         5.2         3.3
std         93.9        33.8            std         40.4        51.5
min          1.0         1.0            min          1.0         1.0
50%          7.0         1.0            50%          3.0         1.0
75%         23.0         2.0            75%          6.0         2.0
90%         37.0         3.0            90%          7.0         3.0
98%         98.0        15.0            98%         21.0        17.1
99%        125.0        16.0            99%         23.0        19.0
max       7944.0      7221.0            max       7226.0      9010.0
                                        
      Try             Non-try                 Try            Non-try
  queue lengths     queue lengths        queue lengths     queue lengths
count    17010.0     94858.0            count    25651.0    225482.0
mean         5.5         4.2            mean         5.2         4.5
std          3.0         2.5            std          2.9         2.6
min          1.0         1.0            min          1.0         1.0
50%          6.0         4.0            50%          5.0         4.0
75%          8.0         6.0            75%          8.0         7.0
90%         10.0         8.0            90%          9.0         8.0
98%         11.0        10.0            98%         10.0        10.0
99%         12.0        10.0            99%         11.0        10.0
max         16.0        16.0            max         13.0        18.0
Assignee: nobody → hwine
(Assignee)

Comment 7

3 years ago
Nothing more to do here
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Product: Release Engineering → Developer Services
You need to log in before you can comment on or make changes to this bug.