Closed Bug 1522935 Opened 6 years ago Closed 2 years ago

Perform load testing on hgweb mirrors

Categories

(Developer Services :: Mercurial: hg.mozilla.org, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sheehan, Assigned: sheehan)

References

(Blocks 1 open bug)

Details

Now that we have mirrors of hgmo's HTTP endpoint running in AWS, I'd like to write and run some load tests against them. For obvious reasons, running load tests against our production on-premises hardware is not something we can do. With EC2 mirrors, we can create a new host that is independent of the public facing service but matches it's configuration almost exactly, and emulate a high-load event to see how the service stands up. The main objectives of this work are:

  1. Gather data points to help with capacity planning for the CI-private hgmo endpoints.
  2. Identify endpoints that contribute to the recent load spikes seen on the public hgmo service (see bug 1515291 for example, or the recent Nagios alerts in #vcs).
  3. Set up a framework that is extensible and can be used to load test future features (such as shallow clones/remotefilelog)

My plan is to use the Locust load testing framework to accomplish these objectives. I have past experience working with Locust on my university engineering capstone project. Using locust we can emulate user behavior using simple Python classes, load them into the locust tool and perform a test. Locust spawns a web UI that can be used to control the behaviour of the test, and generates some useful graphs and output files that we can gather more context from.

For the first objective (CI-private mirrors) I'd like to determine how much cloning load the service can sustain without causing any performance issues. I'd like to ensure we have sufficient computing power behind the private endpoint before rolling out to CI, as getting that wrong before switching over production traffic will cause highly visible outages and generally displease engineers.

For the second objective, I'd like to simulate different load events and see if we can find the root cause of our load spikes. I have a hunch that many consumers of the same SQLite pushlog database can cause resource contention and busy loops on the server, which slows down all requests as a result. This work can help us verify that hunch, or find the actual root cause and mitigate the issue.

Finally, having a load test framework in place that we can extend will allow us to confidently roll out partial clones in the near future. Enabling partial clones will mean clients no longer use the clonebundles feature, which is currently responsible for offloading ~97% of all bytes served from hg.mozilla.org. We should be able to add a load test and verify that we can handle the new traffic before rolling out to production, scaling our compute power up or down as needed.

Type: defect → enhancement
Priority: -- → P5
Priority: P5 → P3
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.