Closed Bug 1010316 Opened 10 years ago Closed 10 years ago

tweak varnish PoC

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: taras.mozilla, Assigned: gozer)

References

Details

It doesn't seem the server is reporting its load/cpu/network/etc. usage to graphite/hosted graphite. Can that be added?

We need to use something like varnishncsa to log response time, time to first byte. This seems to require an upgrade to varnish3x
There also some io problems on box: http://vps.glek.net/dmesg.txt Perhaps switching to deadline would help or maybe the hd is dying(http://vps.glek.net/smart.txt)?
Flags: needinfo?(gozer)
Assignee: server-ops-storage → gozer
Status: NEW → ASSIGNED
Flags: needinfo?(gozer)
(In reply to Taras Glek (:taras) from comment #0)
> It doesn't seem the server is reporting its load/cpu/network/etc. usage to
> graphite/hosted graphite. Can that be added?

Normally, somewhat easy, but this host is *not* using infra puppet, but releng's puppet. So not
sure how that could be done quickly.

> We need to use something like varnishncsa to log response time, time to
> first byte. This seems to require an upgrade to varnish3x

Varnish upgraded to 3.0.5
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Taras Glek (:taras) from comment #0)
> It doesn't seem the server is reporting its load/cpu/network/etc. usage to
> graphite/hosted graphite. Can that be added?

collectd has been added.  see https://bugzilla.mozilla.org/show_bug.cgi?id=1001517#c21
this poc is going really well. However the hd seems to be dying(resulting in crappy perf during peak times). Can we move this poc to a box with more RAM and either an SSD or a RAID1 or something?
I'm happy to order an ssd if we don't have one.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Note, this means that we move from 0.01-1s cache miss perf to 1-10s.

Cache hits move from <0.0001s-0.001s to 0.001-10s

This is during periods of heavy usage
Note that in practice, the server is in production-like mode. It is relied upon. Any change done to it, including migration to another box with more ram and ssd would need a smooth transition of some sort. Also note that changing the hostname under which it's available will require a puppet config change on build slaves.
Note, from varnish performance today. I'm perfectly happy with it. It is working much better than I expected, given the crappy hw it's running on.

I started an etherpad with notes of what a production setup would look like: https://etherpad.mozilla.org/ikFCaXUCbK . Feel free to add to it
Great to hear the POC is so successful.

When you say "production-like", do you mean that if, say, we lose a disk, we will experience tree closures? If so, my focus is on how we get this to a better state in the fewest number of steps. Let's discuss.
Given that this host is running on a single-drive and is already showing warning sings of disk failure, I am *really* not sure this current setup should be allowed to remain in the critical path of the tree.

I was under the impression that this was a POC, to verify and demonstrate wether this sort of cache could have a positive impact on build time. This has been achieved successfully, right? And there are numbers to prove this conclusively, correct?

In that case, I would suggest disabling this for the time being, and taking the time needed to spec and design a proper solution with some redundancy that would not compromise the tree by introducing another
single point of failure.

Ideally, like discussed before, also building the smarts in the client-side of this equation to gracefull handle cache failure and fallback to no-cache behaviour.

As a side-note, with a single drive like that, showing signs of fatigue, we could have a dead system in 20 minutes.
So if the system fails, just make the domain point at an http server that's returning 404s for everything.

We don't have a good way to load test other than point all of our builders at this
Got a transition plan from PoC->production. TLDR: Gonna swap configs to point at s3, we'll pay for s3 bandwidth while we standup production varnish(using s3 as hot spare). sccache client will be made more resilient too.


https://etherpad.mozilla.org/ikFCaXUCbK
Moving out of storage (as that is our netapp-related component)
Component: Server Operations: Storage → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: dparsons → arich
We'll be turning this PoC off as soon as configs are changed
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → WORKSFORME
Depends on: 1024651
You need to log in before you can comment on or make changes to this bug.