Closed
Bug 1010316
Opened 11 years ago
Closed 11 years ago
tweak varnish PoC
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: taras.mozilla, Assigned: gozer)
References
Details
It doesn't seem the server is reporting its load/cpu/network/etc. usage to graphite/hosted graphite. Can that be added?
We need to use something like varnishncsa to log response time, time to first byte. This seems to require an upgrade to varnish3x
Reporter | ||
Comment 1•11 years ago
|
||
There also some io problems on box: http://vps.glek.net/dmesg.txt Perhaps switching to deadline would help or maybe the hd is dying(http://vps.glek.net/smart.txt)?
Flags: needinfo?(gozer)
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops-storage → gozer
Status: NEW → ASSIGNED
Flags: needinfo?(gozer)
Assignee | ||
Comment 2•11 years ago
|
||
(In reply to Taras Glek (:taras) from comment #0)
> It doesn't seem the server is reporting its load/cpu/network/etc. usage to
> graphite/hosted graphite. Can that be added?
Normally, somewhat easy, but this host is *not* using infra puppet, but releng's puppet. So not
sure how that could be done quickly.
> We need to use something like varnishncsa to log response time, time to
> first byte. This seems to require an upgrade to varnish3x
Varnish upgraded to 3.0.5
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 3•11 years ago
|
||
(In reply to Taras Glek (:taras) from comment #0)
> It doesn't seem the server is reporting its load/cpu/network/etc. usage to
> graphite/hosted graphite. Can that be added?
collectd has been added. see https://bugzilla.mozilla.org/show_bug.cgi?id=1001517#c21
Reporter | ||
Comment 4•11 years ago
|
||
this poc is going really well. However the hd seems to be dying(resulting in crappy perf during peak times). Can we move this poc to a box with more RAM and either an SSD or a RAID1 or something?
I'm happy to order an ssd if we don't have one.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 5•11 years ago
|
||
Reporter | ||
Comment 6•11 years ago
|
||
Note, this means that we move from 0.01-1s cache miss perf to 1-10s.
Cache hits move from <0.0001s-0.001s to 0.001-10s
This is during periods of heavy usage
Comment 7•11 years ago
|
||
Note that in practice, the server is in production-like mode. It is relied upon. Any change done to it, including migration to another box with more ram and ssd would need a smooth transition of some sort. Also note that changing the hostname under which it's available will require a puppet config change on build slaves.
Reporter | ||
Comment 8•11 years ago
|
||
Note, from varnish performance today. I'm perfectly happy with it. It is working much better than I expected, given the crappy hw it's running on.
I started an etherpad with notes of what a production setup would look like: https://etherpad.mozilla.org/ikFCaXUCbK . Feel free to add to it
Comment 9•11 years ago
|
||
Great to hear the POC is so successful.
When you say "production-like", do you mean that if, say, we lose a disk, we will experience tree closures? If so, my focus is on how we get this to a better state in the fewest number of steps. Let's discuss.
Assignee | ||
Comment 10•11 years ago
|
||
Given that this host is running on a single-drive and is already showing warning sings of disk failure, I am *really* not sure this current setup should be allowed to remain in the critical path of the tree.
I was under the impression that this was a POC, to verify and demonstrate wether this sort of cache could have a positive impact on build time. This has been achieved successfully, right? And there are numbers to prove this conclusively, correct?
In that case, I would suggest disabling this for the time being, and taking the time needed to spec and design a proper solution with some redundancy that would not compromise the tree by introducing another
single point of failure.
Ideally, like discussed before, also building the smarts in the client-side of this equation to gracefull handle cache failure and fallback to no-cache behaviour.
As a side-note, with a single drive like that, showing signs of fatigue, we could have a dead system in 20 minutes.
Reporter | ||
Comment 11•11 years ago
|
||
So if the system fails, just make the domain point at an http server that's returning 404s for everything.
We don't have a good way to load test other than point all of our builders at this
Reporter | ||
Comment 12•11 years ago
|
||
Got a transition plan from PoC->production. TLDR: Gonna swap configs to point at s3, we'll pay for s3 bandwidth while we standup production varnish(using s3 as hot spare). sccache client will be made more resilient too.
https://etherpad.mozilla.org/ikFCaXUCbK
Comment 13•11 years ago
|
||
Moving out of storage (as that is our netapp-related component)
Component: Server Operations: Storage → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: dparsons → arich
Reporter | ||
Comment 14•11 years ago
|
||
We'll be turning this PoC off as soon as configs are changed
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•