1010316 - tweak varnish PoC

Reporter

Description

•

11 years ago

It doesn't seem the server is reporting its load/cpu/network/etc. usage to graphite/hosted graphite. Can that be added? We need to use something like varnishncsa to log response time, time to first byte. This seems to require an upgrade to varnish3x

(dormant account)

Reporter

Comment 1

•

11 years ago

There also some io problems on box: http://vps.glek.net/dmesg.txt Perhaps switching to deadline would help or maybe the hd is dying(http://vps.glek.net/smart.txt)?

Flags: needinfo?(gozer)

Philippe M. Chiasson (:gozer)

Assignee

Updated

•

11 years ago

Assignee: server-ops-storage → gozer

Status: NEW → ASSIGNED

Flags: needinfo?(gozer)

Philippe M. Chiasson (:gozer)

Assignee

Comment 2

•

11 years ago

(In reply to Taras Glek (:taras) from comment #0) > It doesn't seem the server is reporting its load/cpu/network/etc. usage to > graphite/hosted graphite. Can that be added? Normally, somewhat easy, but this host is *not* using infra puppet, but releng's puppet. So not sure how that could be done quickly. > We need to use something like varnishncsa to log response time, time to > first byte. This seems to require an upgrade to varnish3x Varnish upgraded to 3.0.5

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Jake Watkins [:dividehex]

Comment 3

•

11 years ago

(In reply to Taras Glek (:taras) from comment #0) > It doesn't seem the server is reporting its load/cpu/network/etc. usage to > graphite/hosted graphite. Can that be added? collectd has been added. see https://bugzilla.mozilla.org/show_bug.cgi?id=1001517#c21

(dormant account)

Reporter

Comment 4

•

11 years ago

this poc is going really well. However the hd seems to be dying(resulting in crappy perf during peak times). Can we move this poc to a box with more RAM and either an SSD or a RAID1 or something? I'm happy to order an ssd if we don't have one.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

(dormant account)

Reporter

Comment 5

•

11 years ago

https://graphite.mozilla.org/render/?width=586&height=308&_salt=1400099484.017&target=test.relabs.hp4_relabs_releng_scl3_mozilla_com.disk.sda.disk_time.read&target=test.relabs.hp4_relabs_releng_scl3_mozilla_com.disk.sda.disk_time.write&from=-4hours Shows the io wait terribleness.

(dormant account)

Reporter

Comment 6

•

11 years ago

Note, this means that we move from 0.01-1s cache miss perf to 1-10s. Cache hits move from <0.0001s-0.001s to 0.001-10s This is during periods of heavy usage

Mike Hommey [:glandium]

Comment 7

•

11 years ago

Note that in practice, the server is in production-like mode. It is relied upon. Any change done to it, including migration to another box with more ram and ssd would need a smooth transition of some sort. Also note that changing the hostname under which it's available will require a puppet config change on build slaves.

(dormant account)

Reporter

Comment 8

•

11 years ago

Note, from varnish performance today. I'm perfectly happy with it. It is working much better than I expected, given the crappy hw it's running on. I started an etherpad with notes of what a production setup would look like: https://etherpad.mozilla.org/ikFCaXUCbK . Feel free to add to it

Laura Thomson :laura

Comment 9

•

11 years ago

Great to hear the POC is so successful. When you say "production-like", do you mean that if, say, we lose a disk, we will experience tree closures? If so, my focus is on how we get this to a better state in the fewest number of steps. Let's discuss.

Philippe M. Chiasson (:gozer)

Assignee

Comment 10

•

11 years ago

Given that this host is running on a single-drive and is already showing warning sings of disk failure, I am *really* not sure this current setup should be allowed to remain in the critical path of the tree. I was under the impression that this was a POC, to verify and demonstrate wether this sort of cache could have a positive impact on build time. This has been achieved successfully, right? And there are numbers to prove this conclusively, correct? In that case, I would suggest disabling this for the time being, and taking the time needed to spec and design a proper solution with some redundancy that would not compromise the tree by introducing another single point of failure. Ideally, like discussed before, also building the smarts in the client-side of this equation to gracefull handle cache failure and fallback to no-cache behaviour. As a side-note, with a single drive like that, showing signs of fatigue, we could have a dead system in 20 minutes.

(dormant account)

Reporter

Comment 11

•

11 years ago

So if the system fails, just make the domain point at an http server that's returning 404s for everything. We don't have a good way to load test other than point all of our builders at this

(dormant account)

Reporter

Comment 12

•

11 years ago

Got a transition plan from PoC->production. TLDR: Gonna swap configs to point at s3, we'll pay for s3 bandwidth while we standup production varnish(using s3 as hot spare). sccache client will be made more resilient too. https://etherpad.mozilla.org/ikFCaXUCbK

Corey Shields [:cshields]

Comment 13

•

11 years ago

Moving out of storage (as that is our netapp-related component)

Component: Server Operations: Storage → RelOps

Product: mozilla.org → Infrastructure & Operations

QA Contact: dparsons → arich

(dormant account)

Reporter

Comment 14

•

11 years ago

We'll be turning this PoC off as soon as configs are changed

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → WORKSFORME

(dormant account)

Reporter

Updated

•

11 years ago

Depends on: 1024651

Bugzilla

tweak varnish PoC

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: taras.mozilla, Assigned: gozer)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated