Closed Bug 836823 Opened 11 years ago Closed 11 years ago

Zeus logs and configuration in persona staging requirement requested

Categories

(Cloud Services :: Operations: Metrics/Monitoring, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lhilaiel, Assigned: gene)

Details

jrgm and I are trying to understand failure behavior of persona under load which exceeds capacity.  To help us, we'd like the following:

1. zeus logs from our front-line LB over the last 48 hours - (zlb.pub.scl.stage) 
2. a dump of the current configuration of that load balacncer (can you export this in a textual format?)
Do you need a sample or the entire logs? 
Are you looking for access logs?
I can backup the configuration but would need to go through and extract secrets. I'm not sure the format of the backup, I'll find out.
Flags: needinfo?(lhilaiel)
gene: what I really want is a 1 hour window on either side of 19:30 pacific yesterday (01/30)

in terms of config, we're interested in configuration related to failing health checks, and http request thresholds before nodes are considered errored, as well as failure behavior w.r.t outstanding queries, and the tunable re-introduction parameters (how long to wait, what mechanism to determine health, etc).

So I thought it was easiest just to ask for zeus's equivalent of `sysctl -a` :)
Flags: needinfo?(lhilaiel)
The webheads depend on 2 health checks passing to be considered up and get traffic

HTTP __heartbeat__ 
The minimum time between calls to a monitor.
	delay: 3 seconds
 
The maximum runtime for an individual instance of the monitor.
	timeout: 3 seconds
 
The number of times in a row that a node must fail execution of the monitor before it is classed as unavailable.
	failures:  3 	
 
Should the monitor slowly increase the delay after it has failed?
	back_off:   	Yes      
 
Whether or not the monitor should emit verbose logging. This is useful for diagnosing problems.
	verbose:   	No     

The maximum amount of data to read back from a server, use 0 for unlimited.
	max_response_len:  2048 	 bytes
 
Whether or not the monitor should connect using SSL.
	use_ssl:   	No     
 
The host header to use in the test HTTP request.
	host_header:   	
 
The path to use in the test HTTP request. This must be a string beginning with a / (forward slash).
	path:   	/__heartbeat__
 
A regular expression that the HTTP status code must match. If the status code doesn't matter then set this to .* (match anything).
	status_regex:   	^200$
 
The other health check is :

HTTP __heartbeat__ deep check

which has the same values except for :

The path to use in the test HTTP request. This must be a string beginning with a / (forward slash).
	path:   	/__heartbeat__?deep=true
I've got the logs, lloyd can you point me to your gpg public key so I can encrypt and send them to you?
Status: NEW → ASSIGNED
looking at section 10.1 here: http://support.zeus.com/zlb/media/docs/userguide.pdf

I'm curious about tuning parameters related to "Back-end failure".  specifically:

tuning!max_reply_time
tuning!max_connect_time
tuning!dead_time

In may be useful to grep out from zeus logs in the specified time period 'SERIOUS' messages?  - egrep '^SERIOUS'?
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: GnuPG v1.4.12 (Darwin)

mQENBFAkJ84BCADL8wQWVJeFIMFPr44+CCuhMleiajh38RKhrb4Yql6aDRGTIrNZ
UU+J/QkMqWze6jkBaCxcEoyMrrPcqUyXWtZsw3uH00KrRx3vh0ZSpM1XY9y5V1Pv
uoFAtUWfnMUNCB3LmGuozPEEhu/nhHjwTDLo7Mbmj7d14tNB9RVosKQtQ6xW+50V
dpUxRmu5Wvsn9ii8Y3+DrEw3xBAT29DdeilRbSK9AKwdAGZVdllQf6VxjMjQ/9E9
uVbLSU4wFoU2qCn+EuWL6m1MatwyL31V8PI2B404oytm+4Md7AojYwTQ3crDIrLG
64N/7v7tV/dYcE2oyVhidIbtNdJXP73Sjd8LABEBAAG0J0xsb3lkIEhpbGFpZWwg
KFxvLykgPGxsb3lkQG1vemlsbGEuY29tPokBOAQTAQIAIgUCUCQnzgIbAwYLCQgH
AwIGFQgCCQoLBBYCAwECHgECF4AACgkQZaBvZTZiJiIaXQf/btOZA5C+0LTAIaCr
Jm8ockrjK/n2+1bUWPqMIdaQL3dX76fZxYB1JoTfkyCx8YuKLYoIw4syAg9YjgBP
sQgtnQANbTeYZvM7NqA2VcWSenqskw5QiMXny4OnMkFgkhuzhUsnhhrV5SMgpCPY
AzMQ7QQFRHew4hnf0B6X21Y/KUN8/ejLSga6GG6CwGsNHFsiETLKY7h0IanV8iWi
nX4A9DCsDXoq5xw5PtKnNqYMdzzm28od3sDE8Y1iRn3qVLy3uhequBTOdkf/yI/a
NA/rILsj3XFUOnvsKtfcEr1sljCfYjW6Dwfzt11PCiafV4asjhCNM1vVgVqWDRo+
pfRHgLkBDQRQJCfOAQgA3h88SajP1YjT4xnG85kFD+onJDRvVKFxq+GoxAhRLYE3
rLJk1P5dWk7l57jqKiBEj5PPZ3wWDaoMFNDG1Chjh1WsvsQ7YkI4UPF+oL3Nec65
42AQh3IGmc22ki4q/0nDlRCo9NXWKu18s50gE9nro+yqWIz2w5EBFyOvnjZnz4xH
5eCbpwx3iAnqdWAC+v8I3Kz7/H7Um87/OY4VBv3ru4cKEYtUTEejhIfol9OkolNX
vbIRhTNKmC9qONv8t39oVs9J/UuLiS++LtOPZ0B2bQmzPBS9rHm8ygzbHZ94BGu4
72VY3qwvpRkNyzjmtA3xKQYXc94GijPRDjrDUVCG6QARAQABiQEfBBgBAgAJBQJQ
JCfOAhsMAAoJEGWgb2U2YiYiy7AH/AnJK4ZfBAOMyXFa2vDJlx0M2gN4wURpshxs
1e9TsA/775yseo16g/HZDMgzVilpczYd9o1ikmYCrOBSLMzCbavpICRGmgqWuGgE
4okfkeAGAa96B9w+3sNe/cHbrkRQkFSaA+ybdHyYMSo9PMnLZUnPXG8V9n5zE08Y
0GUV3hkz6vghZUvSGLya7Xt63YEf4R3zES38RcSyiTD3eg5e0szx8OMFS57LAecW
Rquko4hJTKvMWX6FuoDtjvIOwn/tM6HTM+YOI+w8Awkqhp1zRsTtCmrFgBCyh3jF
gKG1gXWDQglg6dmiramwKqyRhO3Pf37BMLzEgcScpejYVFID/UM=
=0iXt
-----END PGP PUBLIC KEY BLOCK-----
also, I wonder what version of zeus we're running and where the appropriate docs are?  Now I'll stop conflating distinct information requests with the original.
It looks like that link is to docs from 2004. The version we run ( https://support.riverbed.com/download.htm?filename=public/doc/stingray/trafficmanager/8.1/Stingray_8.1_User_Manual.pdf ) doesn't mention dead_time.

The max_reply_time is 30 seconds
The max_connect_time is 4 seconds

Connections are 'keepalive' enabled
Timeout when connecting to a node: 4 seconds
Timeout when waiting for a reply from a node: 30 seconds
I've emailed you the logs
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.