jrgm and I are trying to understand failure behavior of persona under load which exceeds capacity. To help us, we'd like the following: 1. zeus logs from our front-line LB over the last 48 hours - (zlb.pub.scl.stage) 2. a dump of the current configuration of that load balacncer (can you export this in a textual format?)
Do you need a sample or the entire logs? Are you looking for access logs? I can backup the configuration but would need to go through and extract secrets. I'm not sure the format of the backup, I'll find out.
gene: what I really want is a 1 hour window on either side of 19:30 pacific yesterday (01/30) in terms of config, we're interested in configuration related to failing health checks, and http request thresholds before nodes are considered errored, as well as failure behavior w.r.t outstanding queries, and the tunable re-introduction parameters (how long to wait, what mechanism to determine health, etc). So I thought it was easiest just to ask for zeus's equivalent of `sysctl -a` :)
The webheads depend on 2 health checks passing to be considered up and get traffic HTTP __heartbeat__ The minimum time between calls to a monitor. delay: 3 seconds The maximum runtime for an individual instance of the monitor. timeout: 3 seconds The number of times in a row that a node must fail execution of the monitor before it is classed as unavailable. failures: 3 Should the monitor slowly increase the delay after it has failed? back_off: Yes Whether or not the monitor should emit verbose logging. This is useful for diagnosing problems. verbose: No The maximum amount of data to read back from a server, use 0 for unlimited. max_response_len: 2048 bytes Whether or not the monitor should connect using SSL. use_ssl: No The host header to use in the test HTTP request. host_header: The path to use in the test HTTP request. This must be a string beginning with a / (forward slash). path: /__heartbeat__ A regular expression that the HTTP status code must match. If the status code doesn't matter then set this to .* (match anything). status_regex: ^200$ The other health check is : HTTP __heartbeat__ deep check which has the same values except for : The path to use in the test HTTP request. This must be a string beginning with a / (forward slash). path: /__heartbeat__?deep=true
I've got the logs, lloyd can you point me to your gpg public key so I can encrypt and send them to you?
looking at section 10.1 here: http://support.zeus.com/zlb/media/docs/userguide.pdf I'm curious about tuning parameters related to "Back-end failure". specifically: tuning!max_reply_time tuning!max_connect_time tuning!dead_time In may be useful to grep out from zeus logs in the specified time period 'SERIOUS' messages? - egrep '^SERIOUS'?
-----BEGIN PGP PUBLIC KEY BLOCK----- Version: GnuPG v1.4.12 (Darwin) mQENBFAkJ84BCADL8wQWVJeFIMFPr44+CCuhMleiajh38RKhrb4Yql6aDRGTIrNZ UU+J/QkMqWze6jkBaCxcEoyMrrPcqUyXWtZsw3uH00KrRx3vh0ZSpM1XY9y5V1Pv uoFAtUWfnMUNCB3LmGuozPEEhu/nhHjwTDLo7Mbmj7d14tNB9RVosKQtQ6xW+50V dpUxRmu5Wvsn9ii8Y3+DrEw3xBAT29DdeilRbSK9AKwdAGZVdllQf6VxjMjQ/9E9 uVbLSU4wFoU2qCn+EuWL6m1MatwyL31V8PI2B404oytm+4Md7AojYwTQ3crDIrLG 64N/7v7tV/dYcE2oyVhidIbtNdJXP73Sjd8LABEBAAG0J0xsb3lkIEhpbGFpZWwg KFxvLykgPGxsb3lkQG1vemlsbGEuY29tPokBOAQTAQIAIgUCUCQnzgIbAwYLCQgH AwIGFQgCCQoLBBYCAwECHgECF4AACgkQZaBvZTZiJiIaXQf/btOZA5C+0LTAIaCr Jm8ockrjK/n2+1bUWPqMIdaQL3dX76fZxYB1JoTfkyCx8YuKLYoIw4syAg9YjgBP sQgtnQANbTeYZvM7NqA2VcWSenqskw5QiMXny4OnMkFgkhuzhUsnhhrV5SMgpCPY AzMQ7QQFRHew4hnf0B6X21Y/KUN8/ejLSga6GG6CwGsNHFsiETLKY7h0IanV8iWi nX4A9DCsDXoq5xw5PtKnNqYMdzzm28od3sDE8Y1iRn3qVLy3uhequBTOdkf/yI/a NA/rILsj3XFUOnvsKtfcEr1sljCfYjW6Dwfzt11PCiafV4asjhCNM1vVgVqWDRo+ pfRHgLkBDQRQJCfOAQgA3h88SajP1YjT4xnG85kFD+onJDRvVKFxq+GoxAhRLYE3 rLJk1P5dWk7l57jqKiBEj5PPZ3wWDaoMFNDG1Chjh1WsvsQ7YkI4UPF+oL3Nec65 42AQh3IGmc22ki4q/0nDlRCo9NXWKu18s50gE9nro+yqWIz2w5EBFyOvnjZnz4xH 5eCbpwx3iAnqdWAC+v8I3Kz7/H7Um87/OY4VBv3ru4cKEYtUTEejhIfol9OkolNX vbIRhTNKmC9qONv8t39oVs9J/UuLiS++LtOPZ0B2bQmzPBS9rHm8ygzbHZ94BGu4 72VY3qwvpRkNyzjmtA3xKQYXc94GijPRDjrDUVCG6QARAQABiQEfBBgBAgAJBQJQ JCfOAhsMAAoJEGWgb2U2YiYiy7AH/AnJK4ZfBAOMyXFa2vDJlx0M2gN4wURpshxs 1e9TsA/775yseo16g/HZDMgzVilpczYd9o1ikmYCrOBSLMzCbavpICRGmgqWuGgE 4okfkeAGAa96B9w+3sNe/cHbrkRQkFSaA+ybdHyYMSo9PMnLZUnPXG8V9n5zE08Y 0GUV3hkz6vghZUvSGLya7Xt63YEf4R3zES38RcSyiTD3eg5e0szx8OMFS57LAecW Rquko4hJTKvMWX6FuoDtjvIOwn/tM6HTM+YOI+w8Awkqhp1zRsTtCmrFgBCyh3jF gKG1gXWDQglg6dmiramwKqyRhO3Pf37BMLzEgcScpejYVFID/UM= =0iXt -----END PGP PUBLIC KEY BLOCK-----
also, I wonder what version of zeus we're running and where the appropriate docs are? Now I'll stop conflating distinct information requests with the original.
It looks like that link is to docs from 2004. The version we run ( https://support.riverbed.com/download.htm?filename=public/doc/stingray/trafficmanager/8.1/Stingray_8.1_User_Manual.pdf ) doesn't mention dead_time. The max_reply_time is 30 seconds The max_connect_time is 4 seconds Connections are 'keepalive' enabled Timeout when connecting to a node: 4 seconds Timeout when waiting for a reply from a node: 30 seconds
I've emailed you the logs