Closed Bug 1066145 Opened 10 years ago Closed 10 years ago

Build a proof of concept for log aggregation

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1541] )

Attachments

(6 files)

From the etherpad: system logs (daemon logs) build a POC to aggregate syslog data maybe we could give money to opsec to beef up theirs b2g is using papertrail (cheap) for POC, loggly is good but expensive if we're sending things offsite, we need to be careful about data (buildbot and kickstart passwords in the logs, etc) share with b2g, and maybe all of mozilla (does mmayo/services have something already, opsec has something) +++ build logs - large volume, we've done this smaller scale for elmo buildbot master logs - do not currently go to rsyslog, but can later: http server logs, hg / git server logs later: other releng daemons (self-serve, etc.) we want structured logging so we don't have to do complex parsing - ateam is uploading them to blobber many many other logs (after we get a POC that works, we need to enumerate) - bb masters first
I'm going to get started by putting knobs into PuppetAgain to control where everything systemwide sends its syslog data, and then start working on dumping other data into syslog. I'll initially point that at a test syslog box in relabs, just to see the data coming in. In parallel, I'll talk to services, b2g, and opsec to see what they're doing, and look at loggly and paperchase. I'm hesitant to start sending data to an external service since it may include PII, but I'm also hesitant to host something locally.
OK, I have something simple set up in puppet, running on hp1. It just dumps all the logs in one file and rotates it quickly. I'm also setting up a simple log_aggregator::client class that will send all of syslog to the log aggregation host. That works on CentOS and I'm working on OS X and Ubuntu now. Flows are going to be problematic, though -- nothing has a flow to 514/udp on hp1. I can request a flow, but that's a huge chunk of config for a temporary POC. So I'm not sure how best to proceed there.
Jake, since you've configured rsyslog before, can you have a look at https://github.com/djmitche/releng-puppet/commit/e26fd5bd7122126cb56ed9814e6930757966e3fc and let me know what you think? I don't want to get too deeply into optimizing aggregation just yet -- I mostly just want to be able to see logs pouring in, to know I could aim them somewhere more useful if desired. Still, any easy improvements you suggest would be cool.
Flags: needinfo?(jwatkins)
Depends on: 1068334
Flags: needinfo?(jwatkins)
Attachment #8490750 - Flags: review?(jwatkins)
Attached patch bug1066145-client.patch β€” β€” Splinter Review
Attachment #8490751 - Flags: review?(jwatkins)
I tested these across OS X, Ubuntu, and CentOS. OS X, especially, is super-chatty. But we'll see how that turns out. A few things I'd like: * easy way to capture messages per second metrics, so once this lands I can make predictions about total message rate (not all sources, notably AWS, have flows to hp1) * always show fqdn (preferably handled on the client side, rather than by the aggregator doing a lot of DNS) So if either of those is easy, let me know.
Attachment #8490750 - Flags: review?(jwatkins) → review+
Attachment #8490751 - Flags: review?(jwatkins) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #6) > A few things I'd like: > * easy way to capture messages per second metrics, so once this lands I can > make predictions about total message rate (not all sources, notably AWS, > have flows to hp1) > * always show fqdn (preferably handled on the client side, rather than by > the aggregator doing a lot of DNS) > > So if either of those is easy, let me know. This looks really good. I know rsyslog is very picky about conf order. That is always a bum lesson to learn while troubleshooting. As for message rate metrics, I don't know of anything off the shelf but I think the best route would be a collectd plugin. (might be biased here) You'd probably need to write one. First thought is to have the aggregate server dup all incoming messages to a unix socket where the collectd plugin would listen and count. I'm assuming rsyslog doesn't have a way to query the msg rate. Otherwise if it does, use that. :-) And regarding FQDN, this is a common problem. From my experience the %HOSTNAME% var is what gets sent from the client while %FROMHOST% is the lookup of the incoming servers name. If you can get the rsyslog client to set this to its FQDN, it gets passed along to the server. Also, I would try using '$PreserveFQDN on' on both client and server rsyslog configs to prevent the %HOSTNAME% from having its domain stripped. You can also debug with a format using %raw%. You can dump that to a file and examine exactly what is being sent and received in whole.
See Also: → 926466
Everything I'm hearing so far says "learn your log volume", and I think at the moment that breaks down into * current syslog * servers * slaves * auditd * servers * slaves * other log sources * load-balancer access logs or webhead access logs * buildbot master twistd.log * buildbot slave twistd.log * other services (probably a very small fraction of volume) * slaveapi * vcs-sync * relengapi * buildapi * etc. So, when I'm back after next week I'll see if I can come up with some order-of-magnitude estimates for those values. Once that's in place we can talk about sending events to Heka, syslog.*.mozilla.com, and/or our own aggregator; and about dumping some or all of the results into mozdef's ES, or our own ES. And how much "our own" would cost in each case.
Next step is to land the two patches on this bug, but I don't want to do that (especially the client patch) on a Friday.
Attachment #8490750 - Flags: checked-in+
Attachment #8490751 - Flags: checked-in+
Attached patch bug1066145.patch β€” β€” Splinter Review
OK, let's turn this thing on and see what happens. I've added comments for the additional load calculations. Assuming hp1 doesn't explode, I'll leave this on for a few days and get an initial estimate, then for the rest of the week to get a more balanced estimate. I'll throw out the first 24h anyway, since it'll take a while to "soak in". In testing this I noted that when hp1 is down or inaccessible, the sending hosts don't care, so there's no risk to production here.
Attachment #8501005 - Flags: review?(jwatkins)
relabs-puppet2.relabs.releng.scl3.mozilla.com isn't doing any *local* logging (to /var/log/messages). So something's funky and will need investigation before deploying attachment 8501005 [details] [diff] [review].
Attached patch bug1066145-p1.patch β€” β€” Splinter Review
Attachment #8501279 - Flags: review?(jwatkins)
Attachment #8501005 - Flags: review?(jwatkins) → review+
Attachment #8501279 - Flags: review?(jwatkins) → review+
Attachment #8501279 - Flags: checked-in+
Attachment #8501005 - Flags: checked-in+
I'll keep an eye on this over the next few days
Component: General Automation → RelOps: Puppet
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → dustin
Version: unspecified → other
A Pivotal Tracker story has been created for this Bug: https://www.pivotaltracker.com/story/show/80405738
Judging by yesterday's data, we're looking at about 17 million entries per day, which comes out to about 2GB. These figures are *very* loose, though - even within host types I saw differences greater than an order of magnitude, for example: bld-lion-r5-014 4143 bld-lion-r5-015 91788 bld-lion-r5-016 8923 bld-lion-r5-017 5626 A week's worth of data will hopefully be a bit more stable.
Here are the raw numbers for Oct 9-15, inclusive (604800 seconds): aws-manager1 60958 bld-lion-r5-014 53685 bld-lion-r5-015 143861 bld-lion-r5-016 75753 bld-lion-r5-017 35181 b-linux64-hp-0018 99742 b-linux64-hp-0019 104900 b-linux64-hp-0020 19675 b-linux64-hp-0021 0 b-linux64-ix-0001 81381 b-linux64-ix-0002 83599 buildbot-master81 33090 buildbot-master82 26301 dev-master1 13082 mac-signing1 0 mac-signing2 62688 mac-signing3 63336 mac-signing4 107008 mac-v2-signing1 55716 mac-v2-signing2 56243 mac-v2-signing3 61954 proxxy1 2985 releng-puppet1 106494 releng-puppet2 1242408 signing4 117182 signing5 5325 signing6 5996 slaveapi1 22907 slaveapi-dev1 18664 talos-linux32-ix-037 8114 talos-linux32-ix-038 11530 talos-mtnlion-r5-033 51171 talos-mtnlion-r5-034 51858 t-snow-r4-0026 284806 t-snow-r4-0027 198190 I'll apply the relevant multipliers to that and come up with an overall volume estimate.
> aws-manager1 60958 > proxxy1 2985 > dev-master1 13082 > slaveapi1 22907 > slaveapi-dev1 18664 total: 118596 > bld-lion-r5-014 53685 > bld-lion-r5-015 143861 > bld-lion-r5-016 75753 > bld-lion-r5-017 35181 average: 77120 multiplier: 90 total: 6940800 > b-linux64-hp-0018 99742 > b-linux64-hp-0019 104900 > b-linux64-hp-0020 19675 average: 74772 multiplier: 34 total: 2542248 > b-linux64-ix-0001 81381 > b-linux64-ix-0002 83599 average: 82490 multiplier: 27 total: 2227230 > buildbot-master81 33090 > buildbot-master82 26301 average: 29695 multiplier: 66 total: 1959870 > mac-signing2 62688 > mac-signing3 63336 average: 63012 multiplier: 4 total: 252048 > mac-v2-signing1 55716 > mac-v2-signing2 56243 > mac-v2-signing3 61954 average: 57971 multiplier: 3 total: 173913 > releng-puppet1 106494 > releng-puppet2 1242408 average: 674451 multiplier: 6 total: 4046706 > signing4 117182 > signing5 5325 > signing6 5996 average: 42834 multiplier: 3 total: 128503 > talos-linux32-ix-037 8114 > talos-linux32-ix-038 11530 average: 29986 multiplier: 201 total: 6027186 > talos-mtnlion-r5-033 51171 > talos-mtnlion-r5-034 51858 average: 51514 multiplier: 99 total: 5099886 > t-snow-r4-0026 284806 > t-snow-r4-0027 198190 average: 241498 multiplier: 165 total: 39847170 Grand total: 69364156 Per day: about 10 million Per second: about 110 events/sec As for space, assuming those 7 days are a representative sample of event sizes, the average is 117.5 bytes. For 10 million events, that's about 1.2GB per day. Honestly, those numbers are *far* smaller than I thought. Even with a healthy 50% buffer for data not currently in syslog, we're still looking at less than 200 events/sec and less than 2GB/day of storage.
Jeff, these numbers are pretty small. Could we just point this at MozDef? As an experiment, could I point the limited sample I used above? That sample works out to about a half-million events/day, or 5 per second.
Flags: needinfo?(jbryner)
Yeah, those are pretty small. We do about that in our QA instance alone, so I'd say it's safe to just point it at mozdef, especially if starting with the sample. Lots of ways to get from syslog to mozdef including heka, nxlog, syslog/rsyslog->rabbitmq, any preferences on a log shipper?
Flags: needinfo?(jbryner)
I want the first level of logging (from clients0 to be syslog over UDP so that we don't impact production systems during network events. I expect those packets will be aimed at a server per DC/region, but as far as what runs on that server and what it talks to upstream and what protocol it uses I am agnostic and open to suggestions of best practices. If you have an example to point at (in infra puppet?) or docs to read, that'd be great.
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/433]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/433] → [kanban:engops:https://kanbanize.com/ctrl_board/6/438]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/438]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/519]
See Also: → 1091286
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/519] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1527] [kanban:engops:https://kanbanize.com/ctrl_board/6/519]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1527] [kanban:engops:https://kanbanize.com/ctrl_board/6/519] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1528] [kanban:engops:https://kanbanize.com/ctrl_board/6/519]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1528] [kanban:engops:https://kanbanize.com/ctrl_board/6/519] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1535] [kanban:engops:https://kanbanize.com/ctrl_board/6/519]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1535] [kanban:engops:https://kanbanize.com/ctrl_board/6/519] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1541] [kanban:engops:https://kanbanize.com/ctrl_board/6/519]
Hey jeff, still waiting for some info to connect this to MozDef. I'm not in a huge hurry, just want it on your radar :)
Flags: needinfo?(jbryner)
Yeah, sorry for the delay. Been working through updating our config to recent heka (v7) ran into issues and am now targeting v8. Good news is bugs fixed. One important factor will be the timezone. Heka assumes/requires UTC. i.e. If your logs don't include timezone enriched timestamps it will assume they are UTC. There are some workarounds, but do you know what the timezone/stamp situation looks like currently?
Flags: needinfo?(jbryner)
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1541] [kanban:engops:https://kanbanize.com/ctrl_board/6/519] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1541]
All of our systems operate in US/Pacific and all of our log entries are in that timezone. I think rsyslog can fix that - http://www.rsyslog.com/doc/master/configuration/timezone.html suggests so, anyway.
Another option is to log date/time in ISO format and parse the timestamp for timezone offset. Can I get a gander at some sample log files/lines?
Sure! Random sample of two lines from /opt/aggregator/log/aggregate.log on hp1.relabs.releng.scl3.mozilla.com: Nov 10 08:34:05 buildbot-master82 audit-cef: CEF:0|Unix|auditd|2|EXECVE|Unix Exec|3|end=141563 dhost=buildbot-master82.srv.releng.scl3.mozilla.com msg= pid\=23450 gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=107713 dproc=/bin/bash cs4Label=TTY cs4= (none) duid=0 duser=root cn1Label=auid cn1=0 suser=root cs5Label=ParentProcess cs5=sh cs3Label=AuditKey cs3=exec Nov 10 08:34:05 buildbot-master82 audit-cef: CEF:0|Unix|auditd|2|EXECVE|Unix Exec|3|end=141563 dhost=buildbot-master82.srv.releng.scl3.mozilla.com msg= pid\=23451 gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=107713 dproc=/usr/bin/getopt cs4Label=TT Y cs4=(none) duid=0 duser=root cn1Label=auid cn1=0 suser=root cs5Label=ParentProcess cs5=virt-what cs3Label=AuditKey cs3=exec I can get you access to that host if it would help -- it's the proof-of-concept aggregator. Let me know if that's useful -- I don't want to get you more involved than you would like!
Ahh, it's CEF, sorry I didn't realize that. In that case, heka is out as we haven't figured out how to parse CEF in lua. We do have python to parse CEF though and it's easier to manage timezone/stamp differences. https://github.com/jeffbryner/MozDef/blob/master/lib/cefTail2Mozdef.py https://github.com/jeffbryner/MozDef/blob/master/lib/cefFile2Mozdef.py are some starters.
Well, those two log messages happen to be CEF, yeah, maybe that was misleading. If that stuff's useless in Heka, we'd probably just filter it out (sending it to the datacenter syslog servers only). Here are some more representative log entries: > Nov 10 09:15:02 aws-manager1 postfix/cleanup[369]: 8A04F1619CF: message-id=<20141110171502.8A04F1619CF@aws-manager1.srv.releng.scl3.mozilla.com> > Nov 10 09:15:02 aws-manager1 postfix/qmgr[31070]: 8A04F1619CF: from=<root@aws-manager1.srv.releng.scl3.mozilla.com>, size=840, nrcpt=1 (queue active) > Nov 10 09:15:02 aws-manager1 postfix/local[371]: warning: dict_nis_init: NIS domain name not set - NIS lookups disabled
Ahh, yes that's a good plan. The other thing we did to help parsing was to move to tab-delimited logs. So timestamp<tab>hostname<tab>program<tab>processid<tab> message which makes it much easier/faster/cheaper to pick out the fields in heka. Is that possible?
I think so -- I belive that would require specific configuration on the sending hosts, so I'll need to futz across operating systems, but it should work.
Actually, that's pretty hard. We're not using rsyslog on OS X (and who knows what we'd use on Windows), so there's not a lot of capability to do arbitrary text processing at that point. From what I understand of syslog, an app opens up /dev/log and writes a formatted message: [root@hp1.relabs.releng.scl3.mozilla.com ~]# strace -esendto logger hi sendto(1, "<13>Nov 21 12:18:23 root: hi", 28, MSG_NOSIGNAL, NULL, 0) = 28 On the wire, it looks like this: 12:34:01.401192 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 268) buildbot-master82.srv.releng.scl3.mozilla.com.45200 > hp1.relabs.releng.scl3.mozilla.com.syslog: SYSLOG, length: 240 Facility cron (9), Severity info (6) Msg: Nov 21 12:34:01 buildbot-master82 CROND[18154]: (root) CMD (. /usr/local/bin/proxy_reset_env.sh && /usr/bin/puppet agent --enable >/dev/null 2>/dev/null && /usr/bin/puppet agent --onetime --no-daemonize --logdest=syslog >/dev/null 2>&1) 0x0000: 4500 010c 0000 4000 3f11 a85c 0a1a 3034 E.....@.?..\..04 0x0010: 0a1a 4e1d b090 0202 00f8 c055 3c37 383e ..N........U<78> 0x0020: 4e6f 7620 3231 2031 323a 3334 3a30 3120 Nov.21.12:34:01. 0x0030: 6275 696c 6462 6f74 2d6d 6173 7465 7238 buildbot-master8 0x0040: 3220 4352 4f4e 445b 3138 3135 345d 3a20 2.CROND[18154]:. 0x0050: 2872 6f6f 7429 2043 4d44 2028 2e20 2f75 (root).CMD.(../u 0x0060: 7372 2f6c 6f63 616c 2f62 696e 2f70 726f sr/local/bin/pro 0x0070: 7879 5f72 6573 6574 5f65 6e76 2e73 6820 xy_reset_env.sh. 0x0080: 2626 202f 7573 722f 6269 6e2f 7075 7070 &&./usr/bin/pupp 0x0090: 6574 2061 6765 6e74 202d 2d65 6e61 626c et.agent.--enabl 0x00a0: 6520 3e2f 6465 762f 6e75 6c6c 2032 3e2f e.>/dev/null.2>/ 0x00b0: 6465 762f 6e75 6c6c 2026 2620 2f75 7372 dev/null.&&./usr 0x00c0: 2f62 696e 2f70 7570 7065 7420 6167 656e /bin/puppet.agen 0x00d0: 7420 2d2d 6f6e 6574 696d 6520 2d2d 6e6f t.--onetime.--no 0x00e0: 2d64 6165 6d6f 6e69 7a65 202d 2d6c 6f67 -daemonize.--log 0x00f0: 6465 7374 3d73 7973 6c6f 6720 3e2f 6465 dest=syslog.>/de 0x0100: 762f 6e75 6c6c 2032 3e26 3129 v/null.2>&1) RFC5424 claims there should be some structured data after the ">", but I certainly don't see any there. So basically we have a priority byte and some text that is uninhibited by the standards process. Comparing its input (/dev/log) and output (network), it seems that rsyslog is doing some parsing of that text -- stripping the datestamp and inserting the hostname. I guess I just never realized how unstructured log data really is. I don't see any good place to try to tab-delimit things that's not going to be more work than simply parsing the data in the Heka ingestor. Anyway, we should probably talk about all of this at the workweek and see what the base requirements are, so we can get something working and then improve on it.
Yeah, lets talk at the work week. Windows stuff is actually pretty straightforward using nxlog and bypassing syslog altogether since syslog just flattens a nice json data structure.
Syslog data is a formless mush and rsyslogd is an abomination, so we need to get data out of that format as quickly as possible. Heka's a bit weird too, but having read both rsyslog and Heka docs today, sign me up for Heka any day. And it's easy to install (single RPM, no deps.. I love Go!) One of Heka's quirks is that it doesn't speak syslog at all. The closest it gets is snarfing logfiles as written by tools like rsyslog. So the plan is this: * Configure rsyslog on the aggregation hosts to write logs (to a ramdisk to avoid spinning rust) in a format containing as much information as possible, including a parsed timestamp, hostname info, severity, and facility. * redirect all CEF data from auditd off to the per-datacenter syslog servers, NOT to Heka * Snarf those logfiles with Heka * Later we can add other input mechanisms, too, including HTTP POST, to get logs from other sources like masters' twistd.logs * Filter appropriately in Heka * I can already see just from tailing the logs that >50% of our logs are useless spam * Send the results to MozDef via AMQP
OK, we got in a little deep. The new approach is to add *.* @syslog1.private.scl3.mozilla.com on all client hosts. No releng-specific aggregation systems, no new tools, nothing. If there are problems, we'll look for minimal solutions to them.
Attachment #8533892 - Flags: review?(jwatkins)
Comment on attachment 8533892 [details] [diff] [review] bug1066145-syslog-private-scl3.patch ++ to simplicity
Attachment #8533892 - Flags: review?(jwatkins) → review+
Attachment #8533892 - Flags: checked-in+
Going to wait a bit on this and then see what's showing up in MozDef.
Well, nothing showed up, but if I had squinted at the syslog1.private.scl3 config I would have seen that -- it was dumping the log entries on disk locally, but not forwarding them to MozDef. That said, it looks like we're going to be using an external service instead, so I'm going to revert the changes for this bug and close it up.
Attached patch bug1066145-revert.patch β€” β€” Splinter Review
Note that this leaves the rsyslog improvements in place, but reverts the rest back to where only puppetmasters were sending their syslog upstream.
Attachment #8537829 - Flags: review?(jwatkins)
Comment on attachment 8537829 [details] [diff] [review] bug1066145-revert.patch Sorry, I totally missed this before the holiday.
Attachment #8537829 - Flags: review?(jwatkins) → review+
Attachment #8537829 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Blocks: 1118181
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: