Build a proof of concept for log aggregation



5 years ago
4 years ago


(Reporter: dustin, Assigned: dustin)



(Whiteboard: [kanban:engops:] )


(6 attachments)

From the etherpad:

    system logs (daemon logs)

    build a POC to aggregate syslog data

    maybe we could give money to opsec to beef up theirs

    b2g is using papertrail (cheap) for POC, loggly is good but expensive 

    if we're sending things offsite, we need to be careful about data (buildbot and kickstart passwords in the logs, etc)

    share with b2g, and maybe all of mozilla (does mmayo/services have something already, opsec has something) +++

    build logs - large volume, we've done this smaller scale for elmo

    buildbot master logs - do not currently go to rsyslog, but can

    later: http server logs, hg / git server logs

    later: other releng daemons (self-serve, etc.)

    we want structured logging so we don't have to do complex parsing - ateam is uploading them to blobber

    many many other logs (after we get a POC that works, we need to enumerate) - bb masters first
Depends on: 1065677
I'm going to get started by putting knobs into PuppetAgain to control where everything systemwide sends its syslog data, and then start working on dumping other data into syslog.

I'll initially point that at a test syslog box in relabs, just to see the data coming in.

In parallel, I'll talk to services, b2g, and opsec to see what they're doing, and look at loggly and paperchase.  I'm hesitant to start sending data to an external service since it may include PII, but I'm also hesitant to host something locally.
OK, I have something simple set up in puppet, running on hp1.  It just dumps all the logs in one file and rotates it quickly.  I'm also setting up a simple log_aggregator::client class that will send all of syslog to the log aggregation host.  That works on CentOS and I'm working on OS X and Ubuntu now.

Flows are going to be problematic, though -- nothing has a flow to 514/udp on hp1.  I can request a flow, but that's a huge chunk of config for a temporary POC.  So I'm not sure how best to proceed there.
Jake, since you've configured rsyslog before, can you have a look at
and let me know what you think?  I don't want to get too deeply into optimizing aggregation just yet -- I mostly just want to be able to see logs pouring in, to know I could aim them somewhere more useful if desired.  Still, any easy improvements you suggest would be cool.
Flags: needinfo?(jwatkins)
Depends on: 1068334
Flags: needinfo?(jwatkins)
Attachment #8490751 - Flags: review?(jwatkins)
I tested these across OS X, Ubuntu, and CentOS.  OS X, especially, is super-chatty.  But we'll see how that turns out.

A few things I'd like:
 * easy way to capture messages per second metrics, so once this lands I can make predictions about total message rate (not all sources, notably AWS, have flows to hp1)
 * always show fqdn (preferably handled on the client side, rather than by the aggregator doing a lot of DNS)

So if either of those is easy, let me know.
Attachment #8490750 - Flags: review?(jwatkins) → review+
Attachment #8490751 - Flags: review?(jwatkins) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> A few things I'd like:
>  * easy way to capture messages per second metrics, so once this lands I can
> make predictions about total message rate (not all sources, notably AWS,
> have flows to hp1)
>  * always show fqdn (preferably handled on the client side, rather than by
> the aggregator doing a lot of DNS)
> So if either of those is easy, let me know.

This looks really good.  I know rsyslog is very picky about conf order.  That is always a bum lesson to learn while troubleshooting.

As for message rate metrics, I don't know of anything off the shelf but I think the best route would be a collectd plugin.  (might be biased here)  You'd probably need to write one. First thought is to have the aggregate server dup all incoming messages to a unix socket where the collectd plugin would listen and count.  I'm assuming rsyslog doesn't have a way to query the msg rate.  Otherwise if it does, use that. :-)

And regarding FQDN, this is a common problem.  From my experience the %HOSTNAME% var is what gets sent from the client while %FROMHOST% is the lookup of the incoming servers name.  If you can get the rsyslog client to set this to its FQDN, it gets passed along to the server.  Also, I would try using '$PreserveFQDN on' on both client and server rsyslog configs to prevent the %HOSTNAME% from having its domain stripped.

You can also debug with a format using %raw%.  You can dump that to a file and examine exactly what is being sent and received in whole.
See Also: → bug 926466
Everything I'm hearing so far says "learn your log volume", and I think at the moment that breaks down into

 * current syslog
   * servers
   * slaves
 * auditd
   * servers
   * slaves
 * other log sources
   * load-balancer access logs or webhead access logs
   * buildbot master twistd.log
   * buildbot slave twistd.log
   * other services (probably a very small fraction of volume)
     * slaveapi
     * vcs-sync
     * relengapi
     * buildapi
     * etc.

So, when I'm back after next week I'll see if I can come up with some order-of-magnitude estimates for those values.  Once that's in place we can talk about sending events to Heka, syslog.*, and/or our own aggregator; and about dumping some or all of the results into mozdef's ES, or our own ES.  And how much "our own" would cost in each case.
Next step is to land the two patches on this bug, but I don't want to do that (especially the client patch) on a Friday.
Attachment #8490750 - Flags: checked-in+
Attachment #8490751 - Flags: checked-in+
OK, let's turn this thing on and see what happens.  I've added comments for the additional load calculations.  Assuming hp1 doesn't explode, I'll leave this on for a few days and get an initial estimate, then for the rest of the week to get a more balanced estimate.  I'll throw out the first 24h anyway, since it'll take a while to "soak in".

In testing this I noted that when hp1 is down or inaccessible, the sending hosts don't care, so there's no risk to production here.
Attachment #8501005 - Flags: review?(jwatkins) isn't doing any *local* logging (to /var/log/messages).  So something's funky and will need investigation before deploying attachment 8501005 [details] [diff] [review].
Attachment #8501279 - Flags: review?(jwatkins)
Attachment #8501005 - Flags: review?(jwatkins) → review+
Attachment #8501279 - Flags: review?(jwatkins) → review+
Attachment #8501279 - Flags: checked-in+
Attachment #8501005 - Flags: checked-in+
I'll keep an eye on this over the next few days
Component: General Automation → RelOps: Puppet
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → dustin
Version: unspecified → other

Comment 14

5 years ago
A Pivotal Tracker story has been created for this Bug:
Judging by yesterday's data, we're looking at about 17 million entries per day, which comes out to about 2GB.  These figures are *very* loose, though - even within host types I saw differences greater than an order of magnitude, for example:

bld-lion-r5-014 4143
bld-lion-r5-015 91788
bld-lion-r5-016 8923
bld-lion-r5-017 5626

A week's worth of data will hopefully be a bit more stable.
Here are the raw numbers for Oct 9-15, inclusive (604800 seconds):

aws-manager1 60958
bld-lion-r5-014 53685
bld-lion-r5-015 143861
bld-lion-r5-016 75753
bld-lion-r5-017 35181
b-linux64-hp-0018 99742
b-linux64-hp-0019 104900
b-linux64-hp-0020 19675
b-linux64-hp-0021 0
b-linux64-ix-0001 81381
b-linux64-ix-0002 83599
buildbot-master81 33090
buildbot-master82 26301
dev-master1 13082
mac-signing1 0
mac-signing2 62688
mac-signing3 63336
mac-signing4 107008
mac-v2-signing1 55716
mac-v2-signing2 56243
mac-v2-signing3 61954
proxxy1 2985
releng-puppet1 106494
releng-puppet2 1242408
signing4 117182
signing5 5325
signing6 5996
slaveapi1 22907
slaveapi-dev1 18664
talos-linux32-ix-037 8114
talos-linux32-ix-038 11530
talos-mtnlion-r5-033 51171
talos-mtnlion-r5-034 51858
t-snow-r4-0026 284806
t-snow-r4-0027 198190

I'll apply the relevant multipliers to that and come up with an overall volume estimate.
> aws-manager1 60958
> proxxy1 2985
> dev-master1 13082
> slaveapi1 22907
> slaveapi-dev1 18664
  total: 118596

> bld-lion-r5-014 53685
> bld-lion-r5-015 143861
> bld-lion-r5-016 75753
> bld-lion-r5-017 35181
  average: 77120
  multiplier: 90
  total: 6940800

> b-linux64-hp-0018 99742
> b-linux64-hp-0019 104900
> b-linux64-hp-0020 19675
  average: 74772
  multiplier: 34
  total: 2542248

> b-linux64-ix-0001 81381
> b-linux64-ix-0002 83599
  average: 82490
  multiplier: 27
  total: 2227230

> buildbot-master81 33090
> buildbot-master82 26301
  average: 29695
  multiplier: 66
  total: 1959870

> mac-signing2 62688
> mac-signing3 63336
  average: 63012
  multiplier: 4
  total: 252048

> mac-v2-signing1 55716
> mac-v2-signing2 56243
> mac-v2-signing3 61954
  average: 57971
  multiplier: 3
  total: 173913

> releng-puppet1 106494
> releng-puppet2 1242408
  average: 674451
  multiplier: 6
  total: 4046706

> signing4 117182
> signing5 5325
> signing6 5996
  average: 42834
  multiplier: 3
  total: 128503

> talos-linux32-ix-037 8114
> talos-linux32-ix-038 11530
  average: 29986
  multiplier: 201
  total: 6027186

> talos-mtnlion-r5-033 51171
> talos-mtnlion-r5-034 51858
  average: 51514
  multiplier: 99
  total: 5099886

> t-snow-r4-0026 284806
> t-snow-r4-0027 198190
  average: 241498
  multiplier: 165
  total: 39847170

Grand total: 69364156
Per day: about 10 million
Per second: about 110 events/sec

As for space, assuming those 7 days are a representative sample of event sizes, the average is 117.5 bytes.  For 10 million events, that's about 1.2GB per day.

Honestly, those numbers are *far* smaller than I thought.  Even with a healthy 50% buffer for data not currently in syslog, we're still looking at less than 200 events/sec and less than 2GB/day of storage.
Jeff, these numbers are pretty small.  Could we just point this at MozDef?

As an experiment, could I point the limited sample I used above?  That sample works out to about a half-million events/day, or 5 per second.
Flags: needinfo?(jbryner)
Yeah, those are pretty small. We do about that in our QA instance alone, so I'd say it's safe to just point it at mozdef, especially if starting with the sample. 

Lots of ways to get from syslog to mozdef including heka, nxlog, syslog/rsyslog->rabbitmq, any preferences on a log shipper?
Flags: needinfo?(jbryner)
I want the first level of logging (from clients0 to be syslog over UDP so that we don't impact production systems during network events.  I expect those packets will be aimed at a server per DC/region, but as far as what runs on that server and what it talks to upstream and what protocol it uses I am agnostic and open to suggestions of best practices.  If you have an example to point at (in infra puppet?) or docs to read, that'd be great.


5 years ago
Whiteboard: [kanban:engops:]


5 years ago
Whiteboard: [kanban:engops:] → [kanban:engops:]


5 years ago
Whiteboard: [kanban:engops:]


5 years ago
Whiteboard: [kanban:engops:]


4 years ago
Whiteboard: [kanban:engops:] → [kanban:engops:] [kanban:engops:]


4 years ago
Whiteboard: [kanban:engops:] [kanban:engops:] → [kanban:engops:] [kanban:engops:]


4 years ago
Whiteboard: [kanban:engops:] [kanban:engops:] → [kanban:engops:] [kanban:engops:]


4 years ago
Whiteboard: [kanban:engops:] [kanban:engops:] → [kanban:engops:] [kanban:engops:]
Hey jeff, still waiting for some info to connect this to MozDef.  I'm not in a huge hurry, just want it on your radar :)
Flags: needinfo?(jbryner)
Yeah, sorry for the delay. Been working through updating our config to recent heka (v7) ran into issues and am now targeting v8. Good news is bugs fixed. 

One important factor will be the timezone. Heka assumes/requires UTC. i.e. If your logs don't include timezone enriched timestamps it will assume they are UTC. 

There are some workarounds, but do you know what the timezone/stamp situation looks like currently?
Flags: needinfo?(jbryner)
Whiteboard: [kanban:engops:] [kanban:engops:] → [kanban:engops:]
All of our systems operate in US/Pacific and all of our log entries are in that timezone.  I think rsyslog can fix that - suggests so, anyway.
Another option is to log date/time in ISO format and parse the timestamp for timezone offset. Can I get a gander at some sample log files/lines?
Sure!  Random sample of two lines from /opt/aggregator/log/aggregate.log on

Nov 10 08:34:05 buildbot-master82 audit-cef: CEF:0|Unix|auditd|2|EXECVE|Unix Exec|3|end=141563 msg= pid\=23450 gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=107713 dproc=/bin/bash cs4Label=TTY cs4=
(none) duid=0 duser=root cn1Label=auid cn1=0 suser=root cs5Label=ParentProcess cs5=sh cs3Label=AuditKey cs3=exec 
Nov 10 08:34:05 buildbot-master82 audit-cef: CEF:0|Unix|auditd|2|EXECVE|Unix Exec|3|end=141563 msg= pid\=23451 gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=107713 dproc=/usr/bin/getopt cs4Label=TT
Y cs4=(none) duid=0 duser=root cn1Label=auid cn1=0 suser=root cs5Label=ParentProcess cs5=virt-what cs3Label=AuditKey cs3=exec

I can get you access to that host if it would help -- it's the proof-of-concept aggregator.  Let me know if that's useful -- I don't want to get you more involved than you would like!
Ahh, it's CEF, sorry I didn't realize that. In that case, heka is out as we haven't figured out how to parse CEF in lua.

We do have python to parse CEF though and it's easier to manage timezone/stamp differences.
are some starters.
Well, those two log messages happen to be CEF, yeah, maybe that was misleading.  If that stuff's useless in Heka, we'd probably just filter it out (sending it to the datacenter syslog servers only).  Here are some more representative log entries:

> Nov 10 09:15:02 aws-manager1 postfix/cleanup[369]: 8A04F1619CF: message-id=<>
> Nov 10 09:15:02 aws-manager1 postfix/qmgr[31070]: 8A04F1619CF: from=<>, size=840, nrcpt=1 (queue active)
> Nov 10 09:15:02 aws-manager1 postfix/local[371]: warning: dict_nis_init: NIS domain name not set - NIS lookups disabled
Ahh, yes that's a good plan. 

The other thing we did to help parsing was to move to tab-delimited logs. So timestamp<tab>hostname<tab>program<tab>processid<tab> message  which makes it much easier/faster/cheaper to pick out the fields in heka. 

Is that possible?
I think so -- I belive that would require specific configuration on the sending hosts, so I'll need to futz across operating systems, but it should work.
Actually, that's pretty hard.  We're not using rsyslog on OS X (and who knows what we'd use on Windows), so there's not a lot of capability to do arbitrary text processing at that point.

From what I understand of syslog, an app opens up /dev/log and writes a formatted message:

[ ~]# strace -esendto logger hi
sendto(1, "<13>Nov 21 12:18:23 root: hi", 28, MSG_NOSIGNAL, NULL, 0) = 28

On the wire, it looks like this:

12:34:01.401192 IP (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto UDP (17), length 268) > SYSLOG, length: 240
        Facility cron (9), Severity info (6)
        Msg: Nov 21 12:34:01 buildbot-master82 CROND[18154]: (root) CMD (. /usr/local/bin/ && /usr/bin/puppet agent --enable >/dev/null 2>/dev/null && /usr/bin/puppet agent --onetime --no-daemonize --logdest=syslog >/dev/null 2>&1)
        0x0000:  4500 010c 0000 4000 3f11 a85c 0a1a 3034  E.....@.?..\..04
        0x0010:  0a1a 4e1d b090 0202 00f8 c055 3c37 383e  ..N........U<78>
        0x0020:  4e6f 7620 3231 2031 323a 3334 3a30 3120  Nov.21.12:34:01.
        0x0030:  6275 696c 6462 6f74 2d6d 6173 7465 7238  buildbot-master8
        0x0040:  3220 4352 4f4e 445b 3138 3135 345d 3a20  2.CROND[18154]:.
        0x0050:  2872 6f6f 7429 2043 4d44 2028 2e20 2f75  (root).CMD.(../u
        0x0060:  7372 2f6c 6f63 616c 2f62 696e 2f70 726f  sr/local/bin/pro
        0x0070:  7879 5f72 6573 6574 5f65 6e76 2e73 6820
        0x0080:  2626 202f 7573 722f 6269 6e2f 7075 7070  &&./usr/bin/pupp
        0x0090:  6574 2061 6765 6e74 202d 2d65 6e61 626c  et.agent.--enabl
        0x00a0:  6520 3e2f 6465 762f 6e75 6c6c 2032 3e2f  e.>/dev/null.2>/
        0x00b0:  6465 762f 6e75 6c6c 2026 2620 2f75 7372  dev/null.&&./usr
        0x00c0:  2f62 696e 2f70 7570 7065 7420 6167 656e  /bin/puppet.agen
        0x00d0:  7420 2d2d 6f6e 6574 696d 6520 2d2d 6e6f  t.--onetime.--no
        0x00e0:  2d64 6165 6d6f 6e69 7a65 202d 2d6c 6f67  -daemonize.--log
        0x00f0:  6465 7374 3d73 7973 6c6f 6720 3e2f 6465  dest=syslog.>/de
        0x0100:  762f 6e75 6c6c 2032 3e26 3129            v/null.2>&1)

RFC5424 claims there should be some structured data after the ">", but I certainly don't see any there.  So basically we have a priority byte and some text that is uninhibited by the standards process.

Comparing its input (/dev/log) and output (network), it seems that rsyslog is doing some parsing of that text -- stripping the datestamp and inserting the hostname.

I guess I just never realized how unstructured log data really is.  I don't see any good place to try to tab-delimit things that's not going to be more work than simply parsing the data in the Heka ingestor.

Anyway, we should probably talk about all of this at the workweek and see what the base requirements are, so we can get something working and then improve on it.
Yeah, lets talk at the work week. Windows stuff is actually pretty straightforward using nxlog and bypassing syslog altogether since syslog just flattens a nice json data structure.
Syslog data is a formless mush and rsyslogd is an abomination, so we need to get data out of that format as quickly as possible.  Heka's a bit weird too, but having read both rsyslog and Heka docs today, sign me up for Heka any day.  And it's easy to install (single RPM, no deps.. I love Go!)

One of Heka's quirks is that it doesn't speak syslog at all.  The closest it gets is snarfing logfiles as written by tools like rsyslog.  So the plan is this:

* Configure rsyslog on the aggregation hosts to write logs (to a ramdisk to avoid spinning rust) in a format containing as much information as possible, including a parsed timestamp, hostname info, severity, and facility.
  * redirect all CEF data from auditd off to the per-datacenter syslog servers, NOT to Heka

* Snarf those logfiles with Heka
  * Later we can add other input mechanisms, too, including HTTP POST, to get logs from other sources like masters' twistd.logs

* Filter appropriately in Heka
  * I can already see just from tailing the logs that >50% of our logs are useless spam

* Send the results to MozDef via AMQP
OK, we got in a little deep.

The new approach is to add


on all client hosts.  No releng-specific aggregation systems, no new tools, nothing.

If there are problems, we'll look for minimal solutions to them.
Comment on attachment 8533892 [details] [diff] [review]

++ to simplicity
Attachment #8533892 - Flags: review?(jwatkins) → review+
Attachment #8533892 - Flags: checked-in+
Going to wait a bit on this and then see what's showing up in MozDef.
Well, nothing showed up, but if I had squinted at the syslog1.private.scl3 config I would have seen that -- it was dumping the log entries on disk locally, but not forwarding them to MozDef.

That said, it looks like we're going to be using an external service instead, so I'm going to revert the changes for this bug and close it up.
Note that this leaves the rsyslog improvements in place, but reverts the rest back to where only puppetmasters were sending their syslog upstream.
Attachment #8537829 - Flags: review?(jwatkins)
Comment on attachment 8537829 [details] [diff] [review]

Sorry, I totally missed this before the holiday.
Attachment #8537829 - Flags: review?(jwatkins) → review+
Attachment #8537829 - Flags: checked-in+
Last Resolved: 4 years ago
Resolution: --- → FIXED
Blocks: 1118181
You need to log in before you can comment on or make changes to this bug.