Closed Bug 1444461 Opened 7 years ago Closed 7 years ago

[Meta] POC GCP logging story

Categories

(Data Platform and Tools Graveyard :: Operations, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwoo, Assigned: hwoo)

Details

(Whiteboard: DataOps)

- GCE containers to stackdriver - GKE containers to stackdriver - Stackdriver (30 days) to long term storage (90 days) with limited access - Stackdriver to Bigquery (30 day retention) - Stackdriver export custom metrics to stackdriver monitoring for alerts and dashboards - Automation story
Assignee: nobody → hwoo
Priority: -- → P1
Whiteboard: DataOps
- GCE containers to stackdriver Requires installation of docker, logging agent, and custom config of logging agent. We should bake this into the image. (https://www.fluentd.org/guides/recipes/docker-logging) PREFERRED METHOD: Edit file at /etc/google-fluentd/config.d/forward.conf (have this baked into image in future) =-=-= <source> @type forward port 24224 bind 0.0.0.0 >>>> this line is the line that needs to be added </source> =-=-=- and restart the logging agent sudo service google-fluentd status sudo service google-fluentd reload sudo service google-fluentd restart Run the container on host with --log-driver=fluentd e.g. `docker run -d --log-driver=fluentd testlogger2:latest` This shows logs faster than the second method at: GCE VM Instance -> instance name -> container hash name (e.g. 3feb1077ba0b) If you want more prettiness in stackdriver, you will need to drop a custom fluentd config file into /etc/google-fluentd/config.d/ and then run `sudo service google-fluentd reload|restart` Custom configs can be used to filter log entries out into their own fields, concat multi lines, filter out severity, etc. K8s automatically does some filtering/formatting for you. K8S method also gives you severity in the right field for free. And in the export to cloud storage, using k8s names the folders after the pod container name rather than the containerID hash (with the gce method). Also the jsonpayload message is parsed out already in the field, rather than having the message field contain json. To get this same behavior in GCE we would have to create custom fluentd confs to filter out fields in GCE container logs =-=- NOT RECOMMENDED: Another way to get logs to stackdriver is to set the log-driver and log-opt docker run options e.g. `docker run -d --log-driver=gcplogs --log-opt gcp-log-cmd=true testlogger2:latest` This works but takes a while for logs to appear in stackdriver. This also will ship all container logs from the same instance to the same location. If you have multiple containers, they wont be separated by container easily. GCE VM Instance -> instance name -> gcplogs-docker-driver =-=- - GKE containers to stackdriver - Automatically pushes stdout/stderr container logs to stackdriver Logs can be found in the GCP Portal under Logging -> Logs -> GKE Container, <k8s-cluster-name>, <namespace> (usually 'default' unless you specify a namespace when deploying your resources. 'kube-system' namespace will have the system pod logs e.g. fluentd, kubedns, promtosd, etc.) K8S system logs are at Logging -> Logs -> Kubernetes Cluster, us-west1-a, k8sharold
- Stackdriver to Bigquery (30 day retention) From the GCP UI, you can create exports to either bigquery or cloud storage (and then query with bigquery). There are also options for custom, or cloud pub/sub. Querying via bigquery is easy if you export to bigquery directly. The datasource will automatically show up. If you need to query from cloud storage you need to create a table: https://cloud.google.com/bigquery/external-data-cloud-storage Creating a permanent external table You can create a permanent external table linked to your external data source using the web UI, the CLI, or the API. This isnt working out of the box for k8s -> stackdriver -> export to cloudstorage bucket -> bigquery I get the error: Failed to create table: Invalid field name "container.googleapis.com/stream". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long sample record looks like: {"insertId":"1lzzp6fgcr7062q","jsonPayload":{"message":"This is an INFO message"},"labels":{"compute.googleapis.com/resource_name":"fluentd-gcp-v2.0.9-rr9ss","container.googleapis.com/namespace_name":"default","container.googleapis.com/pod_name":"logtest","container.googleapis.com/stream":"stdout"},"logName":"projects/mozilla-data-poc-198117/logs/logtest","receiveTimestamp":"2018-03-26T19:00:02.435367282Z","resource":{"labels":{"cluster_name":"harold-k8s","container_name":"logtest","instance_id":"1741766681358874447","namespace_id":"default","pod_id":"logtest","project_id":"mozilla-data-poc-198117","zone":"us-west1-a"},"type":"container"},"severity":"INFO","timestamp":"2018-03-26T18:59:56Z"} It doesnt like when fields start with { or contain non escaped double quotes This may require some customization at the fluentd level or tweaking of the bigquery load job https://cloud.google.com/solutions/customizing-stackdriver-logs-fluentd The recommendation here could be to simply have stackdriver export to bigquery and export to cloud storage separately. If we need to frontload more than 30 days to bigquery initially this could be a problem since stackdriver only keeps/allows 30 days. If security needs to look at > 30 day old data in cloud storage, they can try using bigquery to create a table, or use spark to look at the data. Expiration policy for Bigquery is set at the Dataset level (Default Table Expiration) as well as the table level (Expiration Time). Expiration will delete the tables. https://cloud.google.com/bigquery/docs/best-practices-storage https://cloud.google.com/bigquery/docs/managing-datasets#table-expiration https://cloud.google.com/bigquery/docs/managing-tables#updating_a_tables_expiration_time https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration If we need a long running table, we need to partition by date and delete date partitions. Or we can simply have developers export tables when they need to query logs on a case by case basis, and have some external process set 30 day expiration on all new tables periodically.
- Stackdriver (30 days) to long term storage (90 days) with limited access Stackdriver UI has export to cloud storage option. From the cloud storage UI click lifecycle to manage lifecycle (add rules). This can also be done programatically see: https://cloud.google.com/storage/docs/managing-lifecycles From cloud storage UI click the ... button on the far right and select 'Edit bucket permissions' to change IAM permissions to the bucket. (https://cloud.google.com/storage/docs/access-control/iam?hl=en_US&_ga=2.22533308.-62984643.1513201187)
- Stackdriver export custom metrics to stackdriver monitoring for alerts and dashboards https://cloud.google.com/logging/docs/logs-based-metrics/ point developers to this if they plan to use log based metrics and u want them to think about what tags/labels they use From the logging UI -> Logs-based metrics -> create a new User-defined metric https://cloud.google.com/logging/docs/logs-based-metrics/ Heres an example for my GCE instance log container to parse out severity as a metric: Name: test-metric-export Description: extract severity Add Labels -> Name: severity Description: log severity e.g. info warn error Label Type: String Field Name: jasonPayload.log Regex: ^{.*\"severity": \"(.*)\"}$ Units: 1 Type: Counter Then with metric you can create dashboards/alerts: https://cloud.google.com/logging/docs/logs-based-metrics/charts-and-alerts view your metric in metric exporter first (click on the ... next to your new metric -> View in metric exporter) For alerting go to Metrics Exporter -> Alerting -> create a policy Basic Conditions -> Target: Log Metrics If Metric: user/test-metric-export ABOVE X THRESHOLD X FOR X minutes Resource: gce_instance SEVERITY: ERROR Notifications -> Add pagerduty service Documentation/Naming =-=- For dashboarding go to Metric Exporter -> Dashboards -> Create Dashboard Basically use metrics (either user defined custom, or auto generated system metrics like byte_count, dropped_log_entry_count, excluded_byte_count, excluded_log_entry_count, log_entry_count). In my example I used: Resource Type: GKE Container Metric: Log entries Filter -> Severity = INFO | WARN | ERROR (3 different metric entries) container_name = logtest Aligner = mean Second dashboard for GCE: Resource Type: GCE VM Instance Metric: logging/user/test-metric-export
TODO: - Test if nginx logs or any textual logs are converted to json in stackdriver/bigquery - Test alerting - Automation story
- Test if nginx logs or any textual logs are converted to json in stackdriver/bigquery Syslog logs are automatically uploaded to stackdriver if the logging-agent is installed. They are json formatted. Nginx logs will also be uploaded automatically provided that the logs are in default location /var/log/nginx/access|error.log Some of Data ops modules write nginx logs elsewhere (https://github.com/mozilla-services/cloudops-deployment/blob/master/libs/puppet/modules/logging/manifests/nginx.pp), so we either need to change that in GCP or add an additional fluentd config to the host. If we want to run everything in containers moving forward, we need to either run the container with the fluentd logdriver and configure the container to write logs to stdout|err or mount the container log path to the host and ensure that the logging fluentd agent has configuration to scan those paths. - Test alerting Done
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in before you can comment on or make changes to this bug.