Closed
Bug 1444461
Opened 7 years ago
Closed 7 years ago
[Meta] POC GCP logging story
Categories
(Data Platform and Tools Graveyard :: Operations, enhancement, P1)
Data Platform and Tools Graveyard
Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hwoo, Assigned: hwoo)
Details
(Whiteboard: DataOps)
- GCE containers to stackdriver
- GKE containers to stackdriver
- Stackdriver (30 days) to long term storage (90 days) with limited access
- Stackdriver to Bigquery (30 day retention)
- Stackdriver export custom metrics to stackdriver monitoring for alerts and dashboards
- Automation story
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → hwoo
Assignee | ||
Updated•7 years ago
|
Priority: -- → P1
Assignee | ||
Updated•7 years ago
|
Whiteboard: DataOps
Assignee | ||
Comment 1•7 years ago
|
||
- GCE containers to stackdriver
Requires installation of docker, logging agent, and custom config of logging agent. We should bake this into the image. (https://www.fluentd.org/guides/recipes/docker-logging)
PREFERRED METHOD:
Edit file at /etc/google-fluentd/config.d/forward.conf (have this baked into image in future)
=-=-=
<source>
@type forward
port 24224
bind 0.0.0.0 >>>> this line is the line that needs to be added
</source>
=-=-=-
and restart the logging agent
sudo service google-fluentd status
sudo service google-fluentd reload
sudo service google-fluentd restart
Run the container on host with --log-driver=fluentd e.g.
`docker run -d --log-driver=fluentd testlogger2:latest`
This shows logs faster than the second method at:
GCE VM Instance -> instance name -> container hash name (e.g. 3feb1077ba0b)
If you want more prettiness in stackdriver, you will need to drop a custom fluentd config file into /etc/google-fluentd/config.d/ and then run `sudo service google-fluentd reload|restart`
Custom configs can be used to filter log entries out into their own fields, concat multi lines, filter out severity, etc. K8s automatically does some filtering/formatting for you.
K8S method also gives you severity in the right field for free. And in the export to cloud storage, using k8s names the folders after the pod container name rather than the containerID hash (with the gce method). Also the jsonpayload message is parsed out already in the field, rather than having the message field contain json. To get this same behavior in GCE we would have to create custom fluentd confs to filter out fields in GCE container logs
=-=-
NOT RECOMMENDED: Another way to get logs to stackdriver is to set the log-driver and log-opt docker run options e.g.
`docker run -d --log-driver=gcplogs --log-opt gcp-log-cmd=true testlogger2:latest`
This works but takes a while for logs to appear in stackdriver. This also will ship all container logs from the same instance to the same location. If you have multiple containers, they wont be separated by container easily.
GCE VM Instance -> instance name -> gcplogs-docker-driver
=-=-
- GKE containers to stackdriver - Automatically pushes stdout/stderr container logs to stackdriver
Logs can be found in the GCP Portal under Logging -> Logs -> GKE Container, <k8s-cluster-name>, <namespace> (usually 'default' unless you specify a namespace when deploying your resources. 'kube-system' namespace will have the system pod logs e.g. fluentd, kubedns, promtosd, etc.)
K8S system logs are at Logging -> Logs -> Kubernetes Cluster, us-west1-a, k8sharold
Assignee | ||
Comment 2•7 years ago
|
||
- Stackdriver to Bigquery (30 day retention)
From the GCP UI, you can create exports to either bigquery or cloud storage (and then query with bigquery). There are also options for custom, or cloud pub/sub.
Querying via bigquery is easy if you export to bigquery directly. The datasource will automatically show up.
If you need to query from cloud storage you need to create a table:
https://cloud.google.com/bigquery/external-data-cloud-storage
Creating a permanent external table
You can create a permanent external table linked to your external data source using the web UI, the CLI, or the API.
This isnt working out of the box for k8s -> stackdriver -> export to cloudstorage bucket -> bigquery
I get the error:
Failed to create table: Invalid field name "container.googleapis.com/stream". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long
sample record looks like:
{"insertId":"1lzzp6fgcr7062q","jsonPayload":{"message":"This is an INFO message"},"labels":{"compute.googleapis.com/resource_name":"fluentd-gcp-v2.0.9-rr9ss","container.googleapis.com/namespace_name":"default","container.googleapis.com/pod_name":"logtest","container.googleapis.com/stream":"stdout"},"logName":"projects/mozilla-data-poc-198117/logs/logtest","receiveTimestamp":"2018-03-26T19:00:02.435367282Z","resource":{"labels":{"cluster_name":"harold-k8s","container_name":"logtest","instance_id":"1741766681358874447","namespace_id":"default","pod_id":"logtest","project_id":"mozilla-data-poc-198117","zone":"us-west1-a"},"type":"container"},"severity":"INFO","timestamp":"2018-03-26T18:59:56Z"}
It doesnt like when fields start with { or contain non escaped double quotes
This may require some customization at the fluentd level or tweaking of the bigquery load job
https://cloud.google.com/solutions/customizing-stackdriver-logs-fluentd
The recommendation here could be to simply have stackdriver export to bigquery and export to cloud storage separately. If we need to frontload more than 30 days to bigquery initially this could be a problem since stackdriver only keeps/allows 30 days.
If security needs to look at > 30 day old data in cloud storage, they can try using bigquery to create a table, or use spark to look at the data.
Expiration policy for Bigquery is set at the Dataset level (Default Table Expiration) as well as the table level (Expiration Time). Expiration will delete the tables.
https://cloud.google.com/bigquery/docs/best-practices-storage
https://cloud.google.com/bigquery/docs/managing-datasets#table-expiration
https://cloud.google.com/bigquery/docs/managing-tables#updating_a_tables_expiration_time
https://cloud.google.com/bigquery/docs/managing-partitioned-tables#partition-expiration
If we need a long running table, we need to partition by date and delete date partitions. Or we can simply have developers export tables when they need to query logs on a case by case basis, and have some external process set 30 day expiration on all new tables periodically.
Assignee | ||
Comment 3•7 years ago
|
||
- Stackdriver (30 days) to long term storage (90 days) with limited access
Stackdriver UI has export to cloud storage option.
From the cloud storage UI click lifecycle to manage lifecycle (add rules). This can also be done programatically see:
https://cloud.google.com/storage/docs/managing-lifecycles
From cloud storage UI click the ... button on the far right and select 'Edit bucket permissions' to change IAM permissions to the bucket. (https://cloud.google.com/storage/docs/access-control/iam?hl=en_US&_ga=2.22533308.-62984643.1513201187)
Assignee | ||
Comment 4•7 years ago
|
||
- Stackdriver export custom metrics to stackdriver monitoring for alerts and dashboards
https://cloud.google.com/logging/docs/logs-based-metrics/
point developers to this if they plan to use log based metrics and u want them to think about what tags/labels they use
From the logging UI -> Logs-based metrics -> create a new User-defined metric
https://cloud.google.com/logging/docs/logs-based-metrics/
Heres an example for my GCE instance log container to parse out severity as a metric:
Name: test-metric-export
Description: extract severity
Add Labels ->
Name: severity
Description: log severity e.g. info warn error
Label Type: String
Field Name: jasonPayload.log
Regex: ^{.*\"severity": \"(.*)\"}$
Units: 1
Type: Counter
Then with metric you can create dashboards/alerts:
https://cloud.google.com/logging/docs/logs-based-metrics/charts-and-alerts
view your metric in metric exporter first (click on the ... next to your new metric -> View in metric exporter)
For alerting go to Metrics Exporter -> Alerting -> create a policy
Basic Conditions ->
Target: Log Metrics
If Metric: user/test-metric-export ABOVE X THRESHOLD X FOR X minutes
Resource: gce_instance
SEVERITY: ERROR
Notifications ->
Add pagerduty service
Documentation/Naming
=-=-
For dashboarding go to Metric Exporter -> Dashboards -> Create Dashboard
Basically use metrics (either user defined custom, or auto generated system metrics like byte_count, dropped_log_entry_count, excluded_byte_count, excluded_log_entry_count, log_entry_count).
In my example I used:
Resource Type: GKE Container
Metric: Log entries
Filter ->
Severity = INFO | WARN | ERROR (3 different metric entries)
container_name = logtest
Aligner = mean
Second dashboard for GCE:
Resource Type: GCE VM Instance
Metric: logging/user/test-metric-export
Assignee | ||
Comment 5•7 years ago
|
||
TODO:
- Test if nginx logs or any textual logs are converted to json in stackdriver/bigquery
- Test alerting
- Automation story
Assignee | ||
Comment 6•7 years ago
|
||
- Test if nginx logs or any textual logs are converted to json in stackdriver/bigquery
Syslog logs are automatically uploaded to stackdriver if the logging-agent is installed. They are json formatted. Nginx logs will also be uploaded automatically provided that the logs are in default location /var/log/nginx/access|error.log
Some of Data ops modules write nginx logs elsewhere (https://github.com/mozilla-services/cloudops-deployment/blob/master/libs/puppet/modules/logging/manifests/nginx.pp), so we either need to change that in GCP or add an additional fluentd config to the host. If we want to run everything in containers moving forward, we need to either run the container with the fluentd logdriver and configure the container to write logs to stdout|err or mount the container log path to the host and ensure that the logging fluentd agent has configuration to scan those paths.
- Test alerting
Done
Assignee | ||
Comment 7•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Product: Data Platform and Tools → Data Platform and Tools Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•