Closed Bug 1825133 Opened 2 years ago Closed 2 years ago

Telegraf split error in Eliot stage

Categories

(Eliot :: General, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(2 files)

In Eliot stage in GCP, there are telegraf errors like this one:

2023-03-27T19:53:44Z E! [inputs.statsd] Splitting '|', unable to parse metric: eliot.diskcache.usage:0|g|c:f1ce8f072aab3dbe86e5dd91dcf83ad7a2a9147f26c45f5549eddd1a6608f763

The eliot.diskcache.usage:0 is coming from Eliot. The g is part of the statsd line protocol saying this is a gauge. The c:f1c... thing is not something I recognize and not something being emitted by Eliot.

I wrote up an infra bug, but after talking with jwhitlock, I looked at the datadog code and this is something new in 0.45.0.

This bug covers looking into it and fixing it.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

What's going on is the c:... thing is the container id which the datadog library adds. The datadog agent will use the container id to figure out what information to additionally attach to the metric before sending it onward. This is part of the datadog protocol v1.2.

https://docs.datadoghq.com/developers/dogstatsd/datagram_shell?tab=metrics#dogstatsd-protocol-v12

This functionality is new in datadog library 0.45.0.

The v.1.2 line protocol is:

<METRIC_NAME>:<VALUE>|<TYPE>|#<TAG_KEY_1>:<TAG_VALUE_1>,<TAG_2>|c:<CONTAINER_ID>
                                                               ^^^^^^^^^^^^^^^^^

I tried to reproduce it in my local dev environment. The datadog library looks at /proc/self/cgroups to pick up the container id. When I run Eliot in my local dev environment with Docker and Docker compose, /proc/self/cgroups contains 0::/ which the datadog library splits into '' which is the empty string, so it doesn't add the container id to the metric. Ergo, I can't reproduce the issue in my local dev environment. Even so, we're pretty sure this is what's going on.

The code that adds the container id:

https://github.com/DataDog/datadogpy/blob/0873f87c96f72c4e4cbec56457e8d9a33e9f38b6/datadog/dogstatsd/base.py#L762-L772

PR the container id stuff was added in:

https://github.com/DataDog/datadogpy/pull/720

I talked this over with jwhitlock and we think we should do a couple of things:

  1. Add some support to Markus to shut this off by default. Then we all update to the new Markus and we're good to go.
  2. Write up an issue in telegraf to support datadog protocol v1.2. I think this is what we want to do, but I'm not 100% sure. It's unclear how much of the datadog protocol telegraf intends to support.

I'll work on the Markus thing and bring this up with gcp-migration tomorrow.

willkg merged PR #37: "bug 1825133: update to markus 4.2.0" in f768d5b.

When this autodeploys to stage, I'll check the logs and verify the issue is still fixed.

This is fixed in stage. I think this was deployed to prod in the last week.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: