Telegraf split error in Eliot stage
Categories
(Eliot :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(2 files)
In Eliot stage in GCP, there are telegraf errors like this one:
2023-03-27T19:53:44Z E! [inputs.statsd] Splitting '|', unable to parse metric: eliot.diskcache.usage:0|g|c:f1ce8f072aab3dbe86e5dd91dcf83ad7a2a9147f26c45f5549eddd1a6608f763
The eliot.diskcache.usage:0
is coming from Eliot. The g
is part of the statsd line protocol saying this is a gauge. The c:f1c...
thing is not something I recognize and not something being emitted by Eliot.
I wrote up an infra bug, but after talking with jwhitlock, I looked at the datadog code and this is something new in 0.45.0.
This bug covers looking into it and fixing it.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 1•2 years ago
|
||
What's going on is the c:...
thing is the container id which the datadog library adds. The datadog agent will use the container id to figure out what information to additionally attach to the metric before sending it onward. This is part of the datadog protocol v1.2.
https://docs.datadoghq.com/developers/dogstatsd/datagram_shell?tab=metrics#dogstatsd-protocol-v12
This functionality is new in datadog library 0.45.0.
The v.1.2 line protocol is:
<METRIC_NAME>:<VALUE>|<TYPE>|#<TAG_KEY_1>:<TAG_VALUE_1>,<TAG_2>|c:<CONTAINER_ID>
^^^^^^^^^^^^^^^^^
I tried to reproduce it in my local dev environment. The datadog library looks at /proc/self/cgroups
to pick up the container id. When I run Eliot in my local dev environment with Docker and Docker compose, /proc/self/cgroups
contains 0::/
which the datadog library splits into ''
which is the empty string, so it doesn't add the container id to the metric. Ergo, I can't reproduce the issue in my local dev environment. Even so, we're pretty sure this is what's going on.
The code that adds the container id:
PR the container id stuff was added in:
https://github.com/DataDog/datadogpy/pull/720
I talked this over with jwhitlock and we think we should do a couple of things:
- Add some support to Markus to shut this off by default. Then we all update to the new Markus and we're good to go.
- Write up an issue in telegraf to support datadog protocol v1.2. I think this is what we want to do, but I'm not 100% sure. It's unclear how much of the datadog protocol telegraf intends to support.
I'll work on the Markus thing and bring this up with gcp-migration tomorrow.
Assignee | ||
Comment 2•2 years ago
|
||
Assignee | ||
Comment 3•2 years ago
|
||
Assignee | ||
Comment 4•2 years ago
|
||
Assignee | ||
Comment 5•2 years ago
|
||
willkg merged PR #37: "bug 1825133: update to markus 4.2.0" in f768d5b.
When this autodeploys to stage, I'll check the logs and verify the issue is still fixed.
Assignee | ||
Comment 6•2 years ago
|
||
This is fixed in stage. I think this was deployed to prod in the last week.
Description
•