Closed
Bug 1254770
Opened 10 years ago
Closed 9 years ago
Reporting format for consumption by ActiveData
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: selenamarie, Unassigned)
References
Details
(Whiteboard: [docker-worker])
We'd like to start reporting performance data to ActiveData from the docker-worker. Let us know how we should format things!
Comment 1•10 years ago
|
||
Optimally we could format TaskCluster logs just like buildbot logs. Compare this buildbot log:
http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux64-debug/1457433743/mozilla-central_ubuntu64_vm-debug_test-mochitest-e10s-6-bm120-tests1-linux64-build0.txt.gz
with this TaskCluster log:
https://public-artifacts.taskcluster.net/RG2axt5xTsqwL862q0zMWQ/0/public/logs/live_backing.log
If the TaskCluster parts prior to the execution of the mozharness script could be blocked like:
========= Started foo (results: 0, elapsed: 0 secs) (at 2016-03-08 04:09:15.896202) =========
blah blah blah
========= Finished foo (results: 0, elapsed: 0 secs) (at 2016-03-08 04:09:15.896861) =========
then ActiveData could parse without modification. If that's not easy, I'm sure we could use another format.
I'm aware that a lot of the pre-mozharness TC output comes from the task itself (presumably everything after "Task Starting" in the log), so even if we just logged the docker image download and the task execution times above, it would give us something to parse and track in ActiveData. We might tackle logging from the task separately in a later bug.
Comment 2•9 years ago
|
||
Currently we have attempted to not introduce buildbot specific logging headers/footers into the things that workers are doing. Of course a once a task is running it can print those headers since the worker does not control what the task does.
We have typically adopted prefixing lines with "[taskcluster]" for things the worker has injected into the logs vs what the task logged. I would prefer to keep the lines similar to that if possible.
Right now we write one line with "[taskcluster] <message about downloading image>" but we could make that more easily parsed and add one for the loading of the image.
such as:
[taskcluster] Downloaded image 'public/image.tar' from task 'abc'
[taskcluster] Image download time: xxx seconds
[taskcluster] Image load time: xxx seconds
Comment 4•9 years ago
|
||
From bug 1257122 and bug 1257117: these log lines should have times on them as well.
Blocks: 1257117
Comment 5•9 years ago
|
||
(In reply to Greg Arndt [:garndt] from comment #2)
> Currently we have attempted to not introduce buildbot specific logging
> headers/footers into the things that workers are doing. Of course a once a
> task is running it can print those headers since the worker does not control
> what the task does.
>
> We have typically adopted prefixing lines with "[taskcluster]" for things
> the worker has injected into the logs vs what the task logged. I would
> prefer to keep the lines similar to that if possible.
>
> Right now we write one line with "[taskcluster] <message about downloading
> image>" but we could make that more easily parsed and add one for the
> loading of the image.
>
> such as:
> [taskcluster] Downloaded image 'public/image.tar' from task 'abc'
> [taskcluster] Image download time: xxx seconds
> [taskcluster] Image load time: xxx seconds
We aren't married to buildbot formatting. I think we'd prefer timestamping, however, to elapsed time, e.g.
[taskcluster 2016-03-15 08:12:15Z] Downloading image ...
Comment 6•9 years ago
|
||
I am agnostic about the actual format; whether you choose to prefix each line with the
system that generated it, or you use demarcation lines (like buildbot's "========="
prefix). Your choice will probably be guided by how the system is run, or if
log interlacing can happen. For individual properties, choose a format you think is
easy to parse, but also kind to the humans that might look at it.
* Time in GMT - I would prefer all times in GMT; whether you choose unix timestamps
(seconds since epoch), or the humane ISO format. Please no PST/PDT.
* Duration in seconds - to whatever number of decimal points you deem appropriate
* Stable format - You can future-proof your format by being explicit and plain with
properties, and knowing you will make heavy use of line markers.
* Don't add redundancy - Please log a value only once; omitting any derived values.
For example, record the start_time with end_time, *or* record start_time with
duration, but not both. Do not provide summary statistics, unless the human log
readers need it (?and mark the line so the parser ignores it?). The problem with
redundant, or derived, values is that I must add code to ensure they are
consistent, just to ensure I get alerted when my assumptions about the meaning of
the values breaks. For example: What do I do when
duration != end_time - start_time within some error bounds? If redundancy
already exists, leave it. Just do not spend effort adding it.
For example, in your task cluster example:
> # Unzip the mozharness ZIP file created by the build task
> + curl --fail -o mozharness.zip --retry 10 -L https://queue.taskcluster.net/v1/task/KvU9KKOVTtakEv1fzzDweA/artifacts/public/build/mozharness.zip
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
>
> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
> 100 29 100 29 0 0 56 0 --:--:-- --:--:-- --:--:-- 65
> 100 29 100 29 0 0 56 0 --:--:-- --:--:-- --:--:-- 65
>
> 100 149 100 149 0 0 236 0 --:--:-- --:--:-- --:--:-- 236
>
> 100 657k 100 657k 0 0 935k 0 --:--:-- --:--:-- --:--:-- 935k
> rm -rf mozharness
is missing information, so it is ok to change the format: Prefixing each line with
with a ISO timestamp will let me extract the time it takes to unzip this file. No
need to add a redundant "unzip time" property: I can't use it anyway because I still
must measure the time to the beginning of the next step.
This step is also missing a short label. You can decree a new format so your
comments can be used to label steps with a short name. This does not change the
existing format, just inserts line markers:
> # Step: Unzip mozharness
> # Unzip the mozharness ZIP file created by the build task
> + curl --fail -o mozharness.zip --retry 10 -L https://queue.taskcluster.net/v1/task/KvU9KKOVTtakEv1fzzDweA/artifacts/public/build/mozharness.zip
Comment 7•9 years ago
|
||
There are two entities that can write messages to the same log. The worker which is executing the task, and the task itself. I believe this bug was opened for the the data that the worker (docker-worker) writes and is currently prefixed with "[taskcluster]" or "[taskcluster:error]". The goal if getting this data out of the logs is to determine timings for certain infrastructure related events to support the task, rather than the task itself.
The examples in comment 6 are related to the output created by the task. Messages between "[taskcluster] === Task Starting ===" and "[taskcluster] === Task Finished ===" are unaltered by the worker and streamed directly to the task log. Within a task, however, some have adopted outputting buildbot style headers/footers within the task, but this is independent of anything the worker is concerned with and not an enforced style across all tasks.
Taskcluster as a platform/service does not have the concept of steps that a task executes so that they can be labeled and have associated timings. The workers receive a task with a command to execute in a certain environment without any knowledge of what happens within that command execution. Perhaps that's something to consider in the future.
Comment 8•9 years ago
|
||
One proposal is to have worker related log statements to be written as such:
[taskcluster 2016-03-16 20:43:37.803Z] taskId: 7bfM6qMkSI-2B1xs5JvRcA, workerId: dummy-worker-Ps6qE4CMS
[taskcluster 2016-03-16 20:43:38.863Z] === Task Starting ===
Hello
[taskcluster 2016-03-16 20:43:48.947Z] Task timeout after 10 seconds. Force killing container.
[taskcluster:error] Task timeout after 10 seconds. Force killing container.
[taskcluster 2016-03-16 20:43:49.146Z] === Task Finished ===
[taskcluster 2016-03-16 20:43:49.147Z] Unsuccessful task run with exit code: -1 completed in 11.344 seconds
Note that the error was written without a timestamp to aid sheriffs that will be starring issues and having variable data within those error messages makes it difficult to retrieve bug suggestions.
Comment 9•9 years ago
|
||
I hope my confusion about what is writing what logs does not get in the way of the overall intent. I believe the worker takes a non-trivial amount of time from the moment the task is requested to the time the task begins. There is probably a delay to get a new spot instance, a delay while the instance is re-imaged (or the docker image downloaded), the environment probably requires a few instructions to be run before the task even starts. From my naive perspective, the worker has many "steps", but maybe it is some other entity that is performing those steps, in which case having it log would be nice. Given how slim the [taskcluster] logs are now, maybe these steps all happen before the worker starts? Maybe these steps are logged already elsewhere?
Comment 10•9 years ago
|
||
Thank you for having patience with working through unraveling what it is we would like to capture and how best to do it. I know some of it is not as straightforward as we would hope.
So I think there are two problems that we're trying to solve that might have different solutions depending on priorities and complexity.
1. Record timing data into activedata related to docker image download/loading
* For docker images, downloading and loading are one step with `docker pull`
* For image tarballs this is downloading and importing (two stages that impact over all timing)
2. Standardize timing formats and reporting of events within TaskCluster
* As it stands now, the docker image piece is the main time consuming piece that happens before task execution from a worker point of view. Instance provisioning and wait times are handled outside of the worker so not considered here. In the future, we will be adding more events that happen before the task executes but after a task is claimed, such as preloading caches, where having a standardized format for recording those would be helpful rather than adding any kind of custom rules for each event.
For #1, which is the more immediate need right now, could we get away with logging things such as (open for suggestions):
[taskcluster <timestamp>] Docker image size: XXXX mb
[taskcluster <timestamp>] Docker image download time: XXX seconds
[taskcluster <timestamp>] Docker image load time: XXX seconds
Although the last two I think could be combined into one log message because I'm not sure if developers care about the separation of loading and downloading time, but mostly care that this event take however many seconds before the task started. We will still record each timing as distinct entries in our metrics storage though if we needed to dig in.
| Reporter | ||
Comment 11•9 years ago
|
||
Kyle -- how does Greg's solution for solving #1 sound to you?
Flags: needinfo?(klahnakoski)
Comment 12•9 years ago
|
||
Let me assume this format, and these lines are written after each event completes.
> [taskcluster <timestamp1>] Docker image size: XXXX1 mb
> [taskcluster <timestamp2>] Docker image download time: XXX2 seconds
> [taskcluster <timestamp3>] Docker image load time: XXX3 seconds
I like the timestamps because they are an absolute reference I can use to calculate step durations:
> XXX2 = timestamp2 - timestamp1
> XXX3 = timestamp3 - timestamp2
I will perform this calculation upon ingestion of the log, and confirm they are consistent with the reported duration.
If the logging is asynchronous, then the timestamps might be skewed, and this plan will not work. I would then ask that the `start` and `end` timestamps that were used to calculate the duration be included. Like so,
> [taskcluster <timestamp1>] Docker image size: XXXX1 mb
> [taskcluster <timestamp2>] Docker image download time: XXX2 seconds (from 14xxxxxxxx to 14xxxxxxxx)
> [taskcluster <timestamp3>] Docker image load time: XXX3 seconds (from 14xxxxxxxx to 14xxxxxxxx)
Flags: needinfo?(klahnakoski)
Comment 13•9 years ago
|
||
We do have a start timestamp for this event, here is an example:
[taskcluster 2016-04-22 11:51:21.186Z] Downloading artifact "public/image.tar" from task ID: LgvzhG9dR4uLMCxbXA70yg.
Which was taken from this task log where you can see what we're currently recording:
https://public-artifacts.taskcluster.net/DlwQdOihTv6-Mk8mKSMLHg/0/public/logs/live_backing.log
The logging is synchronous (happens right before/after the action it's performing)
There are a few events here:
1. Total docker image download and load time
2. Image Download time
3. Image Load time
#1 can be achieved by looking at these two lines:
[taskcluster 2016-04-22 11:51:21.186Z] Downloading artifact "public/image.tar" from task ID: LgvzhG9dR4uLMCxbXA70yg.
[taskcluster 2016-04-22 11:55:13.684Z] Image 'public/image.tar' from task 'LgvzhG9dR4uLMCxbXA70yg' loaded. Using image ID sha256:e1ba6d4db86405d7f66f2b2fbe4962dfdf47594d3b4884cdfdb9815816114b15.
#2 from these lines:
[taskcluster 2016-04-22 11:51:21.186Z] Downloading artifact "public/image.tar" from task ID: LgvzhG9dR4uLMCxbXA70yg.
[taskcluster 2016-04-22 11:54:32.565Z] Downloaded artifact successfully.
#3 from these:
[taskcluster 2016-04-22 11:54:32.566Z] Loading docker image from downloaded archive.
[taskcluster 2016-04-22 11:55:13.684Z] Image 'public/image.tar' from task 'LgvzhG9dR4uLMCxbXA70yg' loaded. Using image ID sha256:e1ba6d4db86405d7f66f2b2fbe4962dfdf47594d3b4884cdfdb9815816114b15.
Comment 14•9 years ago
|
||
This looks excellent then!
| Reporter | ||
Updated•9 years ago
|
Whiteboard: [docker-worker]
| Reporter | ||
Updated•9 years ago
|
Component: Docker-Worker → Worker
Comment 15•9 years ago
|
||
Time stamps have been added awhile ago to these messages and deployed to all workers.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•7 years ago
|
Component: Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•