Closed Bug 916260 Opened 6 years ago Closed 6 years ago

Specify standard message formats for reporting test results via structured logging

Categories

(Testing :: Mozbase, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cmanchester, Unassigned)

References

(Blocks 1 open bug, )

Details

Attachments

(1 file)

Structured logging is already being used to report test results in xpcshell, and will be used in future harnesses and future efforts to port harnesses to structured messages. 

This bug is about specifying a standard set of message actions and the fields that will be expected for those actions. The goal is to allow consumers of structured message to be able to expect certain messages, and have a reliable notion of their meaning across test harnesses.

These actions/messages should be defined in central location, such as mozlog or moztest, although equivalents may need to be specified for use by javascript harnesses.

I think a good start for this would be to review the formats currently in use and their intended meanings, and take a look at what I used for xpcshell, although both of these may be subject to change as needed.
FWIW I am using the following fields for web-platform-tests, based on what I have needed so far (but also designed to be more or less compatible with other harnesses that I examined):

TESTS-START: A set of tests started executing
  number (int):  Number of tests to be executed

TESTS-END: A set of tests finished executing

TEST-START: Executing an individual top-level test started
  test (list): list of test id components e.g. [url] for a test purely identified by url or [url, reftype, refurl] for a reftest

TEST-RESULT: Result from a subtest of a particular test
  test (list): As for TEST-START
  subtest (str|null): Name of the subtest (where applicable)
  status (str, enum[PASS|FAIL|TIMEOUT|NOTRUN|ASSERT?]): Result of the test
  message (str): Optional string indicating the reason for the status
  unexpected (bool): True indicates test result didn't match expectations. False or missing indicates that expectations were matched.

TEST-END: A top-level test finished executing
  test (list): As for TEST-START
  status (str, enum[OK|ERROR|TIMEOUT|CRASH|ASSERT]): Overall status of the test. OK indicates that the test completed, not that subtests passed; the point of this field is to distingish conditions that prevent us getting test results at all. ERROR indicates a harness-detected error e.g. an uncaught javascript exception that prevented all/any results being collected. TIMEOUT indicates that the harness timed out without completing. CRASH is for a whole program crash and ASSERT indicates an assertion that wasn't fatal (i.e. that is distinguishable from a crash) but can't be tied to a specific test.
  
  message (str): As for TEST-RESULT
  unexpected (bool): As for TEST-RESULT

PROCESS-OUTPUT: Output captured from another process
  process (str): An identifier for the process producing output
  data (str): The captured output

LOG: Generic logging that doesn't correspond to actual testing
  level (str, enum(CRITICAL|ERROR|WARNING|INFO|DEBUG)): The severity of the log message
  message (str): The log message
So we realized we should nail this down before we proceeding with converting other harnesses to use structured logging, like mochitest. Once we all agree on a format, we need to document it clearly and provide examples.

Chris, could you post a brief summary of your thoughts and xpcshell's implementation, and how it compares to jgraham's proposal? Once you do that I'll post this to dev.platform to solicit feedback from a broader audience.
Flags: needinfo?(chmanchester)
First, about the work done for xpcshell. This gives a good idea of how the messages are used in conversion to the expected output format:

http://hg.mozilla.org/mozilla-central/file/11762cace081/testing/xpcshell/runxpcshelltests.py#l48

And this is the way formatting is done:

http://hg.mozilla.org/mozilla-central/file/11762cace081/testing/xpcshell/runxpcshelltests.py#l423

The comment here expresses the invariant required for formatting: message objects either have a “_message” field that will be the output, or the output is made from the string associated with the action together with the file name and a diagnostic message.

Some feedback on jgraham’s format below. Overall I think it composes a rich set of data that will be capable of expressing and surpassing what is currently reported by test harnesses, and we can probably take much of it verbatim.

I'm not a big fan of using upper-case, '-' delimited strings as general purpose field names. These have the advantage of reflecting the existing output convention, so I think they're fine as a concrete representation, but wherever we treat these names as symbolic constants, we might run into the problem that these aren't valid identifiers in python or javascript.

It may be advantageous to specify a restricted set of fields that will be expected from javascript (an individual test file) and differentiate this from the fields that will be produced by a python test runner with the benefit of the context of a full set of tests.


(In reply to jgraham from comment #1)
> FWIW I am using the following fields for web-platform-tests, based on what I
> have needed so far (but also designed to be more or less compatible with
> other harnesses that I examined):
>
> TESTS-START: A set of tests started executing
>   number (int):  Number of tests to be executed

This is only one character from "TEST-START", below, which may impact readability of output.

>
> TESTS-END: A set of tests finished executing

Ditto.

>
> TEST-START: Executing an individual top-level test started
>   test (list): list of test id components e.g. [url] for a test purely
> identified by url or [url, reftype, refurl] for a reftest

Maybe we can use a dictionary/object here to help our format be more self-describing? Various tests might be identified by a variety of parameters; this seems highly specific to reftests.

>
> TEST-RESULT: Result from a subtest of a particular test
>   test (list): As for TEST-START
>   subtest (str|null): Name of the subtest (where applicable)
>   status (str, enum[PASS|FAIL|TIMEOUT|NOTRUN|ASSERT?]): Result of the test
>   message (str): Optional string indicating the reason for the status
>   unexpected (bool): True indicates test result didn't match expectations.
> False or missing indicates that expectations were matched.

I am a big fan of having a unexpected as a discrete field like this. This will be something to revisit in the xpcshell output format.

>
> TEST-END: A top-level test finished executing
>   test (list): As for TEST-START
>   status (str, enum[OK|ERROR|TIMEOUT|CRASH|ASSERT]): Overall status of the
> test. OK indicates that the test completed, not that subtests passed; the
> point of this field is to distingish conditions that prevent us getting test
> results at all. ERROR indicates a harness-detected error e.g. an uncaught
> javascript exception that prevented all/any results being collected. TIMEOUT
> indicates that the harness timed out without completing. CRASH is for a
> whole program crash and ASSERT indicates an assertion that wasn't fatal
> (i.e. that is distinguishable from a crash) but can't be tied to a specific
> test.

We may want to explicitly allow extensions to the status field.

>
>   message (str): As for TEST-RESULT
>   unexpected (bool): As for TEST-RESULT
>
> PROCESS-OUTPUT: Output captured from another process
>   process (str): An identifier for the process producing output
>   data (str): The captured output
>
> LOG: Generic logging that doesn't correspond to actual testing
>   level (str, enum(CRITICAL|ERROR|WARNING|INFO|DEBUG)): The severity of the
> log message
>   message (str): The log message
Flags: needinfo?(chmanchester)
(In reply to Chris Manchester [:chmanchester] from comment #3)
> I'm not a big fan of using upper-case, '-' delimited strings as general
> purpose field names. These have the advantage of reflecting the existing
> output convention, so I think they're fine as a concrete representation, but
> wherever we treat these names as symbolic constants, we might run into the
> problem that these aren't valid identifiers in python or javascript.

That sounds reasonable.

> It may be advantageous to specify a restricted set of fields that will be
> expected from javascript (an individual test file) and differentiate this
> from the fields that will be produced by a python test runner with the
> benefit of the context of a full set of tests.

With the way that the web-platform-tests work all the messages are produced from python; the test file sends its data back over marionette rather than logging directly. I realise this isn't the way that all test types work, but we should at least bear in mind that there isn't a neat division between "things that are produced by js" and "things that are produced by python". It may be that there exist a set of things that cannot be produced by js and a set of things that may either be produced by js or by python.

> >
> > TEST-START: Executing an individual top-level test started
> >   test (list): list of test id components e.g. [url] for a test purely
> > identified by url or [url, reftype, refurl] for a reftest
> 
> Maybe we can use a dictionary/object here to help our format be more
> self-describing? Various tests might be identified by a variety of
> parameters; this seems highly specific to reftests.
> 

My only concern with using a dictionary is that the test parameter gets repeated a lot to enable consistency checking and ease reading result logs (it's on every TEST-RESULT item and on the TEST-END item) so anything that increases its size has a significant impact on the size of the output file in cases where there are many results per test. One could consider changing this by e.g. having a test_id integer field on the TEST-START and repeating only this field on the corresponding -RESULT and -END fields, but this has a negative impact on readability and could perhaps add implementation complexity propogating the test_id to the code that writes the TEST-RESULTs.
Keep in mind that a consumer of this event stream are the tools that run and report on test results. A good test of whether the schema is "good" is whether downstream consumers can produce useful near real-time "dashboards" of the execution state.

For example, my dashboard may wish to report on test progress by listing every test file and changing the color of that test as it executes. The proposal in comment #1 won't enable this because TESTS-START only contains the numeric test count. However, replacing that count with a list of test names should be all that is needed. (This may not be a perfect example because the test runner often knows about the test set).

Something else to keep in mind is buffering. We want the event producer to emit events as soon as there is something to emit. It shouldn't e.g. receive an event (such as a line of output from an invoked process) and wait before releasing that event to downstream consumers. Real-time monitoring is a concern, yes. But the bigger concern is diagnosing hung processes: we don't want important state stuck in the buffer of the tested process. I only need to remind people how frustrating it has historically been to diagnose frozen tests through buildbot logs because important output is buffered. This all means having separate events for each "atomic" action. You may need to invent identifiers (such as UUIDs) so downstream consumers can associate "sub-"events together into one logical data structure.
So the problem with emitting events as soon as they are recorded is that it leads to interleaved output when tests are run in parallel. This makes it much harder for humans to read the logs. AIUI the parallel xpcshell tests buffer the output and emit that from each test atomically for this reason.
So, this may be a silly question, but why are we inventing anything? 

The Syslog protocol specifies a structured data format already - https://tools.ietf.org/html/rfc5424 - and just about everything in the world that collects, parses and analyzes logs already speaks it. 

There are some very powerful tools available in that space already. Do we have a really compelling reason to be rolling our own format here?
(In reply to Mike Hoye [:mhoye] from comment #7)
> So, this may be a silly question, but why are we inventing anything? 

It's a fair question and was addressed months ago at the beginning of this project.

> The Syslog protocol specifies a structured data format already -
> https://tools.ietf.org/html/rfc5424 - and just about everything in the world
> that collects, parses and analyzes logs already speaks it. 

For one reason, the parsing overhead of syslog's format is significant. It's straight text with key-value pairs. All our tools are written in higher-level languages (notably Python and JS) and the overhead of a non-C-based parser is significant (mostly in GC pressure). We're doing JSON, which while not compact, is very fast and is "weby." We've already done things like have tools pipe events directly into WebSockets for real-time HTML dashboards. It just works. JSON is good for that.
(In reply to Gregory Szorc [:gps] from comment #8)
> For one reason, the parsing overhead of syslog's format is significant. It's
> straight text with key-value pairs. All our tools are written in
> higher-level languages (notably Python and JS) and the overhead of a
> non-C-based parser is significant (mostly in GC pressure). We're doing JSON,
> which while not compact, is very fast and is "weby." We've already done
> things like have tools pipe events directly into WebSockets for real-time
> HTML dashboards. It just works. JSON is good for that.

As just an FYI (no need to discuss this more), in Fedora they're junking syslog in favor of the systemd journal.  Which is interesting because it also is an attempt to provide structured logging with magic tools for searching/export, etc. and avoid all the syslog parsing stuff.

Informative links:
http://fedoraproject.org/wiki/Changes/NoDefaultSyslog
http://www.freedesktop.org/wiki/Software/systemd/
journal export/network format: http://www.freedesktop.org/wiki/Software/systemd/export/ 
journal append-only-ish on-disk format: http://www.freedesktop.org/wiki/Software/systemd/journal-files/ 
journal JSON format: http://www.freedesktop.org/wiki/Software/systemd/json/
I know Andrew said no more need to discuss this more, but it's worth stating explicitly that the transport is mostly an implementation detail because it's hidden from external consumers since we almost certainly have something sitting in front of the process that emits structured events. We /may/ re-expose the raw event stream for things like real-time dashboards. But most consumers will likely be interested in a massaged and more concise version of that event stream (e.g. a single JSON blob encapsulating the entirety of results). Of course, there's nothing really preventing us from re-emitting the event stream as syslog, systemd, etc: we just don't want to be conformed to those protocols internally.
I'd like to suggest another datapoint for this format to contain: per-file execution time.

Having this data on a per-test basis is definitely useful, but reporting it on a per-file basis would allow testing infrastructures to analyze test file performance. This in turn would enable intelligent scheduling of parallelized test runs.
I didn't mention it in comment 1, but the format I have been using contains several global fields (largely based on chmanchester's prior art). In particular, what I currently have is:

time (int): A timestamp in ms since the epoch at which the event occurred.
thread (string): A named identifier for the thread responsible for the log message
source (string): A named identifier for the logging object producing the messages

The time field in particular allows you to calculate the test file time as (made up notation) time[TEST-END] - time[TEST-START]. The others probably aren't well designed, but the general idea is that there should be some way to tell which process and which thread produced a particular log message.
I think this bug is a good forum for discussion, but not for actually keeping track of what the proposed format is. Therefore I took jgraham's proposal as well as the feedback from chmanchester and gps and compiled it into json on an etherpad so it is easier to visualize:

https://etherpad.mozilla.org/structured-logging-format

If you have suggestions or amendments you'd like to make, please modify the etherpad directly and comment here with the reasons for your changes. We can periodically clear authorship colours when a format is agreed upon and iterate if we need to.

James and Chris, please make sure the format in the etherpad actually reflects the proposal and how things currently work.
What's the use case for reporting number of passes/fails at the end of the run? It isn't immediately obvious why this is useful compared to, say, the number of unexpected results, or why it is helpful to have it repeated when anyone can work out exactly the data they require from parsing the test_status entries (or even the test_end entries, if that's what they care about).
Sure, unexpected results works better.

As for being able to calculate the totals, I was thinking the calculation could get confusing if there are multiple suites running in parallel, and potentially impossible if a single test can run in more than one suite. Maybe we can operate under the assumption that this won't be the case. Otherwise, each test event could have an associated suite_id field or something which would make the calculation possible again.
I haven't seen the multiple-suite use case, so if we have that I would be interested to look at an example of how it works today. But in general I am much more predisposed toward adding whatever is needed to the individual events, so that post-processing tools have all the information that they need to do whatever reporting we find interesting in the future rather than baking in specific summary data to the output, which is both less flexible and harder to implement as state needs to be maintained during the test run.

In an entirely unrelated question, how do we store the name of the event type? So far each line is a single Object, with a field named "action" for the event type. But it might make more sense to call this "event", or to make the overall structure a List of the form [event-type, event-data] e.g.

["log", {"level": "debug", "message":"Example log message"}]
Sigh, forgot to enter the description of the attachment, sorry.

That's an example of the output I am currently producing having updated to the latest spec on the wiki page except for the extra stuff in suite_end that I don't like.
(In reply to jgraham from comment #16)
> I haven't seen the multiple-suite use case, so if we have that I would be
> interested to look at an example of how it works today. But in general I am
> much more predisposed toward adding whatever is needed to the individual
> events, so that post-processing tools have all the information that they
> need to do whatever reporting we find interesting in the future rather than
> baking in specific summary data to the output, which is both less flexible
> and harder to implement as state needs to be maintained during the test run.

Sure, it could be a case of YAGNI. I'm ok with omitting it for now.
Was there some conclusion about buffering output from specific tests vs writing immediately? The former is better for humans reading the files in the happy case, the latter is better for debugging when problems occur. I don't think the difference for machines is very substantial.
This would be up to the individual harness and not have much to do with the structured logger, no? Currently xpcshell buffers it's test output and only prints it on failure. But I would think in general the harness should dump everything and it should be up to the consumer to decide whether or not it cares about it.
Well that isn't how it works in my code at the moment; the structured logger buffers everything between a test_start and test_end so that all tests are output atomically. I don't think this has been very beneficial (one can always write a postprocessor that rearranges the messages if needed). Therefore I am leaning toward the idea that we should expect the messages to be in approximately chronological order rather than grouped per test. If we aren't clear about the expectations here, people will write tools that work with the output from one harness and don't work with the output from others.
Picture our actors:

 <user> <-------> <test harness> <----> <tested process>
              /
 <machines> <-

We *don't* want buffering between <tested process> and <test harness> because if <tested process> crashes, hangs, etc, <test harness> doesn't know what the current state is and accurate reporting is undermined.

Buffering between <test harness> and <user>/<machines> is acceptable for various reasons, including more concise output for humans, sanitized/combined output to machines, etc.
How well would it work for other testsuites if "unexpected": true became "expected": "FOO". For imported tests where the expected results are stored in a manifest file this makes it a little bit easier to see what changed when an unexpected result occurred because you don't have to go and dig the expectation out of the manifest during post-processing. But perhaps for other suites this would be a burden?
That would assume that the tested process has access to the manifest, though I guess the old "unexpected" way has the same problem. I don't think it would be a problem for mochitest, reftest or xpcshell, though I wouldn't be surprised if there exists a harness where the expected/unexpected logic lives in the test harness itself. If there would be a problematic harness, we could either just return null, or mark expected pass for everything and let the test harness sort it out.

But yes, I like "expected": <status> better too.
I think this bug can be closed out now. There doesn't seem to be any objections and there's a patch in bug 956738 that implements it. That's not to say we can't change the format down the road if we realize something isn't working out.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.