Closed Bug 1113868 Opened 10 years ago Closed 9 years ago

generate a summary file of warnings, errors, and test failures for each structured log

Categories

(Testing :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: camd, Assigned: chmanchester)

References

Details

Attachments

(3 files, 1 obsolete file)

We still need to determine which bits of the structured log to put into the summary.  But these will at least include all "critical", "error" and "warning" lines.  Each line should have a field that includes the line number in the greater structured log file.
Blocks: 1113873
I think a reasonable and reasonably low-effort approach for this will be:

Add a "counter" field to the raw JSON formatter.
Write a formatter that works just like the raw JSON formatter with the added counter field except that it only writes lines that are a test failure or log message with a level worse than warning.
Write the summary file and upload to blobber alongside the full raw log that's written today and establish a naming convention so Treeherder can know where to find them.
chmanchester: this sounds good.  And, just to be clear: the "counter" you mention is the "line number" in the main log, then?
Flags: needinfo?(cmanchester)
Generally it would provide the same kind of information, it would start at 0 and increment every time we emit a log message, but this won't always map 1:1 if you want to compare with a buildbot log, for instance, because multiline things like stacktraces are kept in a single JSON object produced by mozlog. Does that work?
Flags: needinfo?(cmanchester)
nah, I think comparing to your structured log is the way I'd go anyway.  So as long as I can load the "surrounding context" lines, then we're good.  This sounds like a great solution.  Thanks!!
Summary: generate a summary file of each structured log → generate a summary file of warnings, errors, and test failures for each structured log
Here's a first cut at this (test failures my own): https://treeherder.mozilla.org/#/jobs?repo=try&revision=05b07872f741

It seems to be working pretty well. There's a bunch of warnings in the log that aren't that useful, but I don't know if those really need to be logged at as warnings in the first place.
chmanchester: After doing some preliminary testing with this structure, it looks like it'll work just fine.  Like you say, we may choose to refine it a bit.  But Let's go with this for now.

Thanks!!
(just needinfo'ing you in case your bugmail spam is as hard to wade through as mine...)
Flags: needinfo?(cmanchester)
Awesome, I'll come back to this next week and get something landed. One thing I remembered that I neglected here was logging "crash" actions, I think we'll want those included, but that's an easy fix. I'm usually pretty good at keeping up with arbitrary bugmail, but needinfo reflects this situation pretty well, so let's go with that!
Assignee: nobody → cmanchester
/r/3427 - Bug 1113868 - Add a "serial" field to mozlog's log objects, add a formatter to only include errors/warnings/failures.
/r/3429 - Bug 1113868 - Add --log-faults to in-tree config to upload errors and failures to blobber.

Pull down these commits:

hg pull review -r 544ac94cb1681be8d10aeceddb3efe6763477967
Try is looking as I expect. I realize now the top commit from comment 9 needs updates for all the platforms (as does the corresponding mozharness update).
Comment on attachment 8560196 [details]
MozReview Request: bz://1113868/chmanchester

/r/3427 - Bug 1113868 - Add a "serial" field to mozlog's log objects, add a formatter to only include errors/warnings/failures.
/r/3429 - Bug 1113868 - Add --log-faults to in-tree config to upload errors and failures to blobber.

Pull down these commits:

hg pull review -r 544ac94cb1681be8d10aeceddb3efe6763477967
Attachment #8560196 - Flags: review?(james)
Flags: needinfo?(cmanchester)
Sorry, I haven't been following this so I don't entirely understand the problem space. Why are we generating a new log format that seems to more or less be a subset of the raw logs rather than filtering the raw logs in the backend?
(In reply to James Graham [:jgraham] from comment #12)
> Sorry, I haven't been following this so I don't entirely understand the
> problem space. Why are we generating a new log format that seems to more or
> less be a subset of the raw logs rather than filtering the raw logs in the
> backend?

It is exactly a subset of the raw log. :camd can probably speak to the motivation in more detail than I can, but generally this is an aide to building a UI around the parts of the structured log most likely to be interesting.
Flags: needinfo?(cdawson)
OK. My a-priori feeling is that we should consume the whole raw log and use the typical kind of iterator / handler pattern in the consuming code to implement this, rather than pushing the requirements of specific consumers onto the production side.
This subset is the "summary" file we initially load in the UI.  It shows the potentially interesting (broken) parts which the user can click to load the surrounding lines from the raw structured log for viewing.

Treeherder could have a process that runs and iterates over the whole log again to generate this summary (that's what we do with the textual logs now).  But it's slow and requires 3 workers to keep up with it.

The intention here was that, if we can generate this summary "while we're at it" (creating the main log) then it relieves a second iteration on the log to generate it.
Flags: needinfo?(cdawson)
It would be interesting to know parisng structured logs compares to the existing log parsing code in throughput.

It seems to me that there are definite disadvantages in making this Yet Another Formatter in mozlog. It adds complexity for every user of the library who has to ignore a command line option that is intended for only very specific circumstances. That could perhaps be solved by making it something implemented in mozharness rather than mozlog. Worse than that, however, it will make updating the data that treeherder uses a drawn-out process since any update will have to ride the trains. If we are planning to build a more advanced log viewer UI in the future, it seems likely that more of the log data is going to be needed, so we will end up parsing the full raw log anyway.
I just thought of another issue here; if something sneaks past structured logging and into stdout/stderr we won't put it in the summary (this happens occasionally e.g. when updating the harness code and causing a python traceback). Obviously we don't have to include this information this if the benefits are worth the cost, but when it does happen it's typically the one thing you want to know to debug the error.
(In reply to James Graham [:jgraham] from comment #16)
> It would be interesting to know parisng structured logs compares to the
> existing log parsing code in throughput.
> 
> It seems to me that there are definite disadvantages in making this Yet
> Another Formatter in mozlog. It adds complexity for every user of the
> library who has to ignore a command line option that is intended for only
> very specific circumstances. That could perhaps be solved by making it
> something implemented in mozharness rather than mozlog. Worse than that,
> however, it will make updating the data that treeherder uses a drawn-out
> process since any update will have to ride the trains. If we are planning to
> build a more advanced log viewer UI in the future, it seems likely that more
> of the log data is going to be needed, so we will end up parsing the full
> raw log anyway.

Any update to the data source (more or less of the log being potentially interesting) will happen in-tree and need to ride the trains anyways. The model provided by mozlog is determining what's interesting here. As long as mozlog is in tree its changes would need to ride the trains. Providing this information doesn't seem that specific to the proposed treeherder ui -- it's akin to the things mach's hand parsing print out at the end of a run. Doing this filtering in mozharness is probably an option, but in my estimation that implementation would be higher effort.
(In reply to James Graham [:jgraham] from comment #17)
> I just thought of another issue here; if something sneaks past structured
> logging and into stdout/stderr we won't put it in the summary (this happens
> occasionally e.g. when updating the harness code and causing a python
> traceback). Obviously we don't have to include this information this if the
> benefits are worth the cost, but when it does happen it's typically the one
> thing you want to know to debug the error.

I don't think it's unreasonable to say that cases where structured logging is unaware of a problem (such as when mozlog throws an unexpected exception), that problem wouldn't be given first class treatment in the UI driven by structured logs. This seems like the purview of treeherder's regex.
Based on our discussion Monday jgraham wants more information about how this file is going to be used. Can we have that question phrased here?
Flags: needinfo?(james)
This is a screenshot of the current WIP state of the structured log viewer.  The sample file that's generated is used for the upper right portion.  These are considered "lines of interest" (as in failures, etc).  When the user clicks one, it loads the lines from the raw log in the bottom portion.

disclaimer: this is all sample data and very rough.  The fields shown in the log will almost certainly be refined from this.  But it gives you some idea.
Thanks for the screenshot. Based on that I wonder where the full log data in the bottom panel comes from. Is that something that has to be created on the treeherder side from the raw logs, or is it something that you get from buildbot? If it's the latter it I don't understand how the lines in the raw logs are matched to the lines in the formatted logs, since there isn't a 1:1 relationship. If it's the former, it seems like we are going to have to parse the full log in treeherder anyway, at least when this view is displayed.

However my original question was about the relevant tradeoffs here. Pushing the summarisation into the production side seems like it comes at the cost of an increased maintenance burden in the long term because every time we want to change what's in the summary we have to make the change in mozilla-inbound (or mozharness tip) and wait for it to percolate across all trees and all out-of-tree users of mozlog (who may be using some older version). By contrast putting it on the treeherder side removes a chance for parallelisation, but improves the ability of the treeherder team to control the evolution of the feature in the future. I'm wondering if we have examined this set of tradeoffs and determined that it's worthwhile to forego the ability to easily iterate the UI because e.g. consuming structured logs is simply too slow otherwise and pre-summarisation allows enough cases where we don't have to parse the logs on the treeherder side at all to be a win.

If that *is* the case, we should really consider some followup work on making structured logs faster to process e.g. investigating the benefits of a new "raw" format based on one of the fast binary buffer serialisation formats (protobuf, cap'n proto, etc.).
Flags: needinfo?(james)
James-- Thanks for the info.  In answer to your first question: it's the latter.  When someone clicks a line in the summary, it uses the "serial" value to find the appropriate log slice in the structured "raw" log.  It fetches just the lines around the line in question and displays that to the user.

However, after chatting with Mauro on this and looking a bit at your mozlog docs, I think it's true that we should host this code in Treeherder.  It looks like we can use much of this patch (the custom formatter) and plug that into a process to generate this summary.  As you say, this gives us the most latitude to generate the summary to our custom needs.

Perhaps chmanchester could help mentor me on doing this (I'm going to take a crack at it next week).  But I'm thinking my approach would be to have a celery task that uses the mozlog.structured.reader package to read in the log, apply the formatter and generate my summary.  I'll then save the summary as a job artifact just like we currently do for the non-structured logs.  

I may be a little off on the steps, but perhaps I could get some assistance with getting that right?
(In reply to Cameron Dawson [:camd] from comment #23)
> James-- Thanks for the info.  In answer to your first question: it's the
> latter.  When someone clicks a line in the summary, it uses the "serial"
> value to find the appropriate log slice in the structured "raw" log.  It
> fetches just the lines around the line in question and displays that to the
> user.
> 
> However, after chatting with Mauro on this and looking a bit at your mozlog
> docs, I think it's true that we should host this code in Treeherder.  It
> looks like we can use much of this patch (the custom formatter) and plug
> that into a process to generate this summary.  As you say, this gives us the
> most latitude to generate the summary to our custom needs.
> 
> Perhaps chmanchester could help mentor me on doing this (I'm going to take a
> crack at it next week).  But I'm thinking my approach would be to have a
> celery task that uses the mozlog.structured.reader package to read in the
> log, apply the formatter and generate my summary.  I'll then save the
> summary as a job artifact just like we currently do for the non-structured
> logs.  
> 
> I may be a little off on the steps, but perhaps I could get some assistance
> with getting that right?

Sounds like a plan. Just ping me or request f? on a patch, happy to take a look!
(In reply to Cameron Dawson [:camd] from comment #23)

> But I'm thinking my approach would be to have a
> celery task that uses the mozlog.structured.reader package to read in the
> log, apply the formatter and generate my summary.  I'll then save the
> summary as a job artifact just like we currently do for the non-structured
> logs.  

Yes, that sounds like exactly the right approach.

> I may be a little off on the steps, but perhaps I could get some assistance
> with getting that right?

I'm happy to help in any way I can.
Attachment #8560196 - Attachment is obsolete: true
Not pursuing this, it's pretty much covered by the errorsummary formatter.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
See Also: → 1821646
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: