Closed Bug 1372668 Opened 8 years ago Closed 7 years ago

[mozlog] Limit the maximum size of errorsummary.log

Categories

(Testing :: Mozbase, enhancement, P3)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: emorley, Unassigned)

References

Details

Attachments

(1 file)

This error summary was 29MB: https://public-artifacts.taskcluster.net/NKnlpYdfS9qti_DlHtevpQ/0/public/test_info//reftest-stylo_errorsummary.log Ideally we'd either have a max file size for errorsummary.log or else a max length for each json key/value. (The impact of the size of this file is increased due to bug 1347956)
I wonder if just not storing the stackwalk stdout/stderr would be enough.
See Also: → 1372922
I don't think we can limit the errorsummary.log by size from mozlog.. would probably have to be whatever does the uploading to S3. In mozlog we could limit the number of errors, or what gets printed (e.g stack trace). I'm not sure we'd want to do either of those things though. I'm more interested in what the hell is going on with that log you linked! I filed bug 1372922 to look into it.
(In reply to Andrew Halberstadt [:ahal] from comment #3) > I don't think we can limit the errorsummary.log by size from mozlog It wouldn't fit in a concise way in the bug summary, but I was thinking also: (In reply to Ed Morley [:emorley] from comment #0) > ...or else a max length for each json key/value. This feels a bit more like something that mozlog should enforce? (Either fail loudly if field size >X or else truncate the fields).
I think I see what is going on in bug 1372922 comment 1. The main reason the log is big isn't that a particular field is "too long", but rather those messages get logged over and over again. I believe this is due to bugs in the reftest harness (note the job associated with that log was on try, and likely someone had accidentally made Firefox crash, ergo every test crashes). I think the errorsummary formatter is doing everything right, and I'm not sure we should change it. There is definitely an improvement we can make to mozlog's error handling though that should cut this log down by *a lot*. That coupled with a fix or two to the reftest harness should make these types of logs disappear.
Flags: needinfo?(ahalberstadt)
Ed, the attached patch should fix the main culprit that was causing log bloat in that 236MB log. Though it's still possible for large errorsummary logs if every test crashes and leaves a stack trace.. so feel free to add [leave-open] if you want.
Comment on attachment 8877641 [details] Bug 1372668 - [mozlog] Limit the size of data logged from error message when suite_start called multiple times, https://reviewboard.mozilla.org/r/149098/#review153496 ::: testing/mozbase/mozlog/mozlog/structuredlog.py:258 (Diff revision 1) > > def _ensure_suite_state(self, action, data): > if action == 'suite_start': > if self._state.suite_started: > + # limit data to reduce unnecessary log bloat > + msg = json.dumps(data) `json.dumps(data)[:100]` should be fine, right?
Attachment #8877641 - Flags: review?(james) → review+
Comment on attachment 8877641 [details] Bug 1372668 - [mozlog] Limit the size of data logged from error message when suite_start called multiple times, https://reviewboard.mozilla.org/r/149098/#review153496 > `json.dumps(data)[:100]` should be fine, right? True.. not sure why I thought that would fail.
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f23dc18a0316 [mozlog] Limit the size of data logged from error message when suite_start called multiple times, r=jgraham
(In reply to Andrew Halberstadt [:ahal] from comment #7) > Though it's still possible for large errorsummary > logs if every test crashes and leaves a stack trace.. In that case there would be thousands of lines in the file though surely? This is easier for Treeherder to handle, since it already aborts after 20 lines (we don't yet, but could start streaming these files to save downloading the whole thing too). The case that is more problematic, is where an individual line in these files contains fields that are enormous. Running some analysis on this file (using https://emorley.pastebin.mozilla.org/9024605), I see there are 64 `message` attributes and 65 `stackwalk_stdout` attributes whose length is over 1000 characters, with the average length being 3430128 and 231227 characters respectively. I'm not sure 3 million characters in a log message will ever be useful, so I wonder if it should be truncated more generically in mozlog, rather than just in _ensure_suite_state()? (I know this is a partial repeat of comment 4 - I'm just wondering if you saw that this file only had 300 ish log entries and not thousands?)
Good points. The patch that just landed will reduce the former from 3430128 characters to 100. Those logs were completely useless and were an order of magnitude larger than the stackwalk_stdout. The stackwalk_stdout isn't *completely useless*, but maybe you're right that it's not worth storing in the errorsummary. We could truncate stacks in the errorsummary formatter if desired (and if jgraham who is also consuming them, agrees). But I'd be very opposed to a general "truncate logs" mechanism in the core mozlog module. Fyi, the reason we are seeing such big logs is because of bug 1344991. That causes reftest to restart and keep running tests in the event of a crash. However, if we run it on a crashy Firefox (e.g on try), it will cause a crash stack for every test.
wpt will also restart and continue if it crashes, so could have the same problem. That's more or less by design because otherwise there would need to be some heuristic about how many crashes are allowed. I don't know if stackwalk stderror is ever useful, and I don't have any objections to truncating stdout at some reasonable level (we don't really use that data yet and we could always reverse the change if we did start to use it).
Whiteboard: [leave-open]
Keywords: leave-open
Whiteboard: [leave-open]
Priority: -- → P3

The leave-open keyword is there and there is no activity for 6 months.
:gbrown, maybe it's time to close this bug?

Flags: needinfo?(gbrown)

:ahal - Did you want to follow-up here? (It's OK with me to close this.)

Flags: needinfo?(gbrown) → needinfo?(ahal)

Yeah, let's close. I think the original condition that caused this was a bug that is now fixed.

Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(ahal)
Resolution: --- → FIXED

The original issue that resulted in large logs was bug 1372922, which was fixed.

This bug was filed separate to that about setting a limit, so if we're not doing that the resolution should be wontfix.

Treeherder will start rejecting large logs at some point (since they still cause backlogs every now and again; bug 1295997) - however it's going to have to do so fairly indiscriminately [1], and it also won't prevent the cost of S3 uploads/storage, and also isn't great for UX since jobs won't be marked as failed but instead Treeherder will just report it didn't attempt to parse them.

When that happens it may be that the priority of this increases, but I guess the bug could always be reopened at that point?

[1] since it can only use the Content-Length header prior to streaming the log, which is not an entirely reliable indicator given some assets are gzipped and others not, and even large logs can compress down small if the log spam happens to be the same line repeated etc

Resolution: FIXED → WONTFIX
See Also: → 1295997
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: