Closed Bug 1343659 Opened 3 years ago Closed 3 years ago

Mozharness should use log level ERROR for rust panics lines

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set

Tracking

(firefox55 fixed)

VERIFIED FIXED
Tracking Status
firefox55 --- fixed

People

(Reporter: kats, Assigned: ahal)

References

Details

Attachments

(3 files, 1 obsolete file)

I would like lines like [1] to be recognized as errors in treeherder so that they show up in the failure summary and can be used to inform starring of bugs.

I wrote a patch but I don't think I can test it easily.

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=79852537&repo=graphics&lineNumber=9313
Assignee: nobody → bugmail
Thank you for the PR!

Before we try this approach, would you mind seeing if mozharness can be made to output these lines with a log level of ERROR not INFO? (Since Treeherder already recognises mozharness ERROR prefixes)

ie at the moment the line is incorrectly using INFO:

[task 2017-02-24T04:20:20.368227Z] 04:20:20     INFO - ...

We're trying to move away from having to hardcode every variation of error string in the log parser where possible.
Comment on attachment 8842588 [details] [review]
[treeherder] staktrace:rustpanic > mozilla:master

Ed's suggestion would be ideal if it's not too difficult (unfortunately I suspect it is).

Otherwise I think this should be fine to go in with a unit test (see PR for details).
Attachment #8842588 - Flags: review?(wlachance)
Thanks all! I'll try to do the change in mozharness along the lines of what Wes posted in his patches. Although his patch seems to modify the "logcat" error strings which sounds like it's Android-specific?
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #7)
> Thanks all! I'll try to do the change in mozharness along the lines of what
> Wes posted in his patches. Although his patch seems to modify the "logcat"
> error strings which sounds like it's Android-specific?

Yeah that doesn't seem right to me. Maybe we need some kind of new category for these types of error strings.
Well, I tried the patch at [1], and the job at [2] has the rust panic line in the output but it didn't get flagged as an error. Is there a way to test this locally? Trial-and-error with try pushes is going to be quite painful.

[1] https://hg.mozilla.org/try/rev/3a255d30ece138d2bab3072dc9069a51a15f6db7
[2] https://treeherder.mozilla.org/logviewer.html#?job_id=81233456&repo=try&lineNumber=2289
Andrew, can you help Kats with this? See comment 10.
Flags: needinfo?(ahalberstadt)
Component: Treeherder: Log Viewer → Mozharness
Product: Tree Management → Release Engineering
QA Contact: jlund
Summary: TreeHerder should recognize rust panics as errors → Mozharness should use log level ERROR for rust panics lines
Attachment #8842588 - Attachment is obsolete: true
Kats, I think you have it right but it looks like your regex doesn't match because the thread name is single-quoted. Try using:
thread '(\w+)' panicked

Unfortunately there's not a great way of testing this locally. I usually write a script that manually imports and mocks out an OutputParser object, and pipe logs into it from my mach runs.. But it's not really more convenient than pushing to try.
Flags: needinfo?(ahalberstadt)
(In reply to Andrew Halberstadt [:ahal] from comment #12)
> Kats, I think you have it right but it looks like your regex doesn't match
> because the thread name is single-quoted. Try using:
> thread '(\w+)' panicked

That didn't work either. Try push at [1], log at [2]. I verified that the regex matches by running this in my shell:
  python -c "import re; print re.compile(r'''thread '([^']+)' panicked''').match(r'''thread 'RenderBackend' panicked at 'called''').group(0)"

[1] https://hg.mozilla.org/try/rev/f85149064dc88f99baf8e57d322bcc6b20df959f
[2] https://treeherder.mozilla.org/logviewer.html#?job_id=81315988&repo=try&lineNumber=2252
Ok, I see why adding a regex to the errorlist isn't working. That log message is being logged via the 'info' level from the structured logger. For example, check out the raw log:
https://public-artifacts.taskcluster.net/fHFkweGmTOOzETCWHMbJuw/0/public/test_info//browser-chrome-gpu_raw.log

Now ctrl-f for 'panicked' and you'll notice it's a 'log' action with 'level': 'info'. Mozharness/treeherder will automatically highlight anything logged with 'error' or above. The 'errorlist' regexes are only used to catch errors in unstructured output. Do you happen to know where that rust output is being logged from?
(In reply to Andrew Halberstadt [:ahal] from comment #14)
> Do you happen to know where that rust output
> is being logged from?

The rust output is spit out directly from the firefox binary to stderr, as far as I know. I'm not sure what all the different harness layers are doing with the output along the way.
I spy some mochitest stupidity.. I'm going to file a bug that blocks this, though it might take a bit of time to get right. So if you're in a rush to get this landed and wlach/emorley are ok with it, I'd recommend landing your original patch if possible.

Ed, would it be alright to re-open/land the original pull request for now?
Flags: needinfo?(emorley)
I'm not in a particular rush to get this landed. I can wait a bit.
Flags: needinfo?(emorley)
Depends on: 1344346
Now that bug 1344346 is in m-c, I'll try the patch at [1] again. Let me know if there's more things that need to be fixed first.

[1] https://hg.mozilla.org/try/rev/f85149064dc88f99baf8e57d322bcc6b20df959f
Still not working: https://treeherder.mozilla.org/logviewer.html#?job_id=83460775&repo=try&lineNumber=17598

The same log later has TEST-UNEXPECTED and PROCESS-CRASH, both of which are also "INFO" level despite already being listed as ERROR-level substrings at https://hg.mozilla.org/try/rev/9470d7dab8d2cc5914763a5dbd5b5a5f18e45b0d#l1.10
^
Flags: needinfo?(ahalberstadt)
Crap, sorry.. thought that would fix it.

I'm PTO until the end of May, so re-directing to jgraham.
Flags: needinfo?(ahalberstadt) → needinfo?(james)
Right, this is being logged as process_output from the reftest harness. Which seems prety reasonable, but I think you now need to land your changes in https://hg.mozilla.org/try/rev/3a255d30ece138d2bab3072dc9069a51a15f6db7 so that mozharness flags this process_output as an error. let me know if that doesn't work.
Flags: needinfo?(james)
If I understand you correctly, that's exactly what I did in the try push in comment 19, but it still didn't work. Or has more stuff landed in terms of changing the logging since then?

I've kicked off another try push anyway, just to confirm: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf051186a554bdc95d6354907f0ecd014300f5ac
Ah, sorry I forget that people put actual code in the try-cyntax commit (mach try is your friend!). I'll investigate again.
Flags: needinfo?(james)
Any update here, James?
So the StruturedOutputParser wasn't using this code at all for lines with action="log". The branch in https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b689b80021d22680b2f0da99cc36dfcc0d4f0f6 changes that, which I think will fix this case (the old logs were lost so I can't easily work out if this applies to the original problem).

Can you rerun something that is expected to fail on top of the above changes?
Flags: needinfo?(james)
Flags: needinfo?(bugmail)
I kicked off a try push with your patches from the try run above plus a webrender change that should trigger a panic.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d3fc73392c4d94eab0103b82b76178e4ddb29f3a

Leaving needinfo on me until I confirm the panic-inducing change works as intended :)
Oh.. it looks like there's an error in one your patches:

[task 2017-04-27T14:48:19.241204Z] Exception in thread Thread-1:
[task 2017-04-27T14:48:19.241282Z] Traceback (most recent call last):
[task 2017-04-27T14:48:19.241410Z]   File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
[task 2017-04-27T14:48:19.241448Z]     self.run()
[task 2017-04-27T14:48:19.241490Z]   File "/usr/lib/python2.7/threading.py", line 754, in run
[task 2017-04-27T14:48:19.241621Z]     self.__target(*self.__args, **self.__kwargs)
[task 2017-04-27T14:48:19.241681Z]   File "/home/worker/workspace/mozharness/mozprocess/processhandler.py", line 733, in _processOutput
[task 2017-04-27T14:48:19.241720Z]     self.processOutputLine(line.rstrip())
[task 2017-04-27T14:48:19.241805Z]   File "/home/worker/workspace/mozharness/mozprocess/processhandler.py", line 697, in processOutputLine
[task 2017-04-27T14:48:19.241836Z]     handler(line)
[task 2017-04-27T14:48:19.241885Z]   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 1387, in processOutput
[task 2017-04-27T14:48:19.241960Z]     parser.add_lines(line)
[task 2017-04-27T14:48:19.242070Z]   File "/home/worker/workspace/mozharness/mozharness/base/log.py", line 356, in add_lines
[task 2017-04-27T14:48:19.242124Z]     self.parse_single_line(line)
[task 2017-04-27T14:48:19.242182Z]   File "/home/worker/workspace/mozharness/mozharness/mozilla/structuredlog.py", line 90, in parse_single_line
[task 2017-04-27T14:48:19.242366Z]     level = error_check.get(level, getattr(log, data["level"].upper()))
[task 2017-04-27T14:48:19.242417Z] AttributeError: 'NoneType' object has no attribute 'get'
Flags: needinfo?(bugmail)
So as far as I can tell the string you are matching never appears in the logs.

Also note that there's a slight issue with this approach; the error will be found by the treeherder log parser but won't end up  in the errorsummary.json file. If you wanted that you would have to change all the harnesses.
Flags: needinfo?(bugmail)
(In reply to James Graham [:jgraham] from comment #31)
> So as far as I can tell the string you are matching never appears in the
> logs.

So I tried a few different approaches to triggering the panic. All of them work and produce the string when I run the reftests locally, but they don't show up in the log I see from treeherder. Considering that the string is output by the firefox binary itself, I'm pretty convinced that it's getting lost or eaten somewhere along the way.

The way I'm running it locally is `mach reftest layout/reftests/reftest-sanity` (on a build with gfx.webrender.enabled=true), if that makes any difference.
Flags: needinfo?(bugmail)
Ok, I found another way to trigger the panic, and it is showing up in the raw log but not in the TH summary:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0164e3fed763e3d2f2bdf98f8677c86d6d131c8d&selectedJob=97110590
Flags: needinfo?(james)
I think I see what's going on. In bug 1344346 I started running 'process_output' actions through the error lists, but I made sure the base OutputParser class didn't do any actual printing because we still wanted to go through the proper formatter. Because I skipped this step, the 'process_output' still won't be printed with ERROR at the front. Essentially bug 1344346 succeeded in using 'process_output' to determine the worst_level (used for things like the overall return code of the run), but didn't modify the output at all.

I have a patch I'll run through try before submitting.
Flags: needinfo?(james)
This patch [1] seems to get it working. However, in my try run [2], the panic line reads:
GECKO(4580) | thread '<unnamed>' panicked at 'called `Result::unwrap()`

Which doesn't get matched by the '(?:\w+)' part of the regex. I'm not sure why it doesn't know the thread name. Simply using .* does produce an ERROR though. I think I'll file a new bug to get the patch in [1] landed, then we can continue using this one to get the proper regex added.

[1]: https://hg.mozilla.org/try/rev/5652f5690f71a09009d3f135fb8384838faf6abd
[2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=50e486301204a7d5a86b401eefa4be6c2909323f&selectedJob=101589582
Depends on: 1367745
Btw, here's a try run using .+ instead, finally with the panicked line highlighted:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=45983434984766aa0f0aa1219a1d62c2da0f5789&selectedJob=102021643

I guess the original '([^']+)' would have worked too.. Either way, we can land whatever regex we want after bug 1367745 is fixed. Should have a patch up over there soon.
Of course now that all the pieces are aligned, I wasn't able to reproduce the kernel panic to verify the fix:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=195c164fceeaa2f362e7224a2164df3e95432058

This was working in comment 36, though the regex there was slightly different. I tested this regex on the command line so I'm pretty confident it'll work as well.
Assignee: bugmail → ahalberstadt
Comment on attachment 8873081 [details]
Bug 1343659 - [mozharness] Flag rust kernel panics as an ERROR,

https://reviewboard.mozilla.org/r/144534/#review148388

Ok. I'm happy with landing this. If I run across a problem with panics not being detected we can always revisit this.
Attachment #8873081 - Flags: review?(bugmail) → review+
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/72dfa41fadc1
[mozharness] Flag rust kernel panics as an ERROR, r=kats
https://hg.mozilla.org/mozilla-central/rev/72dfa41fadc1
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
I have try pushes where this is verified working. Thanks :ahal!
Status: RESOLVED → VERIFIED
Many thanks for persevering with this Andrew!
You need to log in before you can comment on or make changes to this bug.