Log lines containing "TALOSDATA" that are not talos results, cause 10 log parse retries & parser backlogs



4 years ago
4 years ago


(Reporter: wlach, Assigned: wlach)




(1 attachment)

The failures in question were:

[2015-07-27 14:47:35,208: INFO/MainProcess] Task parse-log[c2051f0d-6047-4af5-ac67-bf10f50956f7] retry: Retry in 600s: ValueError('Invalid TALOSDATA: 13:14:19     INFO -                 "comments": "Bug 1187082 - Ensure talos always produces TALOSDATA json structure in logs so perfherder can ingest data. r=jlund", \r\n',)
[2015-07-27 14:47:35,285: INFO/MainProcess] Received task: parse-log[7a59706a-60f2-4fb8-94f9-c16188c94b64] eta:[2015-07-27 14:57:35.283045-07:00]
[2015-07-27 14:47:35,645: INFO/Worker-9] Starting new HTTPS connection (1): treeherder.mozilla.org

so tl;dr we're retrying the log parse if we weren't able to json decode the TALOSDATA line. Now since we do an exponential back-off of log parsing retries (which is a good thing when ftp.m.o is down/slow) this just meant we had lots of log parser jobs backing up, since we do up to 10 retries.

We clearly shouldn't be retrying this case, since the TALOSDATA line content isn't going to change each time, so it's just going to fail 10 times over the course of ~1 hour, slowing everything up.

Will landed a hotfix for this on a branch, but we need to fix on master too. PR is at:

I think we should:
1) Make the TALOSDATA regex more strict
2) Either make the try-except not raise, or else make the task retry exclude this specific case.
Priority: -- → P1
Summary: treeherder breaks if there are lines in the log with TALOSDATA that don't correspond to TALOSDATA → Log lines containing "TALOSDATA" that are not talos results, cause 10 log parse retries & parser backlogs
Depends on: 1188351

Comment 3

4 years ago
Commit pushed to master at https://github.com/mozilla/treeherder

Bug 1188132 - Don't retry log parsing for invalid TALOSDATA log lines

Since if it failed the first time, it's always going to fail.
So I think this bug is partly an example of why blanket try/except is bad: we're retrying on errors we'll see every time. The problem isn't (just) that we're throwing an error, but how we're handling them. I would suggest that we should *only* retry on urllib2 errors (i.e. subclasses of URLError: https://docs.python.org/2/library/urllib2.html#urllib2.URLError): other errors (e.g. ValueError) indicate a permanent unresolvable problem.

Also, the original behaviour was dumb. We obviously shouldn't fail just because TALOSDATA appeared somewhere else in the log. :) The fix already applied is good, we just need to do a few more things to make sure errors like this don't recur in the future.
Yeah agreed; though I think we should also make the TALOSDATA regex more specific, so we don't waste time trying to json decode 400 jobs in a push, if the logs include the phrase. eg match against "TALOSDATA: { .+ }" not just "TALOSDATA".
I've just found the previously filed bug 1125104 which basically covers this case too.
Created attachment 8641867 [details] [review]
Further fixes/changes

Further fixes to make log parsing simpler and more robust in case of failure. I think most of this is self-explanatory, I guess the one exception being the simplification of TALOSDATA parsing: I found out in testing that the regular expression by itself was basically just as fast as the `"TALOSDATA" in foo` logic, so decided to just leave that in.

I think if we wanted to be extra careful here we'd write unit tests to cover this specific failure case. I'm inclined not to just because it would be rather complex for what's pretty simple (esp. now) and going away in the next few months anyway as we move away from log parsing internally.

There's no rush to review this.
Attachment #8641867 - Flags: review?(cdawson)
Comment on attachment 8641867 [details] [review]
Further fixes/changes

This all looks reasonable to me.
Attachment #8641867 - Flags: review?(cdawson) → review+
Duplicate of this bug: 1125104
Commit pushed to master at https://github.com/mozilla/treeherder

Bug 1188132 - Further fixes to make (talos) log parsing more robust

* Simplify logic in talos parser (there was an optimization which didn't
  save anything and just caused confusion before)
* Make it so if log parsing fails for a non-http reason, we don't try
Commit pushed to master at https://github.com/mozilla/treeherder

Bug 1188132 - Don't raise exception if 403/404 when donwloading log

This is somewhat expected (in the case of taskcluster), so we don't
want to hear about it in new relic
I think we're done here. :)
Last Resolved: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.