Closed Bug 1816144 Opened 2 years ago Closed 2 years ago

reduce the number of text_log_error messages we store

Categories

(Tree Management :: Treeherder: Data Ingestion, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

currently it seems that we hit a limit of 100 error lines that can be ingested from a single job. These are stored in text_log_error table.

I would like to see this reduced to something more like 50. I am not sure of the value we get from having even >20 lines in the failure summary tab, and what few use cases could be served by reading the raw log.

I am not sure where we set the limit of 100. we add to TextLogError here:
https://github.com/mozilla/treeherder/blob/9581acd8b5f367158fa12070c47e5a909878b0e7/treeherder/etl/artifact.py#L20

from what I can tell we do more of the raw log parsing here:
https://github.com/mozilla/treeherder/blob/aa078b35371d42e337617408763ddd4bb57a36e8/treeherder/log_parser/tasks.py#L112

and this seems to be more of where the parsing happens:
https://github.com/mozilla/treeherder/blob/cfb765d7b222b8e6a09e08187c0e98c4fb933135/treeherder/log_parser/artifactbuildercollection.py#L15

and the line level parser seems to be here:
https://github.com/mozilla/treeherder/blob/4c5ffec2f7ff06bf6b77f77534586c30bd9061ca/treeherder/log_parser/parsers.py#L128

with a setting here:
https://github.com/mozilla/treeherder/blob/310dbc9c39cf00428e0d689cf0cb81c6d0b194b9/treeherder/config/settings.py#L438

theoretically we could fix this real quick!

:aryx, do you have any concerns with reducing the number of lines that we parse/store? today it is 100, and I think 50 would go far to help reduce queries- is there a different number that might be more accurate?

Flags: needinfo?(aryx.bugmail)

nudge

Can we re-evaluate once the wpt output reduction has been deployed? Code sheriffs run daily into the situation in which the 100 failure lines are insufficient and they have to open the raw log and find the actual failure. With the current logging, lowering the limit would deteriorate the situation. Let's see how many tasks will still have >100 (and >50 failure lines) for successful runs.

The limit on the amount of failure lines gets set in config/settings.py.

Flags: needinfo?(aryx.bugmail)

can you give a few examples (links) of tasks that have many lines (50+) and the sheriffs still need to dig into the log file.

Flags: needinfo?(aryx.bugmail)

I took a look at 2.5 days worth of failures where we had annotated data and the bugid showed up in the bug suggestions, in total there are 96 instances where we had >=20 failures in the failure summary display in treeherder. of these 2 had annotations >= line 20:
https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=f60ZEN2PS6GMYyKb7qlDyg&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=523c4bab3f1816469848ddd2fab7d466a69a0eb3&searchStr=wpt13 (line 23)
https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=AIixVhfvSvG1EgBu1kdGXw&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunning%2Cpending%2Crunnable&revision=380a2f111a72a2d78155056877f6c35e99bfeb92&searchStr=wpt14 (line 20)

This is falling in line with 2% of the annotations needing >20 lines of text.

I will look at 1 month of data.

Attached file jobids
looking at a month of data, I find 1860 instances where >20 lines of text existed and a 63 jobs where we annotated >20 lines: ```

looking at a month of data, I find 1860 instances where >20 lines of text existed and a 64 jobs where we annotated after the 20th line- this means that if we set the max lines at 20 instead of 100, 3% of the tasks would miss out on the proper annotation, if we set it at 50, then 15 tasks would be missing the proper annotation or <1%.

here are the tasks annotated >50:
https://firefox-ci-tc.services.mozilla.com/tasks/aA7LIMhLTdSnpQgmySlrXQ (bunch of crash on network access, and bug about leak that was short lived)
https://firefox-ci-tc.services.mozilla.com/tasks/OUv_rqS-R7OleiFyslw3EA (bunch of crash on network access, and bug about leak that was short lived)
https://firefox-ci-tc.services.mozilla.com/tasks/Ekw1hBR5Sxm73MEYEgKxsA (after bug 1815965, 82 reduced lines - would be 12th line)
https://firefox-ci-tc.services.mozilla.com/tasks/BYw2UVUDSvabXWPaC9u1fw (after bug 1815965, 76 reduced lines - would be 2nd line)
https://firefox-ci-tc.services.mozilla.com/tasks/B1jRiMC3Q3OVtw42A0NlvQ (after bug 1815965, 68 reduced lines - would be 2nd line)
https://firefox-ci-tc.services.mozilla.com/tasks/QtpdaGYmSfKaVso22V6Ywg (after bug 1815965, 61 reduced lines - would be 6th line)
https://firefox-ci-tc.services.mozilla.com/tasks/Tr_hlvvuQfWsDoA8KFHKfA (after bug 1815965, 62 reduced lines - would be 5th line)
https://firefox-ci-tc.services.mozilla.com/tasks/f_XBYRD8RC6-M8-PKQ_Nkw (after bug 1815965, 50 reduced lines - would be 11th line)
https://firefox-ci-tc.services.mozilla.com/tasks/S1scRm_cRoigsYYUCToHpA (after bug 1815965, 56 reduced lines - would be 2nd line)
https://firefox-ci-tc.services.mozilla.com/tasks/b4qlqQUdTjCkNZ7pT5HnfA (after bug 1815965, 44 reduced lines - would be 11th line)
https://firefox-ci-tc.services.mozilla.com/tasks/A23AfnfwS-u8GiNI9VuehA (after bug 1815965, 39 reduced lines - would be 16th line)
https://firefox-ci-tc.services.mozilla.com/tasks/Sb-WpvQtTWCpcUbnzdCCUg (after bug 1815965, 49 reduced lines - would be 6th line)
https://firefox-ci-tc.services.mozilla.com/tasks/anAN66QST2WobJwjHraHoQ (after bug 1815965, 44 reduced lines - would be 10th line)
https://firefox-ci-tc.services.mozilla.com/tasks/Gb0ero7_TvWXOwU3ClD0oA (installer error)
https://firefox-ci-tc.services.mozilla.com/tasks/IcWh7v1lQvizciv1vckz2w (after bug 1815965, 41 reduced lines - would be 12th line)

all in all, we had 1 task that had a useful message >50 (after bug 1815965).

On the flip side there are >300 jobs (in the same batch of 1860) that have annotations that do not show up in the suggestions, some examples:
https://firefox-ci-tc.services.mozilla.com/tasks/L0xCVk5nQxCoH5W2vwFoPg (new bug - 1st line)
https://firefox-ci-tc.services.mozilla.com/tasks/JFk_NGWOT46yxviD4Ea3SA (not related to any bug suggestion or new bug)
https://firefox-ci-tc.services.mozilla.com/tasks/FwMP8TulQQOZwdhmOhM8fA (not related to any bug suggestion or new bug)
https://firefox-ci-tc.services.mozilla.com/tasks/VQEHGcALRvSGMobcbq9Ehg (not related to any bug suggestion or new bug)
https://firefox-ci-tc.services.mozilla.com/tasks/P_4-JHJUR4Saqnk0npH0ng (not related to any bug suggestion or new bug)

I think the risk of reducing to 50 lines or lower is minimal, especially given the sheriffs are already having to look in the log file for 3% of the extreme failures. Going to 20 lines might have more issues that require log searching, but keeping it at 50 lines seems very safe.

this is landed and pushed to production

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(aryx.bugmail)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: