Closed
Bug 1076770
Opened 10 years ago
Closed 9 years ago
Profile the log parser to see if performance can be improved
Categories
(Tree Management :: Treeherder: Data Ingestion, defect, P4)
Tree Management
Treeherder: Data Ingestion
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: emorley, Unassigned)
References
Details
Broken out of bug 1074927.
There might be some further perf improvements we can make.
It would also be interesting to know how much slower each additional regex makes it, and as such, whether it's worth spending time trying to figure out if any are no longer used.
Reporter | ||
Updated•10 years ago
|
Priority: P2 → P3
Reporter | ||
Updated•10 years ago
|
Component: Treeherder → Treeherder: Data Ingestion
Reporter | ||
Comment 1•10 years ago
|
||
I know in many places we've intentionally used .match() instead of .search(), since if you're matching from start of string, it's faster.
However there are places where we've added a '.*' to the start of the regex just so we can use .match() - but interestingly this seems to actually be slower than using .search()
eg: (from bug 1121670)
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'TEST-UNEXPECTED-FAIL | leakcheck | tab process: 42114 bytes leaked (AsyncLatencyLogger, AsyncTransactionTrackersHolder, AudioOutputObserver, BufferRecycleBin, CipherSuiteChangeObserver, ...)'; r = re.compile(r'.*\d+ bytes leaked \((.+)\)$')",
... number = 10000000)
43.355268762
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'TEST-UNEXPECTED-FAIL | leakcheck | tab process: 42114 bytes leaked (AsyncLatencyLogger, AsyncTransactionTrackersHolder, AudioOutputObserver, BufferRecycleBin, CipherSuiteChangeObserver, ...)'; r = re.compile(r'\d+ bytes leaked \((.+)\)$')",
... number = 10000000)
18.8647965157
-> So just over twice as fast to use .search() and drop the '.*' (using Python 2.7.9)
Using .+ doesn't seem to be a
Reporter | ||
Comment 2•10 years ago
|
||
Bah didn't mean to submit.
I was going to say using .+ doesn't seem to be as bad, however it turns out what helped was adding a space between the '.*' or '.+' and the '\d+'. eg:
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'TEST-UNEXPECTED-FAIL | leakcheck | tab process: 42114 bytes leaked (AsyncLatencyLogger, AsyncTransactionTrackersHolder, AudioOutputObserver, BufferRecycleBin, CipherSuiteChangeObserver, ...)'; r = re.compile(r'.* \d+ bytes leaked \((.+)\)$')",
... number = 10000000)
19.0460573373
Anyway guess this shows we need to profile and not assume - and significant speedups are possible.
Reporter | ||
Updated•10 years ago
|
Priority: P3 → P4
Reporter | ||
Comment 3•10 years ago
|
||
Using the New Relic thread profiler:
https://rpm.newrelic.com/accounts/677903/applications/5585473/profiles/1671082
Reporter | ||
Comment 4•9 years ago
|
||
Lets not worry about this unless we start getting backlogs, or log parsing tasks start appearing in the slow transaction traces in New Relic.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•