Closed Bug 1281180 Opened 9 years ago Closed 7 years ago

Consider replacing log parser with Rust implementation

Categories

(Tree Management :: Treeherder, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jgraham, Unassigned)

References

Details

Typically the log parser is the slowest task in our data ingestion by some margin (a typical figure is 80% of all non-web transaction time). It is possible to optimise the actual download and processing part by rewriting in rust; my implementation of this running on heroku with no real attempt at optimisation, shows about a factor of 2 difference to just the parsing, but there is obviously still some overhead from inserting the results into the db and so on: [with rust] ./~ $ ./manage.py test_parse_log http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux-debug/1466423484/mozilla-central_ubuntu32_vm-deb_test-web-platform-tests-e10s-3-bm08-tests1-linux32-build4.txt.gz --profile=10 --rust Timings: [1.0024049282073975, 0.8330569267272949, 0.7861700057983398, 1.0473411083221436, 1.0301811695098877, 0.8356599807739258, 0.9065229892730713, 1.0190601348876953, 0.8524329662322998, 1.0475420951843262] Average: 0.936037230492 Total: 9.36037230492 [no rust] ~ $ ./manage.py test_parse_log http://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-linux-debug/1466423484/mozilla-central_ubuntu32_vm-debug_test-web-platform-tests-e10s-3-bm08-tests1-linux32-build4.txt.gz --profile=10 Timings: [1.2806651592254639, 1.4643540382385254, 1.3752620220184326, 1.2319939136505127, 1.8426599502563477, 1.9552280902862549, 2.261673927307129, 2.199648141860962, 3.79407000541687, 3.1280570030212402] Average: 2.05336122513 Total: 20.5336122513 There are of course a number of totally reasonable concerns with this idea e.g. * It is a rewrite of a major component that is relatively untested. This suggests a high possibility for regressions and possible data loss. This is mitigated somewhat by the fact that the rewrite is almost a 1:1 mapping in the essential parts, but obviously better test coverage would be helpful here. * It involves a new language that people would have to learn to contribute. On the other hand, many other Mozilla projects are starting to use bits of Rust, so it seems like an increasingly useful skill (and this code doesn't have any deep magic). * The use of a binary blob makes deployment on heroku a bit of a pain. That's nothing unfixable, but there is some work involved (the current setup requires manually compiling the .so file on a heroku node, uploading it somewhere and then downloading it at deploy time. This could, for example, be scripted). The code I have is at https://github.com/jgraham/th-logparser-rust
Oh, and I guess I should mention that I enabled this on heroku and it seems to work more or less OK… it's at least not producing a lot of errors.
Thank you for looking into this! A few thoughts off the top of my head: * We used to use Cython, but removed it (bug 1169916), since at the time the perf win wasn't worth the build complexity (which was simpler than the proposed Rust parser solution) - however that was an older version, and we weren't using it for all of the log parser, only some of it. * Before we perform any rewrite, we should probably add some more log parser tests. * The current log parser implementation is rather crufty - there are things we plain don't need any more (/should be simplified, eg the step data duplication + lots of now redundant regex due to the mozharness prefixes). * I think refactoring to remove this cruft will be quicker/safer if performed against the current log parser, rather than first converting to Rust, since eg I'm much more familiar with the current implementation & very much not familiar with Rust. Plus once we switch to Rust it sounds like the build process may be more of a pain, so would be great to not have as many changes planned after that. * We should probably profile the current log parser (if you haven't already) to see if there are quick wins that can be had. The refactor mentioned previously may affect the profile, so will need to happen first. -> It seems like regardless of whether we decide to switch to Rust or not, there is some prep work that we can do that will be helpful either way, at which point (a) we'll be able to make a more informed decision, (b) the conversion will be simpler & lower risk. Some more things we'll likely want answers to: * How does the overall log parsing speedup compare, when looking at the DB ingestion times etc. (ie the 2x speedup for the above may not translate into a halving of the 80% of total non-web transaction times). * Whether it's possible to build the Rust log parser on the fly on {vagrant, travis, heroku} or not. Having to upload a file to iterate during development/deploy increases the hassle-threshold for making changes to the log parser - so in some cases we just won't bother working on it (particularly if we haven't yet had time to learn Rust). * Whether Cython/PyPy or say newer regex libraries (eg https://pypi.python.org/pypi/regex) or pyparsing (http://pyparsing.wikispaces.com/) would give us ballpark similar wins but for less build complexity. * How much $$/month the extra dynos would cost for the machine time saving this would give us. * Whether we still see the future as having people like TaskCluster parse their own logs. If so, having the log parser be a separate repo with more complex build process doesn't actually affect us as much.
(In reply to Ed Morley [:emorley] from comment #2) > Thank you for looking into this! > > A few thoughts off the top of my head: > > * We used to use Cython, but removed it (bug 1169916), since at the time the > perf win wasn't worth the build complexity (which was simpler than the > proposed Rust parser solution) - however that was an older version, and we > weren't using it for all of the log parser, only some of it. I imagine that Cython will help with loop dispatch overhead, but not regex performance, since that's all in C anyway. So I anticipate that the wins would be smaller for a slight reduction in complexity (you still have to learn Cython, and have the same deployment problems). > * Before we perform any rewrite, we should probably add some more log parser > tests. Yes, the current tests aren't very useful; they are mostly testing layers of abstraction that I removed for speed (e.g. the ability to select just one parser) rather than the actual data format parsing correctness. > * The current log parser implementation is rather crufty - there are things > we plain don't need any more (/should be simplified, eg the step data > duplication + lots of now redundant regex due to the mozharness prefixes). > * I think refactoring to remove this cruft will be quicker/safer if > performed against the current log parser, rather than first converting to > Rust, since eg I'm much more familiar with the current implementation & very > much not familiar with Rust. Plus once we switch to Rust it sounds like the > build process may be more of a pain, so would be great to not have as many > changes planned after that. Yes, I agree if we plan to simplify in the short term we should undoubtedly do that first. > * We should probably profile the current log parser (if you haven't already) > to see if there are quick wins that can be had. The refactor mentioned > previously may affect the profile, so will need to happen first. So I did profile it with cProfile and the profile was not very illuminating. It just told you that it spent a lot of time doing regex and downloading data. Which is hardly surprising (I also didn't profile the rust implementation, so I suspect there are some wins to be had there. e.g. if you are happy to allocate a static 50Mb buffer for each process you could experiment with decompressing the log into that directly and then make the rest of the parse zero allocation). There is also room to make the way that the data is passed between Rust and Python more efficient. > -> It seems like regardless of whether we decide to switch to Rust or not, > there is some prep work that we can do that will be helpful either way, at > which point (a) we'll be able to make a more informed decision, (b) the > conversion will be simpler & lower risk. > > Some more things we'll likely want answers to: > > * How does the overall log parsing speedup compare, when looking at the DB > ingestion times etc. (ie the 2x speedup for the above may not translate into > a halving of the 80% of total non-web transaction times). From comparing the average times on heroku-stage and heroku-prototype it may translate to a 25% speedup (i.e. "takes 75% of the time"). Note that this will change when we move away from datasource because we will be able to control the number of db writes more easily, and the storage structure will also change. > * Whether it's possible to build the Rust log parser on the fly on {vagrant, > travis, heroku} or not. Having to upload a file to iterate during > development/deploy increases the hassle-threshold for making changes to the > log parser - so in some cases we just won't bother working on it > (particularly if we haven't yet had time to learn Rust). This is all very possible, although some of it is a little work. I have a travis script that compiles the parser already. One could certainly set up rust in vagrant and make it possible to rebuild the parser. On heroku there are a couple of options; probably the best is to create a buildpack that installs the rust tooling onto the machines and does a build during deploy. Needless to say similar buildpacks already exist, although they are mostly aimed at applications fully written in rust. The other option would be to do the build elsewhere (e.g. using travis, with a statically linked openssl) and download the .so file onto heroku. > * Whether Cython/PyPy or say newer regex libraries (eg > https://pypi.python.org/pypi/regex) or pyparsing > (http://pyparsing.wikispaces.com/) would give us ballpark similar wins but > for less build complexity. I would be astonished if pyparsing was faster (and if it is, I would imagine there might be even greater wins from a similar approach in a non-python library). It is unclear to me whether the new regex is supposed to be faster, but that could well be worth testing. > * How much $$/month the extra dynos would cost for the machine time saving > this would give us. > * Whether we still see the future as having people like TaskCluster parse > their own logs. If so, having the log parser be a separate repo with more > complex build process doesn't actually affect us as much.
Depends on: 1281463
(In reply to James Graham [:jgraham] from comment #3) > I imagine that Cython will help with loop dispatch overhead, but not regex > performance, since that's all in C anyway. So I anticipate that the wins > would be smaller for a slight reduction in complexity (you still have to > learn Cython, and have the same deployment problems). There's little new to learn (if using Cython with Python you get a perf win for free, and more if you add the type declarations) + it doesn't toolchain changes that mean we have to upload binaries somewhere else. (Not saying I know enough about it to really advocate for it over any other solution, however) Adding some deps that will clean up the profiles for both the existing parser and the rust one.
Depends on: 1280910, 1281058
Just as a heads up: I've removed the vendor-binaries buildpack from the treeherder-heroku Heroku app, to bring the configs of all three Heroku apps more in line (helpful when I run the config comparison/validation script for bug 1277304). Personally I prefer just fetching the archive as part of `bin/post_compile` -- to keep everything in one place, and reduce the number of dependencies we have. (The extra buildpack is unnecessary for us, since we're using a buildpack that provides a user-controllable script -- not all buildpacks have it, which is why I guess this extra buildpack exists.)
If we do this, I will create a buildpack that installs the rust compiler, after which we should be able to use the normal |pip install| approach to the log parser code. That has the advantage that we don't have to manually log on to heroku to compile, or link against a static openssl which would need to be kept up to date.
Yeah that would be great (and having a buildpack around for others to use will help Rust adoption too).
For my future reference, https://github.com/Hoverbear/heroku-buildpack-rust seems to be the latest prior art, but is focused on pure-rust web projects.
Also for future reference, https://github.com/jbaiter/python-rust-fst has all the setuptools bits needed for building rust libraries with python wrappers.
Priority: -- → P3
Component: Treeherder: Data Ingestion → Treeherder: Log Parsing & Classification
I think with the move to structured logging the overall size of the logs being parsed will go down (once we remove those parts from the main log), and as such the importance of a non-python parser will decrease. As such, let's close this for now given the complexity/build time this will add to deployments/local environment/... - we can always revisit.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.