We currently use Cython for some (but not all) of our log parser code, in an attempt to improve performance. I'm curious to know how much of a perf benefit we're getting, and so whether it's worth using Cython with the rest of the log parser files (or any other parts of Treeherder)? Or alternatively if it's not really worth it (especially if we start to parse fewer logs given bug 1080760), we could just stop using Cython, which would: 1) Simplify the travis/vagrant/prod build 2) Reduce the number of steps/custom things someone contributing to Treeherder has to worry about (eg getting caught out by changes not taking affect until the log_parser files are rebuilt after changes are made to them) If we're only seeing a 5-20% perf improvement I'd be inclined to ditch Cython - obviously if it's giving us a 2x+ speedup, that's a different matter (and perhaps we should see if it's also helpful for parts of say builds-4hr handling).
I really wonder whether we're getting any substantial performance benefit from using cython, log parsing is basically just doing a lot of regexes, which is fast in python. Then again, someone decided that this was worth doing at some point so... I dunno. Maybe Cam or Mauro would have more to say. If you're interested in profiling this, you could use my test_parse_log management command: https://github.com/mozilla/treeherder/blob/master/treeherder/etl/management/commands/test_parse_log.py (be careful to control for the speed of downloading the actual logs, which can lead to dramatically different results -- consider perhaps downloading them in advance and log parsing against the local network)
Also the move from python 2.6 to 2.7 might have helped reduce the gap :-)
It appears that we're only speeding up the log parse by ~2% (this is total parse time, including download; I didn't use the log locally since the results were consistent enough it wasn't necessary). Cython... First batch of 50: 1.4435 seconds/log Later batch of 50: 1.4412 seconds/log Without Cython... First batch of 50: 1.4720 seconds/log Later batch of 50: 1.4711 seconds/log For full output, see: https://emorley.pastebin.mozilla.org/8838088 As such, I think we should just ditch Cython for the log parser (IMO anything less than say 20% isn't worth the added complexity).
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Priority: -- → P3
Created attachment 8627462 [details] [review] Stop using Cython to build the log parser
Attachment #8627462 - Flags: review?(wlachance)
Comment on attachment 8627462 [details] [review] Stop using Cython to build the log parser It's nice to reduce complexity where we can!
Attachment #8627462 - Flags: review?(wlachance) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/58813b0c51969b9471b8afe1b8624eaf46cde0c5 Bug 1169916 - Stop using Cython to build the log parser Since it only speeds up parsing by a few percent of total runtime, and is therefore not worth the added complexity for deployment and local hack-test-debug cycles when working on the log parser. The .gitignore and update.py entries will be removed in a later commit, once the stage/prod src directories have been cleaned up.
This looks fine to me on prod perf wise: https://rpm.newrelic.com/accounts/677903/applications/4180461/transactions?show_browser=false&tw[end]=1435683371&tw[start]=1435672571&type=other#id=5b224f746865725472616e73616374696f6e2f43656c6572792f70617273652d6c6f67222c22225d I've also cleaned up the build/ directory in the src dir on stage+prod.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Summary: Evaluate our use of Cython (eg for the log parser) → Stop using Cython for the log parser if the performance gain is not significant
You need to log in before you can comment on or make changes to this bug.