1192976 - Change perfherder database schema for performance and extensibility

Assignee

Description

•

10 years ago

Currently the way we store data in perfherder is rather suboptimal: * It uses Datasource, a proprietary database abstraction layer interface that is difficult to reason about and use, which makes it difficult to onboard new contributors and has impacts on even core developer productivity. * We store the data redundantly for each time interval we support (1 day, 1 week, 2 weeks, etc.). This impacts scalability, particularly where disk space is a concern like treeherder stage. * Result set and job information for performance datums can only be looked up by downloading a complete set of performance data that encapsulates the time intervals when the jobs were run for that result sets. This can be rather slow, especially for something like the compare view where we might want to download 20-30 pieces of data at a time. To solve these problems, I'd like to experiment with using the Django ORM and storing performance datums in their own specialized table, with columns for result set information, job information, project, push timestamp, and performance signature. In addition to being more space efficient and easier to work with, this would also allow the following: * Look up performance data for a particular job almost instantly (useful for bug 1186191 and probably other things) * Look up all talos results for each resultset almost instantly (useful for compare view) * Look up arbitrary intervals of performance data without downloading everything (useful for data mining) There's probably even more advantages to this approach that I haven't thought of. Let's give it a try.

William Lachance (:wlach)

Assignee

Comment 1

•

10 years ago

Attached file PR — Details

Hey Mauro, this isn't ready for merging yet, but I was wondering if you could take a look and let me know if you have any initial feedback. I'm still relatively new to the Django ORM so there's probably some things I'm doing wrong. Still todo: * Support ingestion via ORM * Various filter options for getting perf data * I decided to change the web service APIs since the old ones didn't make much sense, to be nice we might want to add compatibility shims * Write some kind of migration script to pull / import data from old databases * More extensive testing to make sure this scales

Attachment #8647209 - Flags: feedback?(mdoglio)

William Lachance (:wlach)

Assignee

Updated

•

10 years ago

Blocks: 1157687

Mauro Doglio [:mdoglio]

Comment 2

•

10 years ago

Comment on attachment 8647209 [details] [review] PR I added a few comments, it looks good so far :-)

Attachment #8647209 - Flags: feedback?(mdoglio) → feedback+

Ed Morley [:emorley]

Updated

•

10 years ago

Blocks: 1178641

William Lachance (:wlach)

Assignee

Comment 3

•

10 years ago

Comment on attachment 8647209 [details] [review] PR I think this is ready for a more thorough review. There is a lingering issue where the performance adapter unit test can be extremely slow if your database is large. Really the problem here is bug 1133273, but after more than a day of trying to solve that using various methods (first trying to switch to pytest-django, then just trying to make the setup of a test db work again), I'm not sure if the fix is easy as long as we're using datasource... it makes setting up a test environment very difficult since you have to do everything by hand that django expects to be able to do for you. We might have an easier time switching away from datasource altogether. I think most of this is pretty straightforward. In addition to the new storage structure, there are also two new management commands: * add_perf_framework: Adds a new performance framework (currently we only have talos, but this opens the door to cleanly seperating out e.g. autophone from the rest of our test data) * migrate_perf_data: Intended to migrate everything in the old database to the new schema. This is a once off, of course -- I expect to delete it in some subsequent commit.

Attachment #8647209 - Flags: review?(mdoglio)

Attachment #8647209 - Flags: feedback?(emorley)

Attachment #8647209 - Flags: feedback+

Ed Morley [:emorley]

Comment 4

•

9 years ago

(In reply to William Lachance (:wlach) from comment #3) > Really the problem here is bug > 1133273, but after more than a day of trying to solve that using various > methods (first trying to switch to pytest-django, then just trying to make > the setup of a test db work again), I'm not sure if the fix is easy as long > as we're using datasource What about reverting the hunk that camd (presumably accidentally) removed? (See bug 1133273 comment 1)

Ed Morley [:emorley]

Comment 5

•

9 years ago

Comment on attachment 8647209 [details] [review] PR My look at this was still quite cursory (whilst I've left a lot of comments, it was still quite a large PR), so Mauro please do still look at this thoroughly :-) Overall looks great - agree that this is a much better way of structuring the perf data - hopefully will avoid some of the DB issues we've been seeing :-) I also can't wait until even more things are using the ORM, is so much nicer.

Attachment #8647209 - Flags: feedback?(emorley) → feedback+

William Lachance (:wlach)

Assignee

Comment 6

•

9 years ago

(In reply to Ed Morley (UK public holiday 31st Aug) [:emorley] from comment #4) > (In reply to William Lachance (:wlach) from comment #3) > > Really the problem here is bug > > 1133273, but after more than a day of trying to solve that using various > > methods (first trying to switch to pytest-django, then just trying to make > > the setup of a test db work again), I'm not sure if the fix is easy as long > > as we're using datasource > > What about reverting the hunk that camd (presumably accidentally) removed? > (See bug 1133273 comment 1) Yeah sadly that doesn't help. See my latest comment in that bug for details.

William Lachance (:wlach)

Assignee

Comment 7

•

9 years ago

Attached file SQL for new performance artifact tables — Details

Hi Sheeri, I updated the treeherder performance schema SQL with your suggestions: 1. We now index performance data properties together with the project properties (since you'll always want to index on project). 2. I set a maximum length on the fields that can contain JSON data, to 1024 bytes. As a refresher: * PerformanceDatum refers to the output of a single test run * PerformanceSignature refers to a specific type of test run * PerformanceFramework is a specific type of framework for running tests (Talos, Autophone, etc.) Let me know if there is anything else that stands out as needing improvement.

Attachment #8655044 - Flags: feedback?(scabral)

Mauro Doglio [:mdoglio]

Comment 8

•

9 years ago

Comment on attachment 8647209 [details] [review] PR :wlach please r? me again when it's ready for a final pass

Attachment #8647209 - Flags: review?(mdoglio)

Sheeri Cabral [:sheeri]

Comment 9

•

9 years ago

Comment on attachment 8655044 [details] SQL for new performance artifact tables Looks good to me!

Attachment #8655044 - Flags: feedback?(scabral) → feedback+

Avi Halachmi (:avih)

Updated

•

9 years ago

Blocks: 1200716

William Lachance (:wlach)

Assignee

Comment 10

•

9 years ago

Comment on attachment 8647209 [details] [review] PR I believe this is ready for final review. I addressed all the comments from last time (I think), plus: * No longer try to store a whole datum's worth of data, following from bug 1220716, just store a single "float" field storing the value that the test gave. This is all we ever usefully displayed in the UI anyway. * Since we're storing the value as a raw float, don't bother rounding it when inserting it into the database. We can do that when the user hits the JSON API. * Eliminate old duplicates from performance series data when doing the migration (even though the new code was supposed to auto-eliminate these, there were some really old series that had gone under the radar before) Everything seems in order locally, though of course we'll want to give this a good run through stage before pushing to master and deploying....

Attachment #8647209 - Attachment description: First pass → PR

Attachment #8647209 - Flags: review?(mdoglio)

Mauro Doglio [:mdoglio]

Comment 11

•

9 years ago

Comment on attachment 8647209 [details] [review] PR Left a few comments. Sorry it took so long to review it.

Attachment #8647209 - Flags: review?(mdoglio) → review+

William Lachance (:wlach)

Assignee

Comment 12

•

9 years ago

So the initial deployment to stage looks to be going well, though I did get an error halfway through running the migration script on inbound: [wlachance@treeherder-rabbitmq1.stage.private.scl3 treeherder-service]$ ../venv/bin/python manage.py migrate_perf_data --project mozilla-inbound --interval 7776000 Traceback (most recent call last): File "manage.py", line 11, in <module> execute_from_command_line(sys.argv) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 338, in execute_from_command_line utility.execute() File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 330, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 393, in run_from_argv self.execute(*args, **cmd_options) File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 444, in execute output = self.handle(*args, **options) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/perf/management/commands/migrate_perf_data.py", line 127, in handle signature_props, options['interval']) File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 37, in __exit__ self.disconnect() File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 142, in disconnect self.dhub.disconnect() File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/datasource/bases/SQLHub.py", line 246, in disconnect self.connection[host_type]['con_obj'].commit() _mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away') This is odd, as it worked fine on my vagrant instance when migrating a much larger dump of the production database. Pythian: Can you look into why the mysql database might have died at around 5:00pm EST today (September 10th)?

Flags: needinfo?(team73)

Pythian Team73

Comment 13

•

9 years ago

I'm creating the ticket for this at the moment. I will come back with the logs soon.

Pythian Team73

Comment 14

•

9 years ago

I'm creating the ticket for this at the moment and will come back with the logs soon. Orkan Dere

Rohit Kumar

Comment 15

•

9 years ago

Hi, I checked the treeherder stage servers, but could not find anything in error logs for the mentioned timestamps (September 10, 5PM EST) and i checked the setting variables which can be the cause for the issue you faced and found they have values which can be increased. max_allowed_packet is 32M, it can be increased a bit (like 64M) to allow larger data packets in transaction. log_warnings is 1, we can set it to 2, so that it will also log the aborted connection messages(which are warnings) in error logs one other variable wait_timeout is 600, it can be increased a bit. I also checked the prod servers, they also have the same settings. mysql> show variables like 'max_allowed_packet'; +--------------------+----------+ | Variable_name | Value | +--------------------+----------+ | max_allowed_packet | 33554432 | +--------------------+----------+ 1 row in set (0.00 sec) mysql> show variables like 'wait_timeout'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | wait_timeout | 600 | +---------------+-------+ 1 row in set (0.00 sec) mysql> show variables like 'log_warnings'; +---------------+-------+ | Variable_name | Value | +---------------+-------+ | log_warnings | 1 | +---------------+-------+ 1 row in set (0.00 sec) Please let us know if any help required to make the changes and if you see any issues in changing these. Thanks, Rohit Kumar

Treeherder GitHub Bugbot

Comment 16

•

9 years ago

Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/066f437ca52b8833b5178eb68a7f2d2480f97574 Bug 1192976 - Refactor performance data + store in master db https://github.com/mozilla/treeherder/commit/6a62082bff30a579f2e1dbafba32e8f387c04d0a Bug 1192976 - Add initial fixture for talos performance schema

Treeherder GitHub Bugbot

Comment 17

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f7aa0330853602fbc192ba46010dbbecced21210 Bug 1192976 - Fix unit tests after adding fixture

Treeherder GitHub Bugbot

Comment 18

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/655ee074481b4c9e308917bd624f18d26f1cec30 Bug 1192976 - Fix unit tests for reals

William Lachance (:wlach)

Assignee

Comment 19

•

9 years ago

(In reply to Rohit Kumar from comment #15) > Hi, > > I checked the treeherder stage servers, but could not find anything in error > logs for the mentioned timestamps (September 10, 5PM EST) and i checked the > setting variables which can be the cause for the issue you faced and found > they have values which can be increased. My data migration script can be re-run when it fails, I think I'm just going to do that, it doesn't seem worthwhile to track down the root cause of this.

Flags: needinfo?(team73)

William Lachance (:wlach)

Assignee

Comment 20

•

9 years ago

Attached file Remove legacy code — Details

I think after applying this and removing the legacy databases, we can make this as done.

Attachment #8661769 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 21

•

9 years ago

Comment on attachment 8661769 [details] [review] Remove legacy code Looks good; thank you :-)

Attachment #8661769 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 22

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/e90d55e2c314cc38e11feedaea716596c5f24b2f Bug 1192976 - Remove migration script + legacy performance handling code

Ed Morley [:emorley]

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

9 years ago

Blocks: 1214618

PR 10 years ago William Lachance (:wlach) 46 bytes, text/x-github-pull-request	mdoglio : review+ emorley : feedback+	Details \| Review
SQL for new performance artifact tables 9 years ago William Lachance (:wlach) 3.19 KB, text/plain	scabral : feedback+	Details
Remove legacy code 9 years ago William Lachance (:wlach) 46 bytes, text/x-github-pull-request	emorley : review+	Details \| Review