Closed
Bug 1192976
Opened 10 years ago
Closed 9 years ago
Change perfherder database schema for performance and extensibility
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: wlach, Assigned: wlach)
References
Details
Attachments
(3 files)
Currently the way we store data in perfherder is rather suboptimal:
* It uses Datasource, a proprietary database abstraction layer interface that is difficult to reason about and use, which makes it difficult to onboard new contributors and has impacts on even core developer productivity.
* We store the data redundantly for each time interval we support (1 day, 1 week, 2 weeks, etc.). This impacts scalability, particularly where disk space is a concern like treeherder stage.
* Result set and job information for performance datums can only be looked up by downloading a complete set of performance data that encapsulates the time intervals when the jobs were run for that result sets. This can be rather slow, especially for something like the compare view where we might want to download 20-30 pieces of data at a time.
To solve these problems, I'd like to experiment with using the Django ORM and storing performance datums in their own specialized table, with columns for result set information, job information, project, push timestamp, and performance signature. In addition to being more space efficient and easier to work with, this would also allow the following:
* Look up performance data for a particular job almost instantly (useful for bug 1186191 and probably other things)
* Look up all talos results for each resultset almost instantly (useful for compare view)
* Look up arbitrary intervals of performance data without downloading everything (useful for data mining)
There's probably even more advantages to this approach that I haven't thought of. Let's give it a try.
Assignee | ||
Comment 1•10 years ago
|
||
Hey Mauro, this isn't ready for merging yet, but I was wondering if you could take a look and let me know if you have any initial feedback. I'm still relatively new to the Django ORM so there's probably some things I'm doing wrong.
Still todo:
* Support ingestion via ORM
* Various filter options for getting perf data
* I decided to change the web service APIs since the old ones didn't make much sense, to be nice we might want to add compatibility shims
* Write some kind of migration script to pull / import data from old databases
* More extensive testing to make sure this scales
Attachment #8647209 -
Flags: feedback?(mdoglio)
Comment 2•10 years ago
|
||
Comment on attachment 8647209 [details] [review]
PR
I added a few comments, it looks good so far :-)
Attachment #8647209 -
Flags: feedback?(mdoglio) → feedback+
Assignee | ||
Comment 3•10 years ago
|
||
Comment on attachment 8647209 [details] [review]
PR
I think this is ready for a more thorough review.
There is a lingering issue where the performance adapter unit test can be extremely slow if your database is large. Really the problem here is bug 1133273, but after more than a day of trying to solve that using various methods (first trying to switch to pytest-django, then just trying to make the setup of a test db work again), I'm not sure if the fix is easy as long as we're using datasource... it makes setting up a test environment very difficult since you have to do everything by hand that django expects to be able to do for you. We might have an easier time switching away from datasource altogether.
I think most of this is pretty straightforward. In addition to the new storage structure, there are also two new management commands:
* add_perf_framework: Adds a new performance framework (currently we only have talos, but this opens the door to cleanly seperating out e.g. autophone from the rest of our test data)
* migrate_perf_data: Intended to migrate everything in the old database to the new schema. This is a once off, of course -- I expect to delete it in some subsequent commit.
Attachment #8647209 -
Flags: review?(mdoglio)
Attachment #8647209 -
Flags: feedback?(emorley)
Attachment #8647209 -
Flags: feedback+
Comment 4•9 years ago
|
||
(In reply to William Lachance (:wlach) from comment #3)
> Really the problem here is bug
> 1133273, but after more than a day of trying to solve that using various
> methods (first trying to switch to pytest-django, then just trying to make
> the setup of a test db work again), I'm not sure if the fix is easy as long
> as we're using datasource
What about reverting the hunk that camd (presumably accidentally) removed? (See bug 1133273 comment 1)
Comment 5•9 years ago
|
||
Comment on attachment 8647209 [details] [review]
PR
My look at this was still quite cursory (whilst I've left a lot of comments, it was still quite a large PR), so Mauro please do still look at this thoroughly :-)
Overall looks great - agree that this is a much better way of structuring the perf data - hopefully will avoid some of the DB issues we've been seeing :-) I also can't wait until even more things are using the ORM, is so much nicer.
Attachment #8647209 -
Flags: feedback?(emorley) → feedback+
Assignee | ||
Comment 6•9 years ago
|
||
(In reply to Ed Morley (UK public holiday 31st Aug) [:emorley] from comment #4)
> (In reply to William Lachance (:wlach) from comment #3)
> > Really the problem here is bug
> > 1133273, but after more than a day of trying to solve that using various
> > methods (first trying to switch to pytest-django, then just trying to make
> > the setup of a test db work again), I'm not sure if the fix is easy as long
> > as we're using datasource
>
> What about reverting the hunk that camd (presumably accidentally) removed?
> (See bug 1133273 comment 1)
Yeah sadly that doesn't help. See my latest comment in that bug for details.
Assignee | ||
Comment 7•9 years ago
|
||
Hi Sheeri, I updated the treeherder performance schema SQL with your suggestions:
1. We now index performance data properties together with the project properties (since you'll always want to index on project).
2. I set a maximum length on the fields that can contain JSON data, to 1024 bytes.
As a refresher:
* PerformanceDatum refers to the output of a single test run
* PerformanceSignature refers to a specific type of test run
* PerformanceFramework is a specific type of framework for running tests (Talos, Autophone, etc.)
Let me know if there is anything else that stands out as needing improvement.
Attachment #8655044 -
Flags: feedback?(scabral)
Comment 8•9 years ago
|
||
Comment on attachment 8647209 [details] [review]
PR
:wlach please r? me again when it's ready for a final pass
Attachment #8647209 -
Flags: review?(mdoglio)
Comment 9•9 years ago
|
||
Comment on attachment 8655044 [details]
SQL for new performance artifact tables
Looks good to me!
Attachment #8655044 -
Flags: feedback?(scabral) → feedback+
Assignee | ||
Comment 10•9 years ago
|
||
Comment on attachment 8647209 [details] [review]
PR
I believe this is ready for final review. I addressed all the comments from last time (I think), plus:
* No longer try to store a whole datum's worth of data, following from bug 1220716, just store a single "float" field storing the value that the test gave.
This is all we ever usefully displayed in the UI anyway.
* Since we're storing the value as a raw float, don't bother rounding it when inserting it into the database. We can do that when the user hits the JSON API.
* Eliminate old duplicates from performance series data when doing the migration (even though the new code was supposed to auto-eliminate these, there were some really old series that had gone under the radar before)
Everything seems in order locally, though of course we'll want to give this a good run through stage before pushing to master and deploying....
Attachment #8647209 -
Attachment description: First pass → PR
Attachment #8647209 -
Flags: review?(mdoglio)
Comment 11•9 years ago
|
||
Comment on attachment 8647209 [details] [review]
PR
Left a few comments. Sorry it took so long to review it.
Attachment #8647209 -
Flags: review?(mdoglio) → review+
Assignee | ||
Comment 12•9 years ago
|
||
So the initial deployment to stage looks to be going well, though I did get an error halfway through running the migration script on inbound:
[wlachance@treeherder-rabbitmq1.stage.private.scl3 treeherder-service]$ ../venv/bin/python manage.py migrate_perf_data --project mozilla-inbound --interval 7776000
Traceback (most recent call last):
File "manage.py", line 11, in <module>
execute_from_command_line(sys.argv)
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 338, in execute_from_command_line
utility.execute()
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 330, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 393, in run_from_argv
self.execute(*args, **cmd_options)
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 444, in execute
output = self.handle(*args, **options)
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/perf/management/commands/migrate_perf_data.py", line 127, in handle
signature_props, options['interval'])
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 37, in __exit__
self.disconnect()
File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 142, in disconnect
self.dhub.disconnect()
File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/datasource/bases/SQLHub.py", line 246, in disconnect
self.connection[host_type]['con_obj'].commit()
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')
This is odd, as it worked fine on my vagrant instance when migrating a much larger dump of the production database. Pythian: Can you look into why the mysql database might have died at around 5:00pm EST today (September 10th)?
Flags: needinfo?(team73)
Comment 13•9 years ago
|
||
I'm creating the ticket for this at the moment. I will come back with the logs soon.
Comment 14•9 years ago
|
||
I'm creating the ticket for this at the moment and will come back with the logs soon.
Orkan Dere
Comment 15•9 years ago
|
||
Hi,
I checked the treeherder stage servers, but could not find anything in error logs for the mentioned timestamps (September 10, 5PM EST) and i checked the setting variables which can be the cause for the issue you faced and found they have values which can be increased.
max_allowed_packet is 32M, it can be increased a bit (like 64M) to allow larger data packets in transaction.
log_warnings is 1, we can set it to 2, so that it will also log the aborted connection messages(which are warnings) in error logs
one other variable wait_timeout is 600, it can be increased a bit.
I also checked the prod servers, they also have the same settings.
mysql> show variables like 'max_allowed_packet';
+--------------------+----------+
| Variable_name | Value |
+--------------------+----------+
| max_allowed_packet | 33554432 |
+--------------------+----------+
1 row in set (0.00 sec)
mysql> show variables like 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout | 600 |
+---------------+-------+
1 row in set (0.00 sec)
mysql> show variables like 'log_warnings';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_warnings | 1 |
+---------------+-------+
1 row in set (0.00 sec)
Please let us know if any help required to make the changes and if you see any issues in changing these.
Thanks,
Rohit Kumar
Comment 16•9 years ago
|
||
Commits pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/066f437ca52b8833b5178eb68a7f2d2480f97574
Bug 1192976 - Refactor performance data + store in master db
https://github.com/mozilla/treeherder/commit/6a62082bff30a579f2e1dbafba32e8f387c04d0a
Bug 1192976 - Add initial fixture for talos performance schema
Comment 17•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/f7aa0330853602fbc192ba46010dbbecced21210
Bug 1192976 - Fix unit tests after adding fixture
Comment 18•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/655ee074481b4c9e308917bd624f18d26f1cec30
Bug 1192976 - Fix unit tests for reals
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Rohit Kumar from comment #15)
> Hi,
>
> I checked the treeherder stage servers, but could not find anything in error
> logs for the mentioned timestamps (September 10, 5PM EST) and i checked the
> setting variables which can be the cause for the issue you faced and found
> they have values which can be increased.
My data migration script can be re-run when it fails, I think I'm just going to do that, it doesn't seem worthwhile to track down the root cause of this.
Flags: needinfo?(team73)
Assignee | ||
Comment 20•9 years ago
|
||
I think after applying this and removing the legacy databases, we can make this as done.
Attachment #8661769 -
Flags: review?(emorley)
Comment 21•9 years ago
|
||
Comment on attachment 8661769 [details] [review]
Remove legacy code
Looks good; thank you :-)
Attachment #8661769 -
Flags: review?(emorley) → review+
Comment 22•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/e90d55e2c314cc38e11feedaea716596c5f24b2f
Bug 1192976 - Remove migration script + legacy performance handling code
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•