Last Comment Bug 1192976 - Change perfherder database schema for performance and extensibility
: Change perfherder database schema for performance and extensibility
Status: RESOLVED FIXED
:
Product: Tree Management
Classification: Other
Component: Perfherder (show other bugs)
: ---
: Unspecified Unspecified
-- normal
: ---
Assigned To: William Lachance (:wlach) (use needinfo!)
:
:
Mentors:
Depends on:
Blocks: 1157687 1178641 1200716 1214618
  Show dependency treegraph
 
Reported: 2015-08-10 12:28 PDT by William Lachance (:wlach) (use needinfo!)
Modified: 2015-10-14 05:39 PDT (History)
11 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
PR (46 bytes, text/x-github-pull-request)
2015-08-12 15:56 PDT, William Lachance (:wlach) (use needinfo!)
mdoglio: review+
emorley: feedback+
Details | Review | Splinter Review
SQL for new performance artifact tables (3.19 KB, text/plain)
2015-08-31 13:51 PDT, William Lachance (:wlach) (use needinfo!)
scabral: feedback+
Details
Remove legacy code (46 bytes, text/x-github-pull-request)
2015-09-16 06:46 PDT, William Lachance (:wlach) (use needinfo!)
emorley: review+
Details | Review | Splinter Review

Description User image William Lachance (:wlach) (use needinfo!) 2015-08-10 12:28:07 PDT
Currently the way we store data in perfherder is rather suboptimal:

* It uses Datasource, a proprietary database abstraction layer interface that is difficult to reason about and use, which makes it difficult to onboard new contributors and has impacts on even core developer productivity.
* We store the data redundantly for each time interval we support (1 day, 1 week, 2 weeks, etc.). This impacts scalability, particularly where disk space is a concern like treeherder stage.
* Result set and job information for performance datums can only be looked up by downloading a complete set of performance data that encapsulates the time intervals when the jobs were run for that result sets. This can be rather slow, especially for something like the compare view where we might want to download 20-30 pieces of data at a time.

To solve these problems, I'd like to experiment with using the Django ORM and storing performance datums in their own specialized table, with columns for result set information, job information, project, push timestamp, and performance signature. In addition to being more space efficient and easier to work with, this would also allow the following:

* Look up performance data for a particular job almost instantly (useful for bug 1186191 and probably other things)
* Look up all talos results for each resultset almost instantly (useful for compare view)
* Look up arbitrary intervals of performance data without downloading everything (useful for data mining)

There's probably even more advantages to this approach that I haven't thought of. Let's give it a try.
Comment 1 User image William Lachance (:wlach) (use needinfo!) 2015-08-12 15:56:59 PDT
Created attachment 8647209 [details] [review]
PR

Hey Mauro, this isn't ready for merging yet, but I was wondering if you could take a look and let me know if you have any initial feedback. I'm still relatively new to the Django ORM so there's probably some things I'm doing wrong.

Still todo:

* Support ingestion via ORM
* Various filter options for getting perf data
* I decided to change the web service APIs since the old ones didn't make much sense, to be nice we might want to add compatibility shims
* Write some kind of migration script to pull / import data from old databases
* More extensive testing to make sure this scales
Comment 2 User image Mauro Doglio [:mdoglio] 2015-08-13 06:42:49 PDT
Comment on attachment 8647209 [details] [review]
PR

I added a few comments, it looks good so far :-)
Comment 3 User image William Lachance (:wlach) (use needinfo!) 2015-08-28 15:38:24 PDT
Comment on attachment 8647209 [details] [review]
PR

I think this is ready for a more thorough review.

There is a lingering issue where the performance adapter unit test can be extremely slow if your database is large. Really the problem here is bug 1133273, but after more than a day of trying to solve that using various methods (first trying to switch to pytest-django, then just trying to make the setup of a test db work again), I'm not sure if the fix is easy as long as we're using datasource... it makes setting up a test environment very difficult since you have to do everything by hand that django expects to be able to do for you. We might have an easier time switching away from datasource altogether.

I think most of this is pretty straightforward. In addition to the new storage structure, there are also two new management commands:

* add_perf_framework: Adds a new performance framework (currently we only have talos, but this opens the door to cleanly seperating out e.g. autophone from the rest of our test data)
* migrate_perf_data: Intended to migrate everything in the old database to the new schema. This is a once off, of course -- I expect to delete it in some subsequent commit.
Comment 4 User image Ed Morley [:emorley] 2015-08-31 04:20:52 PDT
(In reply to William Lachance (:wlach) from comment #3)
> Really the problem here is bug
> 1133273, but after more than a day of trying to solve that using various
> methods (first trying to switch to pytest-django, then just trying to make
> the setup of a test db work again), I'm not sure if the fix is easy as long
> as we're using datasource

What about reverting the hunk that camd (presumably accidentally) removed? (See bug 1133273 comment 1)
Comment 5 User image Ed Morley [:emorley] 2015-08-31 06:24:45 PDT
Comment on attachment 8647209 [details] [review]
PR

My look at this was still quite cursory (whilst I've left a lot of comments, it was still quite a large PR), so Mauro please do still look at this thoroughly :-)

Overall looks great - agree that this is a much better way of structuring the perf data - hopefully will avoid some of the DB issues we've been seeing :-) I also can't wait until even more things are using the ORM, is so much nicer.
Comment 6 User image William Lachance (:wlach) (use needinfo!) 2015-08-31 12:47:35 PDT
(In reply to Ed Morley (UK public holiday 31st Aug) [:emorley] from comment #4)
> (In reply to William Lachance (:wlach) from comment #3)
> > Really the problem here is bug
> > 1133273, but after more than a day of trying to solve that using various
> > methods (first trying to switch to pytest-django, then just trying to make
> > the setup of a test db work again), I'm not sure if the fix is easy as long
> > as we're using datasource
> 
> What about reverting the hunk that camd (presumably accidentally) removed?
> (See bug 1133273 comment 1)

Yeah sadly that doesn't help. See my latest comment in that bug for details.
Comment 7 User image William Lachance (:wlach) (use needinfo!) 2015-08-31 13:51:27 PDT
Created attachment 8655044 [details]
SQL for new performance artifact tables

Hi Sheeri, I updated the treeherder performance schema SQL with your suggestions:

1. We now index performance data properties together with the project properties (since you'll always want to index on project).
2. I set a maximum length on the fields that can contain JSON data, to 1024 bytes.

As a refresher:

* PerformanceDatum refers to the output of a single test run
* PerformanceSignature refers to a specific type of test run
* PerformanceFramework is a specific type of framework for running tests (Talos, Autophone, etc.)

Let me know if there is anything else that stands out as needing improvement.
Comment 8 User image Mauro Doglio [:mdoglio] 2015-09-02 06:41:57 PDT
Comment on attachment 8647209 [details] [review]
PR

:wlach please r? me again when it's ready for a final pass
Comment 9 User image Sheeri Cabral [:sheeri] 2015-09-02 11:18:11 PDT
Comment on attachment 8655044 [details]
SQL for new performance artifact tables

Looks good to me!
Comment 10 User image William Lachance (:wlach) (use needinfo!) 2015-09-08 15:20:20 PDT
Comment on attachment 8647209 [details] [review]
PR

I believe this is ready for final review. I addressed all the comments from last time (I think), plus:

* No longer try to store a whole datum's worth of data, following from bug 1220716, just store a single "float" field storing the value that the test gave.
This is all we ever usefully displayed in the UI anyway.
* Since we're storing the value as a raw float, don't bother rounding it when inserting it into the database. We can do that when the user hits the JSON API.
* Eliminate old duplicates from performance series data when doing the migration (even though the new code was supposed to auto-eliminate these, there were some really old series that had gone under the radar before)

Everything seems in order locally, though of course we'll want to give this a good run through stage before pushing to master and deploying....
Comment 11 User image Mauro Doglio [:mdoglio] 2015-09-10 11:00:23 PDT
Comment on attachment 8647209 [details] [review]
PR

Left a few comments. Sorry it took so long to review it.
Comment 12 User image William Lachance (:wlach) (use needinfo!) 2015-09-10 14:13:54 PDT
So the initial deployment to stage looks to be going well, though I did get an error halfway through running the migration script on inbound:

[wlachance@treeherder-rabbitmq1.stage.private.scl3 treeherder-service]$  ../venv/bin/python manage.py migrate_perf_data --project mozilla-inbound --interval 7776000

Traceback (most recent call last):
  File "manage.py", line 11, in <module>
    execute_from_command_line(sys.argv)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 338, in execute_from_command_line
    utility.execute()
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 330, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 393, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 444, in execute
    output = self.handle(*args, **options)
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/perf/management/commands/migrate_perf_data.py", line 127, in handle
    signature_props, options['interval'])
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 37, in __exit__
    self.disconnect()
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 142, in disconnect
    self.dhub.disconnect()
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/datasource/bases/SQLHub.py", line 246, in disconnect
    self.connection[host_type]['con_obj'].commit()
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

This is odd, as it worked fine on my vagrant instance when migrating a much larger dump of the production database. Pythian: Can you look into why the mysql database might have died at around 5:00pm EST today (September 10th)?
Comment 13 User image Pythian Team73 2015-09-10 20:58:33 PDT
I'm creating the ticket for this at the moment. I will come back with the logs soon.
Comment 14 User image Pythian Team73 2015-09-10 20:59:24 PDT
I'm creating the ticket for this at the moment and will come back with the logs soon.

Orkan Dere
Comment 15 User image Rohit Kumar 2015-09-10 23:13:18 PDT
Hi, 

I checked the treeherder stage servers, but could not find anything in error logs for the mentioned timestamps (September 10, 5PM EST) and i checked the setting variables which can be the cause for the issue you faced and found they have values which can be increased.

max_allowed_packet is 32M, it can be increased a bit (like 64M) to allow larger data packets in transaction.
log_warnings is 1, we can set it to 2, so that it will also log the aborted connection messages(which are warnings) in error logs
one other variable wait_timeout is 600, it can be increased a bit.

I also checked the prod servers, they also have the same settings.

mysql> show variables like 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 33554432 |
+--------------------+----------+
1 row in set (0.00 sec)

mysql> show variables like 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout  | 600   |
+---------------+-------+
1 row in set (0.00 sec)

mysql> show variables like 'log_warnings';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_warnings  | 1     |
+---------------+-------+
1 row in set (0.00 sec)


Please let us know if any help required to make the changes and if you see any issues in changing these.

Thanks,
Rohit Kumar
Comment 16 User image Treeherder Bugbot 2015-09-14 09:03:05 PDT
Commits pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/066f437ca52b8833b5178eb68a7f2d2480f97574
Bug 1192976 - Refactor performance data + store in master db

https://github.com/mozilla/treeherder/commit/6a62082bff30a579f2e1dbafba32e8f387c04d0a
Bug 1192976 - Add initial fixture for talos performance schema
Comment 17 User image Treeherder Bugbot 2015-09-14 09:35:35 PDT
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/f7aa0330853602fbc192ba46010dbbecced21210
Bug 1192976 - Fix unit tests after adding fixture
Comment 18 User image Treeherder Bugbot 2015-09-14 09:53:39 PDT
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/655ee074481b4c9e308917bd624f18d26f1cec30
Bug 1192976 - Fix unit tests for reals
Comment 19 User image William Lachance (:wlach) (use needinfo!) 2015-09-14 14:31:45 PDT
(In reply to Rohit Kumar from comment #15)
> Hi, 
> 
> I checked the treeherder stage servers, but could not find anything in error
> logs for the mentioned timestamps (September 10, 5PM EST) and i checked the
> setting variables which can be the cause for the issue you faced and found
> they have values which can be increased.

My data migration script can be re-run when it fails, I think I'm just going to do that, it doesn't seem worthwhile to track down the root cause of this.
Comment 20 User image William Lachance (:wlach) (use needinfo!) 2015-09-16 06:46:10 PDT
Created attachment 8661769 [details] [review]
Remove legacy code

I think after applying this and removing the legacy databases, we can make this as done.
Comment 21 User image Ed Morley [:emorley] 2015-09-16 07:04:18 PDT
Comment on attachment 8661769 [details] [review]
Remove legacy code

Looks good; thank you :-)
Comment 22 User image Treeherder Bugbot 2015-09-16 14:12:56 PDT
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/e90d55e2c314cc38e11feedaea716596c5f24b2f
Bug 1192976 - Remove migration script + legacy performance handling code

Note You need to log in before you can comment on or make changes to this bug.