Closed Bug 1192976 Opened 9 years ago Closed 9 years ago

Change perfherder database schema for performance and extensibility

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wlach, Assigned: wlach)

References

Details

Attachments

(3 files)

Currently the way we store data in perfherder is rather suboptimal:

* It uses Datasource, a proprietary database abstraction layer interface that is difficult to reason about and use, which makes it difficult to onboard new contributors and has impacts on even core developer productivity.
* We store the data redundantly for each time interval we support (1 day, 1 week, 2 weeks, etc.). This impacts scalability, particularly where disk space is a concern like treeherder stage.
* Result set and job information for performance datums can only be looked up by downloading a complete set of performance data that encapsulates the time intervals when the jobs were run for that result sets. This can be rather slow, especially for something like the compare view where we might want to download 20-30 pieces of data at a time.

To solve these problems, I'd like to experiment with using the Django ORM and storing performance datums in their own specialized table, with columns for result set information, job information, project, push timestamp, and performance signature. In addition to being more space efficient and easier to work with, this would also allow the following:

* Look up performance data for a particular job almost instantly (useful for bug 1186191 and probably other things)
* Look up all talos results for each resultset almost instantly (useful for compare view)
* Look up arbitrary intervals of performance data without downloading everything (useful for data mining)

There's probably even more advantages to this approach that I haven't thought of. Let's give it a try.
Attached file PR
Hey Mauro, this isn't ready for merging yet, but I was wondering if you could take a look and let me know if you have any initial feedback. I'm still relatively new to the Django ORM so there's probably some things I'm doing wrong.

Still todo:

* Support ingestion via ORM
* Various filter options for getting perf data
* I decided to change the web service APIs since the old ones didn't make much sense, to be nice we might want to add compatibility shims
* Write some kind of migration script to pull / import data from old databases
* More extensive testing to make sure this scales
Attachment #8647209 - Flags: feedback?(mdoglio)
Blocks: 1157687
Comment on attachment 8647209 [details] [review]
PR

I added a few comments, it looks good so far :-)
Attachment #8647209 - Flags: feedback?(mdoglio) → feedback+
Blocks: 1178641
Comment on attachment 8647209 [details] [review]
PR

I think this is ready for a more thorough review.

There is a lingering issue where the performance adapter unit test can be extremely slow if your database is large. Really the problem here is bug 1133273, but after more than a day of trying to solve that using various methods (first trying to switch to pytest-django, then just trying to make the setup of a test db work again), I'm not sure if the fix is easy as long as we're using datasource... it makes setting up a test environment very difficult since you have to do everything by hand that django expects to be able to do for you. We might have an easier time switching away from datasource altogether.

I think most of this is pretty straightforward. In addition to the new storage structure, there are also two new management commands:

* add_perf_framework: Adds a new performance framework (currently we only have talos, but this opens the door to cleanly seperating out e.g. autophone from the rest of our test data)
* migrate_perf_data: Intended to migrate everything in the old database to the new schema. This is a once off, of course -- I expect to delete it in some subsequent commit.
Attachment #8647209 - Flags: review?(mdoglio)
Attachment #8647209 - Flags: feedback?(emorley)
Attachment #8647209 - Flags: feedback+
(In reply to William Lachance (:wlach) from comment #3)
> Really the problem here is bug
> 1133273, but after more than a day of trying to solve that using various
> methods (first trying to switch to pytest-django, then just trying to make
> the setup of a test db work again), I'm not sure if the fix is easy as long
> as we're using datasource

What about reverting the hunk that camd (presumably accidentally) removed? (See bug 1133273 comment 1)
Comment on attachment 8647209 [details] [review]
PR

My look at this was still quite cursory (whilst I've left a lot of comments, it was still quite a large PR), so Mauro please do still look at this thoroughly :-)

Overall looks great - agree that this is a much better way of structuring the perf data - hopefully will avoid some of the DB issues we've been seeing :-) I also can't wait until even more things are using the ORM, is so much nicer.
Attachment #8647209 - Flags: feedback?(emorley) → feedback+
(In reply to Ed Morley (UK public holiday 31st Aug) [:emorley] from comment #4)
> (In reply to William Lachance (:wlach) from comment #3)
> > Really the problem here is bug
> > 1133273, but after more than a day of trying to solve that using various
> > methods (first trying to switch to pytest-django, then just trying to make
> > the setup of a test db work again), I'm not sure if the fix is easy as long
> > as we're using datasource
> 
> What about reverting the hunk that camd (presumably accidentally) removed?
> (See bug 1133273 comment 1)

Yeah sadly that doesn't help. See my latest comment in that bug for details.
Hi Sheeri, I updated the treeherder performance schema SQL with your suggestions:

1. We now index performance data properties together with the project properties (since you'll always want to index on project).
2. I set a maximum length on the fields that can contain JSON data, to 1024 bytes.

As a refresher:

* PerformanceDatum refers to the output of a single test run
* PerformanceSignature refers to a specific type of test run
* PerformanceFramework is a specific type of framework for running tests (Talos, Autophone, etc.)

Let me know if there is anything else that stands out as needing improvement.
Attachment #8655044 - Flags: feedback?(scabral)
Comment on attachment 8647209 [details] [review]
PR

:wlach please r? me again when it's ready for a final pass
Attachment #8647209 - Flags: review?(mdoglio)
Comment on attachment 8655044 [details]
SQL for new performance artifact tables

Looks good to me!
Attachment #8655044 - Flags: feedback?(scabral) → feedback+
Blocks: 1200716
Comment on attachment 8647209 [details] [review]
PR

I believe this is ready for final review. I addressed all the comments from last time (I think), plus:

* No longer try to store a whole datum's worth of data, following from bug 1220716, just store a single "float" field storing the value that the test gave.
This is all we ever usefully displayed in the UI anyway.
* Since we're storing the value as a raw float, don't bother rounding it when inserting it into the database. We can do that when the user hits the JSON API.
* Eliminate old duplicates from performance series data when doing the migration (even though the new code was supposed to auto-eliminate these, there were some really old series that had gone under the radar before)

Everything seems in order locally, though of course we'll want to give this a good run through stage before pushing to master and deploying....
Attachment #8647209 - Attachment description: First pass → PR
Attachment #8647209 - Flags: review?(mdoglio)
Comment on attachment 8647209 [details] [review]
PR

Left a few comments. Sorry it took so long to review it.
Attachment #8647209 - Flags: review?(mdoglio) → review+
So the initial deployment to stage looks to be going well, though I did get an error halfway through running the migration script on inbound:

[wlachance@treeherder-rabbitmq1.stage.private.scl3 treeherder-service]$  ../venv/bin/python manage.py migrate_perf_data --project mozilla-inbound --interval 7776000

Traceback (most recent call last):
  File "manage.py", line 11, in <module>
    execute_from_command_line(sys.argv)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 338, in execute_from_command_line
    utility.execute()
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/__init__.py", line 330, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 393, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/django/core/management/base.py", line 444, in execute
    output = self.handle(*args, **options)
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/perf/management/commands/migrate_perf_data.py", line 127, in handle
    signature_props, options['interval'])
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 37, in __exit__
    self.disconnect()
  File "/data/www/treeherder.allizom.org/treeherder-service/treeherder/model/derived/base.py", line 142, in disconnect
    self.dhub.disconnect()
  File "/data/www/treeherder.allizom.org/venv/lib/python2.7/site-packages/datasource/bases/SQLHub.py", line 246, in disconnect
    self.connection[host_type]['con_obj'].commit()
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone away')

This is odd, as it worked fine on my vagrant instance when migrating a much larger dump of the production database. Pythian: Can you look into why the mysql database might have died at around 5:00pm EST today (September 10th)?
Flags: needinfo?(team73)
I'm creating the ticket for this at the moment. I will come back with the logs soon.
I'm creating the ticket for this at the moment and will come back with the logs soon.

Orkan Dere
Hi, 

I checked the treeherder stage servers, but could not find anything in error logs for the mentioned timestamps (September 10, 5PM EST) and i checked the setting variables which can be the cause for the issue you faced and found they have values which can be increased.

max_allowed_packet is 32M, it can be increased a bit (like 64M) to allow larger data packets in transaction.
log_warnings is 1, we can set it to 2, so that it will also log the aborted connection messages(which are warnings) in error logs
one other variable wait_timeout is 600, it can be increased a bit.

I also checked the prod servers, they also have the same settings.

mysql> show variables like 'max_allowed_packet';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 33554432 |
+--------------------+----------+
1 row in set (0.00 sec)

mysql> show variables like 'wait_timeout';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wait_timeout  | 600   |
+---------------+-------+
1 row in set (0.00 sec)

mysql> show variables like 'log_warnings';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_warnings  | 1     |
+---------------+-------+
1 row in set (0.00 sec)


Please let us know if any help required to make the changes and if you see any issues in changing these.

Thanks,
Rohit Kumar
(In reply to Rohit Kumar from comment #15)
> Hi, 
> 
> I checked the treeherder stage servers, but could not find anything in error
> logs for the mentioned timestamps (September 10, 5PM EST) and i checked the
> setting variables which can be the cause for the issue you faced and found
> they have values which can be increased.

My data migration script can be re-run when it fails, I think I'm just going to do that, it doesn't seem worthwhile to track down the root cause of this.
Flags: needinfo?(team73)
Attached file Remove legacy code
I think after applying this and removing the legacy databases, we can make this as done.
Attachment #8661769 - Flags: review?(emorley)
Comment on attachment 8661769 [details] [review]
Remove legacy code

Looks good; thank you :-)
Attachment #8661769 - Flags: review?(emorley) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Blocks: 1214618
You need to log in before you can comment on or make changes to this bug.