Closed Bug 387141 Opened 17 years ago Closed 17 years ago

Port Graph Server to mysql

Categories

(Webtools Graveyard :: Graph Server, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mtschrep, Assigned: oremj)

References

Details

Attachments

(2 files, 1 obsolete file)

Currently the Graph Server uses sqlite as the backing store.   For scalability and data back reasons we should port it to use mysql (or postgres) and leverage the MySQL clusters we have setup for backups and performance.

We are definitely seeing slowdowns on the production graphs.mozilla.org.
Assignee: nobody → oremj
Blocks: 386671
Attached patch Allows getdata.cgi to use MySQL (obsolete) — Splinter Review
This patch includes a sqlite to mysql migration script.  We will still need to alter and verity annotate.cgi, bonsaibouncer.cgi,bulk.cgi,and collect.cgi.  I'm not sure where these scripts are receiving input, so I'm not quite sure how I am going to test them.
Attachment #272070 - Flags: review?(vladimir)
Attachment #272070 - Flags: review?(anodelman)
Comment on attachment 272070 [details] [diff] [review]
Allows getdata.cgi to use MySQL


I'm not familiar with the mysql execute interface -- will %s do the same thing as ? did?  That is, do the binding such that the db engine sees a single data argument in that spot (hopefully without needing to explicitly quote it and the like)?  I thought that the python dbapi was portable across databases?

It would be nice if that was resolved so that we could use a single query string for both, since the ability to use sqlite in a quick standalone setup is very helpful.

>diff --git a/getdata.cgi b/getdata.cgi
>index 5deffd8..897fb4c 100755
>--- a/getdata.cgi
>+++ b/getdata.cgi
>@@ -103,7 +100,7 @@ def doListTests(fo, type, datelimit, branch, machine, testname):
>                          "test_type": row[3],
>                          "date": row[4],
>                          "extra_data": row[5],
>-                         "branch": row[6]})
>+                         "branch": float(row[6])})

Do we really want to coerce branch to a float?  It can be a full string; right now we're only using 1.9 and 1.8, but it might be a full symbolic name...

>diff --git a/scripts/migration.py b/scripts/migration.py
>new file mode 100755
>index 0000000..38bb73e
>--- /dev/null
>+++ b/scripts/migration.py
>@@ -0,0 +1,27 @@
>+from pysqlite2 import dbapi2 as sqlite
>+import MySQLdb, sys, os
>+
>+DIRNAME= os.path.dirname(sys.argv[0])
>+if not DIRNAME:
>+	DBPATH="../db/data.sqlite"
>+else:
>+	DBPATH=DIRNAME + "/../db/data.sqlite"
>+
>+sqlite_db = sqlite.connect(DBPATH)
>+
>+mysql_db = MySQLdb.connect("localhost","o","o","o_graphs")
>+mysql_cur = mysql_db.cursor()
>+
>+def migrate_table(table, select, insert):
>+	print "Migrating: " + table
>+	sqlite_cur = sqlite_db.cursor()
>+	res = sqlite_cur.execute(select)
>+	for row in res:
>+		mysql_cur.execute(insert % row)

This is kinda scary -- doing a % format with just %s will cause this to break on at least any ' characters that are embedded in any field.  Assuming that mysql.execute will do the right quoting with %s, would execute(insert, row) work?

>+
>+migrate_table('annotations',"SELECT dataset_id,time,value FROM annotations", "INSERT INTO annotations (`dataset_id`, `time`, `value` ) VALUES ('%s','%s','%s')")
>+migrate_table('dataset_branchinfo',"SELECT `dataset_id`, `time`, `branchid` FROM dataset_branchinfo", "INSERT INTO dataset_branchinfo (`dataset_id`, `time`, `branchid` ) VALUES ('%s','%s','%s')")
>+migrate_table('dataset_extra_data',"SELECT `dataset_id`, `time`, `data`  FROM dataset_extra_data", "INSERT INTO dataset_extra_data (`dataset_id`, `time`, `data` ) VALUES ('%s','%s','%s')")
>+migrate_table('dataset_info',"SELECT `id`, `type`, `machine`,`test`, `test_type`, `extra_data`, `branch`, `date` FROM dataset_info", "INSERT INTO dataset_info (`id`, `type`, `machine`,`test`, `test_type`, `extra_data`, `branch`, `date` ) VALUES ('%s', '%s', '%s','%s', '%s', '%s', '%s', '%s')")
>+migrate_table('dataset_values',"SELECT `dataset_id`, `time`, `value` FROM dataset_values", "INSERT INTO dataset_values (`dataset_id`, `time`, `value` ) VALUES ('%s','%s','%s')")


>diff --git a/sql/schema.sql b/sql/schema.sql
>new file mode 100644
>index 0000000..ff32fe8
>--- /dev/null
>+++ b/sql/schema.sql

Tables look ok; is int(11) enough for a timestamp? I guess that's 11 numeric chars, so that's enough for an unsigned 32-bit number, so that's ok.
Forgot to mention -- annotate.cgi isn't used anywhere yet, it's a feature that never got implemented (the ability to add markers to graphs, so that e.g., you can mark a line on a timestamp and say "new textframe turn-on").

bonsaibouncer.cgi shouldn't rely on the database; it's a helper to avoid running into cross-site xmlhttprequest issues when doing a bonsai query from the graph scripts (which is disabled for now -- this would be a good thing to add, basically "run bonsai query for selected range" that would just take you to bonsai; my code tried to be fancy and graph the results but I never got around to finishing it)

collect.cgi is in theory supposed to be called by the tinderboxes; rhelmer and I kept talking about making tinderbox do two calls or redirecting a call from the old script (for the old graphs) to the new one or somesuch.  But so far nothing calls it live.

I'm not sure what bulk.cgi is; I'd guess it's to retrieve bulk data, not sure who calls it :)
(In reply to comment #2)
> (From update of attachment 272070 [details] [diff] [review])
> 
> I'm not familiar with the mysql execute interface -- will %s do the same thing
> as ? did?  That is, do the binding such that the db engine sees a single data
> argument in that spot (hopefully without needing to explicitly quote it and the
> like)?  I thought that the python dbapi was portable across databases?

Unfortunately, MySQLdb uses %s instead of ?.  Basically, when execute is called with (query,args) it escapes all the args with mysql's escape function and inserts them into %s.   

> 
> It would be nice if that was resolved so that we could use a single query
> string for both, since the ability to use sqlite in a quick standalone setup is
> very helpful.
I'll work on a simple wrapper.

> 
> >diff --git a/getdata.cgi b/getdata.cgi
> >index 5deffd8..897fb4c 100755
> >--- a/getdata.cgi
> >+++ b/getdata.cgi
> >@@ -103,7 +100,7 @@ def doListTests(fo, type, datelimit, branch, machine, testname):
> >                          "test_type": row[3],
> >                          "date": row[4],
> >                          "extra_data": row[5],
> >-                         "branch": row[6]})
> >+                         "branch": float(row[6])})
> 
> Do we really want to coerce branch to a float?  It can be a full string; right
> now we're only using 1.9 and 1.8, but it might be a full symbolic name...
I accidentally left that in from when I was debugging :-(

> >diff --git a/scripts/migration.py b/scripts/migration.py
> >new file mode 100755
> >index 0000000..38bb73e
> >--- /dev/null
> >+++ b/scripts/migration.py
> >@@ -0,0 +1,27 @@
> >+from pysqlite2 import dbapi2 as sqlite
> >+import MySQLdb, sys, os
> >+
> >+DIRNAME= os.path.dirname(sys.argv[0])
> >+if not DIRNAME:
> >+	DBPATH="../db/data.sqlite"
> >+else:
> >+	DBPATH=DIRNAME + "/../db/data.sqlite"
> >+
> >+sqlite_db = sqlite.connect(DBPATH)
> >+
> >+mysql_db = MySQLdb.connect("localhost","o","o","o_graphs")
> >+mysql_cur = mysql_db.cursor()
> >+
> >+def migrate_table(table, select, insert):
> >+	print "Migrating: " + table
> >+	sqlite_cur = sqlite_db.cursor()
> >+	res = sqlite_cur.execute(select)
> >+	for row in res:
> >+		mysql_cur.execute(insert % row)
> 
> This is kinda scary -- doing a % format with just %s will cause this to break
> on at least any ' characters that are embedded in any field.  Assuming that
> mysql.execute will do the right quoting with %s, would execute(insert, row)
> work?
I was running in to some strange problems using bind variables in this case.  For some reason is was converting branches like 1.9 to 1.899999999...  There is probably a better solution for this as well.  Either way, this script should just be a one-off. 

> 
> >+
> >+migrate_table('annotations',"SELECT dataset_id,time,value FROM annotations", "INSERT INTO annotations (`dataset_id`, `time`, `value` ) VALUES ('%s','%s','%s')")
> >+migrate_table('dataset_branchinfo',"SELECT `dataset_id`, `time`, `branchid` FROM dataset_branchinfo", "INSERT INTO dataset_branchinfo (`dataset_id`, `time`, `branchid` ) VALUES ('%s','%s','%s')")
> >+migrate_table('dataset_extra_data',"SELECT `dataset_id`, `time`, `data`  FROM dataset_extra_data", "INSERT INTO dataset_extra_data (`dataset_id`, `time`, `data` ) VALUES ('%s','%s','%s')")
> >+migrate_table('dataset_info',"SELECT `id`, `type`, `machine`,`test`, `test_type`, `extra_data`, `branch`, `date` FROM dataset_info", "INSERT INTO dataset_info (`id`, `type`, `machine`,`test`, `test_type`, `extra_data`, `branch`, `date` ) VALUES ('%s', '%s', '%s','%s', '%s', '%s', '%s', '%s')")
> >+migrate_table('dataset_values',"SELECT `dataset_id`, `time`, `value` FROM dataset_values", "INSERT INTO dataset_values (`dataset_id`, `time`, `value` ) VALUES ('%s','%s','%s')")
> 
> 
> >diff --git a/sql/schema.sql b/sql/schema.sql
> >new file mode 100644
> >index 0000000..ff32fe8
> >--- /dev/null
> >+++ b/sql/schema.sql
> 
> Tables look ok; is int(11) enough for a timestamp? I guess that's 11 numeric
> chars, so that's enough for an unsigned 32-bit number, so that's ok.
> 

I was thinking more about being able to use both MySQL and sqlite.  This is going to be someone hard/limiting unless we use a db abstraction layer.  We could either use a pre-existing abstraction such as http://www.sqlalchemy.org/ or we could do something like make two db libraries and maintain queries for both.  There is also the simple solution of running a translation on the queries for (%s -> ?) and vice-versa, but that sounds very error prone since not all queries will be compatible.


What do you all think?
(In reply to comment #5)
> I was thinking more about being able to use both MySQL and sqlite.  This is
> going to be someone hard/limiting unless we use a db abstraction layer.  We
> could either use a pre-existing abstraction such as http://www.sqlalchemy.org/
> or we could do something like make two db libraries and maintain queries for
> both.  There is also the simple solution of running a translation on the
> queries for (%s -> ?) and vice-versa, but that sounds very error prone since
> not all queries will be compatible.
> 
> 
> What do you all think?
> 

CCing Morgamic who's had experience with sqlalchemy for Breakpad.   I'd advocate it only if you can use it to simply wrap the db connections and easily pass it fairly raw SQL with it just doing the DB specific mappings.  For these type of queries the O/R mapping stuff will generally get in the way.   Morgamic is this easy with sqlalchemy?
I'd say just do a simple sed %s -> ? for now -- it looked like all the queries were directly compatible.  If that changes in the future then we should look into an abstraction layer, but I don't foresee the queries getting too complex.
(In reply to comment #7)
> I'd say just do a simple sed %s -> ? for now -- it looked like all the queries
> were directly compatible.  If that changes in the future then we should look
> into an abstraction layer, but I don't foresee the queries getting too complex.
> 

Yep ok - let's bias towards getting something working then we can improve :-)
This should resolve most of the questionable issues.
Attachment #272070 - Attachment is obsolete: true
Attachment #272507 - Flags: review?(vladimir)
Attachment #272070 - Flags: review?(vladimir)
Attachment #272070 - Flags: review?(anodelman)
Comment on attachment 272507 [details] [diff] [review]
Converts application for use with MySQL

Looks fine, I don't think sqlite does the <=> stuff, but we can deal with that separately
Attachment #272507 - Flags: review?(vladimir) → review+
Checking in bulk.cgi;
/cvsroot/mozilla/webtools/new-graph/bulk.cgi,v  <--  bulk.cgi
new revision: 1.3; previous revision: 1.2
done
Checking in collect.cgi;
/cvsroot/mozilla/webtools/new-graph/collect.cgi,v  <--  collect.cgi
new revision: 1.8; previous revision: 1.7
done
Checking in dumpdata.cgi;
/cvsroot/mozilla/webtools/new-graph/dumpdata.cgi,v  <--  dumpdata.cgi
new revision: 1.2; previous revision: 1.1
done
Checking in getdata.cgi;
/cvsroot/mozilla/webtools/new-graph/getdata.cgi,v  <--  getdata.cgi
new revision: 1.8; previous revision: 1.7
done
RCS file: /cvsroot/mozilla/webtools/new-graph/graphsdb.py,v
done
Checking in graphsdb.py;
/cvsroot/mozilla/webtools/new-graph/graphsdb.py,v  <--  graphsdb.py
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/webtools/new-graph/databases/__init__.py,v
done
Checking in databases/__init__.py;
/cvsroot/mozilla/webtools/new-graph/databases/__init__.py,v  <--  __init__.py
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/webtools/new-graph/databases/mysql.py,v
done
Checking in databases/mysql.py;
/cvsroot/mozilla/webtools/new-graph/databases/mysql.py,v  <--  mysql.py
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/webtools/new-graph/scripts/migration.py,v
done
Checking in scripts/migration.py;
/cvsroot/mozilla/webtools/new-graph/scripts/migration.py,v  <--  migration.py
initial revision: 1.1
done
RCS file: /cvsroot/mozilla/webtools/new-graph/sql/schema.sql,v
done
Checking in sql/schema.sql;
/cvsroot/mozilla/webtools/new-graph/sql/schema.sql,v  <--  schema.sql
initial revision: 1.1
done
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Is this deployed on production?
It is deployed at graphs-stage.mozilla.com.
Looking over the deployed code on graphs-stage.m.o I realized that the contents of utils/ had not been ported.  The code in utils is for grabbing data from the old build-graphs.m.o for plotting on the new graph server - currently the switch to mysql will break this code.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This should fix the problem, but I am not quite sure how to test it.
Attachment #275802 - Flags: review?(anodelman)
That code can be tested by first creating a directory called 'old' in the
utils/ directory.  Then, ensure that the paths in pull.sh and import.py are
correct (pointing at the old/ directory and the appropriate locations in the
graph server code).  Then run ./pull.sh.  You should find at the end of the
pull that there are now graphs in the given database representing the data from
build-graphs. 
what do I put in the 'old' directory?
The old directory will be filled with pulled in build-graphs data - so you just need to provide the empty directory and it will fill itself.

Voila!
After running the script on an empty database I get this http://oremj.khan-vm.mozilla.org/graph.html

Does that look right?

Attachment #275802 - Flags: review?(anodelman) → review+
Checking in utils/import.py;
/cvsroot/mozilla/webtools/new-graph/utils/import.py,v  <--  import.py
new revision: 1.12; previous revision: 1.11
done
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
No longer blocks: 386669
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: