Closed
Bug 605539
Opened 14 years ago
Closed 14 years ago
Make our database dumps downloadable via tsv files
Categories
(Input :: General, enhancement, P2)
Tracking
(Not tracked)
VERIFIED
FIXED
3.0
People
(Reporter: aakashd, Assigned: wenzel)
References
Details
Attachments
(1 file, 2 obsolete files)
1.00 KB,
text/plain
|
Details |
I've received requests from a number of folks to have access to our database and they don't necessarily want to get them in an unreadable format. It'd be great to have a feature in the admin screen to grab our entire db (and possibly subsets of each beta) and format it into a readable csv file.
Reporter | ||
Comment 1•14 years ago
|
||
Forgot to mention that something like what's on the top right side would be suggested: http://blocker-reports.brasstacks.mozilla.com/
Reporter | ||
Updated•14 years ago
|
Priority: -- → P2
Target Milestone: --- → 3.0
Assignee | ||
Comment 2•14 years ago
|
||
By "in the admin screen", do you mean the Django admin site (after login)? Or do you want to expose something on the public dashboard? The latter, I guess, is already somewhat covered by the search feeds.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → mozaakash
Reporter | ||
Comment 3•14 years ago
|
||
Yeah, I meant the Django admin site. In the future, we should offer APIs to make it easy for people to extend their own solutions and create data visualizations and what not though (separate bug).
Reporter | ||
Updated•14 years ago
|
Assignee: mozaakash → fwenzel
Assignee | ||
Comment 4•14 years ago
|
||
Our entire DB is getting pretty big meanwhile. I am not sure if we can (or rather: should) generalize this request like this. Generating a CSV or SQL file on demand is out of the question (it'd take a while, all the time binding the application process). We can probably have a daily SQL dump of the opinions table generated by IT and dropped somewhere, but before we go through this hassle, I think we should find out who wants the data, what for, and if their questions can't be answered more usefully through the search feeds.
Comment 5•14 years ago
|
||
From a metrics perspective (sites processing): * Opinions as TSVs split by version will be most generally useful for the clustering we do. Splits by (version,week) would be ideal as it gives better chunks and allows for incremental processing. But we can do those 2nd splits by week on the metrics side. * A daily update would be minimum, more often even better (4x/day would be great). * Fields needed (from feedback.opinion): id, created, os, product, version, locale, url, type (happy/sad/...), and the actual message
Comment 6•14 years ago
|
||
Followup to Comment 5: CSV format would also work, but TSV is the default format for hadoop map-reduce stream processing.
Reporter | ||
Comment 7•14 years ago
|
||
Here are some ideas that other open data sites use: http://www.factual.com/ http://www.worldvaluessurvey.org/ http://www.data.gov/catalog/raw http://infochimps.com/
Reporter | ||
Comment 8•14 years ago
|
||
Here's another: http://www.flickr.com/services/api/
Comment 9•14 years ago
|
||
Hey Fred, here's a suggestion for an export format. Works without using quotes (always neat for regex and stream processing): * Fields separated by TAB, lines by newline (LF). * These field-internal characters are prefixed by one backslash: - newline (LF) - backslash - TAB - NUL (ASCII \x00) To have "1 record = 1 row", some tool would be used to replace \<LF> with \n * Field types: * text: utf8-encoded unicode Only LF and the separator (TAB) are escaped. * uint: unsigned int written using plain ASCII/utf-8 0-9 Format 1: opinions.tsv * All opinion-fields except the UA are included * Fields: 1. id : uint 2. created: uint (unix time = seconds since 1 Jan 1970 UTC) 3. type : text ("happy"/"sad"/"suggestion") 4. product : text ("desktop"/"mobile") 5. version : text 6. os : text 7. locale : text 8. manufacturer : text 9. device : text 10. feedback : text, list of comma-separated rating values (uints) 11. url : text 12. description : text * Ugly SQL to generate this as one big file from the input database: mysql> SELECT id, unix_timestamp(created) as created, CASE type WHEN 1 THEN "happy" WHEN 2 THEN "sad" WHEN 3 then "suggestion" END as type, CASE product WHEN 1 THEN "desktop" WHEN 2 THEN "mobile" END as product, version, os, locale, manufacturer, device, url, description FROM feedback_opinion where device != "" LIMIT 100;-- INTO OUTFILE '/path/to/file' FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n'; Format 2: ratings.tsv * All rating-fields except the rating id (aggregate key should be fine) 1. opinion_id : uint 2. type : text ("startup"/"pageload"/"responsive"/"crashy"/"features") 3. rating : uint * One-big-file SQL mysql> SELECT opinion_id, CASE type WHEN 1 THEN "startup" WHEN 2 THEN "pageload" WHEN 3 then "responsive" WHEN 4 then "crashy" WHEN 5 THEN "features" END as type, rating FROM feedback_rating INTO OUTFILE '/path/to/file' FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n';
Comment 10•14 years ago
|
||
Having a look at Aakash’s Description, I should add that while there are some things that help humans (e.g. urls and descriptions at the end). But the format described in comment 9 mainly targets big-data programs (map-reduce), not so much Excel. For Excel, we would probably want US-style PDT dates (not sure if it can do iso8601 nowadays) and comma as the seperator with optionally enclosed fields, and lines terminated by '\r\n'. All of this can be achieved by slightly modifiying the queries. Can we have both?
Comment 11•14 years ago
|
||
For further reference, the queries from comment 9, formatted, sans the debug stuff, with correct product-ids
Updated•14 years ago
|
Attachment #501236 -
Attachment mime type: application/octet-stream → text/plain
Comment 12•14 years ago
|
||
The originally submitted export format splits existing \r\n line endings (such as in descriptions) so that we get \r\\n. That is somewhat unnatural to work with. Given that the export is targeted at batch processing, I think it won’t hurt to replace \r\n line endings with \n prior to encoding to TSV. This way there is only one type of line-ending, and that can be escaped or not. See updated attachment.
Attachment #501236 -
Attachment is obsolete: true
Assignee | ||
Comment 13•14 years ago
|
||
Comment on attachment 502849 [details]
Amended TSV export (SQL)
Your opinion types do not include the new types "rating" and "broken" (website).
I'm working on a cron job to create this data format for us.
Comment 14•14 years ago
|
||
Added the missing types. Until we have a job in place, I can create the export myself from the database snapshots that we get pushed to metrics. Note on Comment 9: The field 10 for opinions.tsv is useless due to the extra export we have for ratings, and not included in the export query. It only snuck into that comment due to an editing mistake I made.
Attachment #502849 -
Attachment is obsolete: true
Assignee | ||
Comment 15•14 years ago
|
||
Here's a patch for this: https://github.com/fwenzel/reporter/commit/d4e9e3f michaelk or davedash, can you review this? This is both a cron job and a celery task if we so desire. It exports all our opinions into one .tsv.bz2 file, and all ratings into a second such file. When putting this on stage (and prod), we have to agree with IT on where to export these files to, and how to make the files accessible from Apache.
Reporter | ||
Comment 16•14 years ago
|
||
> When putting this on stage (and prod), we have to agree with IT on where to
export these files to, and how to make the files accessible from Apache.
Can we get a bug filed to figure that out, Fred?
Assignee | ||
Comment 17•14 years ago
|
||
Yes, after the code is done, I'll file a bug to get the cron job set up.
Assignee | ||
Updated•14 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 18•14 years ago
|
||
All right, I landed this on master: http://github.com/fwenzel/reporter/commit/ca42a0f Will file an IT bug to get this cron job set up on stage, then close this.
Assignee | ||
Comment 19•14 years ago
|
||
IT bug is filed, this will be testable on stage after it is closed.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 20•13 years ago
|
||
What is the IT bug number?
Comment 21•13 years ago
|
||
To comment 20: Bug 625965 (you’re on CC)
Comment 22•13 years ago
|
||
Verified able to see and click the export TSV file option in /admin, export will not work until bug 625965 has been resolved
Status: RESOLVED → VERIFIED
Reporter | ||
Updated•13 years ago
|
Summary: Make our database dumps downloadable via csv files → Make our database dumps downloadable via tsv files
Updated•13 years ago
|
Component: Input → General
Product: Webtools → Input
You need to log in
before you can comment on or make changes to this bug.
Description
•