Closed Bug 605539 Opened 14 years ago Closed 14 years ago

Make our database dumps downloadable via tsv files

Categories

(Input :: General, enhancement, P2)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: aakashd, Assigned: wenzel)

References

Details

Attachments

(1 file, 2 obsolete files)

I've received requests from a number of folks to have access to our database and they don't necessarily want to get them in an unreadable format. It'd be great to have a feature in the admin screen to grab our entire db (and possibly subsets of each beta) and format it into a readable csv file.
Forgot to mention that something like what's on the top right side would be suggested: http://blocker-reports.brasstacks.mozilla.com/
Priority: -- → P2
Target Milestone: --- → 3.0
By "in the admin screen", do you mean the Django admin site (after login)? Or do you want to expose something on the public dashboard? The latter, I guess, is already somewhat covered by the search feeds.
Assignee: nobody → mozaakash
Yeah, I meant the Django admin site. In the future, we should offer APIs to make it easy for people to extend their own solutions and create data visualizations and what not though (separate bug).
Assignee: mozaakash → fwenzel
Our entire DB is getting pretty big meanwhile. I am not sure if we can (or rather: should) generalize this request like this. Generating a CSV or SQL file on demand is out of the question (it'd take a while, all the time binding the application process). We can probably have a daily SQL dump of the opinions table generated by IT and dropped somewhere, but before we go through this hassle, I think we should find out who wants the data, what for, and if their questions can't be answered more usefully through the search feeds.
From a metrics perspective (sites processing):

* Opinions as TSVs split by version will be most generally useful for the clustering we do. 
  Splits by (version,week) would be ideal as it gives better chunks and allows for
  incremental processing. But we can do those 2nd splits by week on the metrics side.

* A daily update would be minimum, more often even better (4x/day would be great). 

* Fields needed (from feedback.opinion): 
  id, created, os, product, version, locale, 
  url, type (happy/sad/...), and the actual message
Followup to Comment 5: CSV format would also work, but TSV is the default format for hadoop map-reduce stream processing.
Blocks: 621725
Hey Fred, here's a suggestion for an export format. Works without using quotes (always neat for regex and stream processing):

* Fields separated by TAB, lines by newline (LF).
* These field-internal characters are prefixed by one backslash:
  - newline (LF)
  - backslash
  - TAB
  - NUL (ASCII \x00)
  To have "1 record = 1 row", some tool would be used to replace \<LF> with \n

* Field types:
  * text: utf8-encoded unicode
    Only LF and the separator (TAB) are escaped.
  * uint: unsigned int written using plain ASCII/utf-8 0-9

Format 1: opinions.tsv
* All opinion-fields except the UA are included
*  Fields:
   1. id : uint 
   2. created: uint (unix time = seconds since 1 Jan 1970 UTC)
   3. type : text ("happy"/"sad"/"suggestion")
   4. product : text ("desktop"/"mobile")
   5. version : text
   6. os : text
   7. locale : text
   8. manufacturer : text
   9. device : text
  10. feedback : text, list of comma-separated rating values (uints)
  11. url : text
  12. description : text
  

* Ugly SQL to generate this as one big file from the input database:
mysql> SELECT id, unix_timestamp(created) as created, CASE type WHEN 1 THEN "happy" WHEN 2 THEN "sad" WHEN 3 then "suggestion" END as type, CASE product WHEN 1 THEN "desktop" WHEN 2 THEN "mobile" END as product, version, os, locale, manufacturer, device, url, description FROM feedback_opinion where device != "" LIMIT 100;-- INTO OUTFILE '/path/to/file' FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n';


Format 2: ratings.tsv
* All rating-fields except the rating id (aggregate key should be fine)

1. opinion_id : uint 
2. type : text ("startup"/"pageload"/"responsive"/"crashy"/"features")
3. rating : uint

* One-big-file SQL
mysql> SELECT opinion_id, CASE type WHEN 1 THEN "startup" WHEN 2 THEN "pageload" WHEN 3 then "responsive" WHEN 4 then "crashy" WHEN 5 THEN "features" END as type, rating FROM feedback_rating INTO OUTFILE '/path/to/file' FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n';
Having a look at Aakash’s Description, I should add that while there are some things that help humans (e.g. urls and descriptions at the end). But the format described in comment 9 mainly targets big-data programs (map-reduce), not so much Excel.

For Excel, we would probably want US-style PDT dates (not sure if it can do iso8601 nowadays) and comma as the seperator with optionally enclosed fields, and lines terminated by '\r\n'. All of this can be achieved by slightly modifiying the queries.

Can we have both?
Attached file Opinion TSV export (as SQL) (obsolete) —
For further reference, the queries from comment 9, formatted, sans the debug stuff, with correct product-ids
Attachment #501236 - Attachment mime type: application/octet-stream → text/plain
Attached file Amended TSV export (SQL) (obsolete) —
The originally submitted export format splits existing \r\n line endings (such as in descriptions) so that we get \r\\n.

That is somewhat unnatural to work with. Given that the export is targeted at batch processing, I think it won’t hurt to replace \r\n line endings with \n prior to encoding to TSV. This way there is only one type of line-ending, and that can be escaped or not.

See updated attachment.
Attachment #501236 - Attachment is obsolete: true
Comment on attachment 502849 [details]
Amended TSV export (SQL)

Your opinion types do not include the new types "rating" and "broken" (website).

I'm working on a cron job to create this data format for us.
Added the missing types. Until we have a job in place, I can create the export myself from the database snapshots that we get pushed to metrics.

Note on Comment 9: 
The field 10 for opinions.tsv is useless due to the extra export we have for ratings, and not included in the export query. It only snuck into that comment due to an editing mistake I made.
Attachment #502849 - Attachment is obsolete: true
Blocks: 624016
Here's a patch for this:

https://github.com/fwenzel/reporter/commit/d4e9e3f

michaelk or davedash, can you review this? This is both a cron job and a celery task if we so desire. It exports all our opinions into one .tsv.bz2 file, and all ratings into a second such file.

When putting this on stage (and prod), we have to agree with IT on where to export these files to, and how to make the files accessible from Apache.
Blocks: 611382
> When putting this on stage (and prod), we have to agree with IT on where to
export these files to, and how to make the files accessible from Apache.

Can we get a bug filed to figure that out, Fred?
Yes, after the code is done, I'll file a bug to get the cron job set up.
Status: NEW → ASSIGNED
All right, I landed this on master:

http://github.com/fwenzel/reporter/commit/ca42a0f

Will file an IT bug to get this cron job set up on stage, then close this.
Blocks: 625965
IT bug is filed, this will be testable on stage after it is closed.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
What is the IT bug number?
To comment 20: Bug 625965 (you’re on CC)
Verified able to see and click the export TSV file option in /admin, export will not work until bug 625965 has been resolved
Status: RESOLVED → VERIFIED
Summary: Make our database dumps downloadable via csv files → Make our database dumps downloadable via tsv files
Component: Input → General
Product: Webtools → Input
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: