Closed Bug 605539 Opened 14 years ago Closed 14 years ago

Make our database dumps downloadable via tsv files

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

3.0

People

(Reporter: aakashd, Assigned: wenzel)

References

Details

Attachments

(1 file, 2 obsolete files)

Opinion TSV export (as SQL) 14 years ago Michael Kurze [:michaelk] 927 bytes, text/plain		Details
Amended TSV export (SQL) 14 years ago Michael Kurze [:michaelk] 956 bytes, text/plain		Details
Amended TSV export complete opinion types (SQL) 14 years ago Michael Kurze [:michaelk] 1.00 KB, text/plain		Details

Aakash Desai [:aakashd]

Reporter

Description

•

14 years ago

I've received requests from a number of folks to have access to our database and they don't necessarily want to get them in an unreadable format. It'd be great to have a feature in the admin screen to grab our entire db (and possibly subsets of each beta) and format it into a readable csv file.

Aakash Desai [:aakashd]

Reporter

Comment 1

•

14 years ago

Forgot to mention that something like what's on the top right side would be suggested: http://blocker-reports.brasstacks.mozilla.com/

Aakash Desai [:aakashd]

Reporter

Updated

•

14 years ago

Priority: -- → P2

Target Milestone: --- → 3.0

Fred Wenzel [:wenzel]

Assignee

Comment 2

•

14 years ago

By "in the admin screen", do you mean the Django admin site (after login)? Or do you want to expose something on the public dashboard? The latter, I guess, is already somewhat covered by the search feeds.

Fred Wenzel [:wenzel]

Assignee

Updated

•

14 years ago

Assignee: nobody → mozaakash

Aakash Desai [:aakashd]

Reporter

Comment 3

•

14 years ago

Yeah, I meant the Django admin site. In the future, we should offer APIs to make it easy for people to extend their own solutions and create data visualizations and what not though (separate bug).

Aakash Desai [:aakashd]

Reporter

Updated

•

14 years ago

Assignee: mozaakash → fwenzel

Fred Wenzel [:wenzel]

Assignee

Comment 4

•

14 years ago

Our entire DB is getting pretty big meanwhile. I am not sure if we can (or rather: should) generalize this request like this. Generating a CSV or SQL file on demand is out of the question (it'd take a while, all the time binding the application process). We can probably have a daily SQL dump of the opinions table generated by IT and dropped somewhere, but before we go through this hassle, I think we should find out who wants the data, what for, and if their questions can't be answered more usefully through the search feeds.

Michael Kurze [:michaelk]

Comment 5

•

14 years ago

From a metrics perspective (sites processing):

* Opinions as TSVs split by version will be most generally useful for the clustering we do. 
  Splits by (version,week) would be ideal as it gives better chunks and allows for
  incremental processing. But we can do those 2nd splits by week on the metrics side.

* A daily update would be minimum, more often even better (4x/day would be great). 

* Fields needed (from feedback.opinion): 
  id, created, os, product, version, locale, 
  url, type (happy/sad/...), and the actual message

Michael Kurze [:michaelk]

Comment 6

•

14 years ago

Followup to Comment 5: CSV format would also work, but TSV is the default format for hadoop map-reduce stream processing.

Michael Kurze [:michaelk]

Updated

•

14 years ago

Blocks: 621725

Aakash Desai [:aakashd]

Reporter

Comment 7

•

14 years ago

Here are some ideas that other open data sites use:

http://www.factual.com/
http://www.worldvaluessurvey.org/
http://www.data.gov/catalog/raw
http://infochimps.com/

Aakash Desai [:aakashd]

Reporter

Comment 8

•

14 years ago

Here's another:

http://www.flickr.com/services/api/

Michael Kurze [:michaelk]

Comment 9

•

14 years ago

Hey Fred, here's a suggestion for an export format. Works without using quotes (always neat for regex and stream processing):

* Fields separated by TAB, lines by newline (LF).
* These field-internal characters are prefixed by one backslash:
  - newline (LF)
  - backslash
  - TAB
  - NUL (ASCII \x00)
  To have "1 record = 1 row", some tool would be used to replace \<LF> with \n

* Field types:
  * text: utf8-encoded unicode
    Only LF and the separator (TAB) are escaped.
  * uint: unsigned int written using plain ASCII/utf-8 0-9

Format 1: opinions.tsv
* All opinion-fields except the UA are included
*  Fields:
   1. id : uint 
   2. created: uint (unix time = seconds since 1 Jan 1970 UTC)
   3. type : text ("happy"/"sad"/"suggestion")
   4. product : text ("desktop"/"mobile")
   5. version : text
   6. os : text
   7. locale : text
   8. manufacturer : text
   9. device : text
  10. feedback : text, list of comma-separated rating values (uints)
  11. url : text
  12. description : text
  

* Ugly SQL to generate this as one big file from the input database:
mysql> SELECT id, unix_timestamp(created) as created, CASE type WHEN 1 THEN "happy" WHEN 2 THEN "sad" WHEN 3 then "suggestion" END as type, CASE product WHEN 1 THEN "desktop" WHEN 2 THEN "mobile" END as product, version, os, locale, manufacturer, device, url, description FROM feedback_opinion where device != "" LIMIT 100;-- INTO OUTFILE '/path/to/file' FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n';


Format 2: ratings.tsv
* All rating-fields except the rating id (aggregate key should be fine)

1. opinion_id : uint 
2. type : text ("startup"/"pageload"/"responsive"/"crashy"/"features")
3. rating : uint

* One-big-file SQL
mysql> SELECT opinion_id, CASE type WHEN 1 THEN "startup" WHEN 2 THEN "pageload" WHEN 3 then "responsive" WHEN 4 then "crashy" WHEN 5 THEN "features" END as type, rating FROM feedback_rating INTO OUTFILE '/path/to/file' FIELDS TERMINATED BY '\t' ESCAPED BY '\\' LINES TERMINATED BY '\n';

Michael Kurze [:michaelk]

Comment 10

•

14 years ago

Having a look at Aakash’s Description, I should add that while there are some things that help humans (e.g. urls and descriptions at the end). But the format described in comment 9 mainly targets big-data programs (map-reduce), not so much Excel.

For Excel, we would probably want US-style PDT dates (not sure if it can do iso8601 nowadays) and comma as the seperator with optionally enclosed fields, and lines terminated by '\r\n'. All of this can be achieved by slightly modifiying the queries.

Can we have both?

Michael Kurze [:michaelk]

Comment 11

•

14 years ago

Attached file Opinion TSV export (as SQL) (obsolete) — Details

For further reference, the queries from comment 9, formatted, sans the debug stuff, with correct product-ids

Michael Kurze [:michaelk]

Updated

•

14 years ago

Attachment #501236 - Attachment mime type: application/octet-stream → text/plain

Michael Kurze [:michaelk]

Comment 12

•

14 years ago

Attached file Amended TSV export (SQL) (obsolete) — Details

The originally submitted export format splits existing \r\n line endings (such as in descriptions) so that we get \r\\n.

That is somewhat unnatural to work with. Given that the export is targeted at batch processing, I think it won’t hurt to replace \r\n line endings with \n prior to encoding to TSV. This way there is only one type of line-ending, and that can be escaped or not.

See updated attachment.

Attachment #501236 - Attachment is obsolete: true

Fred Wenzel [:wenzel]

Assignee

Comment 13

•

14 years ago

Comment on attachment 502849 [details]
Amended TSV export (SQL)

Your opinion types do not include the new types "rating" and "broken" (website).

I'm working on a cron job to create this data format for us.

Michael Kurze [:michaelk]

Comment 14

•

14 years ago

Attached file Amended TSV export complete opinion types (SQL) — Details

Added the missing types. Until we have a job in place, I can create the export myself from the database snapshots that we get pushed to metrics.

Note on Comment 9: 
The field 10 for opinions.tsv is useless due to the extra export we have for ratings, and not included in the export query. It only snuck into that comment due to an editing mistake I made.

Attachment #502849 - Attachment is obsolete: true

Fred Wenzel [:wenzel]

Assignee

Updated

•

14 years ago

Blocks: 624016

Fred Wenzel [:wenzel]

Assignee

Comment 15

•

14 years ago

Here's a patch for this:

https://github.com/fwenzel/reporter/commit/d4e9e3f

michaelk or davedash, can you review this? This is both a cron job and a celery task if we so desire. It exports all our opinions into one .tsv.bz2 file, and all ratings into a second such file.

When putting this on stage (and prod), we have to agree with IT on where to export these files to, and how to make the files accessible from Apache.

Fred Wenzel [:wenzel]

Assignee

Updated

•

14 years ago

Blocks: 611382

Aakash Desai [:aakashd]

Reporter

Comment 16

•

14 years ago

> When putting this on stage (and prod), we have to agree with IT on where to
export these files to, and how to make the files accessible from Apache.

Can we get a bug filed to figure that out, Fred?

Fred Wenzel [:wenzel]

Assignee

Comment 17

•

14 years ago

Yes, after the code is done, I'll file a bug to get the cron job set up.

Fred Wenzel [:wenzel]

Assignee

Updated

•

14 years ago

Status: NEW → ASSIGNED

Fred Wenzel [:wenzel]

Assignee

Comment 18

•

14 years ago

All right, I landed this on master:

http://github.com/fwenzel/reporter/commit/ca42a0f

Will file an IT bug to get this cron job set up on stage, then close this.

Fred Wenzel [:wenzel]

Assignee

Updated

•

14 years ago

Blocks: 625965

Fred Wenzel [:wenzel]

Assignee

Comment 19

•

14 years ago

IT bug is filed, this will be testable on stage after it is closed.

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Rebecca Billings [:rbillings]

Comment 20

•

13 years ago

What is the IT bug number?

Michael Kurze [:michaelk]

Comment 21

•

13 years ago

To comment 20: Bug 625965 (you’re on CC)

Rebecca Billings [:rbillings]

Comment 22

•

13 years ago

Verified able to see and click the export TSV file option in /admin, export will not work until bug 625965 has been resolved

Status: RESOLVED → VERIFIED

Aakash Desai [:aakashd]

Reporter

Updated

•

13 years ago

Summary: Make our database dumps downloadable via csv files → Make our database dumps downloadable via tsv files

Nobody; OK to take it and work on it

Updated

•

13 years ago

Component: Input → General

Product: Webtools → Input

You need to log in before you can comment on or make changes to this bug.