629019 - Cluster themes and sites as publicly available and output to JSON

Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)

Reporter

Description

•

15 years ago

We'd like metrics to handle themes (as they are similar to sites). What we want is clustering that clusters opinions for each version of Firefox and Mobile. As new data arrives, it should be added to existing themes (like growing crystals), before generating a new cluster. Each cluster should have a unique unchanging ID so that we can have permanent URLs on the input site. The JSON files for sites should look like this: {"youtube.com": [1,2,3,4,..., opinion_ids]} these opinion_ids should be the same ids that you get from input. For themes: {"UNIQUETHEMEID": [1,2,3,4,..., opinion_ids]} We may be dealing with large files, so we'll have to be smart about how we create these. Something like: /firefox/4.0b10/themes.json /firefox/4.0b10/sites.json Input will load all the data for previous betas in a single fetch. For current betas and current versions we'll poll the public dump periodically. If these are public we can also serve them to community, and use them on our own instances of input that we spin up for testing: See bug 629011. Let us know where this might fit in milestone wise, or if you want to work on it as you have time, just let us know where you are at. I can build the consumer side for input.

Michael Kurze [:michaelk]

Assignee

Comment 1

•

15 years ago

We need to keep in mind that for each site, or each beta there can be multiple clusters generated. To retrieve the clusters the service could offer bucket URLs like this: GET /clusters/corpus_id GET /cluster/cluster_id The corpus_id itself can then have a path-style structure. That makes for nice URLs and is still super-extensible: "firefox/beta10/sites/youtube.com/happy/ALL" "firefox-mobile/beta10/sites/youtube.com/happy/mac" and for themes: "beta9/themes/mac" Both URLs will give you a map like this: "{'cluster_id': [[opinion_id, ...], [opinion_id, ...], ...], ...} The /cluster/* URLs always give you only one map entry of course. To push a message to a corpus: POST /corpora/corpus_id (document in the body) Batch-loading: Probably though an admin command not exposed via HTTP.

Michael Kurze [:michaelk]

Assignee

Comment 2

•

15 years ago

Addition: the POST needs to contain the id of course. POST /corpora/corpus_id {"id": "4815162342", "text": "Firef0x is rulzor"}

Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)

Reporter

Comment 3

•

15 years ago

Let's keep the key's small, but the paths to the files large. It'll keep the json files small. > Both URLs will give you a map like this: > "{'cluster_id': [[opinion_id, ...], [opinion_id, ...], ...], ...} Why would there be a list of lists per cluster?

Michael Kurze [:michaelk]

Assignee

Comment 4

•

15 years ago

You are totally right. I think first I had a list of list in mind, with sequential cluster index numbers. With the map, that don’t make no sense anymore. {'cluster_id': [opinion_id, ...], ...}

Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)

Reporter