Last Comment Bug 629019 - Cluster themes and sites as publicly available and output to JSON
: Cluster themes and sites as publicly available and output to JSON
Product: Input
Classification: Server Software
Component: General (show other bugs)
: 4.2
: All All
P1 normal
: Future
Assigned To: Michael Kurze [:michaelk]
: input-deleted
Depends on:
Blocks: 583669 623360 623361 626689 635999 663032
  Show dependency treegraph
Reported: 2011-01-26 09:05 PST by Dave Dash [:davedash, :dd] (assign all bugs to mbrandt)
Modified: 2011-06-23 09:29 PDT (History)
2 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Description User image Dave Dash [:davedash, :dd] (assign all bugs to mbrandt) 2011-01-26 09:05:49 PST
We'd like metrics to handle themes (as they are similar to sites).

What we want is clustering that clusters opinions for each version of Firefox and Mobile.

As new data arrives, it should be added to existing themes (like growing crystals), before generating a new cluster.

Each cluster should have a unique unchanging ID so that we can have permanent URLs on the input site.

The JSON files for sites should look like this:

{"": [1,2,3,4,..., opinion_ids]}

these opinion_ids should be the same ids that you get from input.

For themes:

{"UNIQUETHEMEID": [1,2,3,4,..., opinion_ids]}

We may be dealing with large files, so we'll have to be smart about how we create these.  Something like:


Input will load all the data for previous betas in a single fetch.  For current betas and current versions we'll poll the public dump periodically.

If these are public we can also serve them to community, and use them on our own instances of input that we spin up for testing:

See bug 629011.

Let us know where this might fit in milestone wise, or if you want to work on it as you have time, just let us know where you are at.  

I can build the consumer side for input.
Comment 1 User image Michael Kurze [:michaelk] 2011-01-26 13:10:18 PST
We need to keep in mind that for each site, or each beta there can be multiple clusters generated.

To retrieve the clusters the service could offer bucket URLs like this:

GET /clusters/corpus_id 
GET /cluster/cluster_id

The corpus_id itself can then have a path-style structure. That makes for nice URLs and is still super-extensible:


and for themes:

Both URLs will give you a map like this:
"{'cluster_id': [[opinion_id, ...], [opinion_id, ...], ...], ...}

The /cluster/* URLs always give you only one map entry of course.

To push a message to a corpus:

POST /corpora/corpus_id (document in the body)

Batch-loading: Probably though an admin command not exposed via HTTP.
Comment 2 User image Michael Kurze [:michaelk] 2011-01-26 13:13:20 PST
Addition: the POST needs to contain the id of course.

POST /corpora/corpus_id

{"id": "4815162342", "text": "Firef0x is rulzor"}
Comment 3 User image Dave Dash [:davedash, :dd] (assign all bugs to mbrandt) 2011-01-26 13:23:33 PST
Let's keep the key's small, but the paths to the files large.  It'll keep the json files small.

> Both URLs will give you a map like this:
> "{'cluster_id': [[opinion_id, ...], [opinion_id, ...], ...], ...}

Why would there be a list of lists per cluster?
Comment 4 User image Michael Kurze [:michaelk] 2011-01-26 13:28:44 PST
You are totally right. I think first I had a list of list in mind, with sequential cluster index numbers.

With the map, that don’t make no sense anymore.

{'cluster_id': [opinion_id, ...], ...}
Comment 5 User image Michael Kurze [:michaelk] 2011-02-20 23:01:58 PST
Update: I started writing code and putting it on (plus *-rest, *-worker)

So far mostly a mock REST service and infrastructure for the Java processing part. I’ll be on vacation this week like mentioned in my mail, but I'll try and set up an instance of this the mock on one of the metrics machines during that time. This should allow you to get started writing client code.
Comment 6 User image Michael Kurze [:michaelk] 2011-03-01 19:31:39 PST

A mock server (should be reachable from MPT-VPN) is running at 

I posted updated API specs in the grunde/ -- please note the new namespace prefix, so multiple clients can maintain collections about "firefox". Also, slashes should not be used in namespace or cluster key (encode to %2F if needed), so we can use them for hierarchical clustering at some point in the future. For input I recklessly suggest the namespace "input".

The mock service will return the default clusters for most requests. To explicitly get a 404, use "no-such-key" for a collection key, or "no-such-label" for a cluster label. Let me know if you find problems with the mock service or with the specs.

Other than that, we're reviewing the architecture on Thursday. There are several parts that are pretty much accepted though, so that won’t block. I’ll continue to keep you posted.
Comment 7 User image Fred Wenzel [:wenzel] 2011-03-03 16:07:15 PST
Thanks for your hard work on this, Michael! Pushing this into 3.4 for continued work, but I think your mock server will already prove valuable to get started on the frontend work.
Comment 8 User image Dave Dash [:davedash, :dd] (assign all bugs to mbrandt) 2011-05-04 11:11:02 PDT
Running into GrouperFish issues.  Moving those tasks out of 3.5.
Comment 9 User image Dave Dash [:davedash, :dd] (assign all bugs to mbrandt) 2011-05-18 13:01:25 PDT
Moving theme related stuff to FUTURE, the expectation is that michaelk will work on his end as time permits, and that it'll be a priority on my plate - but I don't want to move it into a milestone, because it's being blocked by michaelk who is busy with BZETL.  Once his portion is complete, I'll move the remaining bugs into a milestone.
Comment 10 User image Michael Kurze [:michaelk] 2011-06-21 08:15:58 PDT
I am marking this as FIXED since the service is up and running. We should file separate bugs for any problems we might find and/or new features (incremental, ...) that we want to add.

Please reopen if there is something in here that GF does not actually do.

Grouperfish docs (including REST api):
Comment 11 User image Michael Kurze [:michaelk] 2011-06-21 08:17:49 PDT
Ah, forgot to mention that the service interface is running at

and can be tested at:

Note You need to log in before you can comment on or make changes to this bug.