Cluster themes and sites as publicly available and output to JSON



8 years ago
8 years ago


(Reporter: davedash, Assigned: michaelk)


Dependency tree / graph



We'd like metrics to handle themes (as they are similar to sites).

What we want is clustering that clusters opinions for each version of Firefox and Mobile.

As new data arrives, it should be added to existing themes (like growing crystals), before generating a new cluster.

Each cluster should have a unique unchanging ID so that we can have permanent URLs on the input site.

The JSON files for sites should look like this:

{"": [1,2,3,4,..., opinion_ids]}

these opinion_ids should be the same ids that you get from input.

For themes:

{"UNIQUETHEMEID": [1,2,3,4,..., opinion_ids]}

We may be dealing with large files, so we'll have to be smart about how we create these.  Something like:


Input will load all the data for previous betas in a single fetch.  For current betas and current versions we'll poll the public dump periodically.

If these are public we can also serve them to community, and use them on our own instances of input that we spin up for testing:

See bug 629011.

Let us know where this might fit in milestone wise, or if you want to work on it as you have time, just let us know where you are at.  

I can build the consumer side for input.

Comment 1

8 years ago
We need to keep in mind that for each site, or each beta there can be multiple clusters generated.

To retrieve the clusters the service could offer bucket URLs like this:

GET /clusters/corpus_id 
GET /cluster/cluster_id

The corpus_id itself can then have a path-style structure. That makes for nice URLs and is still super-extensible:


and for themes:

Both URLs will give you a map like this:
"{'cluster_id': [[opinion_id, ...], [opinion_id, ...], ...], ...}

The /cluster/* URLs always give you only one map entry of course.

To push a message to a corpus:

POST /corpora/corpus_id (document in the body)

Batch-loading: Probably though an admin command not exposed via HTTP.

Comment 2

8 years ago
Addition: the POST needs to contain the id of course.

POST /corpora/corpus_id

{"id": "4815162342", "text": "Firef0x is rulzor"}
Let's keep the key's small, but the paths to the files large.  It'll keep the json files small.

> Both URLs will give you a map like this:
> "{'cluster_id': [[opinion_id, ...], [opinion_id, ...], ...], ...}

Why would there be a list of lists per cluster?

Comment 4

8 years ago
You are totally right. I think first I had a list of list in mind, with sequential cluster index numbers.

With the map, that don’t make no sense anymore.

{'cluster_id': [opinion_id, ...], ...}
Priority: -- → P1
Target Milestone: --- → 3.2
Blocks: 623360
Target Milestone: 3.2 → 3.3

Comment 5

8 years ago
Update: I started writing code and putting it on (plus *-rest, *-worker)

So far mostly a mock REST service and infrastructure for the Java processing part. I’ll be on vacation this week like mentioned in my mail, but I'll try and set up an instance of this the mock on one of the metrics machines during that time. This should allow you to get started writing client code.
Blocks: 623361
Blocks: 635999

Comment 6

8 years ago

A mock server (should be reachable from MPT-VPN) is running at 

I posted updated API specs in the grunde/ -- please note the new namespace prefix, so multiple clients can maintain collections about "firefox". Also, slashes should not be used in namespace or cluster key (encode to %2F if needed), so we can use them for hierarchical clustering at some point in the future. For input I recklessly suggest the namespace "input".

The mock service will return the default clusters for most requests. To explicitly get a 404, use "no-such-key" for a collection key, or "no-such-label" for a cluster label. Let me know if you find problems with the mock service or with the specs.

Other than that, we're reviewing the architecture on Thursday. There are several parts that are pretty much accepted though, so that won’t block. I’ll continue to keep you posted.
Thanks for your hard work on this, Michael! Pushing this into 3.4 for continued work, but I think your mock server will already prove valuable to get started on the frontend work.
OS: Mac OS X → All
Hardware: x86 → All
Target Milestone: 3.3 → 3.4
Whiteboard: [3.5]
Target Milestone: 3.4 → 3.5
Whiteboard: [3.5]
Running into GrouperFish issues.  Moving those tasks out of 3.5.
Target Milestone: 3.5 → 4.0
Moving theme related stuff to FUTURE, the expectation is that michaelk will work on his end as time permits, and that it'll be a priority on my plate - but I don't want to move it into a milestone, because it's being blocked by michaelk who is busy with BZETL.  Once his portion is complete, I'll move the remaining bugs into a milestone.
Target Milestone: 4.0 → Future
Blocks: 663032
No longer depends on: 663032

Comment 10

8 years ago
I am marking this as FIXED since the service is up and running. We should file separate bugs for any problems we might find and/or new features (incremental, ...) that we want to add.

Please reopen if there is something in here that GF does not actually do.

Grouperfish docs (including REST api):
Last Resolved: 8 years ago
Resolution: --- → FIXED

Comment 11

8 years ago
Ah, forgot to mention that the service interface is running at

and can be tested at:
Version: 3.0 → 4.2
Component: Input → General
Product: Webtools → Input
You need to log in before you can comment on or make changes to this bug.