Closed
Bug 982242
Opened 11 years ago
Closed 10 years ago
Have word cloud as an alternate visualization for feedback
Categories
(Input :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: cww, Unassigned)
Details
We'd like to make a word cloud be an alternate view in the dashboard. The idea would be that whatever you're looking at, you can toggle between a graph (what we have) and a word cloud that reflects all the feedback that fits the criteria.
After talking with Will, it looks like the work is going to be writing code (python) that takes all the feedback in a facet and counts occurrences of each word. Then spitting that out to a JS library that generates a word cloud.
Comment 1•11 years ago
|
||
Flagging myself with a needsinfo so I don't forget to break this down.
Flags: needinfo?(willkg)
Comment 2•11 years ago
|
||
Breakdown could be something like this:
1. Add some code to the incredibly monolithic view function named "dashboard" located in fjord/analytics/views.py that adds an additional terms facet on the description field. It should probably be added on line 330. You'll want to do a terms facet and capture the results, then pass those in the dict created on line 31.
When doing the terms facet, you'll have to use .facet_raw() because you'll need to specify the number of terms you want back. Elasticsearch defaults to 10 and you're probably going to want more words in your word cloud. We do this already with the occurrences report in the function named "analytics_occurrences" in fjord/analytics/analytics_views.py around line 117.
Note: There's a lot of technical terms in there, so you're going to have to ask questions. I'll work through the bits with you. This touches Django, Elasticsearch and ElasticUtils (the library we use for Elasticsearch). Plus it's making changes to the dashboard view code which is sort of mediocre.
2. After finishing step 1, you'll have the data you need to generate the wordcloud based on the input parameters for the dashboard. Next step is to make sure that data gets passed in the HTML template to the browser.
The file you're going to want to edit is fjord/analytics/templates/analytics/dashboard.html. For now, the UI should start with the existing happy/sad histogram graph and have a toggle to switch to the wordcloud. The existing histogram graph is in a div block around line 125 of that template.
For this step, I think the easiest thing to do is mirror what we're doing with the happy/sad histogram by creating a new div for the wordcloud and sticking the data in a data-wordcloud tag. We'll add the toggle in step 3--for now the template will have the data, but won't do anything with it. Probably want to add a display: none css rule for the new div.
3. After finishing step 2, you'll have the data in the place you want it. Next you want to add the javascript to render the wordcloud and the javascript that makes the toggle work. That javascript should go in fjord/analytics/static/js/dashboard.js .
I'm on the fence about whether to render this at init or whether to render it only once someone has pressed the toggle.
It probably makes sense to have the wordcloud take up the same amount of space that the graph currently takes up so you can toggle between them without stuff moving around.
I did a search for "js wordcloud" and found a couple of possible javascript implementations like this one:
http://timdream.org/wordcloud2.js/#love
It might be interesting to use a library to generate it, but we'd want to make sure the new dependencies we're adding are all licensed with open source licenses and that it's not a ton of bytes. It seems more fun to write your own than use a library. We're using jquery already, so using that is fine. I'd be game for adding a d3 dependency (http://d3js.org/), too, since I think we'll really want to use that for future dataviz stuff.
Also, here's a list of anti-requirements (i.e. things you shouldn't worry about):
1. Common words. It's likely the terms facet will return words like "is" which isn't particularly useful. I think we can deal with ignoring words in a later pass after we see how things look.
2. Misspellings and l337 speak. There's a lot of misspellings and l337 speak in the data. Let's not think about this for now.
I think that covers it. It's possible step 3 will take a couple of weeks to do depending on whether you decide to write your own wordcloud implementation.
Cheng: Are there any other anti-requirements?
Flags: needinfo?(willkg) → needinfo?(cwwmozilla)
I think that looks excellent.
A few points (helpful or not):
I've historically done counts not by how many times a word shows up but but how many input items contain the word. It's a subtle difference but for consistency, I think we should stick to it. (Unless that's hard given the above.)
We will have WAY too many words tied at the mid-to-low levels. I think we need to figure out a way to decide what to show and what not to.
Stemming/grouping like words. Not a goal, future task.
Log vs linear scaling, we can decide that once we see the data.
Flags: needinfo?(cwwmozilla)
Comment 4•11 years ago
|
||
Is there a "deadline" for this? This sounds like a fantastic intern project.
Comment 5•11 years ago
|
||
If we do a terms facet, then Elasticsearch already stems all the words when they're added to the index. However, it does mean that words like "because" get changed to "becaus"--it'll look the same as our bigram data and a little goofy. I'm not entirely sure what to do about that. We could create a new field in the mapping, do our own parsing and stemming and do a terms facet on that. Then we'll get regular-looking words back. If "becaus" is good enough, then we can start with that and change the analysis done later.
The terms facet will tell you how many responses have that thing. For example, if you look at the dashboard, it has a terms facet for product--that number is the number of items that have that thing. So we should be all set with that.
I don't know what "grouping like words" means--do you mean something like synonyms? Usually that has to be manually defined since synonyms are problem-domain-specific.
Ok, I just meant stemming (and possibly language specific stemming).
I don't think "becaus" is good enough because it's no longer visualization friendly. One way is to pick the most common (or in case of a tie, shortest) "real" word that is available that matches the stem.
Comment 7•11 years ago
|
||
I spent some time figuring out a better solution for the "becaus" problem. I added a new field to the index called "description_terms" that uses the standard analyzer rather than the snowball analyzer. The standard analyzer doesn't do stemming (which is a bummer), but it also doesn't create tokens like "becaus", so it'll make for a better first pass for word clouds. We can figure out what to do about stemming later.
Landed in master in https://github.com/mozilla/fjord/commit/897c570291edfd5130654f1c6d21e8cd0cbd64b2
So, the above steps are still good, but we want to do a terms facet on "description_terms" rather than "description".
Comment 8•10 years ago
|
||
We nixed the chart in the front page dashboard, so I think we can WONTFIX this bug.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•