Closed Bug 927617 Opened 7 years ago Closed 7 years ago

Add version comparison analysis

Categories

(Input Graveyard :: Backend, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Matt_G, Assigned: willkg)

References

Details

(Whiteboard: u=analyzer c=dashboard p=5 s=input.2013q4)

One of the most powerful tools we have for identifying new issues in Input is comparing data sets based on version/date to identify deltas in the feedback. 

We need a way of selecting:

* 2 versions of Firefox (can be the same or different)
* 2 date ranges (can be the same or different)
* n-grams (Results can be a single word or bi-gram, tri-gram, etc. That would allow us to see "Youtube Crash" as opposed to just "crash")

After making these selections, we should calculate the delta based on the criteria and display them in meaningful ways. I'll file a separate bug for display.
Adding whiteboard data so it shows up in my sprint thing.
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: u=analyzer c=dashboard p= s=input.2013q4
Component: Frontend → Backend
Blocks: 928047
Need the analyzers group before I can work on this.
Depends on: 907872
When you say you're looking at bi-grams, tri-grams, etc, are you saying a search for "YouTube Crash" must only select items where "YouTube" is next to "Crash" in the data?

e.g. These two show up:

* opened my firefox and youtube crash
* youtube crash is annoying

but this does not:

* opened my firefox and went to youtube and crash

Is that right? Or are you really asking for "I want all responses with the words 'youtube' AND 'crash'"?
Hey Will. Looping in Hamilton so he can comment on this. He worked on the original clustering tool. I'm sure between the two of  you we can figure out how to make it as accurate and actionable as possible.

This isn't a search, but a report that tells us certain words have changed frequency as compared to a different version, date range, or combination of both. For example, if you select version 23 and 24 of Firefox to compare you would see that Java appears far more frequently in version 24 feedback since there was a Java block. That's the basics of the tool. A single word appears more or less frequently than it did in the other version/time period specified. To take it a step further and provide additional context, we might see that Java and Block both appear much more frequently AND usually in the same piece of feedback. So they might be the bi-gram that has changed frequency the most by comparison.

A good example of a tri-gram would be something like: Flash, Crash, YouTube. This gives us a starting point since we know the crashes seem to be related to flash on youtube. If we only looked at the individual words we might see that these three are individually spiking, but we'd need to read all of the comments for each individual word to figure out that they are all related. 

Early in the release cycle we usually don't have enough feedback to build strong n-grams, so we use the single word spikes. As we get more feedback the larger clusters become more and more accurate. That way we have options for each stage of the release cycle. 

Does that help provide some additional insights? I can walk you through the existing tools the next time we meet.
That doesn't really help. I know what a bi-gram is, but I don't understand what you mean by it enough to figure out how to implement it.

If Hamilton or someone else has code that figures this out already, I'd love to see that.
Hi Will - 

A quick-and-dirty solution would be to focus on two-word and three-word co-occurrences within a piece of feedback itself. This also includes removing stopwords[1] and potentially stemming[2] to change words like 'crashing' to 'crash.' So this means that a phrase like

"Youtube is crashing all the damn time"

Converts to the set (keep in mind NOT the ordered list)

{'youtube', 'crash', 'all', 'damn', 'time'}

Or something to that effect. Then the subset {'youtube', 'crash'} would match with any other co-occurrence in another piece of feedback, regardless of word order. I've found that this tends to be a pretty effective way of getting something interesting out of text.

So my understanding is, that you select, say, the first two weeks of v24's release, and then, say, the first two weeks of v25, and the system sends back the largest deltas of the co-occurrences. So if {'youtube', 'crash'} as a co-occurrence has jumped up 25% between these two periods, and that jump is larger than others, it should be at the top of the list. 

I think this means you can just store in a db [date, co-occurrence, count], which should allow quick summing together date ranges, and that total number would have to be normalized by the number of days selected anyway (if in date range 1 you only select 5 days then the # of co-occurrences might be substantially less than if date range 2 has 100 days, as an extreme example).

There are open-source toolkits in most languages to do the dirty work of preparing the text (again, apologies if this is all old news). Removing stopwords is a piece of cake if you have a list, and there is porter stemmer code all over the internet such as this python one[3].

I think this is the easiest way to get to 90% of the way there without doing something more sophisticated (that is, not using matrix operations & dealing with statistical models, which shouldn't be necessary).

At any rate, you can send me an email at hulmer@mozilla.com if you have any additional questions or issues - I'd be happy to jump on Vidyo and sort things out as well.

[1] sorry if everything here is known / obvious to you, but I'll not assume your background: http://en.wikipedia.org/wiki/Stop_words
[2] http://en.wikipedia.org/wiki/Stemming
[3] https://pypi.python.org/pypi/stemming/1.0
Thanks Hamilton. Will: If that still doesn't give you what you need let me know. I can set something up for the 3 of us to talk this through over vidyo which might be easier.
Hamilton: That helps a ton!

Pretty sure using the analyze api for ES will cover prepping the text. There's an analysis api that'll give us back the text as a set of tokens that sans stems and stop words. Plus we can lowercase and do other things there, too.

We could use the db, but the db is harder to treat as ephemeral. It's much easier to treat ES as ephemeral plus I'm pretty sure we can do a lot of the manipulation using stock ES things like facets.

I'll look into whether that's possible (I haven't used some of those parts of ES before). Probably try indexing the analyzed data in two-/three-word occurrences, then using facets to pull it out for comparison. I'll see whether that gets me most of the way or whether that's not going to work. If it doesn't work at all, I'll look at other possibilities including using the db and doing the computation in Python.
Sounds great :willkg - definitely here if you need any assistance. I am going to be working with you to migrate some of my other weird data tools this quarter at any rate :)
Hey Will. Glad Hamilton was able to offer some guidance. That is definitely beyond my skillset. Do you need anything else from me or Hamilton at this point? It sounds like you've got enough information to at least get started. Let me know and thanks again.
Making this a P1. I'm going to say 3 points for now, but that's likely not in the ballpark. This is the first report we're doing, so it'll probably require writing some scaffolding in addition to writing the actual report. Theoretically, future reports will be easier.
Priority: -- → P1
Whiteboard: u=analyzer c=dashboard p= s=input.2013q4 → u=analyzer c=dashboard p=3 s=input.2013q4
I sent this to Hamilton, but I should have just added it here.

I think I've got a lot of the plumbing working, though it definitely needs tuning. It has a gross ui I threw together just so I could see whether things are "working" or not.

Here's an occurrence comparison for Firefox between version 24.0.0 and 25.0.0 on the first week of launch:

http://bluesock.org/~willkg/images/occurrences.png

Several things I want to point out:

1. It's only got bigrams right now. Trigrams should be more of the same, but I figured I'd focus on issues with bigrams first.

2. I added a search term field. I'm thinking you should provide at least one of version, start/end range or search term. The search term is interesting--it'd let you do occurrences on "australis" for example.

3. It throws things in lists for now.

4. The bigrams are kind of a mess. I think I should add some more stop words including the letters of the alphabet, plus handle urls differently. I think I may convert urls to just the hostname.


But that's where things are at.

I have a few questions:

1. Are you going to want to click on those bigrams to see the list of responses for that bigram?

2. Are there additional things we want to fix in the parsing other than what I mentioned?

3. What do you think? Is this on the right track?
Matt?: ^^^
Flags: needinfo?(mgrimes)
Hey Will. Cheng mentioned that he had some feedback, so I was waiting on him. I'll just jump in so we don't hold you up anymore.

This looks awesome. Appreciate the work. Here are some thoughts and comments.

Things you point out:

1. That sounds good. Tri-grams can come later. We also look at single words that are spiking. This is really helpful at the beginning of a release cycle when we don't have a lot of feedback to create good bi/tri grams.

2. That sounds great. I'd love to play around with it. So you could run a search for australis and then it would display only bi/tri-grams that contain the keyword?

3. Lists sounds perfect for now. We've got some thoughts on how to visualize, but that'll be easier to do over vidyo or in person.

4. These all sound spot on.


Your questions:

1. Yes. In our original vision for this report you would be able to click on any entry and see all of the comments associated with that entry.

2. I'm not sure if you are already doing this, so apologies if you are. In the current script we normalize word frequency (I believe it's occurences per 1000 words?) so that time frames don't have to be the same. That way you could compare the entire release worth of data from 24 to 1 week worth of 25. Cheng can provide the actual script data and probably explain it much better than I can.

3. This is 100% on the right track. Really excited about making this more a more powerful and accessible tool!

Let me know if I've missed anything.
Flags: needinfo?(mgrimes)
I landed a first pass this morning: https://github.com/mozilla/fjord/commit/d98b148

There are some things I need to do with that form so it's clearer how it gets used. At some point p when we've decided it's doing what we want it to do, we can polish up the ui.
(In reply to Matt Grimes [:Matt_G] from comment #14)
> 
> Things you point out:
> 
> 1. That sounds good. Tri-grams can come later. We also look at single words
> that are spiking. This is really helpful at the beginning of a release cycle
> when we don't have a lot of feedback to create good bi/tri grams.

I can add single words, too.


> 2. That sounds great. I'd love to play around with it. So you could run a
> search for australis and then it would display only bi/tri-grams that
> contain the keyword?

Not quite. The "search term" field will give you bigrams for responses that have that term. It's not the case that the bigram has to have that term.


> 1. Yes. In our original vision for this report you would be able to click on
> any entry and see all of the comments associated with that entry.

I'm a little concerned about this. No one has told me about the original vision. All I know about what you want built is what is in this bug. If it's not in this bug, then I don't know about it. The problem there is that if I don't know the full picture, then it's hard for me not to waste time because I have to redo things as I discover new requirements.

I think I can add links.


> 2. I'm not sure if you are already doing this, so apologies if you are. In
> the current script we normalize word frequency (I believe it's occurences
> per 1000 words?) so that time frames don't have to be the same. That way you
> could compare the entire release worth of data from 24 to 1 week worth of
> 25. Cheng can provide the actual script data and probably explain it much
> better than I can.

Two things:

1. Hamilton said there was no existing script and that this was all new functionality. If there is a script, I really would like to see it. Having that from the beginning would have really helped me understand what you're trying to do.

2. Up until now I haven't heard anything about normalizing against the total number of words. Hamilton did mention normalizing against the total number of days in the period specified. What I've got prototyped so far doesn't do that. I still have to figure out how that'll work.

Do you need both normalizations? If so, someone is going to need to walk me through exactly what steps are in the calculation so I know exactly what you want.
(In reply to Will Kahn-Greene [:willkg] from comment #16)
> > 2. That sounds great. I'd love to play around with it. So you could run a
> > search for australis and then it would display only bi/tri-grams that
> > contain the keyword?
> 
> Not quite. The "search term" field will give you bigrams for responses that
> have that term. It's not the case that the bigram has to have that term.

That sounds great and useful. I'll be interested in checking it out. Let me know.

> > 1. Yes. In our original vision for this report you would be able to click on
> > any entry and see all of the comments associated with that entry.
> 
> I'm a little concerned about this. No one has told me about the original
> vision. All I know about what you want built is what is in this bug. If it's
> not in this bug, then I don't know about it. The problem there is that if I
> don't know the full picture, then it's hard for me not to waste time because
> I have to redo things as I discover new requirements.
> 
> I think I can add links.
> 

That's probably my fault. I filed 2 separate bugs. One for the functionality and one for displaying the data. The links are mentioned in the display bug. Let's meet next week to talk about how to visualize the data in person. Then I can close that bug and we can put the outcome of that meeting here. Sound good?

> > 2. I'm not sure if you are already doing this, so apologies if you are. In
> > the current script we normalize word frequency (I believe it's occurences
> > per 1000 words?) so that time frames don't have to be the same. That way you
> > could compare the entire release worth of data from 24 to 1 week worth of
> > 25. Cheng can provide the actual script data and probably explain it much
> > better than I can.
> 
> Two things:
> 
> 1. Hamilton said there was no existing script and that this was all new
> functionality. If there is a script, I really would like to see it. Having
> that from the beginning would have really helped me understand what you're
> trying to do.
> 
> 2. Up until now I haven't heard anything about normalizing against the total
> number of words. Hamilton did mention normalizing against the total number
> of days in the period specified. What I've got prototyped so far doesn't do
> that. I still have to figure out how that'll work.
>
> Do you need both normalizations? If so, someone is going to need to walk me
> through exactly what steps are in the calculation so I know exactly what you
> want.

Cheng wrote a script that shows spikes in single words. Hamilton had been working on some other clustering tools (bi-gram, tri-gram) so we tapped him to help add that functionality to this feature. I'll send you a copy of Cheng's existing script via email. I don't know if there is any sensitive information in there, so best to be safe. Apologies that he hasn't sent that to you before now. One caveat, just because this is the way it was done before does not mean that it's necessarily the BEST way to do this. I want you to have freedom to iterate and explore better ways of doing these things.

I'll let Cheng or Hamilton make the call normalization. That is out of my wheelhouse.
Flags: needinfo?(hulmer)
Flags: needinfo?(cwwmozilla)
Landed some minor ui fixes today so that it's slightly more functional.

Work that is still outstanding:

1. UI has no polish; product should be a dropdown of possible products; version should be a dropdown of possible versions; date fields should be calendar widgets

2. There's no way to coalesce versions. This is a problem with Firefox OS where we have multiple version strings that are all the "same thing" and we probably want to treat "the same". This is a problem with the other products if you want to look at all the iterations of version 25 (for example) and not just 25.0.0 or 25.0.1.

3. The output form is a couple of lists of unnormalized numbers. I'm waiting to hear back on what the normalization math should be. I'm also waiting to hear back on mockups for what graphs should exist, etc.

4. The report isn't remotely mobile-friendly. None of the dashboard stuff is. I'm continuing to go on the assumption that mobile is not a requirement right now.


Anyhow, I think that's all I'm going to do here until I get feedback on what's working, what's not and consensus on where to go from here.
Sorry for any miscommunication on my end. Didn't realize Cheng had done something like this already.

Regarding the normalization, what Matt & Cheng have been doing (normalizing per k words, where k=1000 in Matt's example) would likely do very well in the normalizing. My "divide by number of days" hack was proposed out of simplicity, but theirs will very much be more robust, since it will make the cyclical issues involving days-of-weeks and holidays irrelevant.
Flags: needinfo?(hulmer)
Ack, sorry, I meant to write up a ton of feedback and it totally fell off my radar.

PS, please don't pay any mind to the stemmer, it's a huge cludge because of the way I used to not be able to load libraries on the machine.

The way I did it was as follows: count the number of pieces of feedback that contained a stemmed word (the same word multiple times in the same feedback counts only once) and then normalized by the total number of pieces of feedback for each period under consideration. This helps smooth out spikes around releases and still lets us see if certain issues are starting to take a larger proportion of the feedback.

The ranking system is based on the absolute difference and not the relative difference between the two normalized numbers as an easy way to account for large changes in infrequent words being not as significant as large changes in frequently used words. I'm pretty sure that's not the best way to do it but it's fast and easy to calculate.

From my understanding, the work that Hamilton did is two-fold: 1) it clusters into a limited number of groups. Like words go together in a more sane way than just how they stem. 2) It does tracking of trends with the clusters, so you can say which clusters are getting better/worse. Hamilton also wrote a tagger that uses a training set and some kind of machine learning to help automatically assign tags to feedback.

I think we probably need both kinds of functionality. The single-word system that I wrote is very basic but does a great job at catching major new issues. Something might blow up on a site we've never heard of or the word "mouse" might trend if we broke mouse navigation and we can pick up on that within 12 hours whereas a cluster-making system might just fold that in with "site issues" which is where most unknown things go.  The cluster system does a far better job at tracking things longer term. Are crashes (and related words) getting better over time? Are we fixing the major issues we need to be fixing? Stuff like that. I'm not sure his needs to be normalized the same way mine was (it serves a totally different purpose) so I won't speak to it.
Flags: needinfo?(cwwmozilla)
Going off of comment #20, I'll add a normalized number that's:

    FACET_COUNT / NUM_RESPONSES

where NUM_RESPONSES is the number of responses that meet the specified criteria. I'll also make sure to show the raw numbers on the report so it's easier for us to iterate over this and find errors.
Added links to bigrams and normalized counts per the equation in comment #21. PR: https://github.com/mozilla/fjord/pull/183

Landed in master in 2a2f832 [bug 927617] Add bigram links and normalized counts

Pushed to production just now.

One thing I noticed is that the bigrams are fux0rd. I'm not sure why. For example, the word "explorer" gets converted unfortunately to "explić". I have no idea why. It happens on -stage and -prod, but not on my local machine. I'm guessing wildly it's a configuration setting or something like that. I need to look into that further.
Hey Will. The new changes look great. I'm also looking forward to the UI polish you suggested above. We discussed this last time we met and I'm 100% in support!

I still owe you some basic mockups for layout of the list view and the proposed wordcloud. I'll try to have those to you by the end of the week.
For clarification, by "mock-up" I mean "anything that illustrates an example of how you need it to work". Using pen on a bar napkin, snapping a picture with your phone and attaching it here or printing it out and sending it by fedex to my secret drop location is totally fine.

I think going forward, I'm going to ask for mockups for any graph displays because that'll speed up the process of going from "created bug" to "landed the report" and reduce the work involved on all of us.
Status: NEW → ASSIGNED
Tagging Matt with a needsinfo for some mockups on what the output should look like.
Flags: needinfo?(mgrimes)
Depends on: 948925
Depends on: 948998
Depends on: 949000
I moved some of the outstanding work in this bug into other bugs to more easily track things. I think I got everything except the work involved in changing what the output should look like. I'll do that when I know more about what's involved.
Cheng sent me a PDF that has a mockup of what will probably be the future direction for this work. It fleshes things out into a more fully-featured set of dashboards. There's still some requirements work that needs to be done on that before we can start working on it.

However, given that, it makes sense to push off more work on the Occurrences Report stuff.

Going back to the original description, what we have in production does that, though it's not wildly exciting. The exciting parts are covered in the follow-up bug #928047.

Given that, I'm going clear the needsinfo flag for Matt and close this out. Future work on this will be in new bugs.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(mgrimes)
Resolution: --- → FIXED
Whiteboard: u=analyzer c=dashboard p=3 s=input.2013q4 → u=analyzer c=dashboard p=5 s=input.2013q4
Unblocking and cleaning this up.
No longer depends on: 949000
Assignee: nobody → willkg
Product: Input → Input Graveyard
You need to log in before you can comment on or make changes to this bug.