Closed Bug 898511 Opened 11 years ago Closed 8 years ago

Expire reviews to incentivize developers.

Categories

(Marketplace Graveyard :: Consumer Pages, enhancement, P4)

x86
macOS
enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mhoye, Assigned: dbialer)

Details

(Keywords: productwanted, Whiteboard: [marketplace-transition])

The IEEE being what it is, I'm unable to find the paper that backs this up. Bear with me. 

The idea is: expire old reviews, to incent good developer behavior. After a certain time and after a new release occurs, start to deprecate old reviews of previous releases. This has two major benefits, one being that it keeps developers who've had a really good release from being able to coast on past five-star ratings, as well as keeping developers from being permanently buried by one bad or too-early release.

There's a few ways of doing this, as you might imagine, and as soon as I can find the relevant paper I'll attach it to the bug.
Component: Reviewer Tools → Consumer Pages
This also has the downside of expiring useful bad reviews as well. If a really crappy or spammy app is submitted and gets a lot of very negative reviews right now, the app gets buried by the search algorithm. If we start expiring those reviews, these bad apps will start to resurface every few months as their bad reviews expire.

We could use a bayesian weighting algorithm though, which (instead of expiring reviews) would slowly adjust the review's rating to the average review rating for all apps. This way, a bunch of five star reviews at an app's submission would become three star ratings after six months or so (assuming three stars was the average or median rating given to all apps). This has the added benefit of giving inertia to the app's rating. An app that gets one five-star rating after all of its one-star ratings have expired won't suddenly be rated five stars, and an app that had a lot of five-star ratings won't suddenly be destroyed by a single one-star rating.

Of course, we already do use bayesian averages to prevent newly-submitted apps with one or two reviews at five stars each from showing up as being "top rated", so this would function in addition to that algorithm.
Assignee: nobody → dbialer
Severity: normal → enhancement
Keywords: productwanted
Priority: -- → P4
if this feature is to be implemented, it is important that it works for hosted apps as well, a change in the manifest file version attribute must work the same way as a new version upload of a packaged app.
(In reply to Matt Basta [:basta] from comment #1)
> This also has the downside of expiring useful bad reviews as well.

There are ways to fix this, including keeping reviews flagged as being useful around longer, but to a large extent you don't want to; if an app stays bad, new this-app-is-bad reviews will emerge. The point of the exercise is to demonstrate to developers that improving their product will result in improved uptake of their product, and of course the converse.
Talking with David Bialer, we think that a good basic system would be to slowly migrate each review's value towards the average rating of apps over one year. After one year, the value of a review would be equal to the average.

Rob Hudson: what do you think about this? Should we do this algorithmically as part of a big SQL query or should we have a CRON that figures this out?
(In reply to Matt Basta [:basta] from comment #5)
> Talking with David Bialer, we think that a good basic system would be to
> slowly migrate each review's value towards the average rating of apps over
> one year. After one year, the value of a review would be equal to the
> average.

That seems odd to me and unrepresentative of what the original review was. I think I would feel manipulated if I found my original 2 star review is now a 4 star review a year later with the same comment. It'd feel like Facebook showing I liked a product in other people's timelines when I actually didn't.

What about a weighted average based on time of review approach? A new rating has the full weight applied to the average while a rating 1 year old may have none of its weight applied to the average. So a rating's weight degrades over a year until it has zero impact? The rating would still be in our system and could be found if paging through but have no impact on the current app rating.

> Rob Hudson: what do you think about this? Should we do this algorithmically
> as part of a big SQL query or should we have a CRON that figures this out?

Regardless of my dislike of this idea I'd go toward the cron approach so the rating is updated incrementally over time.

If we did like the weighted average approach I'm sure there's some math formula we could come up with that calculates the current average at the time of a new rating and weights all ratings.
(In reply to Rob Hudson [:robhudson] from comment #6)
> 
> That seems odd to me and unrepresentative of what the original review was. I
> think I would feel manipulated if I found my original 2 star review is now a
> 4 star review a year later with the same comment. 

Yeah, there's no way we should be changing a rating that somebody else has given. We can choose to include it in our calculations or not, and make that decision with some degree of transparency, but to actually modify it would be really dishonest.

> What about a weighted average based on time of review approach? A new rating
> has the full weight applied to the average while a rating 1 year old may
> have none of its weight applied to the average.

Ultimately my ask here is that if we have enough new reviews, old reviews just go away. The 4.0 version of a product should simply not be burdened by comments about bugs that was acknowledged and fixed in version 2.1 last year, and likewise crashes introduced in the current version shouldn't be hidden from view by past successes.

There are a couple of approaches to this, obviously, and my gut instinct tells me that this is something that we will have to iterate on substantially as data comes in and we can see trends and people try to game the process and whatever else.

I'd like to see something like "most recent X% of the last year's reviews, or Y most recent reviews for products with less than Z reviews overall" as a starting point, but my guesses for what X, Y and Z should be are utterly uninformed. Any suggestion that's divisible by two or five, for example, should be viewed with suspicion, and we should consult somebody who actually knows something about stats before committing this to code.
(In reply to Rob Hudson [:robhudson] from comment #6)
> (In reply to Matt Basta [:basta] from comment #5)
> > Talking with David Bialer, we think that a good basic system would be to
> > slowly migrate each review's value towards the average rating of apps over
> > one year. After one year, the value of a review would be equal to the
> > average.
> 
> That seems odd to me and unrepresentative of what the original review was. I
> think I would feel manipulated if I found my original 2 star review is now a
> 4 star review a year later with the same comment. It'd feel like Facebook
> showing I liked a product in other people's timelines when I actually didn't.

The stars would still show your original rating, but it would be computed as having the "decayed" value.

> What about a weighted average based on time of review approach? A new rating
> has the full weight applied to the average while a rating 1 year old may
> have none of its weight applied to the average. So a rating's weight
> degrades over a year until it has zero impact? The rating would still be in
> our system and could be found if paging through but have no impact on the
> current app rating.

That's basically exactly what we're looking for. Numerically I think it would have virtually the same results as what I described. My suggestion was to retain the "weight" of the review while decreasing its value, rather than decreasing the weight and retaining the value.

(In reply to Mike Hoye [:mhoye] from comment #8)
> Ultimately my ask here is that if we have enough new reviews, old reviews
> just go away.

Deleting old reviews is not an option. We do, however, show when a review was left on a previous version of a packaged app.

In talking with David, we discussed only using the most recent 500 to 1000 reviews to compute the average rating, but that's not something to think about now (very few apps have that many reviews). David is in favor of reviews within a window of time, but I'm afraid that would burden developers (or help bad developers) that only have a small number of reviews that were left a long time ago.
> Deleting old reviews is not an option.

Why not? If they don't provide relevant information to a user about the current release of the software, they're actively hurting the developer and misinforming the customer. This is precisely the point made in the medium article you linked to above.
(In reply to Mike Hoye [:mhoye] from comment #10)
> > Deleting old reviews is not an option.
> 
> Why not? If they don't provide relevant information to a user about the
> current release of the software, they're actively hurting the developer and
> misinforming the customer. This is precisely the point made in the medium
> article you linked to above.

Because a year-old review might still be about the most recent version of the app. There's nothing wrong with old reviews, and as I said, we do show when reviews are left on older versions of the current app (as Google Play does).
I think there are a few factors that I will attempt to summarize:
1. the time value of a review. I like the notion that the value (or weighting) of a review decays. 
2. Version number of the app - As Mike said, apps shouldn't be burdened or enhanced by past version numbers.  Or at least not completely.  A developer should have the opportunity to fix a tarnished reputation of their app.  Perhaps new versions rating get weighed in with the past versions until there is enough data to have that version stand on its own ratings - or become 'credible'.
3. Number of reviews as a representation of credibility or faith in the data - the more data, the more confidence in the accuracy that data conveys.  A 5 star app with 1 rating should probably not be as popular as an app with 1000 reviews at 4 stars.  Possibly, the more reviews, the higher the confidence in the data. Perhaps an app need to have X (100?) reviews over a period of time in order to be eligible to be considered popular.

Of course, none of this takes into account rating gaming systems, which I think we should keep out of this scope until we have a good measurement system.
Thinking about this more - here is an idea:
1. We have a decay curve on reviews that affects their weighting.  Maybe straightline, maybe a decay curve.  Perhaps we do this, at most, on the last 1000 reviews. So, for instance, on a straightline decay, the review 1000 reviews ago is weighted 1/1000 of the last review.  These weighted ratings are summed up and divided by the total weight to get an average.  So if there are 5 ratings total - the the last rating would be weighted at 1, next to last 4/5, 3/5, 2/5, 1/5.  Perhaps the decay rate is factor of total time.
2. A new version needs at least 50 reviews to stand on its own rating.    So, if a new version is released, after 50 ratings, only those 50 rating and after will count towards the rating.
3. Perhaps an app needs x amount of downloads (500) and y (100?) amount minimum number of ratings to show up in popularity.
(In reply to David Bialer [:dbialer] from comment #13)
> Thinking about this more - here is an idea:
> 1. We have a decay curve on reviews that affects their weighting.  Maybe
> straightline, maybe a decay curve.  Perhaps we do this, at most, on the last
> 1000 reviews. So, for instance, on a straightline decay, the review 1000
> reviews ago is weighted 1/1000 of the last review.  These weighted ratings
> are summed up and divided by the total weight to get an average.  So if
> there are 5 ratings total - the the last rating would be weighted at 1, next
> to last 4/5, 3/5, 2/5, 1/5.  Perhaps the decay rate is factor of total time.

I'd be worried that this would be easy to game. If a bad/malicious app gets 150 negative reviews, the dev could wait a short time and then purchase 1000 good reviews on a seedy website. Inside of a week, that app could have a very high rating. The converse is also obviously true; a good app could have a smear campaign run against it by a competitor.

> 2. A new version needs at least 50 reviews to stand on its own rating.   
> So, if a new version is released, after 50 ratings, only those 50 rating and
> after will count towards the rating.

This would be much more resource intensive to compute. If we're doing a general decay, I'm not sure it will provide much benefit either.

For apps that have very fast release cycles, this could also be a detriment.

> 3. Perhaps an app needs x amount of downloads (500) and y (100?) amount
> minimum number of ratings to show up in popularity.

I'm not sure this would be solving an issue that exists yet. It might be something to consider when we have more users and apps.
Flags: needinfo?(dbialer)
Rob - I looked into this issue and read some good papers on Time Decay and Credibility using rating in Ranking.  Here is a good article: http://www.informatica.si/PDF/35-4/14_Wang%20-%20Improving%20Amazon-like%20Review%20Systems%20by%20Considering%20the%20Credibility%20and.pdf

Basically, minus all the math involved, optimally, a reviewers credibility has a weighting. Time decays the weighting of a rating as well.    

Some other app stores may also tie ratings to versions, but not completely, so just releasing a new version doesn't wipe the bad reviews or good reviews clean.

I looked a bit at Zamboni code and see we have a simple Bayesian Rating which factors in quantity and quality of ratings, but it is pretty simple and think we can improve by giving reviewers weighting factors and decaying ratings.  Possibly tie rating partially to released versions (there are debates on the merits of this).  

To do this we would need to:
ask users whether a review/rating was helpful.
implement a time decay and credibility to the computed rating.  Possibly computationally heavy.

Any thoughts on this?
Flags: needinfo?(dbialer)
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Whiteboard: [marketplace-transition]
You need to log in before you can comment on or make changes to this bug.