641644 - Reject duplicate personas automatically

Reporter

Description

•

14 years ago

A recurring problem on getpersonas is that duplicate personas are uploaded and unless you have someone that has seen every persona ever, it's hard to flag these. We should be tracking unique values for personas (a hash in the db is fine) and comparing on upload and rejecting if they are the same.

Christopher Van Wiemeersch [:cvan]

Updated

•

13 years ago

Blocks: greater-percona

Christopher Van Wiemeersch [:cvan]

Comment 1

•

12 years ago

(In reply to Wil Clouser [:clouserw] from comment #0) > A recurring problem on getpersonas is that duplicate personas are uploaded > and unless you have someone that has seen every persona ever, it's hard to > flag these. > > We should be tracking unique values for personas (a hash in the db is fine) > and comparing on upload and rejecting if they are the same. If the same image is uploaded twice, the hash will probably will be different since we use PIL to resize the images always. We can store the hash upon upload before resizing though.

Severity: normal → enhancement

Priority: P2 → P4

Target Milestone: Q3 2011 → ---

Christopher Van Wiemeersch [:cvan]

Comment 2

•

12 years ago

(In reply to Chris Van Wiemeersch [:cvan] from comment #1) > (In reply to Wil Clouser [:clouserw] from comment #0) > > A recurring problem on getpersonas is that duplicate personas are uploaded > > and unless you have someone that has seen every persona ever, it's hard to > > flag these. > > > > We should be tracking unique values for personas (a hash in the db is fine) > > and comparing on upload and rejecting if they are the same. > > If the same image is uploaded twice, the hash will probably will be > different since we use PIL to resize the images always. We can store the > hash upon upload before resizing though. Wil, how would you suggest we do this?

Wil Clouser [:clouserw]

Reporter

Comment 3

•

12 years ago

Once I think about it, I'm not sure the PIL resizing is the problem. Hypothetically, resizing the same image twice is going to deliver the exact same image both times, right? So, I'd say: - Calculate hashes of final, full sized images (both header and footer) for all themes - On upload, we calculate the hashes like normal and put them in the db. New themes go in the review queue as normal. - We alter the review queue with a place for a message to the reviewer, and if the hash is the same as one already in the db (please add an index to that column!), then a short message to the reviewer is all that is needed. Something like "This theme might be a duplicate of _$themename_". If it is a dupe, they should reject/delete it like any other disallowed material.

[:Aleksej]

Comment 4

•

12 years ago

What if somebody uploads somebody else's persona that has been published elsewhere, and then the author decides to upload it to AMO?

[:Aleksej]

Comment 5

•

12 years ago

(In reply to Aleksej [:Aleksej] from comment #4) [nm, I should have read comment #3 first]

Kevin Ngo [:ngoke]

Assignee

Updated

•

12 years ago

Assignee: nobody → ngoke

Kevin Ngo [:ngoke]

Assignee

Comment 6

•

12 years ago

Should we calculate these hashes for all currently existing 360k+ personas? With the images being hosted on a static server, it would take 720k+ HTTP requests to pull the images (header and footer). I suggest we just start calculating hashes for personas from here on out? I'd assume most duplicate personas come from an artist accidentally submitting a theme twice in a row.

Wil Clouser [:clouserw]

Reporter

Comment 7

•

12 years ago

We have the personas on disk, we wouldn't need to pull over http. They are all in the files/ dir organized by add-on ID. We can handle the huge load of back processing with celery. Off the top of my head: > def: > themes = select all themes where hashes="" limit 1000 > for theme in themes: > theme.hash = calculate_hash() or null We just call that 360 times in a bash loop (sleeping 60s in between) and eventually we have our hashes - doesn't matter if it takes days or weeks. Null would be different than "" so those rows wouldn't be returned again. If we do have a null (problem calculating the hash) it should be logged though so we know what's up. Actually, there are a bunch of themes which aren't themes (.doc files, pdfs, exes, etc.) back from the days before we checked those things. It'd be awesome if you made this script open each image and make sure it was a real image. I think that's just a matter of x = Image.open(). If x has a width and height, it's an image. We have code in AMO for celery tasks and image processing which can be copied for this. Let me know if you've got concerns about it but I think it's totally possible and happy to help.

Wil Clouser [:clouserw]

Reporter

Comment 8

•

12 years ago

> Actually, there are a bunch of themes which aren't themes (.doc files, pdfs, > exes, etc.) back from the days before we checked those things. It'd be > awesome if you made this script open each image and make sure it was a real > image. I think that's just a matter of x = Image.open(). If x has a width > and height, it's an image. There is also potential for a file to simply be missing (see 861234). It's all part of the same handling/reporting but another case to make sure we hit. Again, I'm happy to help out here.

Kevin Ngo [:ngoke]

Assignee

Comment 9

•

12 years ago

https://github.com/mozilla/zamboni/commit/4c578488cd79b7f2df9496674ddf32dc7e25c2b7

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Wil Clouser [:clouserw]

Reporter

Updated

•

12 years ago

Target Milestone: --- → 2013-05-16

Kevin Ngo [:ngoke]

Assignee

Comment 10

•

11 years ago

Still need to get that migration script working.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Kevin Ngo [:ngoke]

Assignee

Comment 11

•

11 years ago

It's rollin' https://github.com/mozilla/zamboni/commit/c9dcf6898c709ff53bb704048cf616ecdd72f6af

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

Wil Clouser [:clouserw]

Reporter

Comment 12

•

11 years ago

This is a 2 year old bug. Awesome to get it closed! Thanks.

Target Milestone: 2013-05-16 → 2013-06-06

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: addons.mozilla.org → addons.mozilla.org Graveyard

Bugzilla

Reject duplicate personas automatically

Categories

(addons.mozilla.org Graveyard :: Developer Pages, enhancement, P4)

Tracking

(Not tracked)

People

(Reporter: clouserw, Assigned: kngo)

References

Details

(Whiteboard: [monarch])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Updated