Closed Bug 641644 Opened 14 years ago Closed 11 years ago

Reject duplicate personas automatically

Categories

(addons.mozilla.org Graveyard :: Developer Pages, enhancement, P4)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED
2013-06-06

People

(Reporter: clouserw, Assigned: kngo)

References

Details

(Whiteboard: [monarch])

A recurring problem on getpersonas is that duplicate personas are uploaded and unless you have someone that has seen every persona ever, it's hard to flag these. We should be tracking unique values for personas (a hash in the db is fine) and comparing on upload and rejecting if they are the same.
(In reply to Wil Clouser [:clouserw] from comment #0) > A recurring problem on getpersonas is that duplicate personas are uploaded > and unless you have someone that has seen every persona ever, it's hard to > flag these. > > We should be tracking unique values for personas (a hash in the db is fine) > and comparing on upload and rejecting if they are the same. If the same image is uploaded twice, the hash will probably will be different since we use PIL to resize the images always. We can store the hash upon upload before resizing though.
Severity: normal → enhancement
Priority: P2 → P4
Target Milestone: Q3 2011 → ---
(In reply to Chris Van Wiemeersch [:cvan] from comment #1) > (In reply to Wil Clouser [:clouserw] from comment #0) > > A recurring problem on getpersonas is that duplicate personas are uploaded > > and unless you have someone that has seen every persona ever, it's hard to > > flag these. > > > > We should be tracking unique values for personas (a hash in the db is fine) > > and comparing on upload and rejecting if they are the same. > > If the same image is uploaded twice, the hash will probably will be > different since we use PIL to resize the images always. We can store the > hash upon upload before resizing though. Wil, how would you suggest we do this?
Once I think about it, I'm not sure the PIL resizing is the problem. Hypothetically, resizing the same image twice is going to deliver the exact same image both times, right? So, I'd say: - Calculate hashes of final, full sized images (both header and footer) for all themes - On upload, we calculate the hashes like normal and put them in the db. New themes go in the review queue as normal. - We alter the review queue with a place for a message to the reviewer, and if the hash is the same as one already in the db (please add an index to that column!), then a short message to the reviewer is all that is needed. Something like "This theme might be a duplicate of _$themename_". If it is a dupe, they should reject/delete it like any other disallowed material.
What if somebody uploads somebody else's persona that has been published elsewhere, and then the author decides to upload it to AMO?
(In reply to Aleksej [:Aleksej] from comment #4) [nm, I should have read comment #3 first]
Assignee: nobody → ngoke
Should we calculate these hashes for all currently existing 360k+ personas? With the images being hosted on a static server, it would take 720k+ HTTP requests to pull the images (header and footer). I suggest we just start calculating hashes for personas from here on out? I'd assume most duplicate personas come from an artist accidentally submitting a theme twice in a row.
We have the personas on disk, we wouldn't need to pull over http. They are all in the files/ dir organized by add-on ID. We can handle the huge load of back processing with celery. Off the top of my head: > def: > themes = select all themes where hashes="" limit 1000 > for theme in themes: > theme.hash = calculate_hash() or null We just call that 360 times in a bash loop (sleeping 60s in between) and eventually we have our hashes - doesn't matter if it takes days or weeks. Null would be different than "" so those rows wouldn't be returned again. If we do have a null (problem calculating the hash) it should be logged though so we know what's up. Actually, there are a bunch of themes which aren't themes (.doc files, pdfs, exes, etc.) back from the days before we checked those things. It'd be awesome if you made this script open each image and make sure it was a real image. I think that's just a matter of x = Image.open(). If x has a width and height, it's an image. We have code in AMO for celery tasks and image processing which can be copied for this. Let me know if you've got concerns about it but I think it's totally possible and happy to help.
> Actually, there are a bunch of themes which aren't themes (.doc files, pdfs, > exes, etc.) back from the days before we checked those things. It'd be > awesome if you made this script open each image and make sure it was a real > image. I think that's just a matter of x = Image.open(). If x has a width > and height, it's an image. There is also potential for a file to simply be missing (see 861234). It's all part of the same handling/reporting but another case to make sure we hit. Again, I'm happy to help out here.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2013-05-16
Still need to get that migration script working.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
This is a 2 year old bug. Awesome to get it closed! Thanks.
Target Milestone: 2013-05-16 → 2013-06-06
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.