If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Help with a query for getpersonas.com

RESOLVED FIXED

Status

Data & BI Services Team
DB: MySQL
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: clouserw, Unassigned)

Tracking

Details

(Reporter)

Description

5 years ago
Goal: Get email addresses for every author (not every user) in getpersonas.  
Secondary goal:  Get the total number of email addresses (just because this is less bulky and we can use it for estimating some things)
Size: 1.5M total users, I'm not sure how many authors.  3/4ths of that?
Indexes: If only.  I don't think we hit any useful indexes here.  users.username has one because it is the PK.  Wait, what?  Oh yes.  personas.author does not have one.

My query for goal #1:
> SELECT username, email 
> FROM users 
> WHERE username IN (
> 	SELECT DISTINCT author 
> 	FROM personas
> );

My query for goal #2:
> SELECT count(email) 
> FROM users 
> WHERE username IN (
> 	SELECT DISTINCT author 
> 	FROM personas
> );

Problem: Queries sit at "copying to tmp table" for hours on end.  Would this be faster with a JOIN instead of a sub-select?  I'm not sure how to do that with DISTINCT.
Timeline: This is a run-once thing so we don't need to optimize the tables past getting this information out of them.  Maybe it would be worth adding an index to authors though?
SELECT DISTINCT username, email FROM users AS u INNER JOIN personas AS p ON p.author = u.username;

SELECT COUNT(DISTINCT email) FROM users AS u INNER JOIN personas AS p ON p.author = u.username;
(Reporter)

Comment 2

5 years ago
(In reply to Kris Maglione [:kmag] from comment #1)
> SELECT DISTINCT username, email FROM users AS u INNER JOIN personas AS p ON
> p.author = u.username;
+----------------------------------+------------------------------------------------------------------+
<snip>
+----------------------------------+------------------------------------------------------------------+
208585 rows in set (12.84 sec)

> 
> SELECT COUNT(DISTINCT email) FROM users AS u INNER JOIN personas AS p ON
> p.author = u.username;

+-----------------------+
| COUNT(DISTINCT email) |
+-----------------------+
|                204945 |
+-----------------------+
1 row in set (14.30 sec)


The numbers are off by 4k, but the timing is awesome! :)  Any ideas why the total count wouldn't be the same?
Definitely faster with a JOIN instead of a subselect, in both cases. I don't think that in this case you have to worry about distinct, because you just want the usernames that also appear as authors in the "personas" table. Sometimes you have to step back and figure out what you're getting, as opposed to trying to mechanically convert.

So the goal for #1 of "Get email addresses for every author (not every user) in getpersonas."

Query 1 turns into:
SELECT username, email 
FROM users
 INNER JOIN personas ON (users.username=personas.author);


Query 2 turns into:
SELECT COUNT(email)
FROM users
 INNER JOIN personas ON (users.username=personas.author);


Although for query 2 do you want "SELECT COUNT(DISTINCT email)" so you're not counting any duplicates?

As for changing the schema - these tables are all quite small, users is <350Mb (and MyISAM) and personas is 188M (InnoDB), so making schema changes is pretty trivial.
Wil - in the original query you're not looking for DISTINCT email. Here's the difference:

mysql> SELECT COUNT(email)
    -> FROM users
    ->  INNER JOIN personas ON (users.username=personas.author);
+--------------+
| COUNT(email) |
+--------------+
|       495983 |
+--------------+
1 row in set (2.81 sec)

mysql> SELECT COUNT(distinct email) FROM users  INNER JOIN personas ON (users.username=personas.author);
+-----------------------+
| COUNT(distinct email) |
+-----------------------+
|                205020 |
+-----------------------+
1 row in set (3.26 sec)
(I did this on production, so the numbers might be a few off from the export you have).
(Reporter)

Comment 6

5 years ago
Sheeri: awesome.  I changed your query 1 to be this (added the distinct email):

> SELECT distinct email 
> FROM users
>  INNER JOIN personas ON (users.username=personas.author)

which gives 204945 rows and looks like what I want.  I think we're done.  Thanks to you both - JOIN is crazy faster than sub select.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Reporter)

Comment 7

5 years ago
We collided at the end but I think we're good.  Thanks. :)
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.