Closed Bug 601967 Opened 15 years ago Closed 14 years ago

pentaho active_users data incomplete every day since 2010-09-13

Categories

(mozilla.org Graveyard :: Server Operations, task)

Product:

Component:

Platform:

x86

All

Type:

task

Priority:

Not set

Severity:

critical

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rags, Assigned: Atoll)

References

(
URL
)

Details

Ragavan S [:rags]

Reporter

Description

•

15 years ago

The count of *total* registered users went down today. Yesterday it was above 1Million, but now it shows 667K. Pretty sure we haven't deleted 300K users from Sync overnight. Same thing for actives as well. The data is inconsistent between the Summary view and the requests view and user view. Can someone figure out what's going on please?

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 1

•

15 years ago

It is a holiday in Portugal today so they might not be able to look at this until tomorrow AM, but it sounds to me like there was a segment of the user database servers that didn't report in yesterday. We'll need to pull in some Ops people to try to figure this out.

Comment 2

•

15 years ago

We had an error in vertica in the nightly process - reruning

Ragavan S [:rags]

Reporter

Comment 3

•

15 years ago

(In reply to comment #2) > We had an error in vertica in the nightly process - reruning Thanks for the quick response, Pedro.

Comment 4

•

15 years ago

Hum, after the reprocess the numbers are still the same. We'll look into that in more detail first thing tomorrow!

Assignee: nobody → nelson.sousa

Status: NEW → ASSIGNED

Comment 5

•

15 years ago

It went back to normal without any intervention. Could be that the synch job that populates the weave db didn't finish in time for the data extraction and only partial data was loaded for yesterday. Can anyone confirm this? What time did the weave db synch job finish on 10/04 and what time does the data export script from metrics kick off? (the job isn't run by us)

Daniel Einspanjer [:dre] [:deinspanjer]

Updated

•

15 years ago

Assignee: nelson.sousa → server-ops

Group: metrics-private

Component: Frontend Reports → Server Operations

Product: Mozilla Metrics → mozilla.org

QA Contact: frontend-reports → mrz

Version: unspecified → other

Phong Tran [:phong]

Updated

•

15 years ago

Assignee: server-ops → rsoderberg

Assignee

Comment 6

•

15 years ago

The timestamps on wp-adm01 for October 4th indicate that it took an extra 80 minutes for the metrics collection to finish, as compared to the surrounding days. Do we need to push the completed users_2010-10-04 file to you or otherwise make it available somehow? Can this process be modified to add a "this file is complete" marker, so that Pentaho can flag incomplete files? wp-adm01# ls -al /opt/weave_2/output/ ... -rw-r--r-- 1 root root 13246103 Oct 3 07:31 users_2010-10-03 -rw-r--r-- 1 root root 13414544 Oct 4 08:54 users_2010-10-04 -rw-r--r-- 1 root root 13613678 Oct 5 07:31 users_2010-10-05 -rw-r--r-- 1 root root 13716413 Oct 6 07:32 users_2010-10-06 (as discovered via aravind and wp-adm01:/etc/cron.d/kitchen)

Assignee

Updated

•

15 years ago

Summary: Errors in Sync data → 2010-10-04 sync data collection incomplete

Comment 7

•

15 years ago

There's no need to resend us the files. If something is wrong one day the scripts will recover the following day. You're currently running a script from us that will grab all data from weave db and extract it into files for inserting in the metrics db. If you could add a check to only start that job when the db synch is complete it would be great and would prevent this situation to happen again in the future.

Assignee

Comment 8

•

15 years ago

(In reply to comment #7) > There's no need to resend us the files. If something is wrong one day the > scripts will recover the following day. Great, thanks! > You're currently running a script from us that will grab all data from weave db > and extract it into files for inserting in the metrics db. If you could add a > check to only start that job when the db synch is complete it would be great > and would prevent this situation to happen again in the future. Hmm. The script that we run once a day does the following: * cd /opt/pentaho/kettle; ./kitchen.sh -file j_writeWeaveDataToFile.kjb > ${DATE_YMD}.log * cd /opt/weave_2/output; scp *_${DATE_YMD} root@somewhere:/data/logs/weave-etl/production/ So, it looks like j_writeWeave has to finish before anything is copied to what appears to be where you get the files from. You said something about a "db synch" - I don't understand, is that part of the kettle/kitchen.sh script?

Comment 9

•

15 years ago

Those scripts read from the weave DB, where all usage data is available and export it to a location the metrics team can access to import then to one of metrics' databases. I was under the impression that the weave DB was populated once a day with that day's data. If for some reason the database isn't fully populated by the time the j_writeWeaveDataToFile.kjb job kicks off, then the exported files will have only partial data. Is this what happens, or am I missing something? From our part the output files are read from that location and imported. But if the files aren't there, then our job has nothing to do, so no data is inserted.

Assignee

Comment 10

•

15 years ago

(In reply to comment #9) > Is this what happens, or am I missing something? That sounds reasonable to me, I'll look into it and let you know if I find anything.

Assignee

Comment 11

•

15 years ago

okay, i found how we write the data to active_users. basically, each day there should be 400 entries in active_users. if there aren't, then something is wrong, and we need to skip collection to avoid reporting incorrect data. active_users cannot be fully reconstructed for the dates between 7/28 and 8/2, on 8/21, and on 9/28; the data cannot be recovered unless we have a backup from that day available and consider it worthwhile to restore backups and reconstruct the data. specific to pentaho, active_users was not fully populated prior to the 7:30am (-0700) pentaho collection nearly every day since 2010-09-13. the collection is completed somewhere between 8:30am and 10:30am most days; on 2010-08-30, 11:15pm; on 2010-09-29, 2pm; on 2010-10-04, 3pm. so, what should we do next? the specific list of days i can prove are affected is below. if we can re-run the kitchen process for this list of days, you can pull the corrected data and that will repair the past few weeks of data. i will implement something right now that delays pentaho collection until all known servers report in or 24 hours have passed, so that either you get complete data or no data at all. (if it helps, all times in this specific comment are GMT -0700)

Summary: 2010-10-04 sync data collection incomplete → pentaho active_users data incomplete every day since 2010-09-13

Assignee

Comment 12

•

15 years ago

i am reasonably certain that pentaho collection was incomplete for the 29 days listed below. each of these days has an active user report for all 400 servers; the third column indicates how many servers had NOT reported in by the pentaho 7:30am (-0700) collection time. mysql> select date_format(recorded, "%Y-%m-%d") as the_day, max(recorded), count(recorded) from active_users where date_format(recorded, "%k%i") >= 0730 group by the_day having the_day >= '2010-07-18' and (select count(*) from active_users where date_format(recorded, "%Y-%m-%d") = the_day) = (select count(*) from available_nodes where downed=0); +------------+---------------------+-----------------+ | the_day | max(recorded) | count(recorded) | +------------+---------------------+-----------------+ | 2010-07-20 | 2010-07-20 11:53:46 | 4 | | 2010-07-23 | 2010-07-23 14:20:29 | 10 | | 2010-08-30 | 2010-08-30 23:15:13 | 145 | | 2010-08-31 | 2010-08-31 08:48:26 | 2 | | 2010-09-01 | 2010-09-01 08:47:45 | 2 | | 2010-09-06 | 2010-09-06 10:44:29 | 5 | | 2010-09-13 | 2010-09-13 07:45:09 | 1 | | 2010-09-14 | 2010-09-14 07:55:38 | 1 | | 2010-09-15 | 2010-09-15 07:59:59 | 1 | | 2010-09-16 | 2010-09-16 07:58:13 | 1 | | 2010-09-17 | 2010-09-17 07:42:18 | 1 | | 2010-09-20 | 2010-09-20 08:23:47 | 1 | | 2010-09-21 | 2010-09-21 08:32:52 | 1 | | 2010-09-22 | 2010-09-22 09:06:27 | 2 | | 2010-09-23 | 2010-09-23 09:13:13 | 2 | | 2010-09-24 | 2010-09-24 09:08:53 | 2 | | 2010-09-25 | 2010-09-25 08:24:22 | 2 | | 2010-09-26 | 2010-09-26 08:19:13 | 1 | | 2010-09-27 | 2010-09-27 09:48:34 | 3 | | 2010-09-29 | 2010-09-29 13:58:37 | 5 | | 2010-09-30 | 2010-09-30 09:40:35 | 3 | | 2010-10-01 | 2010-10-01 09:09:45 | 2 | | 2010-10-02 | 2010-10-02 08:39:55 | 2 | | 2010-10-03 | 2010-10-03 08:33:16 | 2 | | 2010-10-04 | 2010-10-04 14:51:51 | 132 | | 2010-10-05 | 2010-10-05 09:25:33 | 3 | | 2010-10-06 | 2010-10-06 10:21:59 | 4 | | 2010-10-07 | 2010-10-07 16:24:37 | 3 | | 2010-10-08 | 2010-10-08 10:03:20 | 3 | +------------+---------------------+-----------------+ 29 rows in set (2.29 sec)

Assignee

Comment 13

•

15 years ago

the new 4th column is how many users were undercounted on each of the days listed, to offer some idea of significance. mysql> select date_format(recorded, "%Y-%m-%d") as the_day, max(recorded), count(recorded), sum(total) from active_users where date_format(recorded, "%k%i") >= 0730 group by the_day having the_day >= '2010-07-18' and (select count(*) from active_users where date_format(recorded, "%Y-%m-%d") = the_day) = (select count(*) from available_nodes where downed=0); +------------+---------------------+-----------------+------------+ | the_day | max(recorded) | count(recorded) | sum(total) | +------------+---------------------+-----------------+------------+ | 2010-07-20 | 2010-07-20 11:53:46 | 4 | 21406 | | 2010-07-23 | 2010-07-23 14:20:29 | 10 | 53351 | | 2010-08-30 | 2010-08-30 23:15:13 | 145 | 341182 | | 2010-08-31 | 2010-08-31 08:48:26 | 2 | 32081 | | 2010-09-01 | 2010-09-01 08:47:45 | 2 | 32072 | | 2010-09-06 | 2010-09-06 10:44:29 | 5 | 85840 | | 2010-09-13 | 2010-09-13 07:45:09 | 1 | 13787 | | 2010-09-14 | 2010-09-14 07:55:38 | 1 | 13786 | | 2010-09-15 | 2010-09-15 07:59:59 | 1 | 13786 | | 2010-09-16 | 2010-09-16 07:58:13 | 1 | 13785 | | 2010-09-17 | 2010-09-17 07:42:18 | 1 | 13785 | | 2010-09-20 | 2010-09-20 08:23:47 | 1 | 13785 | | 2010-09-21 | 2010-09-21 08:32:52 | 1 | 13784 | | 2010-09-22 | 2010-09-22 09:06:27 | 2 | 31926 | | 2010-09-23 | 2010-09-23 09:13:13 | 2 | 31920 | | 2010-09-24 | 2010-09-24 09:08:53 | 2 | 31914 | | 2010-09-25 | 2010-09-25 08:24:22 | 2 | 31912 | | 2010-09-26 | 2010-09-26 08:19:13 | 1 | 13782 | | 2010-09-27 | 2010-09-27 09:48:34 | 3 | 50471 | | 2010-09-29 | 2010-09-29 13:58:37 | 5 | 84482 | | 2010-09-30 | 2010-09-30 09:40:35 | 3 | 49895 | | 2010-10-01 | 2010-10-01 09:09:45 | 2 | 31552 | | 2010-10-02 | 2010-10-02 08:39:55 | 2 | 31546 | | 2010-10-03 | 2010-10-03 08:33:16 | 2 | 31538 | | 2010-10-04 | 2010-10-04 14:51:51 | 132 | 416339 | | 2010-10-05 | 2010-10-05 09:25:33 | 3 | 49785 | | 2010-10-06 | 2010-10-06 10:21:59 | 4 | 67015 | | 2010-10-07 | 2010-10-07 16:24:37 | 3 | 49758 | | 2010-10-08 | 2010-10-08 10:03:20 | 3 | 49747 | +------------+---------------------+-----------------+------------+ 29 rows in set (2.25 sec)

Assignee

Comment 14

•

15 years ago

this monitoring query returns "1" if collection is complete, "0" if it's not ready yet. select (select count(*) from available_nodes) = (select count(*) from active_users where date_format(recorded, "%Y-%m-%d") = date_format(@now, "%Y-%m-%d")) as ready; i'd like to implement this but i need to know that we can reprocess statistics for a given day before i put this into place.

Assignee

Comment 15

•

15 years ago

update_totalAndActiveUsers.ktr does a complete dump of the active_users data each day using the query. this means that each day's dump of the totalAndActiveUsers data contains the correct data for all prior days (and specifically those days above). so, metrics folks, i have two requests: how is the data import being handled? if a partial count for 2010-10-08 is published on 2010-10-08, and then a full count for 2010-10-08 is published along with other data on 2010-10-09, will the full count for 2010-10-08 be imported into metrics? could you please verify that the counts for the 29 dates above match those published in the latest totalAndActiveUsers_2010-10-13? ps. the query mentioned above: SELECT CAST( LEFT(recorded,10) as CHAR) as date, CAST(node as char) AS node, total, actives FROM active_users

Assignee

Comment 16

•

15 years ago

we identified a new query that appears to solve both issues, full writeup coming in the next few minutes select date_format(from_unixtime(left(last_active, 10)), "%Y-%m-%d") as the_day, count(distinct username) from usersummary group by the_day with rollup;

Assignee

Comment 17

•

15 years ago

pulling counts directly from the active_users table is effective only when all nodes report in. theoretically that works okay, but in practice it doesn't scale. i found that the script that writes data to active_users also writes data in a different way to usersummary. there's one key difference, though. active_users is a count of data *per day*. usersummary is a count of data *per user/collection*. there is a key reason this matters. in usersummary, the most recent data about a user - uid, username, collection, last_active - overwrites any prior data for that user. each (uid, username) has a single row. so as new data is being written, existing data is left alone and can safely be pulled for metrics. (with the usual caveat that if the data is pulled while the table is being updated, the most recent day of data will be partial.) the data in usersummary is generated by scanning every database for a user's data and then overwriting any existing data about that user when it's found. the current method is at risk of overcounting if we have users stranded on a second database somehow, and differences in the totals we've encountered suggest that there could be 25k users overcounted in that manner. pulling counts from usersummary ensures that each user (specifically, username) is counted precisely once, at the risk of not counting them as an active user down the line. we can extract two pieces of data from usersummary that are of specific relevance to this bug. for each username, we know at what time they were most recently active, so we can count how many users were active since a certain time. since we run the usersummary collection between 1am and 6am (-0700) each morning, i suggest defining "active" as "active since N days ago at 00:00 (-0700)", where N=1 "since yesterday". as a side effect of producing the list of dates and how many users were active no more recently than a given date, we can sum that count to get the total number of usernames known to the system. this query provides both counts: SELECT DATE_FORMAT(IF(LEFT(last_active, 10)>0,FROM_UNIXTIME(LEFT(last_active, 10)),'0000-00-00'), "%Y-%m-%d") AS active_on, COUNT(username) FROM (SELECT username, MAX(last_active) AS last_active FROM usersummary GROUP BY username ORDER BY NULL) AS t1 GROUP BY active_on WITH ROLLUP; the result sets looks like: | active_on | count(username) | | 0000-00-00 | 39787 | | 2009-11-12 | 1 | | 2009-11-13 | 7 | ... | 2010-10-12 | 55008 | | 2010-10-13 | 382858 | | 2010-10-14 | 156430 | | NULL | 1121359 | so, presuming today is 2010-10-14: to find out how many usernames were active since yesterday (N=1), add up the user count for every day >= 2010-10-13 WHERE active_on IS NOT NULL, which is 539288 users. to find out how many usernames exist somewhere in our databases, take the user count where active_on IS NULL (this is the WITH ROLLUP clause), which is 1121359 users. to find out how many users have not been active since we started recording last_active data, take the user count where active_on = '0000-00-00', which is 39787 users. with some post-processing in kettle, this should be sufficient to fix the existing weave/sync metrics. one bonus is that this provides enough data for metrics to create a chart of how many stale users we have over time (last seen more than 7/30/60 days ago). the next step is to implement a new kettle collection script, push it to the server, and verify/switch everything. not sure who's going to pick that up, if any of the metrics team have bandwidth, please let me know.

Comment 18

•

15 years ago

A new test script was just made and will be run today. For the first few days we'll keep counting users the old way, until we can assess the reliability of this method.

Ragavan S [:rags]

Reporter

Comment 19

•

15 years ago

(In reply to comment #17) > > one bonus is that this provides enough data for metrics to create a chart of > how many stale users we have over time (last seen more than 7/30/60 days ago). I would really like to see this data on the dashboard.

Assignee

Comment 20

•

15 years ago

(In reply to comment #18) > A new test script was just made and will be run today. For the first few days > we'll keep counting users the old way, until we can assess the reliability of > this method. Turns out there's a conflict with assigning last_seen '0000-00-00' to users who have no last_seen, pinged Nelson to see which fix is best.

Comment 21

•

15 years ago

bug fixed (for some reason the script didn't complain about parsing 0000-00-00 as a date on my computer)

Assignee

Comment 22

•

15 years ago

Updated job runs fine, pushed the data manually to im-log03 in the usual place, updated wp-adm01:/opt/etl-run.sh to add the new collection after the existing ones. Next update will be in a few days - once data has been collected over time, and Metrics switches over to the new data source.

Assignee

Comment 23

•

15 years ago

Data files are present for inspection: ./count_Users.ktr: <name>${OUTPUT_DIR}/userCount_${DATE}</name> # ls -al userCount_* -rw-r--r-- 1 root root 3684 Oct 21 09:32 userCount_2010-10-21 -rw-r--r-- 1 root root 3677 Oct 22 09:06 userCount_2010-10-22 -rw-r--r-- 1 root root 3699 Oct 23 07:37 userCount_2010-10-23 -rw-r--r-- 1 root root 3690 Oct 24 07:37 userCount_2010-10-24 -rw-r--r-- 1 root root 3702 Oct 25 07:38 userCount_2010-10-25 -rw-r--r-- 1 root root 3721 Oct 26 07:37 userCount_2010-10-26 -rw-r--r-- 1 root root 3738 Oct 27 07:37 userCount_2010-10-27

Daniel Einspanjer [:dre] [:deinspanjer]

Comment 24

•

15 years ago

Believe this is finished now.

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Assignee

Updated

•

14 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Assignee

Comment 26

•

14 years ago

Nelson confirmed that there is still work to do on this to stabilize user counts.

Comment 27

•

14 years ago

new user count etl is in place, results will reflect on the dashboard shortly.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.