Closed
Bug 879835
Opened 12 years ago
Closed 11 years ago
[Bugs ES] Script to mirror data from Bugzilla
Categories
(Testing :: General, defect)
Testing
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mcote, Assigned: ekyle)
References
Details
Attachments
(5 files)
We need a script that (a) adds data for *public* bugs that have changed since the last time it was run and (b) removes data for bugs that were public but have since had access restrictions placed on them. We want this script to run at least every hour.
We will also have to determine to which machine this should be deployed.
Reporter | ||
Comment 1•12 years ago
|
||
Note that metrics already has a script that does this; we would just need to modify it.
Reporter | ||
Comment 2•12 years ago
|
||
A further note that the schema for tracking flags is changing (bug 750742). This will probably require an update to the script.
(In reply to Mark Côté ( :mcote ) from comment #0)
> We need a script that (a) adds data for *public* bugs that have changed
> since the last time it was run and (b) removes data for bugs that were
> public but have since had access restrictions placed on them.
we need to remove non-public bugs, comments, and attachments.
we also need to populate the ES cluster when a bug/comment/attachment is changed from private to public.
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → klahnakoski
Status: NEW → ASSIGNED
Comment 4•12 years ago
|
||
Note, I think current bugzilla->ES approach is wrong. Instead of munging data into ES and dealing with synchronization headaches, we should fix bugzilla to show review info(is there anything else we care about?)
Assignee | ||
Comment 5•12 years ago
|
||
The most important thing we care about is historical and time series data. The "munging" that is done during ETL is actually making the data explicit and easier to query in the time dimension.
https://metrics.mozilla.com/bugzilla-analysis/Bug-Ages.html
https://metrics.mozilla.com/bugzilla-analysis/Security_Q1_Goal.html
and perform some simple calculations on those time series
https://metrics.mozilla.com/bugzilla-analysis/Predictive_Simple.html#dueDate=2013-08-11&sampleMin=2013-06-21&titleName=&programFilter=B2G+1.1.0+%28Leo%29
Overall, ElasticSearch can calculate aggregates very fast. So fast, that we no longer need to pre-compute or cache those aggregates and still get relatively quick charts.
Bugzilla ETL, in my opinion, the most complicated. I have seen telemetry and talos test data, and both can be dumped directly into ES with minimal transformation.
Reporter | ||
Comment 6•12 years ago
|
||
Yeah, to summarize what Kyle said, an ES database is incredibly useful for external dashboards (particularly burn-down rates). It is also useful for analysis of Bugzilla trends in general, since ES is much better at some types of analysis than MySQL will ever be.
Assignee | ||
Comment 7•12 years ago
|
||
Checking if bug is public:
bug_group_map
longdescs.isprivate (boolean)
attachments.isprivate
Assignee | ||
Comment 8•11 years ago
|
||
This patch is a single changeset containing the net changes from the initial commit. It is now part of master. The old master branch was renamed to first_version_for_review for the sake of keeping change history.
Attachment #816332 -
Flags: review?(mcote)
i was driving by, and noticed the following:
get_recent_private_attachments() only checks for attachments which are changed from public to private, and doesn't catch attachments which were created as private. you need to also check the attachments table for recently created private attachments.
get_recent_private_comments() only checks for comments which are private at the time they are created. you also need to check the bugs_activity table for comments changed to private.
i'm probably missing it, but i don't see anything that deals with adding comments or attachments when they are changed from private to public.
i have some concerns about your main polling query:
> SELECT
> bug_id
> FROM
> bugs
> WHERE
> delta_ts >= CONVERT_TZ(FROM_UNIXTIME({{start_time}}/1000), 'UTC', 'US/Pacific') AND
> ({{min}} <= bug_id AND bug_id < {{max}}) AND
> bug_id not in {{private_bugs}}
we have 63314 private bugs in the database currently, so that's a rather IN clause you'll be sending to mysql every time. it would probably be more efficient to join with the bug_group_map table and skip rows where exist there.
Comment 10•11 years ago
|
||
also, functions void index usage, resulting in a full table scan. it would be better to calculate the value for delta_ts in python and pass a constant in to the query instead. (thanks cyborgshadow!)
Comment 11•11 years ago
|
||
Even though it was a bit past midnight, glad I could help. :)
Assignee | ||
Comment 12•11 years ago
|
||
Thanks for you time looking at this. I am working on the fixes now. In the meantime I meant to ask you to confirm HOW things get marked private/public. Here is the code I use to simulate those actions:
https://github.com/klahnakoski/Bugzilla-ETL/blob/master/tests/util/database.py#L49
Specifically, I would like to know if the longdescs(bug_when) is updated when the isprivate is changed.
Your comments above already reveal I have no test for when attachments/comments go back public, so I will add those tests and code too.
On the issue of functions and index usage: I believe MySQL is smart enough to recognize the pure functional nature of CONVERT_TZ() and FROM_UNIXTIME() and evaluates the right-hand-side of the comparison once before using the index to complete the lookup:
delta_ts >= CONVERT_TZ(FROM_UNIXTIME({{start_time}}/1000), 'UTC', 'US/Pacific')
This is different than the inverse form, which must operate on every timestamp value and results in a full table scan:
UNIX_TIMESTAMP(CONVERT_TZ(delta_ts, 'US/Pacific', 'UTC'))*1000 >= {{start_time}}
I confirmed this with a DESCRIBE on the two possible query phrasings:
describe SELECT
b.bug_id
FROM
bugs b
LEFT JOIN
bug_group_map m ON m.bug_id=b.bug_id
WHERE
delta_ts >= CONVERT_TZ(FROM_UNIXTIME(0/1000), 'UTC', 'US/Pacific') AND
m.bug_id IS NULL
;
describe SELECT
b.bug_id
FROM
bugs b
LEFT JOIN
bug_group_map m ON m.bug_id=b.bug_id
WHERE
UNIX_TIMESTAMP(CONVERT_TZ(delta_ts, 'US/Pacific', 'UTC'))*1000 >= 0 AND
m.bug_id IS NULL
;
I would like to keep the SQL timezone conversion code because I trust it to do the "right thing" when it comes to daylight saving time, including pre 2007. Moving to Python time conversion will probably involve another library, that slightly disagrees with one used by the database, and introducing timezone bugs which I find especially hard to stomp out.
Comment 13•11 years ago
|
||
(In reply to Kyle Lahnakoski [:ekyle] from comment #12)
> Specifically, I would like to know if the longdescs(bug_when) is updated
> when the isprivate is changed.
it isn't not updated -- that date reflects when the comment was made, not when it was last updated.
> On the issue of functions and index usage: I believe MySQL is smart enough
> to recognize the pure functional nature of CONVERT_TZ() and FROM_UNIXTIME()
> and evaluates the right-hand-side of the comparison once before using the
> index to complete the lookup:
if you use EXPLAIN EXTENDED you can see it isn't using an index --- "extra" just has 'Using where' instead of 'Using where; Using index'.
> I would like to keep the SQL timezone conversion code because I trust it to
> do the "right thing" when it comes to daylight saving time
you could do the conversion once in a different sql query, then pass the result as a string into the main query..
SELECT CONVERT_TZ(FROM_UNIXTIME({{start_time}}/1000), 'UTC', 'US/Pacific')
Comment 14•11 years ago
|
||
(In reply to Byron Jones ‹:glob› from comment #13)
> it isn't not updated -- that date reflects when the comment was made, not
> when it was last updated.
err, "it isn't updated" :)
Assignee | ||
Comment 15•11 years ago
|
||
On longdescs(bug_when): Thank you! I had totally missed the bugs_activity(longdesc.isprivate) field :(
On timezone conversion: Hmmm, my BZ instance shows extra=""Using where; Using index; Not exists". Just to be safe, I will calculate the datetime in a separate query as you suggest.
Reporter | ||
Comment 16•11 years ago
|
||
Comment on attachment 816332 [details] [diff] [review]
First review
r+ with all the changes listed in the github commit.
Attachment #816332 -
Flags: review?(mcote) → review+
Assignee | ||
Comment 17•11 years ago
|
||
The bzAliases branch (https://github.com/klahnakoski/Bugzilla-ETL/tree/bzAliases) has the latest version of the code. This is the log it generates (only the first and last 1000's block is ETLed). I hope the literal SQL may expose more problems.
Attachment #827043 -
Flags: review?(glob)
Assignee | ||
Comment 18•11 years ago
|
||
When looking at the patch ( https://bugzilla.mozilla.org/attachment.cgi?id=827043 ) there is a long block of logs with no SQL. Jump to "ETL of first 1000 bugs done" to get to the next(last) block.
Assignee | ||
Comment 19•11 years ago
|
||
This is a log generated during functional testing. This reveals the SQL used by test to simulate production (which I am not sure about) and also reveals the SQL used in the incremental ETL
Attachment #827388 -
Flags: review?(glob)
Comment 20•11 years ago
|
||
Comment on attachment 827043 [details]
Review of SQL log
i don't feel confident enough to do a full review, due to a lack of full context of the operations, and i don't know enough of the ETL outside of the sql to determine if the data from are correct.
i have looked over the sql and it looks sane to me.
Attachment #827043 -
Flags: review?(glob) → feedback+
Comment 21•11 years ago
|
||
Comment on attachment 827388 [details]
Review Test Log
things look sane over here too.
Attachment #827388 -
Flags: review?(glob) → feedback+
Assignee | ||
Comment 22•11 years ago
|
||
Mark,
How should the review for the bzAliases branch be done? Do I merge --squash it into master and you comment on the net changes? Do you review all patches in that branch?
I believe the best course of action is for you to review the diff between master and bzAliases in the test code: https://github.com/klahnakoski/Bugzilla-ETL/compare/master...bzAliases?w=1 starting at "tests/test_etl.py". This will show you what has changed in the tests, which is effectively very little.
This branch is about improving the email aliases matching: It can tease out the the various emails an individual has used over history, and consider them all aliases of each other. This has cleaned up the CC lists significantly (and reviews counts). The tests have not changed, but the reference JSON used to verify the tests have changed to reflect these improvements. I am glad to see that the early CC lists are small to begin, and grow over time as expected. Previously the CC lists erroneously started large, and shrunk (or not) over time.
tests/resources/*reference_es.json files have also changed to reflect the new alias matching algorithm. Github does not want to show the large file diffs, so they must be done locally.
Flags: needinfo?(mcote)
Reporter | ||
Comment 23•11 years ago
|
||
Hm, well, that's what we effectively did last time, so we could do that. Or you can upload a diff from master to the bzAliases branch here and we can review it in Bugzilla. Up to you.
Flags: needinfo?(mcote)
Assignee | ||
Comment 24•11 years ago
|
||
All changes can be seen in this squashed changeset:
https://github.com/klahnakoski/Bugzilla-ETL/commit/711810f08951a731dc543c10a0973fc34ed17c6b
The better_leak_check branch has the breakdown:
https://github.com/klahnakoski/Bugzilla-ETL/commits/better_leak_check
The real meat is in three smaller changesets. The first removes the comments that were leaking, and the tests to confirm this situation is caught:
https://github.com/klahnakoski/Bugzilla-ETL/commit/31a8ae54d9c8e4e1689a77c3a0c5a8464a8f78ac?w=1
The second is a date conversion bug, also unique to production (only because I am using a 6 month old snapshot):
https://github.com/klahnakoski/Bugzilla-ETL/commit/7edbe1711dcb44e75c90ec046dbfed7cfad9cb34
Third is better leak checker (not run in production, but responsible for reviewing the ETL results):
https://github.com/klahnakoski/Bugzilla-ETL/commit/691dd47859fb1050b9c6182590457a121b59a8c1
The rest is PEP8, more logging, better UTF8 coding practices, me fighting with git
Attachment #8357535 -
Flags: review?(mcote)
Reporter | ||
Comment 25•11 years ago
|
||
Uh this attachment is in binary...
Attachment #8357535 -
Attachment is patch: false
Attachment #8357535 -
Attachment mime type: text/plain → application/zip
Reporter | ||
Comment 26•11 years ago
|
||
Comments from github:
7edbe17 bzETL/util/db.py:L626
Is the double colon necessary here?
7edbe17 bzETL/util/queries/windows.py:L152
Why the extra line?
691dd47 tests/resources/python/look_for_leaks.py:L115
Spacing.
691dd47 tests/resources/python/look_for_leaks.py:L126
Spacing.
Reporter | ||
Updated•11 years ago
|
Attachment #8357535 -
Flags: review?(mcote) → review+
Assignee | ||
Comment 27•11 years ago
|
||
All changes from Jan 12th to (and including) Jan 28. It is probably better to review the changes in GitHub.
Here's what's important:
This is a bad attempt as fixing the whiteboard leakage (see https://bugzilla.mozilla.org/show_bug.cgi?id=952677 for links to correct fix)
https://github.com/klahnakoski/Bugzilla-ETL/commit/32ee1022d84fc54f6abaadc9bfcf9ed73dc6732d?w=1
Testing for the leaked whiteboard:
https://github.com/klahnakoski/Bugzilla-ETL/commit/32ee1022d84fc54f6abaadc9bfcf9ed73dc6732d?w=1#diff-469bf8513f3c4ac87f66514572ce704fR417
the look_for_leaks.py is now a daemon that I run constantly to check for leaks:
https://github.com/klahnakoski/Bugzilla-ETL/commit/64bd0e4242e648faf1272453ae4198de40e2bf51#diff-2464b031c4302f5ea65c026afea8e467L2
Log the bugs that are net new deletes:
https://github.com/klahnakoski/Bugzilla-ETL/commit/2ee6ce8f852511da5f40be19e83ccd3f9964a798#diff-1bf47e2bbc2d093719a4eacfa1645ef4L183
All history on changed bug is needed, unless I also probe the ES cluster for current state (which is more complicated and prone to error):
https://github.com/klahnakoski/Bugzilla-ETL/commit/50775d2d3f5b2fd53ff5d6813863aab943384519#diff-e7f5c575eeda495b16d7432bbe35b033L610
Which fixes the expirers_on problems seen in production:
https://github.com/klahnakoski/Bugzilla-ETL/commit/50775d2d3f5b2fd53ff5d6813863aab943384519#diff-469bf8513f3c4ac87f66514572ce704fR455
The remain list are the bug groups' where whiteboard must be screened:
https://github.com/klahnakoski/Bugzilla-ETL/commit/27eec045ecfeedac9651583fc1803da315d3aa52
private bugs leaked into public cluster because delete command had too many terms (66K) to delete. Batch the deletes:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-1bf47e2bbc2d093719a4eacfa1645ef4L186
The whiteboard was leaking, here is the fix:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-e7f5c575eeda495b16d7432bbe35b033L563
Added crypto library to encrypt the alias file:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-ce9ae7df3c225ca1f0fe799f7fbc0c02R1
Added test to ensure whiteboard is [screened]:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-469bf8513f3c4ac87f66514572ce704fR454
ES is a little unstable: Add redundancy:
https://github.com/klahnakoski/Bugzilla-ETL/commit/6aaa23377aeb11fbc2e8b76102ee3462df69f216#diff-c95ca9c9b3a2347a2cd9f631f4dff000R3
But not too much redundancy:
https://github.com/klahnakoski/Bugzilla-ETL/commit/0f9cdc80f130df261822d43bc8a2cab690b24d35#diff-1bf47e2bbc2d093719a4eacfa1645ef4R159
Attachment #8367361 -
Flags: review?(mcote)
Reporter | ||
Comment 28•11 years ago
|
||
Comment on attachment 8367361 [details] [diff] [review]
diff.txt
Review of attachment 8367361 [details] [diff] [review]:
-----------------------------------------------------------------
This seems fine. As mentioned in channel I don't think I'm adding much value with these reviews, so unless you have particular questions, I don't think we should bother going forward.
::: README.md
@@ +33,5 @@
>
> +Installation with PyPy
> +----------------------
> +
> +PyPy will execute 4 to 5 times faster then CPython. PyPy maintains its own environment, and it's own version of the module binaries. This means running SetupTools is just a little different. After
Wrong "its".
::: bzETL/bz_etl.py
@@ +508,5 @@
> +is a summary of log categories:
> +
> +PROBLEM - Indicates data inconsistency. There is nothing we can do, and the
> + ETL will deal with it as best it can. No need to take action.
> +WARN1NG - There is an error in the ETL logic. The code will try it's best to
Wrong "its" again. :)
::: bzETL/extract_bugzilla.py
@@ +560,5 @@
> #TODO: CF_LAST_RESOLVED IS IN PDT, FIX IT
> param.bug_filter = db.esfilter2sqlwhere({"terms": {"a.bug_id": param.bug_list}})
> param.mixed_case_fields = SQL(MIXED_CASE)
> + param.screened_whiteboard = db.esfilter2sqlwhere({"terms": {"m.group_id": SCREENED_BUG_GROUP_IDS}})
> + param.whiteboard_field=STATUS_WHITEBOARD_FIELD_ID
Spacing.
::: bzETL/parse_bug_history.py
@@ +437,5 @@
> changes[c] = Null
> continue
>
> + if DEBUG_CHANGES:
> + ("Processing change: " + CNV.object2JSON(change))
Hm is this actually doing anything? Don't you need a Log method call?
::: resources/json/bug_version.json
@@ +267,5 @@
> "type": "string"
> }
> },
> + "type": "nested",
> + "dynamic": "strict"
Weird spacing.
Attachment #8367361 -
Flags: review?(mcote) → review+
Assignee | ||
Comment 29•11 years ago
|
||
Working, and with no leaks, for a few days now.
Assignee | ||
Updated•11 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•