Closed Bug 879835 Opened 11 years ago Closed 10 years ago

[Bugs ES] Script to mirror data from Bugzilla

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Assigned: ekyle)

References

Details

Attachments

(5 files)

We need a script that (a) adds data for *public* bugs that have changed since the last time it was run and (b) removes data for bugs that were public but have since had access restrictions placed on them.  We want this script to run at least every hour.

We will also have to determine to which machine this should be deployed.
Blocks: 879822
Note that metrics already has a script that does this; we would just need to modify it.
A further note that the schema for tracking flags is changing (bug 750742).  This will probably require an update to the script.
(In reply to Mark Côté ( :mcote ) from comment #0)
> We need a script that (a) adds data for *public* bugs that have changed
> since the last time it was run and (b) removes data for bugs that were
> public but have since had access restrictions placed on them.

we need to remove non-public bugs, comments, and attachments.

we also need to populate the ES cluster when a bug/comment/attachment is changed from private to public.
Assignee: nobody → klahnakoski
Status: NEW → ASSIGNED
Note, I think current bugzilla->ES approach is wrong. Instead of munging data into ES and dealing with synchronization headaches, we should fix bugzilla to show review info(is there anything else we care about?)
The most important thing we care about is historical and time series data.  The "munging" that is done during ETL is actually making the data explicit and easier to query in the time dimension.  

https://metrics.mozilla.com/bugzilla-analysis/Bug-Ages.html
https://metrics.mozilla.com/bugzilla-analysis/Security_Q1_Goal.html

and perform some simple calculations on those time series

https://metrics.mozilla.com/bugzilla-analysis/Predictive_Simple.html#dueDate=2013-08-11&sampleMin=2013-06-21&titleName=&programFilter=B2G+1.1.0+%28Leo%29

Overall, ElasticSearch can calculate aggregates very fast.  So fast, that we no longer need to pre-compute or cache those aggregates and still get relatively quick charts.

Bugzilla ETL, in my opinion, the most complicated.  I have seen telemetry and talos test data, and both can be dumped directly into ES with minimal transformation.
Yeah, to summarize what Kyle said, an ES database is incredibly useful for external dashboards (particularly burn-down rates).  It is also useful for analysis of Bugzilla trends in general, since ES is much better at some types of analysis than MySQL will ever be.
Checking if bug is public:
bug_group_map 
longdescs.isprivate (boolean)
attachments.isprivate
Depends on: 922905
Attached patch First reviewSplinter Review
This patch is a single changeset containing the net changes from the initial commit.  It is now part of master.   The old master branch was renamed to first_version_for_review for the sake of keeping change history.
Attachment #816332 - Flags: review?(mcote)
Depends on: 926887
Depends on: 926889
Depends on: 927494
i was driving by, and noticed the following:

get_recent_private_attachments() only checks for attachments which are changed from public to private, and doesn't catch attachments which were created as private.  you need to also check the attachments table for recently created private attachments.

get_recent_private_comments() only checks for comments which are private at the time they are created.  you also need to check the bugs_activity table for comments changed to private.

i'm probably missing it, but i don't see anything that deals with adding comments or attachments when they are changed from private to public.


i have some concerns about your main polling query:

> SELECT
>     bug_id
> FROM
>     bugs
> WHERE
>     delta_ts >= CONVERT_TZ(FROM_UNIXTIME({{start_time}}/1000), 'UTC', 'US/Pacific') AND
>     ({{min}} <= bug_id AND bug_id < {{max}}) AND
>     bug_id not in {{private_bugs}}

we have 63314 private bugs in the database currently, so that's a rather IN clause you'll be sending to mysql every time.  it would probably be more efficient to join with the bug_group_map table and skip rows where exist there.
also, functions void index usage, resulting in a full table scan.  it would be better to calculate the value for delta_ts in python and pass a constant in to the query instead.  (thanks cyborgshadow!)
Even though it was a bit past midnight, glad I could help. :)
Thanks for you time looking at this.  I am working on the fixes now.  In the meantime I meant to ask you to confirm HOW things get marked private/public.   Here is the code I use to simulate those actions:

https://github.com/klahnakoski/Bugzilla-ETL/blob/master/tests/util/database.py#L49

Specifically, I would like to know if the longdescs(bug_when) is updated when the isprivate is changed.

Your comments above already reveal I have no test for when attachments/comments go back public, so I will add those tests and code too.  


On the issue of functions and index usage:   I believe MySQL is smart enough to recognize the pure functional nature of CONVERT_TZ() and FROM_UNIXTIME() and evaluates the right-hand-side of the comparison once before using the index to complete the lookup:

delta_ts >= CONVERT_TZ(FROM_UNIXTIME({{start_time}}/1000), 'UTC', 'US/Pacific')

This is different than the inverse form, which must operate on every timestamp value and results in a full table scan:

UNIX_TIMESTAMP(CONVERT_TZ(delta_ts, 'US/Pacific', 'UTC'))*1000 >= {{start_time}}

I confirmed this with a DESCRIBE on the two possible query phrasings:

describe SELECT
        b.bug_id
    FROM
        bugs b
    LEFT JOIN
        bug_group_map m ON m.bug_id=b.bug_id
    WHERE
        delta_ts >= CONVERT_TZ(FROM_UNIXTIME(0/1000), 'UTC', 'US/Pacific') AND
        m.bug_id IS NULL
    ;

describe SELECT
        b.bug_id
    FROM
        bugs b
    LEFT JOIN
        bug_group_map m ON m.bug_id=b.bug_id
    WHERE
        UNIX_TIMESTAMP(CONVERT_TZ(delta_ts, 'US/Pacific', 'UTC'))*1000 >= 0 AND
        m.bug_id IS NULL
;

I would like to keep the SQL timezone conversion code because I trust it to do the "right thing" when it comes to daylight saving time, including pre 2007.  Moving to Python time conversion will probably involve another library, that slightly disagrees with one used by the database, and introducing timezone bugs which I find especially hard to stomp out.
(In reply to Kyle Lahnakoski [:ekyle] from comment #12)
> Specifically, I would like to know if the longdescs(bug_when) is updated
> when the isprivate is changed.

it isn't not updated -- that date reflects when the comment was made, not when it was last updated.

> On the issue of functions and index usage:   I believe MySQL is smart enough
> to recognize the pure functional nature of CONVERT_TZ() and FROM_UNIXTIME()
> and evaluates the right-hand-side of the comparison once before using the
> index to complete the lookup:

if you use EXPLAIN EXTENDED you can see it isn't using an index --- "extra" just has 'Using where' instead of 'Using where; Using index'.

> I would like to keep the SQL timezone conversion code because I trust it to
> do the "right thing" when it comes to daylight saving time

you could do the conversion once in a different sql query, then pass the result as a string into the main query..

  SELECT CONVERT_TZ(FROM_UNIXTIME({{start_time}}/1000), 'UTC', 'US/Pacific')
(In reply to Byron Jones ‹:glob› from comment #13)
> it isn't not updated -- that date reflects when the comment was made, not
> when it was last updated.

err, "it isn't updated" :)

On longdescs(bug_when):  Thank you! I had totally missed the bugs_activity(longdesc.isprivate) field :(

On timezone conversion:  Hmmm, my BZ instance shows extra=""Using where; Using index; Not exists".  Just to be safe, I will calculate the datetime in a separate query as you suggest.
Depends on: 930065
Comment on attachment 816332 [details] [diff] [review]
First review

r+ with all the changes listed in the github commit.
Attachment #816332 - Flags: review?(mcote) → review+
Depends on: 932346
Attached file Review of SQL log
The bzAliases branch (https://github.com/klahnakoski/Bugzilla-ETL/tree/bzAliases) has the latest version of the code.  This is the log it generates (only the first and last 1000's block is ETLed).  I hope the literal SQL may expose more problems.
Attachment #827043 - Flags: review?(glob)
When looking at the patch ( https://bugzilla.mozilla.org/attachment.cgi?id=827043 ) there is a long block of logs with no SQL.  Jump to "ETL of first 1000 bugs done" to get to the next(last) block.
Attached file Review Test Log
This is a log generated during functional testing.  This reveals the SQL used by test to simulate production (which I am not sure about) and also reveals the SQL used in the incremental ETL
Attachment #827388 - Flags: review?(glob)
Comment on attachment 827043 [details]
Review of SQL log

i don't feel confident enough to do a full review, due to a lack of full context of the operations, and i don't know enough of the ETL outside of the sql to determine if the data from are correct.

i have looked over the sql and it looks sane to me.
Attachment #827043 - Flags: review?(glob) → feedback+
Comment on attachment 827388 [details]
Review Test Log

things look sane over here too.
Attachment #827388 - Flags: review?(glob) → feedback+
Mark, 

How should the review for the bzAliases branch be done?  Do I merge --squash it into master and you comment on the net changes?  Do you review all patches in that branch? 

I believe the best course of action is for you to review the diff between master and bzAliases in the test code: https://github.com/klahnakoski/Bugzilla-ETL/compare/master...bzAliases?w=1  starting at "tests/test_etl.py".  This will show you what has changed in the tests, which is effectively very little. 

This branch is about improving the email aliases matching:  It can tease out the the various emails an individual has used over history, and consider them all aliases of each other.  This has cleaned up the CC lists significantly (and reviews counts).  The tests have not changed, but the reference JSON used to verify the tests have changed to reflect these improvements.  I am glad to see that the early CC lists are small to begin, and grow over time as expected.  Previously the CC lists erroneously started large, and shrunk (or not) over time.

tests/resources/*reference_es.json files have also changed to reflect the new alias matching algorithm.  Github does not want to show the large file diffs, so they must be done locally.
Flags: needinfo?(mcote)
Hm, well, that's what we effectively did last time, so we could do that.  Or you can upload a diff from master to the bzAliases branch here and we can review it in Bugzilla.  Up to you.
Flags: needinfo?(mcote)
Depends on: 945850
Depends on: 945854
Depends on: 945862
Depends on: 952101
Depends on: 952677
All changes can be seen in this squashed changeset:
https://github.com/klahnakoski/Bugzilla-ETL/commit/711810f08951a731dc543c10a0973fc34ed17c6b

The better_leak_check branch has the breakdown:
https://github.com/klahnakoski/Bugzilla-ETL/commits/better_leak_check

The real meat is in three smaller changesets.  The first removes the comments that were leaking, and the tests to confirm this situation is caught:
https://github.com/klahnakoski/Bugzilla-ETL/commit/31a8ae54d9c8e4e1689a77c3a0c5a8464a8f78ac?w=1

The second is a date conversion bug, also unique to production (only because I am using a 6 month old snapshot):
https://github.com/klahnakoski/Bugzilla-ETL/commit/7edbe1711dcb44e75c90ec046dbfed7cfad9cb34

Third is better leak checker (not run in production, but responsible for reviewing the ETL results):
https://github.com/klahnakoski/Bugzilla-ETL/commit/691dd47859fb1050b9c6182590457a121b59a8c1

The rest is PEP8, more logging, better UTF8 coding practices, me fighting with git
Attachment #8357535 - Flags: review?(mcote)
Uh this attachment is in binary...
Attachment #8357535 - Attachment is patch: false
Attachment #8357535 - Attachment mime type: text/plain → application/zip
Comments from github:

7edbe17 bzETL/util/db.py:L626

Is the double colon necessary here?

7edbe17 bzETL/util/queries/windows.py:L152

Why the extra line?

691dd47 tests/resources/python/look_for_leaks.py:L115

Spacing.

691dd47 tests/resources/python/look_for_leaks.py:L126

Spacing.
Attachment #8357535 - Flags: review?(mcote) → review+
Depends on: 959668
No longer depends on: 945862
No longer depends on: 959668
No longer depends on: 922905
No longer depends on: 927494
No longer depends on: 945854
Attached patch diff.txtSplinter Review
All changes from Jan 12th to (and including) Jan 28.  It is probably better to review the changes in GitHub.

Here's what's important:

This is a bad attempt as fixing the whiteboard leakage (see https://bugzilla.mozilla.org/show_bug.cgi?id=952677 for links to correct fix)
https://github.com/klahnakoski/Bugzilla-ETL/commit/32ee1022d84fc54f6abaadc9bfcf9ed73dc6732d?w=1

Testing for the leaked whiteboard:
https://github.com/klahnakoski/Bugzilla-ETL/commit/32ee1022d84fc54f6abaadc9bfcf9ed73dc6732d?w=1#diff-469bf8513f3c4ac87f66514572ce704fR417

the look_for_leaks.py is now a daemon that I run constantly to check for leaks:
https://github.com/klahnakoski/Bugzilla-ETL/commit/64bd0e4242e648faf1272453ae4198de40e2bf51#diff-2464b031c4302f5ea65c026afea8e467L2

Log the bugs that are net new deletes:
https://github.com/klahnakoski/Bugzilla-ETL/commit/2ee6ce8f852511da5f40be19e83ccd3f9964a798#diff-1bf47e2bbc2d093719a4eacfa1645ef4L183

All history on changed bug is needed, unless I also probe the ES cluster for current state (which is more complicated and prone to error):
https://github.com/klahnakoski/Bugzilla-ETL/commit/50775d2d3f5b2fd53ff5d6813863aab943384519#diff-e7f5c575eeda495b16d7432bbe35b033L610

Which fixes the expirers_on problems seen in production:
https://github.com/klahnakoski/Bugzilla-ETL/commit/50775d2d3f5b2fd53ff5d6813863aab943384519#diff-469bf8513f3c4ac87f66514572ce704fR455

The remain list are the bug groups' where whiteboard must be screened:
https://github.com/klahnakoski/Bugzilla-ETL/commit/27eec045ecfeedac9651583fc1803da315d3aa52

private bugs leaked into public cluster because delete command had too many terms (66K) to delete.  Batch the deletes:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-1bf47e2bbc2d093719a4eacfa1645ef4L186

The whiteboard was leaking, here is the fix:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-e7f5c575eeda495b16d7432bbe35b033L563

Added crypto library to encrypt the alias file:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-ce9ae7df3c225ca1f0fe799f7fbc0c02R1

Added test to ensure whiteboard is [screened]:
https://github.com/klahnakoski/Bugzilla-ETL/commit/bd520f1423d9bed618471795515678cf628afcdf#diff-469bf8513f3c4ac87f66514572ce704fR454

ES is a little unstable: Add redundancy:
https://github.com/klahnakoski/Bugzilla-ETL/commit/6aaa23377aeb11fbc2e8b76102ee3462df69f216#diff-c95ca9c9b3a2347a2cd9f631f4dff000R3

But not too much redundancy:
https://github.com/klahnakoski/Bugzilla-ETL/commit/0f9cdc80f130df261822d43bc8a2cab690b24d35#diff-1bf47e2bbc2d093719a4eacfa1645ef4R159
Attachment #8367361 - Flags: review?(mcote)
Comment on attachment 8367361 [details] [diff] [review]
diff.txt

Review of attachment 8367361 [details] [diff] [review]:
-----------------------------------------------------------------

This seems fine.  As mentioned in channel I don't think I'm adding much value with these reviews, so unless you have particular questions, I don't think we should bother going forward.

::: README.md
@@ +33,5 @@
>  
> +Installation with PyPy
> +----------------------
> +
> +PyPy will execute 4 to 5 times faster then CPython.  PyPy maintains its own environment, and it's own version of the module binaries.  This means running SetupTools is just a little different.  After

Wrong "its".

::: bzETL/bz_etl.py
@@ +508,5 @@
> +is a summary of log categories:
> +
> +PROBLEM - Indicates data inconsistency.  There is nothing we can do, and the
> +          ETL will deal with it as best it can.  No need to take action.
> +WARN1NG - There is an error in the ETL logic.  The code will try it's best to

Wrong "its" again. :)

::: bzETL/extract_bugzilla.py
@@ +560,5 @@
>      #TODO: CF_LAST_RESOLVED IS IN PDT, FIX IT
>      param.bug_filter = db.esfilter2sqlwhere({"terms": {"a.bug_id": param.bug_list}})
>      param.mixed_case_fields = SQL(MIXED_CASE)
> +    param.screened_whiteboard = db.esfilter2sqlwhere({"terms": {"m.group_id": SCREENED_BUG_GROUP_IDS}})
> +    param.whiteboard_field=STATUS_WHITEBOARD_FIELD_ID

Spacing.

::: bzETL/parse_bug_history.py
@@ +437,5 @@
>                              changes[c] = Null
>                              continue
>  
> +                    if DEBUG_CHANGES:
> +                        ("Processing change: " + CNV.object2JSON(change))

Hm is this actually doing anything?  Don't you need a Log method call?

::: resources/json/bug_version.json
@@ +267,5 @@
>  							"type": "string"
>  						}
>  					},
> +					"type": "nested",
> +										"dynamic": "strict"

Weird spacing.
Attachment #8367361 - Flags: review?(mcote) → review+
Working, and with no leaks, for a few days now.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: