Closed Bug 922877 Opened 11 years ago Closed 10 years ago

Machine(s) for running Bugzilla MySQL -> ES ETL

Categories

(bugzilla.mozilla.org :: Infrastructure, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mcote, Unassigned)

References

Details

This is not directly related to Bugzilla, but it will need access to the Bugzilla database and so is fairly tightly linked.

We will need at least one machine for running the ETL developed for bug  	879835.  We will probably run two instances of the ETL application, one for writing to (and regularly scrubbing) the public-only database (bug 872363), and one for writing to the full database (bug 922477).  One machine would probably suffice if it has enough processing power, unless we want to isolate them for security reasons.
What's the primary bottleneck here... processing power, amount of RAM, disk I/O, disk size, or network bandwidth? Do we have any rough estimates of how much of any of these we might need?

I'm guessing this won't need anything in terms of disk space or I/O, but I'm not sure about the others.

Is this something that needs lots of power continuously, or more like a once-a-day cron?

I'm wondering if we can get this into a VM, or if we need to go bare metal. We can do kind of a lot on VMs these days, and we're trying to go that way as much as we can. But of course not everything is a good fit. :)
The first run will require lots of memory (2G) because there appears to be a memory leak that reveals itself over hours.  The incremental runs, every 10min or less, will require minimal memory (200M) and only one core.  The IO connection to the Bugzilla database will be a limiting factor, as well as the single-core execution speed.
Pinging this bug... the security review of the ETL is underway, and the security team would like to be able to see it in motion, which requires this box.  Also we'd like to run it for some time to verify the accumulated data before using it in apps, so we'd like to get started on that as soon as we can.
Submitted a VM request; if we find it's insufficient to the task, we can either try to beef it up or request a blade from dcops.
etl1.bugs.scl3.mozilla.com has been created; what's next?
* PyPy must be installed (with pip too)
* The machine will require permissions to access bugzilla, the public es cluster, and the private es cluster.  
* The connection information must be added to settings files.  Here is an example of a setting file: https://github.com/klahnakoski/Bugzilla-ETL/blob/bzAliases/resources/settings/bz_etl_settings_example.json  one will be required for the public bugs instance, and one required for the private bugs instance.  Please look this over, specifically regarding the bugzilla password.
* Please ensure the  "first_run_time" and "last_run_time" of the two settings files point to different files so they do not crosstalk.

The install instructions are found in the readme https://github.com/klahnakoski/Bugzilla-ETL/tree/bzAliases (bzAlias branch).  If you wait, I can get you a PIP installable for ease of install and upgrade.
Installation is now "pip install Bugzilla-ETL".  The settings.json files must still be created of course.  Instructions are now on master: https://github.com/klahnakoski/Bugzilla-ETL#bugzilla-etl

There are also instructions for a quick run, so you can verify the long run will work.

There are two cron jobs set to run every 10min:
1) incremental public etl
2) incremental private etl
Blocks: 944131
Flags: needinfo?(klibby)
We need this running to perform the security review.  Do you need something from me to complete the install?
(In reply to Kyle Lahnakoski [:ekyle] from comment #6)
> * PyPy must be installed (with pip too)

Ugh, no RHEL rpms available. Hopefully it won't have a dependency rat hole.

> * The machine will require permissions to access bugzilla, the public es
> cluster, and the private es cluster. 

Define access (e.g. bguzilla.mozilla.org, port 443). I *think* it probably has access to ES, but I'm uncertain. Can it use the DC proxies to access bugzilla? 
 
> If you wait, I can get you a PIP installable for ease of install and upgrade.

Awesome; thanks for that. I'll start on the puppet config and rpm; if we don't need NetOps, then I can probably have it for you tomorrow.
Flags: needinfo?(klibby)
Actually, it looks like we do have an rpm but the host doesn't want to see it. So, hurrah.
> > * The machine will require permissions to access bugzilla, the public es
> > cluster, and the private es cluster. 
> 
> Define access (e.g. bguzilla.mozilla.org, port 443). I *think* it probably has
> access to ES, but I'm uncertain. Can it use the DC proxies to access bugzilla? 

It requires access to the bugzilla database (port 3306?).  I do not know what a "dc proxy" is.
May I get an update on this?  I found out my security reviewer will be gone for 10 weeks, starting next week!  

Thank you
Flags: needinfo?(klibby)
I'm wrapping up the puppet config now; I'll poke NetOps on 945745 since it's been almost a week. 

One question: for DB access, does it need read-write access or is read-only sufficient? I presume we don't have a dedicated mysql acct for this access, so I'll have the DBAs add one.
Flags: needinfo?(klibby)
Read-only access is perfect.
Currently running for public-only, and set to go for private bugs. 

Todo: pypy for python27 and commit the cron jobs after the above two finish. And fix any bugs that pop up. :)
Did it finish?  I can not tell if it is almost done, or done but broken.  :/  Thanks
AFAICT the public run finished and is now running from cron every 10 minutes. Still working on the private run.
The private run finished overnight and is also now running out of cron every 10 minutes. Any errors should me mailed to you and me; let me know if anyone else should receive them.

Is there anything else that needs doing?
Nothing else for right now.  Please keep this bug open for a couple days while I review the content and look for bugs.


Thank You!!!  :)
Depends on: 952108
Depends on: 958039

Unhandled exception in thread started by 
sys.excepthook is missing
lost sys.stderr


I *think* this error is caused by streaming/piping problems:

http://stackoverflow.com/questions/12790328/how-to-silence-sys-excepthook-is-missing-error

What is the "debug" value in the settings.json file?  If there is nothing digesting stdout then there should be no

   {
     "class": "bzETL.util.logs.Log_usingStream",
     "stream": "sys.stdout"
   }
    

Here is my debug value, but I use windows and it always consumes stdout.

    "debug": {
        "log": [
            {
                "class": "logging.handlers.RotatingFileHandler",
                "filename": "./results/logs/bz_etl.log",
                "maxBytes": 10000000,
                "backupCount": 200,
                "encoding": "utf8"
            },
            {
                "class": "bzETL.util.logs.Log_usingStream",
                "stream": "sys.stdout"
            }
        ]

    }
960111 has leaked into the public cluster. 

In the logs there should be mention of "Ensure the following private bugs are deleted" followed by a list, which should include 960111.  Is that true?
Flags: needinfo?(klibby)
public_results/logs/bugzilla_etl.log:

15:50:03 - Ensure the following private bugs are deleted:
[
    0,
 (snipped 44,717 lines)
    960111,
Flags: needinfo?(klibby)
From Kyle:

>>
>>> Unhandled exception in thread started by
>>> sys.excepthook is missing
>>> lost sys.stderr
>>
>>
>> I *think* the error (above) is caused by streaming/piping problems:
>>
>> http://stackoverflow.com/questions/12790328/how-to-silence-sys-excepthook-is-missing-error
>>
>> What is the "debug" value in the settings.json file?  If there is
>> nothing digesting stdout then there should be no
>>
>>     {
>> "class": "bzETL.util.logs.Log_usingStream",
>>       "stream": "sys.stdout"
>>     }
>>
>>
>> Here is my debug value, but I use windows and it always consumes stdout.
>>
>>      "debug": {
>>          "log": [
>>              {
>>                  "class": "logging.handlers.RotatingFileHandler",
>>                  "filename": "./results/logs/bz_etl.log",
>>                  "maxBytes": 10000000,
>>                  "backupCount": 200,
>>                  "encoding": "utf8"
>>              },
>>              {
>>                  "class": "bzETL.util.logs.Log_usingStream",
>>                  "stream": "sys.stdout"
>>              }
>>          ]
>>
>>      }
>>

On 16/01/2014 10:56 AM, Kendall Libby wrote:
>
>     "debug":{
>         "log":[{
>             "class": "logging.handlers.RotatingFileHandler",
>             "filename": "/data/www/public_results/logs/bugzilla_etl.log",
>             "maxBytes": 10000000,
>             "backupCount": 100,
>             "encoding": "utf8"
>         },{
>             "class":"bzETL.util.logs.Log_usingStream",
>             "stream":"sys.stdout"
>         }]
>
>     }
>
>
> stdout's currently going to /dev/null; so remove the sys.stdout section above?
>


/dev/null seems fine, and explains why this is not an issue every ten minutes.

I am working on a patch that should catch-and-report this error in more detail.  It looks like problems with stdout, but more likely my own bad programming.


Thanks
Depends on: 961211
Byron, Sheeri

I seem to be loosing my connection during this query.  I guess it is simply pulling too much?  (about 500K records in my public bugzila db)


ERROR: alias_analysis ERROR: Problem executing SQL:
	# from https://github.com/klahnakoski/Bugzilla-ETL
	
	    SELECT
	        bug_id,
	        CAST(9999999999000 AS signed) AS modified_ts,
	        CAST(null AS char(255)) AS new_value,
	        lower(CAST(p.login_name AS CHAR(255) CHARACTER SET utf8)) AS old_value
	    FROM
	        cc
	    LEFT JOIN
	        profiles p ON cc.who = p.userid
	    WHERE
	        `bug_id` BETWEEN 200000 AND 299999
	UNION ALL
	    SELECT
	        a.bug_id,
	        UNIX_TIMESTAMP(CONVERT_TZ(bug_when, 'US/Pacific','UTC'))*1000 AS modified_ts,
	        lower(CAST(trim(added) AS CHAR CHARACTER SET utf8)) AS new_value,
	        lower(CAST(trim(removed) AS CHAR CHARACTER SET utf8)) AS old_value
	    FROM
	        bugs_activity a
	    WHERE
	        a.fieldid = 37 AND
	        `bug_id` BETWEEN 200000 AND 299999
	at File "/data/www/Bugzilla-ETL/bzETL/extract_bugzilla.py", line 435, in get_all_cc_changes
	at File "/data/www/Bugzilla-ETL/bzETL/alias_analysis.py", line 48, in main
	at File "/data/www/Bugzilla-ETL/bzETL/util/threads.py", line 223, in _run

caused by
	ERROR: (2013, 'Lost connection to MySQL server during query')
	at File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 240, in __init__
	at File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 825, in _read_packet
	at File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1049, in read
	at File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 845, in _read_query_result
	at File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 734, in query
	at File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 202, in _query
	at File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 102, in execute
	at File "/data/www/Bugzilla-ETL/bzETL/util/db.py", line 220, in query
Flags: needinfo?(scabral)
Flags: needinfo?(glob)
> I seem to be loosing my connection during this query.  I guess it is simply
> pulling too much?  (about 500K records in my public bugzila db)

yup, that's a large amount of data to return and is most likely the cause of the issue.

fwiw on production the first part returns 237,777 and the second 239,903 resulting in 477,679 rows.
Flags: needinfo?(glob)
Made new bug for this issue

https://bugzilla.mozilla.org/show_bug.cgi?id=961588
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Flags: needinfo?(scabral)
Oops, I will leave this for WebOps to close
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
If you're happy to close it, I'm happy to close it. :)
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 963883
Depends on: 963884
Depends on: 964978
No longer depends on: 963883
No longer depends on: 963884
Today there will be changes, and some will be changes to private bugs, and the incremental ETL must deal with them.  I already see a bug moving from public to private, and the ETL is properly removing them from the ES cluster.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Component: WebOps: Bugzilla → Infrastructure
Product: Infrastructure & Operations → bugzilla.mozilla.org
You need to log in before you can comment on or make changes to this bug.