Closed Bug 1440745 Opened 6 years ago Closed 6 years ago

fix fetch-data-from-hive to look at http 304

Categories

(Socorro :: General, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(2 files)

Per bug #1440637, ADI data in Socorro is hosed for nightly channel. In bug #1399864, they changed the blocklist system so that it could also return an HTTP 304. That landed early February:

https://bugzilla.mozilla.org/show_bug.cgi?id=1399864#c16

Because of that, we need to update the fetch-data-from-hive script to look at HTTP 200 as well as HTTP 304:

https://bugzilla.mozilla.org/show_bug.cgi?id=1440637#c6

This bug covers doing that.
Making this a P1 and grabbing it.

I have to get access to the box that runs the script. Last I checked, I didn't have access to it.

The code on that server is way out of date and I don't dare update it. We have no dev/stage environment for this script, so I'm going to have to edit it live on the server.

Because of how the ADI data flow in Socorro works, I think we won't be able to observe changes until after a day has passed unless I make extensive changes to the script. Even then, I'm not sure how to tell if it's fixed or not. I'll have to figure out how to verify correctness.

All that suggests this could take a week to fix.

Assuming we are able to fix it, I have no idea what to do with the bad data. Bug #1399864 suggests that we have bad data for all of February. I don't know how to fix that. It's possible it's not viable to fix it given the present ADI data flow situation. I think I'm going to plan to leave it as is for now unless someone makes a compelling reason to figure it out.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P1
Some facts:

1. The job runs at 8:00 UTC daily.
2. There's a log in /var/log/socorro/crontabber.log . It's HUGE!
3. Configuration is in /etc/socorro/crontabber.ini.
4. There's a shell script for running the job in /data/socorro/application/scripts/cron/crontabber.sh .
5. The code that's in the Socorro repo is not the same as the code that's running on the box.
6. The FetchADIFromHiveCronApp does not have a "dry-run" mode.

I manually rotated the crontabber.log file so I can more easily see logs for runs over the weekend.

Beyond that, since this situation is super fragile, I'm going to wait until Monday to do anything else other than skulk around. A fire over the weekend is worse than the current situation.

On Monday, I'll do the following:

1. Implement a dry-run mode and write a bash script that runs the job in dry-run mode spitting out what it would have sent to the db. Looking at the code, I'm pretty confident I can do that safely with minimal changes.

2. Let the job run overnight to make sure I haven't messed anything up.

3. On Tuesday, make the change to also look at 304 status code. I can use the dry-run mode to run the script without affecting -prod. The ADI counts for Firefox nightly should go up because they're looking at 200 and also 304. I think I can rely on that to verify the fix.

4. Let the job run overnight.

5. On Wednesday, check ADI counts in -prod and see where things are at.
Lonnen, Rob, Peter: Does the plan in comment #2 look sane?
Flags: needinfo?(rhelmer)
Flags: needinfo?(peterbe)
Flags: needinfo?(chris.lonnen)
I talked with Peter and he pointed out a ton of really helpful stuff:

1. Even if I dry-run-ify FetchADIFromHiveCronApp, it will still change the crontabber state in -prod and that'll cause problems.

2. Since it's a backfill cron app, I can't run it from the command line with --force and expect it to run--it'll just ignore me.

3. If the changes in bug #1399864 made other changes to the request other than an Etag, then we'll have to make more changes to the hive query.


So, given that, I will:

1. Double-check the changes in bug #1399864 on Monday before I do anything.

2. Subclass FetchADIFromHiveCronApp and use a different app_name and do the dry-run stuff there. That way the dry-run version of the app has its own state in the crontabber table in -prod and won't affect anything else. Plus the subclass can the changes the decorator does here:

https://github.com/mozilla/crontabber/blob/c56c6a06e5152ae032cf99ef0a6065aaefa88825/crontabber/mixins.py#L14

and then --force should work.
Flags: needinfo?(peterbe)
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #3)
> Lonnen, Rob, Peter: Does the plan in comment #2 look sane?

Pretty sure the way I tested this was to read from Hive via VPN with a local postgres server (if that helps)

This is really hard to test in a decent way unfortunately :/ setting up a local Hive instance is pretty complex.
Flags: needinfo?(rhelmer)
Another option is to ssh into the SCL3 server, copy the relevant Hive Python code into a little deletemewheniamdone.py script. Activate the same virtualenv as used by crontabber.sh. Then just run it locally on the command line and see what it spits out. Then add the `... OR http_status='304')` and run it again.
Commits pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/48a3ebb07310652a36bbfc6b08f9f6d72ab426e9
bug 1440745 - move alt-adi apps to separate module

Maintenance for FetchADIFromHiveCronApp is a nightmare right now. It
behooves us to keep it completely isolated from the rest of the codebase.
This moves things around to isolate it.

https://github.com/mozilla-services/socorro/commit/c17d78e377d28e8a0d00eea4f8bfeea89085db2a
Merge pull request #4354 from willkg/1440745-hive

bug 1440745 - move alt-adi apps to separate module
Another fun fact:

* The actual code that's being run is:

  /data/socorro/socorro-virtualenv/lib/python2.6/site-packages/socorro-master-py2.6.egg/socorro/cron/jobs/fetch_adi_from_hive.py

  The code in /data/socorro/application/socorro/cron/jobs/fetch_adi_from_hive.py is **not** run.

* There are 6 different copies of that file of which 5 are impersonators.


On Friday, Lonnen said "This is fun. We're having fun." Given that, clearing the needinfo for Lonnen.
Flags: needinfo?(chris.lonnen)
Today, I wrote a DryRun version of the app and implemented and tested a run script for it. It doesn't affect fetch-adi-from-hive bookkeeping in -prod. It has its own bookkeeping. I can run it multiple times. It tells me what would have happened without actually persisting anything to a db and with minimal differences to the fetch-adi-from-hive app.

I'm going to land that in the repo and let it run for the night to make sure I didn't hose anything.

Assuming everything is good tomorrow, next steps:

1. Double-check the changes in bug #1399864 and verify we just need to add a check for 304.

2. Make the changes required. Use the dry-run mode to run the script without affecting -prod. The ADI counts for Firefox nightly should go up because they're looking at 200 and also 304. I think I can rely on that to verify the fix.

3. Let the job run overnight.

4. On Wednesday, check ADI counts in -prod and see where things are at.
I threw together a script that graphs ADI for the nightly channel over the last month from Socorro ADI data. Seems like the blocklist request change landed on or just after 2018-02-01 (February 1st), so I'm going to call before that "normal" and after that "busted".

Normal:

2018-01-23     ========================
2018-01-24     ========================================================
2018-01-25     ================================================================
2018-01-26     =================================================================
2018-01-27 **  ========================================================
2018-01-28 **  ===========================================================
2018-01-29     ==========================================================================
2018-01-30     =============================================================================
2018-01-31     =============================================================================
2018-02-01     ==============================================================================

Busted:

2018-02-02     ============================================================================
2018-02-03 **  ==================================================
2018-02-04 **  ==================================
2018-02-05     ====================================
2018-02-06     ==========================
2018-02-07     =====================
2018-02-08     =======================
2018-02-09     ===========================================================================
2018-02-10 **  ========================
2018-02-11 **  =================
2018-02-12     ==================
2018-02-13     =========================
2018-02-14     ===============================================================================
2018-02-15     ==============================================================
2018-02-16     ==========================
2018-02-17 **  ===============
2018-02-18 **  ============
2018-02-19     ================
2018-02-20     ================================================================================
2018-02-21     =====================================
2018-02-22     =========================================================================
2018-02-23     ==========================================
2018-02-24 **  ==================================================================
2018-02-25 **  =========================
2018-02-26     ============================

I ran the dryrun and the total number of rows it got back from hive was 283844. After the fix, it said 286926. That's an increase, so it seems like the fix is good. Note that that number is the total number of rows for all channels--not just nightly, hence the small increase.

I'll let this run tonight. Tomorrow, I'll check the nightly adi counts and see whether things look more like "normal" or more like "busted".
I checked the ADI data for 2018-02-27 and it's a long bar, so that seems good and it's probably fixed, but I don't know for sure. I'm going to let it run for the rest of the week and on Monday check it again and then update the bug accordingly.
Commits pushed to master at https://github.com/mozilla-services/socorro

https://github.com/mozilla-services/socorro/commit/52676dfd4c7c4ceed685dcf3509779b89fe2f6e2
bug 1440745 - add DryRunFetchADIFromHiveCronApp

This adds a dry-run app that allows us to test hive query changes to
FetchADIFromHiveCronApp.

https://github.com/mozilla-services/socorro/commit/16c389c7b30c7c21013e22ce4c9c615eac90b2ce
fixes bug 1440745 - fix ADI fetch to include http 304 responses

https://github.com/mozilla-services/socorro/commit/bfcf809b2ef963129191ce20388345814ecabff1
Merge pull request #4356 from willkg/1440745-fetch-adi-fixes

bug 1440745 - add DryRunFetchADIFromHiveCronApp
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Here's the last bunch of days:

2018-02-24 **  ===============================================================
2018-02-25 **  ========================
2018-02-26     ===========================
2018-02-27     ================================================================================
2018-02-28     ===============================================================================
2018-03-01     ==============================================================================
2018-03-02     ===========================================================================
2018-03-03 **  ================================================================


I claim we're good now.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: