1226345 - Fail fetch-adi-from-hive if no rows are retrieved

Assignee

Description

•

10 years ago

On the 12th of November, the fetch-adi-from-hive crontabber app ran "perfectly". It started, it connected, it queried and it finished up. However, because of a problem with hive (and deeper problems only sheeri can explain) there simply was no data on the 12th. This meant that the crontabber app did not have to re-attempt until it worked. We should raise an error if the number of rows written from the hive query is ==0.

Peter Bengtsson [:peterbe]

Assignee

Comment 1

•

10 years ago

Sheeri, Suppose that we implement this bug into the job. That means that it will re-attempt the job indefinitely, every 5 minutes until it gets >0 rows from the hive query. However, suppose we're so unlucky that that happens in the middle of hive being fixed. And suppose it takes 15 minutes to fill/fix hive. Is there a chance those writes are not atomic and that we thus might accidentally pick up roughly 33% of the data for that day (5min ~= 33% of 15min)??

Flags: needinfo?(scabral)

Sheeri Cabral [:sheeri]

Comment 2

•

10 years ago

Indeed, there's not only the chance, but the very distinct possibility. Hive is part of Hadoop, a NoSQL solution, so transactionality and atomicity are not guaranteed. I would raise an error if the # of rows written from the hive query is more than 25% off the average...(of the last x days?). Not sure how easy that is to do, but we have a Nagios query that does it with Vertica tables, and of course we can share our code.

Flags: needinfo?(scabral)

Peter Bengtsson [:peterbe]

Assignee

Comment 3

•

10 years ago

Thanks Sheeri. I think this would be relatively straight-forward to do. You simply make a query `select count(*) from raw_adi where date=%s - interval '2 days'` and then use that number to evaluate how many rows got written and decide whether to roll back and raise an exception.

Robert Kaiser

Comment 4

•

10 years ago

Just make it high enough that weekend days still comfortably fit - oh and the days around Christmas and New ear can get pretty low as well.

Peter Bengtsson [:peterbe]

Assignee

Comment 5

•

10 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #4) > Just make it high enough that weekend days still comfortably fit - oh and > the days around Christmas and New ear can get pretty low as well. 25% still good?

Robert Kaiser

Comment 6

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #5) > 25% still good? Not sure, you'd need to look at the existing data to find out.

Peter Bengtsson [:peterbe]

Assignee

Comment 7

•

10 years ago

Matt, is there a way you can figure out how long it takes to fill up whatever store Hive reads from? My concern is that if we change the logic so that we might only get 26% of yesterdays number can call that good enough. The interval for retries is every 5 min. So, if we make a query at 08:00 and get 20% of yesterdays number, we then try 5 minutes later and at 08:05 the Hive database might be in the midst of being filled up. So that on the second attempt it might fetch 26% of yesterdays number and call that good enough. A solution would be to put in a sleep if we ever get too few so that we forcibly don't re-attempt too soon. But to decide on that, we need to know how long it takes to fill up the Hive database for 1 day's worth of logs.

Assignee: nobody → peterbe

Flags: needinfo?(mpressman)

Matt Pressman [:mpressman]

Comment 8

•

10 years ago

It is possible to check with the cluster to determine when the web log processing has completed. iirc, you want to know when the processing for addons.mozilla.org and blocklist.addons.mozilla.org are completed. For this, you can poll the JobTracker service to determine when those jobs have finished. Each job is named for the domain it is processing and prefixed with Logs:. So for addons you will want to poll for the job named Logs: addons.mozilla.org and for blocklist the job is named Logs: blocklist.addons.mozilla.org fwiw - it doesn't look like they completed today until 08:13:33 UTC which I believe is after your job would have kicked off Let me know if this didn't answer what you were looking for or if I can expand on anything

Flags: needinfo?(mpressman)

Peter Bengtsson [:peterbe]

Assignee

Comment 9

•

10 years ago

(In reply to Matt Pressman [:mpressman] from comment #8) > It is possible to check with the cluster to determine when the web log > processing has completed. iirc, you want to know when the processing for > addons.mozilla.org and blocklist.addons.mozilla.org are completed. For this, > you can poll the JobTracker service to determine when those jobs have > finished. Each job is named for the domain it is processing and prefixed > with Logs:. So for addons you will want to poll for the job named Logs: > addons.mozilla.org and for blocklist the job is named Logs: > blocklist.addons.mozilla.org > > fwiw - it doesn't look like they completed today until 08:13:33 UTC which I > believe is after your job would have kicked off > > Let me know if this didn't answer what you were looking for or if I can > expand on anything If we can ask "Are you ready?" that would be a much better. Can you elaborate on "you can poll the JobTracker"? Is there a HTTP API? Do I use Hive?

Peter Bengtsson [:peterbe]

Assignee

Comment 10

•

10 years ago

Kairo, when we talked about the fluctuations. I took a snapshot of the number of rows written for the last 15 days and here are the counts of rows per day and their percentage difference from the rolling 15 average: 211751 13.4% 248572 3.4% 253832 5.4% 239170 0.4% 258437 7.1% 254267 5.6% 221165 8.6% 217950 10.2% 251584 4.6% 253277 5.2% 248482 3.4% 255385 6.0% 251991 4.7% 218689 9.8% 216871 10.7% I.e. the biggest jump is 13.4% from one day to another. (I'm guessing that's the difference between a Sunday and a Monday) With that in mind, I think we should expect 75% of yesterdays count. What do you think?

Peter Bengtsson [:peterbe]

Assignee

Comment 11

•

10 years ago

Matt, can you elaborate on how, technically, I can poll the JobTracker and what that means.

Flags: needinfo?(mpressman)

Matt Pressman [:mpressman]

Comment 12

•

10 years ago

Using the JobTracker Web UI interface on the NameNode in the cluster will show activity for the MapReduce jobs on the cluster. It can be accessed at: http://node2.admin.peach.metrics.scl3.mozilla.com:50030/jobtracker.jsp You can also get json output using http://node2.admin.peach.metrics.scl3.mozilla.com:50030/metrics?format=json However, I believe that only shows the running jobs. Since you need the web log processing jobs for addons.mozilla.org and blocklist.addons.mozilla.org to be completed before your job runs, you can check for the jobs named: Logs: addons.mozilla.org and Logs: blocklist.addons.mozilla.org

Flags: needinfo?(mpressman)

Peter Bengtsson [:peterbe]

Assignee

Comment 13

•

10 years ago

I can access those pages from my laptop once I'm on the VPN but... Our crontabber app runs on socorro-adi1.metrics.scl3. [pbengtsson@socorro-adi1.metrics.scl3 ~]$ curl http://node2.admin.peach.metrics.scl3.mozilla.com:50030/jobtracker.jsp ^C [pbengtsson@socorro-adi1.metrics.scl3 ~]$ curl http://node2.admin.peach.metrics.scl3.mozilla.com:50030/metrics?format=json ^C (The ^C is me cancelling after waiting many many seconds) ANy idea?

Peter Bengtsson [:peterbe]

Assignee

Comment 14

•

10 years ago

Sorry for being a pain, but that webpage is obtuse to me. But it's the key to making the ADI counts predictable and not having to manually firefight it once a month. Can you point at what I should be looking for; If the Hive job is ready to be queried? Are we looking for something that only appears sometimes? Or is there a value I can scrape for that's always reliably there?

Peter Bengtsson [:peterbe]

Assignee

Comment 15

•

10 years ago

Attached image Screen Shot 2015-12-03 at 3.47.11 PM.png — Details

I see a row like this (attached). That's under "Retired Jobs". Am I supposed to scrape that and parse the "State" and "Finish time" columns?

Peter Bengtsson [:peterbe]

Assignee

Comment 16

•

10 years ago

Another interesting thing is that, today's ADI failed again. Our job ran at around 8AM UTC and yielded 0 rows. I.e. Hive returned nothing. But the job FINISHED at 08:13AM (started at 05:05AM) which is really close to when we read from it. So if our job ran at a few minutes after 8AM but not after 8:13AM that means that Hive decided to return nothing UNTIL the whole job was finished. That would indicate that this all works atomically. That means that don't run the risk of accidentally reading before it has finished writing and thus only getting the <100% of that days records.

Matt Pressman [:mpressman]

Comment 17

•

10 years ago

My only guess about not being able to access from socorro-adi1.metrics.scl3 is due to some ACL - I'm sure it could be opened up. The MapReduce jobs with those names run every day, so you can be certain they will be there. As long as the job is not in the list of running jobs - by the time you start your checking - you can be pretty certain they have already completed. The web log processing script kicks off at 5AM UTC. As such, I don't think it matters much that you scrape the date, just ensure it's not in the list of running jobs. I don't know what the difference between completed jobs and retired jobs are. I *believe* completed jobs are moved to archived jobs as new ones come in, but I will try to find out.

Peter Bengtsson [:peterbe]

Assignee

Comment 18

•

10 years ago

Note-to-self: It's still bad if we read from Hive *whilst* it's being filled up. However, this PR [0] is going to give us some really good milage because at least 3 times; what's happened is that we've picked up exactly 0 rows. I'll leave it like that for a couple of months and if we start seeing ADI fetches of like hundreds or lower thousands, that means we're fetching prematurely and need to punch holes for ACLs and start scraping those URLs that Matt showed above. [0] https://github.com/mozilla/socorro/pull/3121

[github robot]

Comment 19

•

10 years ago

Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/953f9d02ab22b4c4462da9101a636b7163c9ff58 bug 1226345 - fail fetch-adi-from-hive if 0 rows https://github.com/mozilla/socorro/commit/af11937b898212ae7e2eb2c5f8aa0df551a56348 Merge pull request #3121 from peterbe/bug-1226345-fail-fetch-adi-from-hive-if-0-rows bug 1226345 - fail fetch-adi-from-hive if 0 rows

Peter Bengtsson [:peterbe]

Assignee

Comment 20

•

10 years ago

That new fetch-adi-from-hive has been deployed on scl3. Let's hope it works from now on.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Robert Kaiser

Comment 21

•

10 years ago

(In reply to Peter Bengtsson [:peterbe] from comment #10) > With that in mind, I think we should expect 75% of yesterdays count. What do > you think? If we ever go back to this, I've seen ~71% for Dec 25 when it was during the week (probably worse when it's Saturday) and ~74% for a Sat vs. Fri in summer so we'd need a bigger margin (or go more intelligent on usual weekday-related and seasonal changes). But for now, we settled on 0 vs. something in any case.

Peter Bengtsson [:peterbe]

Assignee

Comment 22

•

10 years ago

And we were able to settle on exactly 0 thanks to Matt debugging out atomicity works on that database server.