Now that I'm on UTC time and talking to the tip of Hbase data, I'm noticing fetch failures at :00. At 8:01:30 I ask Hbase for all the data for 8:00. This returns no data which makes my script crash (it recovers). It's only happening on the hour so I suspect an hourly job. Hbase should have written 8:00 data six times by 8:01, and I always wait an extra 30 seconds to give it extra processing time. I can recover so this isn't blocking me.
This complaint is directed toward the research cluster. Do we have an idea of what might specifically be causing it? I am not happy that this is running on research instead of something more dedicated, but we have to make due with what we have for the time being.
Seeing this happens every hour, I don’t think Bugzilla ETL can have anything to do with it, as it runs every 5 minutes, usually finishes in less than a minute and only makes for very little load (could run up to 16 of those simultaneously). Also that’ll be gone from research soon. Looking at the jobtracker, there are several regular hadoop jobs (aphadke, fligtar) as well. Last thing I can think of: when are the major compactions scheduled?
fligtar and myself have quite a few MR jobs that run between 2am to 10am PST, nothing touches HBase though....
but if it put enough load on the cluster it might cause this type of delay. okay. thanks for the input. Michael, major compactions are not actually scheduled. They default to 24 hours since last one and the first one runs 24 hours after HBase startup. That said, it is always possible they could be playing a part in this. Need to look at the logs and see if there is a correlation.
I left glow on last night and it died at 12:02am because it still couldn't get the data for 12:00. Besides the hourly crashes, I couldn't find data at 4:34, 4:35, 4:36, and 4:42 (usually it's just on the hour). If we can't figure out what's preventing Hbase from writing for more than a minute I'll introduce more lag on the frontend so Hbase has time to catch up.
The lag was due to the timeouts configured in SQLStream before it looks for a new log file at the top of every hour. Those were lowered to levels that should be appropriate to prevent any top of the hour disruptions (one retry, 5 second wait = 10 seconds lag at top of every hour). We should be able to validate this in 30 minutes or any top of hour thereafter.
This didn't fail at 4:00, calling it.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.