Closed
Bug 1229366
Opened 9 years ago
Closed 8 years ago
sg and input db don't agree on hb scores
Categories
(Input :: General, defect, P1)
Input
General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: willkg, Assigned: glind)
References
Details
(Whiteboard: [2016-GBT-Y])
Attachments
(1 file)
99.75 KB,
image/png
|
Details |
Gregg has found one or more instances where SG has an entry for someone who voted, but when we look at the Input db, we have no score for that person. This bug covers looking into that and fixing it.
Reporter | ||
Comment 1•9 years ago
|
||
Grabbing this to work on now and making it a P1 fire.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P1
Reporter | ||
Comment 2•9 years ago
|
||
I looked at the db data and I see heartbeat answers with scores over the last month, so I think we can rule out "we *never* save the score". I can't tell if there's a point in time where the percent of hb data with scores dips. I didn't spend much time on this, though. I checked the error log and there's nothing related here. However, the error log only seems to be capturing empty packet errors and invalid data errors. It's not capturing anything related to updated_ts checks. I re-read the code and related tests. The updated_ts check seems fine. The code isn't pinned to the master db when it pulls the record, checks it and then saves. We've got multiple database servers and multiple webheads, so this is possible. Further, it's possible that we get so much HB data and packets for a flow happen in rapid succession that the likelihood this happens is high enough. Having said that, I'm having serious deja vu right now so either I looked at all this before or I is crazy crazy. I'm going to tweak the logging and add some statsd calls so I can see frequencies of things over time. Further, I'll look into reworking the pinning and transaction handling.
Reporter | ||
Comment 3•9 years ago
|
||
I added some db pinning to make sure the code executes on the master database. Further, I added some statsd calls so that we can measure some scenarios better over time. In a PR: https://github.com/mozilla/fjord/pull/716
Reporter | ||
Comment 4•9 years ago
|
||
Landed in https://github.com/mozilla/fjord/commit/e9f5b9961be87af1c74ac8dc6f1cdd0386877ae8 I can't test this anywhere except for production, so we're going to have to test it live. The hypothesis here is that because we weren't pinning the thread to the master db, it was possible for the thread executing the code to look at one of the db replicas and possibly be looking at stale data. Coupled with multiple webheads handling packets simultaneously, I think it's possible that older data could stomp on newer data. If the hypothesis is true, then it's always been a problem and all our heartbeat scores are under-reported. I have no idea how to estimate by how much. It's probably the case that the under-reporting varies depending on the server load and other environmental factors. Pushed to prod. It looks like it didn't create new issues, but I can't see data from graphite, so I can't see any of the statsd stuff I did. :( I'll wait on this until I can see data in graphite. Further, this is almost certainly not the cause of the problem where Input isn't getting scores for any Firefox version 42. So keeping this open for a while.
Reporter | ||
Comment 5•8 years ago
|
||
I can see graphite data again. On Monday, I'll check whether anything is looking weird and resolve this accordingly.
Reporter | ||
Comment 6•8 years ago
|
||
success, empty packet and oldtimestamp data for the month.
Reporter | ||
Comment 7•8 years ago
|
||
Gregg: Check out the graph in comment #6. That's a lot of oldtimestamp errors. I can't tell if that's expected or fishy, though.
Flags: needinfo?(glind)
Reporter | ||
Comment 8•8 years ago
|
||
I think there isn't anything more I can do here. I'm unassigning myself.
Assignee: willkg → nobody
Status: ASSIGNED → NEW
Assignee | ||
Comment 9•8 years ago
|
||
1. We understand the no votes for 42, which is a firefox bug. 2. Re: old packet timestamps: 10% (see #6) seems like 'a lot'. No clear hypothesis for why. Possible that `updated_ts` is being retained. WONTFIX because this won't be an issue in unified telemetry.
Assignee: nobody → glind
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(glind)
Resolution: --- → WONTFIX
Whiteboard: [2016-GBT-Y]
Assignee | ||
Comment 10•8 years ago
|
||
As near as I can tell, this is actually fixed. [glind@sumotools1.webapp.phx1 ~]$ mysql --login-path=input -D input_mozilla_org_new -e "select count(*) from heartbeat_answer" gives consistent answers.
Resolution: WONTFIX → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•