Closed Bug 1276690 Opened 9 years ago Closed 9 years ago

Data returned from Elasticsearch is inconsistent

Categories

(Socorro :: Infra, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: adrian, Assigned: dmaher)

References

Details

We are currently seeing numbers that do not make sense in Socorro, in pages pulling data from Elasticsearch. The magic number is 20%. Everything is off by an about 20% margin. Let's look for example at crash reports for Firefox 49.0a1 on May 29, 2016. # Super Search https://crash-stats.mozilla.com/search/?product=Firefox&version=49.0a1&date=%3E%3D2016-05-29&date=%3C2016-05-30 says there are 457 Results, but only shows 358 actual results (go to page 8 to verify). # Crashes per day https://crash-stats.mozilla.com/crashes-per-day/?p=Firefox says there have been 358 crash reports, which is consistent with th previous number. But if you look at the PSQL-based version... # Daily crashes https://crash-stats.mozilla.com/daily?p=Firefox it says there have been 449, which matches the theoretical results number of Super Search (it is expected to have slightly more results in ES than PSQL). There is something fishy happening here, and I am failing at finding the culprit. I believe the data is in Elasticsearch, because there is no sign that we are losing data from looking at datadog and our processor logs. That would also explain why Super Search is able to give a correct total number, even if it cannot return all the results. My guess so far is that something is wrong in our ES cluster, some parts of it are failing to return data for requests. Looking at our AWS console, I have been unable to see anything wrong in our hosts or in the load balancer. I am bouncing this off to you JP, in hope that you'll have some clues about what's wrong. CC'ing phrawzty too, because he has knowledge about both ES and our setup.
the situation seems to have improved in the meantime.
sorry, i spoke too soon - today's data at https://crash-stats.mozilla.com/crashes-per-day/?p=Firefox is way off again.
Assignee: nobody → dmaher
Status: NEW → ASSIGNED
I wrote a script that crawls the SuperSearch API with pagination. It then compares how many rows it gets in total compared to what the "total" is supposed to be, for one day at a time (with product=Firefox, version=49.0a1). https://gist.github.com/peterbe/5d66e31c56992c3205115260580e625e The conclusion is that the problem started appearing on the 23rd of May.
Since we have weekly indices, it is possible that the problem started appearing anytime during last week, but impacted the entire index, thus implying that we see lower numbers all the way back to the 23rd. That would be coherent with what our users told us about the problem starting on last Saturday. It seems this week's index is even more impacted, as I see even more worried discrepancies in yesterday's data. Firefox 46.0.1, 2016-05-30: Super Search -> 100,056 documents | 60,032 results in facet -- https://crash-stats.mozilla.com/api/SuperSearch/?product=Firefox&version=46.0.1&date=%3E%3D2016-05-30&date=%3C2016-05-31&_facets=product&_results_number=0 This is ~40% off, twice as much as for the previous week.
Problem is resolved for now. It was due to our cluster having too much data in memory. Tomorrow we will work on configuring our cluster to avoid this problem to happen again.
Blocks: 1277239
No longer blocks: 1277239
Depends on: 1277239
the data is acting weird again today - when i use this query: https://crash-stats.mozilla.com/search/?product=Firefox&version=47.0b9&process_type=browser&_facets=signature&_facets=user_comments&_facets=version&_facets=adapter_vendor_id&_facets=build_id&_facets=platform_pretty_version&_facets=install_time&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature it results in the top signature OOM | small 3845 5.51 % when i strip all the other search parameters and just use https://crash-stats.mozilla.com/search/?product=Firefox&version=47.0b9&process_type=browser the result is OOM | small 4947 7.08 % (its share would have been > 8% consistently the last time that things worked correctly)
(In reply to [:philipp] from comment #6) > the data is acting weird again today - when i use this query: > https://crash-stats.mozilla.com/search/?product=Firefox&version=47. > 0b9&process_type=browser&_facets=signature&_facets=user_comments&_facets=vers > ion&_facets=adapter_vendor_id&_facets=build_id&_facets=platform_pretty_versio > n&_facets=install_time&_columns=date&_columns=signature&_columns=product&_col > umns=version&_columns=build_id&_columns=platform#facet-signature it results > in the top signature OOM | small 3845 5.51 % > > when i strip all the other search parameters and just use > https://crash-stats.mozilla.com/search/?product=Firefox&version=47. > 0b9&process_type=browser the result is OOM | small 4947 7.08 % > > (its share would have been > 8% consistently the last time that things > worked correctly) I ran the same script as I used in comment #3 and it's not showing any discrepancy. I.e. the same amount of data as the totals is returned.
I believe this is fixed now. We have filed follow-up bugs to ensure it won't happen again.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
See Also: → 1282140
See Also: → 1288179
You need to log in before you can comment on or make changes to this bug.