Closed Bug 738394 Opened 12 years ago Closed 8 years ago

crash ratio views do not match UI due to queries by operating system

Categories

(Socorro :: Database, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: pclemente, Unassigned)

References

Details

What differences are you seeing? One thing could be that we specifically see to not double-count hangs in the results on crash-stats (i.e. include only one hang-pair half) but other than that, count all different types of crashes and hangs for all process types. If you tell us a few numbers to compare, it's probably easier to get a hold on how the differences come to exist.

Unfortunately I haven't got IT to install postgresql support on the machine I use for my scripts, so I can't run the queries myself and see what's going on.
Well, Paula's query isn't filtering by OS, which the daily graph does, so it excludes "Unknown" OS.  However, even when I filter one which should match the source exactly, I get a higher total from the database than I do from the graph, by about 5%.

I suspect that the graph is filtering on some other criteria, such as excluding plugin crashes.  So I need to know what query it's generating ...
In the code, for the ADUs:
      select
          adu_date as date,
          substring(os_name, 1, 3) as product_os_platform,
          sum(adu_count)::BIGINT
      from
          product_adu pa
      join product_info pi using (product_version_id)
      where
          %%(start_date)s <= adu_date
          and adu_date <= %%(end_date)s
          and pi.product_name = %%(product)s
          and pi.version_string = %%(version)s
          %(os_phrase)s
      group by
          date,
          product_os_platform
      order by
          1
and for the crashes:
      SELECT adu_day::DATE, os_short_name, SUM(count)
      FROM daily_crashes
      WHERE timestamp with time zone %%(start_date)s <= adu_day AND
            adu_day <= timestamp with time zone %%(end_date)s AND
            productdims_id = %%(productdims_id)s AND
             %(os_phrase)s AND
             %(report_type_phrase)s
      GROUP BY adu_day, os_short_name
      order by
          1, 2

(Obviously these are parameterised, ask me if you have further questions or need more detail)
yeah, can you look into what "report_type_phrase" is if report_type=any in the URL?
Ok, and re plugins, it depends on the parameters, but the default is to include all types of reports (to wit: crash, oopp, hang_browser, and hang_plugin).
specifically, with report_type any:
      # anything other than 'crash' or 'hang' will return all crashes
      # hang normalized are avoided so as not to count some hang ids multiple times
      parameters.report_type_phrase = "report_type IN ('%s', '%s', '%s', '%s')" % (
        adu_codes.CRASH_BROWSER,
        adu_codes.HANG_PLUGIN,
        adu_codes.CONTENT,
        adu_codes.OOP_PLUGIN,
      )
Unfortunately, I need the exact SQL syntax.  Because the UI is filtering out a few of the crashes which the database view includes.
(In reply to Laura Thomson :laura from comment #5)
> Ok, and re plugins, it depends on the parameters, but the default is to
> include all types of reports (to wit: crash, oopp, hang_browser, and
> hang_plugin).

As the snippet you pasted shows (and as I mentioned in comment #1), this is not completely true, and we *don't* include hang_browser. That is to not count both sides of hang pairs but only one side (and for some reason, we are using the plugin side - in practice, it shouldn't matter which one we use, as it's always pairs).
Hmmm.  This may need to wait for the new DevDB to be ready so that I can do an isolated test and log it.
Hi,

Any ETA for this?

Thanks.
No, we're still hung up on our lack of dev infrastructure so I can log what query the UI is running instead of what query it's *supposed* to be running.
(In reply to [:jberkus] Josh Berkus from comment #11)
> No, we're still hung up on our lack of dev infrastructure so I can log what
> query the UI is running instead of what query it's *supposed* to be running.

I just logged the query that the aduByDay service runs from my dev instance, pointing to the staging DB - does this help?:

SELECT adu_date as date,          substring(os_name, 1, 3) as product_os_platform,          sum(adu_count)::BIGINT       from      
          product_adu pa      join product_info pi using (product_version_id)          where          '2012-03-21T00:00:00+00:00' <= adu_date          and adu_date <= '2012-04-04T00:00:00+00:00'  
          and pi.product_name = E'Firefox'
          and pi.version_string = E'11.0'
          and os_name in ('Windows','Mac OS X','Linux')
      group by
          date,
          product_os_platform
      order by
          1

SELECT adu_day::DATE, os_short_name, SUM(count)
      FROM daily_crashes
      WHERE timestamp with time zone '2012-03-21T00:00:00+00:00' <= adu_day AND
            adu_day <= timestamp with time zone '2012-04-04T00:00:00+00:00' AND
            productdims_id = 900 AND
             os_short_name in ('Win','Mac','Lin') AND
             report_type IN ('C', 'p', 'T', 'P')
      GROUP BY adu_day, os_short_name
      order by
          1, 2
Yes, that's what I needed.
paula,

So the problem is that the UI is not counting hang crashes, whereas the query I've given you *does* count them.  So the question is: should we be counting hang crashes, or not?
Josh, the UI *is* counting hangs, but as I said twice in here already, only one half of the hang pairs. And exactly that is the right thing to do.
I can certainly omit hang-broswer from the metrics view.  Note, however, that this gives a different total of crashes than TCBS does, since TCBS does count browser hangs.

Also, note that "hang pairs" are not always pairs.  Sometimes they're triads (or more), and sometimes they're singletons.
(In reply to [:jberkus] Josh Berkus from comment #16)
> I can certainly omit hang-broswer from the metrics view.  Note, however,
> that this gives a different total of crashes than TCBS does, since TCBS does
> count browser hangs.

Good point. Still, this is intended in this case. The metrics view should only count one hang side.

> Also, note that "hang pairs" are not always pairs.  Sometimes they're triads
> (or more), and sometimes they're singletons.

I doubt that there are triads or more, and if they are, I want to know because those are surely bugs and we should file bugs against the plugin hang mechanism. Singletons happen, esp. through throttling, but rarely where not throttled, and there they are bugs as well. In any case, it's quite rare to have non-pairs by something else than throttling, and in the throttling case, we *statistically* still get the same amount of reports for both plugin and browser halves, so we are safe with counting only one side. Counting both halves of a pair is inflation we don't want to have in metrics.
Paula,

Ok, the views have been updated on Stage.  Give them a try now.
Oh.  What should I be doing with crashes with "unknown" OS?  I'd prefer to count them in the totals, since that's accurate, but I don't really know what Metrics wants.
Hi jberkus,

I can't see any of the views on Stage: product_crash_ratio, product_os_crash_ratio
(In reply to [:jberkus] Josh Berkus from comment #19)
> Oh.  What should I be doing with crashes with "unknown" OS?  I'd prefer to
> count them in the totals, since that's accurate, but I don't really know
> what Metrics wants.

Please, count them in the totals.
(In reply to Paula Clemente [:paula] from comment #20)
> Hi jberkus,
> 
> I can't see any of the views on Stage: product_crash_ratio,
> product_os_crash_ratio

Are you connecting to the new StageDB?  soccorro1.stage.db.phx1?  Your connection *should* work with that server.  If it doesn't, please check with mpressman & Jason.
(In reply to [:jberkus] Josh Berkus from comment #22)
> (In reply to Paula Clemente [:paula] from comment #20)
> > Hi jberkus,
> > 
> > I can't see any of the views on Stage: product_crash_ratio,
> > product_os_crash_ratio
> 
> Are you connecting to the new StageDB?  soccorro1.stage.db.phx1?  Your
> connection *should* work with that server.  If it doesn't, please check with
> mpressman & Jason.


mpressman: 
Can you help me?
We need the same as in: https://bugzilla.mozilla.org/show_bug.cgi?id=687906#c13

[@cm-metricsapp01]$ nc -zv socorro1.zlb.db.sjc1.mozilla.com 6432
Connection to socorro1.zlb.db.sjc1.mozilla.com 6432 port [tcp/*] succeeded!

[@cm-metricsapp01]$ nc -zv socorro1.stage.db.phx1.mozilla.com 6432
nc: connect to socorro1.stage.db.phx1.mozilla.com port 6432 (tcp) failed: Connection timed out
Comparing bug 733489 query numbers with /daily numbers, I see there's still a slight difference in that queries still have slightly higher numbers.

I now found out where those differences come from, and the key is listed right here in the bug:

(In reply to Robert Helmer [:rhelmer] from comment #12)
>              os_short_name in ('Win','Mac','Lin') AND

The queries include all OSes, the /daily view only the 3 major ones. Now, that small difference should usually not matter for us, but it's good to know where it comes from.
IMHO, the error is in /daily now, it should include crashes and ADUs from OSes other than the 3 major ones. Metrics total numbers should as well. When we slit OS numbers, we might need to drag those along as "Other" or so.
Depends on: 755610
Kairo -- is this still an issue?
Flags: needinfo?(kairo)
Comment #24 is still true.

The issue around hangs is gone due to not having pairs any more, though.
Flags: needinfo?(kairo)
Summary: crash ratio views do not match UI → crash ratio views do not match UI due to queries by operating system
Assignee: nobody → sdeckelmann
Assignee: sdeckelmann → nobody
We are not going to fix the issues from comment 24
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.