Open Bug 1861789 (bugbot-auto-crash) Opened 7 months ago Updated 5 hours ago

[meta](BugBot) Automatic filing of new crashes

Categories

(Developer Infrastructure :: Source Code Analysis, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: suhaib, Unassigned)

References

(Depends on 134 open bugs, )

Details

(Keywords: meta)

This meta bug is used to track the bugs that are automatically filed by BugBot.

How is it determined that these crashes are "actionable"?

(In reply to Timothy Nikkel (:tnikkel) from comment #1)

How is it determined that these crashes are "actionable"?

There is a set of rules the bot follows to filter out crashes, most of them listed in the "Identify new actionable crashes" in https://docs.google.com/document/d/1iZiM0eSSEXg0q_pLaUiHilzJaSc6KnG-CcM9-BC2g_Y/edit. Crashes matching the following are not automatically filed:

  • Hardware failure crashes
  • Out-of-memory crashes
  • Crashes reported from the same person
  • Shutdown hangs

We are planning to adjust the rules after the initial rollout, depending on what we see. We have already tuned them on Bugzilla Dev with the help of some engineers, but with a wider rollout I'm sure we'll find more things.
If you already have some suggestions based on the bugs you have seen so far, we are happy to add more rules or tune the existing ones! Also let us know if you see any bug that shouldn't have been filed, and we can try to tune the rules ourselves to make sure similar bugs don't get autofiled in the future.

Most of the bugs I'm seeing filed a very low volume. Engineers aren't going to spend time on them, it just wastes resources having to triage them.

There's only about a dozen bugs depending on this, pretty easy to skim through them to see.

(In reply to Timothy Nikkel (:tnikkel) from comment #4)

There's only about a dozen bugs depending on this, pretty easy to skim through them to see.

Suhaib does not regularly triage platform bugs, so more specific feedback from somebody like you who does would probably be more actionable. I don't think there's been any feedback from anybody who does a lot of graphics triage, so I'm sure that would be a welcome perspective.

I looked over all of the bugs reported so far, and it does look like the reporting threshold for bugs without anything to make them particularly actionable seems rather low. Specifically, there are a bunch of crashes in graphics drivers (bug 1862276, bug 1862279, bug 1862280, bug 1862283) that have only happened a couple of times in the last week on Nightly, so I don't think they should be filed. A few of them only happened once. Bug 1862347 is another crash in graphics drivers, though that one is a bit odd because it is a null pointer crash, but it is still happening in graphics code with very low volume, so the odds of us being able to take a action to fix it are very low.

Bug 1862348 has only happened twice in the last week on Nightly, and is not a null crash or an assert so it also seems like it would have been better to not file it.

Bug 1862284 and bug 1862349 also have only happened a few times each, but they are release asserts in our code, so there might be something actionable, so it looks reasonable to me (as a non-graphics person) that bugs were filed for them.

See Also: → 1863189

(In reply to Marco Castelluccio [:marco] from comment #2)

There is a set of rules the bot follows to filter out crashes, most of them listed in the "Identify new actionable crashes" [...] Crashes matching the following are not automatically filed:

  • Hardware failure crashes
  • Crashes reported from the same person

Counterexamples where the bot might not be applying these properly:

  • In bug 1863178, two out of the four crashes look to be from the same person, and potentially due to hardware issues (per the main paragraph in bug 1863178 comment 3)
  • In bug 1863177, eight out of the ten crashes look to be from the same person who was apparently testing a local web service on 127.0.0.1. (see bug 1863177 comment 2.)

(I'm not sure if BugBot mistakenly interpreted this^ crash volume as contributing to the "signal" here; it looks superficially like it did, but I suppose it might have been aware of these same-user/possible-hardware-issue factors and still filed because it decided that the remaining (small) crash volume was still enough to nudge it to file a bug.)

(In reply to Daniel Holbert [:dholbert] from comment #6)

Thank you for the feedback, this will help us improve the bot.

Counterexamples where the bot might not be applying these properly:

  • In bug 1863178, two out of the four crashes look to be from the same person, and potentially due to hardware issues (per the main paragraph in bug 1863178 comment 3)

The bot ignores crashes that happen on only one installation. In cases other cases where the crash does not show other interesting singles (e.g., security-related, crash on null, has potential regressor), the threshold increased to be crashes > 25 and installations > 5 (working on toning these thresholds, they will probably be increased). In the case of bug 1863178, it is null crash and has a potential regressor (the regressor is a false positive in this case). By checking the number of installations, it seems that the crash happened on more than one installation.

To ignore bad hardware related crashes, there are multiple criteria; the bit flip is one of them. These criteria focus on reports seen in the last two weeks. In the case of bug 1863178, none of the crash reports in the last two weeks have "Possible bit flips max confidence" (the mentioned reports are a bit older).

  • In bug 1863177, eight out of the ten crashes look to be from the same person who was apparently testing a local web service on 127.0.0.1. (see bug 1863177 comment 2.)

(I'm not sure if BugBot mistakenly interpreted this^ crash volume as contributing to the "signal" here; it looks superficially like it did, but I suppose it might have been aware of these same-user/possible-hardware-issue factors and still filed because it decided that the remaining (small) crash volume was still enough to nudge it to file a bug.)

The fact that 8 out of 10 crashes happened on null or near null, makes the crash qualifies to skip the volume threshold. However, in any case, there is no bug will be filed for a crash that happens on only one installation. In case of bug 1863177, the crash happened on 8 different installations on the last two weeks.

Thanks for the clarifications.

(In reply to Suhaib Mujahid [:suhaib] from comment #7)

The fact that 8 out of 10 crashes happened on null or near null, makes the crash qualifies to skip the volume threshold.

FWIW the "crash near null" here is in fact just a MOZ_RELEASE_ASSERT failing, in that particular case. (Maybe it makes sense to consider those as the same thing.)

In case of bug 1863177, the crash happened on 8 different installations on the last two weeks.

I'm not sure how we identify installations, but maybe it's too strict? It's pretty clear to me that those 8 different crashes were all on the same installation, for some definition of installation -- they all have an identical "install time" field (2023-10-31 19:20:56), identical hardware and OS metadata fields, and near-identical URLs (all 127.0.0.1; maybe a fuzzer or something else with local testing).

(Edit: I'm aware that the URL is a protected/hidden field in the crash report, and the bot might not have access to it; I just mention it since [for those with access] it's just a point of additional confirmation, on top of what is already pretty-clear-evidence IMO that those 8 crashes look like the same installation.)

No longer depends on: 1869140
Depends on: 1869140
No longer depends on: 1882933
Depends on: 1882933
You need to log in before you can comment on or make changes to this bug.