Closed Bug 1309112 Opened 3 years ago Closed 3 years ago

Detect and linkify GitHub issue in comment

Categories

(bugzilla.mozilla.org :: General, defect)

Production
defect
Not set

Tracking

()

VERIFIED FIXED

People

(Reporter: xidorn, Assigned: seban)

Details

Attachments

(1 file)

44 bytes, text/x-github-pull-request
dylan
: review+
Details | Review
Nowadays many Mozilla projects are developed on GitHub. It would be helpful if Bugzilla can linkfy GitHub issues.

On GitHub, issues are referred in the form of "owner/repo#number", e.g. servo/servo#13134. Bugzilla can recognize this pattern and create a link to github.com/owner/repo/issues/number.

Pull requests share the number with issues, but if the target is a pull request, GitHub will automatically redirect to pull/number, so we don't need to distinguish between them from our side.

If there could be too many false positives of this patten (which I doubt), the owner part can be restricted to some known mozilla and other web-related orgs.
Assignee: nobody → sebastinssanty
Attached file patch
Attachment #8799737 - Flags: review?(dylan)
Comment on attachment 8799737 [details] [review]
patch

Because we have over 11 million comments in BMO, we have to be careful about regexes that require continuous look-ahead.

Take a look at running this regex against the reporter's comment (this is using the lovely CPAN module Regexp::Debugger)
This ends up taking at least 1200 steps and a lot of back-tracking
and that's just to get to the first match.

https://www.youtube.com/watch?v=WzzHvtozzMg

However I think this is a useful feature. We should have a list of known github projects,
and change the first part of the RE to be  m{\b((?:servo|mozilla|etc)\/[A-Za-z0-9_\.-]+\#[0-9]+)\b}g;

To be efficient, we should use Data::Munge's list2re, which takes a list of static strings and produces an efficient regex.

I don't think we need all of Data::Munge, so for now you can just copy the function into the BMO extension:
https://metacpan.org/source/MAUKE/Data-Munge-0.096/lib/Data/Munge.pm#L59-64

Usage:

my $projects = list2re('servo', 'mozilla', 'mozilla-bteam', 'bugzilla');
m{\b($projects\/[A-Za-z0-9_\.-]+#[0-9]+)\b}g;

For now the list of projects should be a constant at the top of the BMO extension, but I can picture
adding "associated github repos" as metadata to Bugzilla components.

Probably this comment is not enough to go on, but we can chat about it in #bteam at your time allows.

Also, thank you, sebastin for taking on this bug! Good initiative.
Attachment #8799737 - Flags: review?(dylan) → review-
Thanks Dylan,

Yeah, I didn't take the aspect of backtracking over all those comments. I'll have a look at the list2re, I think it will be a better idea too.

Xidorn: I guess the idea mentioned by dylan, would be much more feasible for BMO. Any opinion on this?
Flags: needinfo?(xidorn+moz)
I think that is fine, but I can imagine there may be requests for adding more orgs / users into that list.

If performance issue is mainly because it starts with a rather arbitrary string which takes too much time on trying matching unrelated text, I have a suggestion that, we can start the regex with "/", and once we find a pattern "/[A-Za-z0-9_\.-]+#\d+\b", we do a manual backtrack to find the owner part. That should be faster I guess.
Flags: needinfo?(xidorn+moz)
I just rethought about that regexp, and watched the video again. Then I realized that it is just that the regexp debugger is too threatening, and the regexp itself isn't that bad.

As a comparison, let's pick an innocent-looking regexp listed in the same function: qr/(^|\s)r(\d{4,})\b/. If you replace the match in comment 0 with a match of this regexp, how many steps do you think it would take before the first match? 1400+.

IIUC, most of the url matching regexp would take something about half of the steps of that regexp, and that's all.

In majority of cases, because of the existence of "\b" at the beginning, the naive regexp should just take 1x more steps than others, which isn't really harmful. And actually, the regexp you suggested in comment 2 (via list2re, which is something like /\b((?:mozilla\-bteam|bugzilla|mozilla|servo)\/[A-Za-z0-9_\.-]+\#[0-9]+)\b/) takes 1400+ steps to get to the first match.

There is one case that the naive regexp would be a bit worse than normal cases, which is long url. In that case, it could take 2x more steps because of the rematching of "/". But that is still O(n), so not too bad.

The worst case of that regexp is a O(n^2) regression when the string includes lots of "-" or ".", because those characters generates word boundary, but is matched by [A-Za-z0-9_\.-]. I don't think that's something very common, but if you want to avoid this worst case, you can just replace the \b with something like [^A-Za-z0-9_\.-], then it would take O(n) for any case. And if you include "/" in that, it would have the same performance in long urls as well.
Given comment 5, please reconsider the original patch.
Flags: needinfo?(dylan)
Comment on attachment 8799737 [details] [review]
patch

Based on Xidorn's astute observations in comment 5, I revised this:

So what I did was run this over every comment in my copy of the BMO db (which is 10 million comments),
with and without this patch, and I benchmarked it. There is no significant difference in timings. We still might get some false positives but we can fix those.
Flags: needinfo?(dylan)
Attachment #8799737 - Flags: review- → review+
I think the regexp should just be /\s([A-Za-z0-9_\.-]+\/[A-Za-z0-9_\.-]+\#[0-9]+)\b/, and you should probably replace the svn matching to just use \s as prefix as well, since it seems a grouping "or" takes lots of extra steps in Perl's regexp engine, and it is just not worth for a "^".

If you want to make it exactly correct, it might be more performant to prepend a space to every comment before matching than applying "(^|\s)" to them. But I guess that issue is unimportant and isn't really worth this trouble.
Sebastin: do you want to change the regexps per comment 8? Thanks!


Also, 
Xidorn: Thank you very much for your insight into this bug. It is very much appreciated.
Flags: needinfo?(sebastinssanty)
Dylan, Xidorn: I have changed the regex expression to include whitespace. But I am a bit unsure about it, because, as Xidorn had earlier mentioned about spaces, it won't detect 'See this:mozilla-bteam/bmo.'. Though I guess, these cases would be very less.
Flags: needinfo?(sebastinssanty)
Did I say that?

I'm not very worried about that case. I'm more concerned about the ending, which should be a "\b" rather than "\s", because people may add punctuation after that. The beginning shouldn't be a big issue.
Right, Earlier you had mentioned about the space in the ending part. But I was thinking about the beginning, but that won't be a big issue, as you mentioned above.
It seems the GitHub PR is pending for your review. Could you have a look at that?
Flags: needinfo?(dylan)
Yes, I'll be taking a look at this now. I didn't notice as the review flag was still set, it should pretty good to go now.
Flags: needinfo?(dylan)
Attachment #8799737 - Flags: review+ → review?(dylan)
The PR review is long pending, can you have a look at it? :)
Flags: needinfo?(dylan)
Comment on attachment 8799737 [details] [review]
patch

r=dylan

Finally got to this.
Flags: needinfo?(dylan)
Attachment #8799737 - Flags: review?(dylan) → review+
To git@github.com:mozilla-bteam/bmo.git
   4da65f2..d7dacb9  master -> master
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
This is now live, mozilla-bteam/bmo#26 as an example. Xidorn's servo example also works.
Status: RESOLVED → VERIFIED
Great! Thanks for working on this.
You need to log in before you can comment on or make changes to this bug.