Closed Bug 1872242 Opened 5 months ago Closed 18 days ago

milestone_winversion should be unique

Categories

(Firefox Build System :: General, defect, P3)

Desktop
Windows
defect

Tracking

(firefox128 fixed)

RESOLVED FIXED
128 Branch
Tracking Status
firefox128 --- fixed

People

(Reporter: yannis, Assigned: ahochheiden)

References

(Regressed 1 open bug)

Details

Attachments

(1 file)

In bug 1843977, we had a massive crash spike caused by a third-party antivirus. The antivirus was wrongly applying instrumentation to our 115 ESR xul.dll as if it were our 115 Release xul.dll. It was doing this because the two binaries that we shipped embedded the exact same version number - which the antivirus assumed would be a unique way to identify our 115 Release xul.dll. While we could argue that the antivirus should not have made this assumption, it would still be better if we did ship binaries with a unique version number to avoid this kind of trouble in the future.

As figured out by [:RyanVM], our current code for generating version information for our binaries is as follows, and the logic for that was added in bug 386740:

    milestone_winversion = ",".join(
        split_and_normalize_version(milestone, 3)
        + [str(days_from_2000_to_buildid(buildid))]
    )

Because we shipped the same 115 and 115 ESR minor versions on the same day with the same milestone number, the corresponding binaries ended up with the exact same value for milestone_winversion there.

As discussed with [:RyanVM] and [:rkraesig], we should at least guarantee that building with the same milestone but for a different channel leads to a different version number in the end - independently of the day and hour of build. We probably also want more granularity than days. We should also check if we have the same issue on other platforms.

The severity field is not set for this bug.
:ahochheiden, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(ahochheiden)

(In reply to Yannis Juglaret [:yannis] from comment #0)

but for a different channel leads to a different version number in the end - independently of the day and hour of build. We probably also want more granularity than days. We should also check if we have the same issue on other platforms.

So you just want days_from_2000_to_buildid(buildid): to be seconds instead?

eg:

def seconds_from_2000_to_buildid(buildid):
    start = datetime(2000, 1, 1, 0, 0, 0)
    buildid_time = datetime.strptime(buildid, "%Y%m%d%H%M%S")
    time_delta = (buildid_time - start)
    return time_delta.total_seconds()
Severity: -- → S3
Flags: needinfo?(ahochheiden) → needinfo?(yjuglaret)
Priority: -- → P3

I dug into this with Glandium and we figured out why it's using 'days since 2000' in the first place (and why seconds won't work nicely, but can sort of still work).

https://learn.microsoft.com/en-us/windows/win32/menurc/versioninfo-resource

The 'version' that we care about is FILEVERSION:

FILEVERSION version Binary version number for the file. The version consists of two 32-bit integers, defined by four 16-bit integers. For example, > "FILEVERSION 3,10,0,61" is translated into two doublewords: 0x0003000a and 0x0000003d, in that order. Therefore, if version is defined by the > DWORD values dw1 and dw2, they need to appear in the FILEVERSION statement as follows: HIWORD(dw1), LOWORD(dw1), HIWORD(dw2), > LOWORD(dw2).

Basically we have 4 "words" and currently we use the first 3 "words" for major minor patch semantic versioning, then the 4th word for the days since 2000 part of the FILEVERSION. Since they're "words", they're only 16 bits, which can have a maximum value of 65535 (which is why we're using days, since that's currently ~8700). Seconds will be too large to fit in that range (this would have been a helpful comment).

One way we can 'solve' this is by encoding the major minor and patch versions into only the first two words, and using the last two words for the seconds (that'll give us until the year ~2136 before we overflow). It'll kinda be ugly to read though, but it will work.

Example:

Current FILEVERSION 123.2.13.8780 (arbitrary example)

would break down as follows:

1st word: major (123)
2nd word: minor * 100 + patch : 2 * 100 + 13 (213)

seconds_since_2000: 758660440
3rd word (Upper 16 bits of seconds_since_2000): 11576
4th word (Lower 16 bits of seconds_since_2000): 15704

New FILEVERSION: 123.213.11576.15704

One caveat here being that this still doesn't fully resolve the issue, since if different channels build at the same second, the original error will still occur (but it's extremely unlikely).

We could use the same scheme to encode a number representing the channel in the decimal one's place of the 2nd "word" and and shift the other two by 10 each (eg: minor * 1000 + patch * 10 + encoded_channel_digit : 2 * 1000 + 13 * 10 + 1). Yielding 123.2131.11576.15704.

Note: There are limitations for both approaches. The second (also encoding a channel digit) is more restrictive and assumes the following:

  • There will not be more than 10 channels (0-9)
  • There will not be more than 100 patch versions (0-99)
  • There will not be more than 60 minor versions (0-59)

If you omit the channel digit, the patch and minor restrictions could be:

  • There will not be more than 1000 patch versions (0-999)
  • There will not be more than 65 minor versions (0-64)

That's assuming you want to keep the minor and patch numbers somewhat human readable. We could also use 2/16 bits for the channel encoding ((0-3 range) 4 values for 4 channels), and 7 bits each (0-127 range) for patch and minor, but that would in turn would just look like some nonsense. To be fair, the days_since_2000 as the 4th word currently can also be classified as nonsense.

We could also include the 1st "word" in the encoding scheme, but then we open up the possibility of collisions with older versions. I'm assuming here that the antivirus keeps a list of all past versions, so if we allow the 1st "word" to potentially be a number we've already used (eg: 120) it's possible to collide with a version sequence used previously (though, very unlikely, since by my estimate we've had in the range of 10s of thousands of releases, so let's say upper bound 100,000. 100000/2^64 is pretty low odds).

(In reply to Alex Hochheiden [:ahochheiden] (needinfo? me) from comment #2)

So you just want days_from_2000_to_buildid(buildid): to be seconds instead?

The current problem we identified with using days is that we ship two nightlies per day, so even if we split over release channels, using days would still lead to duplicate version IDs. I don't think we necessarily need seconds granularity though. The granularity should be decided by release managers.

Flags: needinfo?(yjuglaret) → needinfo?(ryanvm)

I honestly don't have a super strong opinion on what to do here. This isn't a version we use anywhere in our tooling. I am a bit worried about fallout/unintended consequences from the proposal in comment 3 to change the second and third digit representations as well, but only because I don't know what might depend on that in the wild.

Flags: needinfo?(ryanvm)

Possibly-more-straightforward suggestion for the encoding: since a separate timestamp is only useful on the Nightly channel(?), carve out a niche for the channel-identifier by means of addition rather than multiplication.

  • 0: release.
  • 1: beta.
  • 2: esr.
  • 3+n: nightly, with some arbitrary counter n.

This is mostly compatible with our existing versioning scheme, puts no more constraints on major/minor/patch version-numbers than already exist, and has an obvious course of action if we want to reinstitute the Aurora channel (i.e., make aurora 3 (or whatever), and nightly 4+n).

I have no opinion on whether the counter granularity should be increased to ensure uniqueness of Nightly builds' FILEVERSIONs — but if it is, and if it remains clock-based, I'd suggest counting from the time of the last major-version-number bump rather than from a fixed date.

Encoding hours would allow up to 7 years in 16-bits, which is plenty enough if we go with last major-version-number bump as a reference point. The downside, obviously, is that we store that information nowhere in the tree currently.

carve out a niche for the channel-identifier by means of addition rather than multiplication.

Addition to which number? If it's any of the "normal" numbers in the version, the result wouldn't allow to distinguish between that version and an older version on another channel.

We could also encode hours since beginning of the milestone year, which requires no additional information, would still allow to distinguish two nightlies from the same day, and leaves us 2 bits where we could encode the channel.

(In reply to Mike Hommey [:glandium] from comment #7)

Addition to which number? If it's any of the "normal" numbers in the version, the result wouldn't allow to distinguish between that version and an older version on another channel.

Not quite. I'm not suggesting "add 1 to the value to represent "beta"', I'm suggesting "add 3 to the value to carve out some unused values solely for the channel-identifier".

(So if you used it for any of the first three numbers, you wouldn't be able to distinguish between that version and any other version on the same channel.)

I do like the hours since current milestone year idea (we could even make it a more granular chunk: value = seconds since start of milestone year / 2^bits available This wouldn't work when going back to much older revisions and building, disregard).

One thought, doesn't a beta channel release candidate become the full release? Should we even encode the channel at all with that in mind?

RC builds get shipped to the Beta channel, yes. But they're built off mozilla-release if that matters.

The previous implementation used days since Jan 1 2000 for the last
16-bit segment. This was not unique enough and caused issues with
Antivirus software if two different channels were built on the same day.

The new approach uses hours since the last milestone bump and uses the
VCS to determine how long ago that was relative to the build time. This
means it will always reset when a new cycle begins, but still be unique
since the digits in the first 3 segments have incremented.

We also now use two of the 16-bits to encode the channel (nightly, beta,
ESR, and release). So two channels built within the same hour will still
be unique.

Using only 14-bits to store the 'hours since version bump', we have
about ~682 days from a version bump before we reach the maximum value we
can store. If a build is done after that point, the segment value will
always be the maximum value for that channel.

Assignee: nobody → ahochheiden
Status: NEW → ASSIGNED
Pushed by ahochheiden@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/344524bd7c75
Change how the last segment of `milestone_winversion` is generated to improve uniqueness r=firefox-build-system-reviewers,RyanVM,glandium
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 124 Branch
Regressions: 1879519
Status: RESOLVED → REOPENED
Flags: needinfo?(ahochheiden)
Resolution: FIXED → ---
Target Milestone: 124 Branch → ---

A Try push as a central-as-beta simulation with this patch applied succeeds - unknown why it passes now.

Alex, feel encouraged to reland as-is.

The problem still exists. It comes from the timestamp of milestone.txt ending up being "newer" than the buildid, making the delta between them negative. And the reason this happens is that the timezone is not taken into account. In the build from bug 1879519, the full timestamp from milestone.txt is 2024-02-09 12:51:24 +0200. The buildid is 20240209105204, which in a more readable form is 2024-02-09 10:52:04. The buildid itself is UTC, so translated in the same timezone as the milestone.txt timestamp, it's actually 2024-02-09 12:52:04 +0200, which is properly greater.

I'd say the timezone should be accounted for, but also, we should make sure the value can't be negative.

The timezone should be accounted for in the current implementation. The value can also no longer be negative (It will raise an exception in CI if it is, and will set to MAX_VALUE outside of CI).

Flags: needinfo?(ahochheiden)

There is an r+ patch which didn't land and no activity in this bug for 2 weeks.
:ahochheiden, could you have a look please?
If you still have some work to do, you can add an action "Plan Changes" in Phabricator.
For more information, please visit BugBot documentation.

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(ahochheiden)
Flags: needinfo?(mh+mozilla)

Just a friendly reminder that 128 will be the next ESR release and we're ~3 weeks from the start of that Nightly cycle.

Pushed by ahochheiden@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b685d0db4248
Change how the last segment of `milestone_winversion` is generated to improve uniqueness r=firefox-build-system-reviewers,glandium

Backed out for causing py3 failures in test_commit.py

Pushed by ahochheiden@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9cc387be4b6c
Change how the last segment of `milestone_winversion` is generated to improve uniqueness r=firefox-build-system-reviewers,glandium
Status: REOPENED → RESOLVED
Closed: 4 months ago18 days ago
Resolution: --- → FIXED
Target Milestone: --- → 128 Branch
Flags: needinfo?(ahochheiden)
Regressions: 1899569
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: