Closed Bug 1814790 Opened 1 year ago Closed 1 year ago

firefox-bin: /builds/worker/checkouts/gecko/third_party/dav1d/src/refmvs.c:182: union mv mv_projection(const union mv, const int, const int): Assertion `den > 0 && den < 32' failed.

Categories

(Core :: Graphics: ImageLib, defect)

defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox-esr102 --- wontfix
firefox111 --- wontfix
firefox112 --- wontfix
firefox113 --- fixed

People

(Reporter: tsmith, Assigned: chunmin, NeedInfo)

References

(Blocks 1 open bug)

Details

(4 keywords, Whiteboard: [adv-main113-])

Attachments

(4 files, 1 obsolete file)

Attached image testcase.avif

Found while fuzzing m-c 20230201-b7f075124503 (--enable-debug --enable-fuzzing)

The test case is not 100% reliable it will likely require multiple reloads (press F5). Seems to reproduce about 3/100 attempts.

Requires pref image.avif.sequence.enabled=true

firefox-bin: /builds/worker/checkouts/gecko/third_party/dav1d/src/refmvs.c:182: union mv mv_projection(const union mv, const int, const int): Assertion `den > 0 && den < 32' failed.

#0 0x7fb24156400b in raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
#1 0x7fb241543858 in abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79:7
#2 0x7fb241543728 in __assert_fail_base /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:92:3
#3 0x7fb241554fd5 in __assert_fail /build/glibc-SzIz7B/glibc-2.31/assert/assert.c:101:3
#4 0x7fb233bd4a72 in mv_projection /builds/worker/checkouts/gecko/third_party/dav1d/src/refmvs.c:182:5
#5 0x7fb233bd4a72 in dav1d_refmvs_load_tmvs /builds/worker/checkouts/gecko/third_party/dav1d/src/refmvs.c:729:35
#6 0x7fb233b93908 in dav1d_decode_tile_sbrow /builds/worker/checkouts/gecko/third_party/dav1d/src/decode.c:2832:9
#7 0x7fb233bd8447 in dav1d_worker_task /builds/worker/checkouts/gecko/third_party/dav1d/src/thread_task.c:761:33
#8 0x7fb241a95608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477:8
#9 0x7fb241640132 in __clone /build/glibc-SzIz7B/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Flags: in-testsuite?
Severity: -- → S3

I was not able to reproduce this with or without image.avif.sequence.enabled

I forgot to mention this here, but I also couldn't reproduce this, even in a debug build with a conditional breakpoint.

A Pernosco session is available here: https://pernos.co/debug/9MxxNHZosXtAA31lgQosBg/index.html

Here is a Valgrind log

Thread 31 dav1d-worker:
Use of uninitialised value of size 8
   at 0xFA9ACFB: ??? (src/third_party/dav1d/src/x86/looprestoration_avx2.asm:1881)
   by 0xFA9A78B: ??? (src/third_party/dav1d/src/x86/looprestoration_avx2.asm:1238)
   by 0x15923DBF: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
   by 0x145014501450144: ???
 Uninitialised value was created by a heap allocation
   at 0x4841818: memalign (vg_replace_malloc.c:1531)
   by 0x4841936: posix_memalign (vg_replace_malloc.c:1703)
   by 0xF98C80B: dav1d_alloc_aligned (third_party/dav1d/src/mem.h:66)
   by 0xF98C80B: dav1d_decode_frame_init (third_party/dav1d/src/decode.c:3144)
   by 0xF98DD2E: dav1d_decode_frame (third_party/dav1d/src/decode.c:3457)
   by 0xF98EF06: dav1d_submit_frame (third_party/dav1d/src/decode.c:3861)
   by 0xF9A58A9: dav1d_parse_obus (third_party/dav1d/src/obu.c:1653)
   by 0xF9A26DA: gen_picture (third_party/dav1d/src/lib.c:459)
   by 0xF9A25D5: dav1d_send_data (third_party/dav1d/src/lib.c:489)
   by 0xCE873FC: mozilla::image::Dav1dDecoder::GetPicture(Dav1dContext&, mozilla::MediaRawData const&, Dav1dPicture*, bool) (image/decoders/nsAVIFDecoder.cpp:619)
   by 0xCE870B5: mozilla::image::Dav1dDecoder::Decode(bool, Mp4parseAvifInfo const&, mozilla::image::AVIFImage const&) (image/decoders/nsAVIFDecoder.cpp:531)
   by 0xCE70E0D: mozilla::image::nsAVIFDecoder::Decode(mozilla::image::SourceBufferIterator&, mozilla::image::IResumable*) (image/decoders/nsAVIFDecoder.cpp:1509)
   by 0xCE705CF: mozilla::image::nsAVIFDecoder::DoDecode(mozilla::image::SourceBufferIterator&, mozilla::image::IResumable*) (image/decoders/nsAVIFDecoder.cpp:1195)
Keywords: sec-moderate

There's also a pernosco accessible to those with mozilla.com emails addresses which contains a full recording that can be inspected (memory, variables) at any moment with gdb that I can get more information from if the stacks and testcase aren't enough.

I believe :tnikkel was going to bring this one up with dav1d?

(In reply to Zaggy1024 from comment #6)

I believe :tnikkel was going to bring this one up with dav1d?

I brought it up and cc'd a dav1d developer here so they can take a look as a result of that.

I'm wondering how to best go about debugging this. I've tried to convert the .avif to .ivf using "ffmpeg -i testcase.avif -c:v copy testcase.ivf". However, this results in a file that is much smaller than the original: 10755 vs 4191 bytes. I'm not sure this is because FFmpeg's bitstream filters reject some of the frame data, or because the MP4 overhead is missing. With the resulting testcase.ivf file, I've tried to run "tools/dav1d -i testcase.ivf --muxer=null [options]" using MSAN in debug mode (so asserts are enabled), but get nothing: no MSAN errors and no asserts triggered. I ran it a couple hundred times in various threading configurations (--framedelay=1 or not, various --threads=X values, including 1).

It would be nice to directly use tools/dav1d for debugging. But if FFmpeg's BSF rejects invalid data, then this workflow isn't going to work. I'm open to other suggestions on going about this. I understand in theory it's possible for me to build Firefox, but I don't think it's reasonable to expect me to do that. Can you trigger this failure with only libavif? That would be a significant code coverage reduction also and simplify my workflow while looking into this.

FWIW I'd be happy to test any dav1d patches in a Firefox builds. This includes debugging patches that might output more info.

(In reply to Ronald S. Bultje from comment #8)

I'm wondering how to best go about debugging this. I've tried to convert the .avif to .ivf using "ffmpeg -i testcase.avif -c:v copy testcase.ivf". However, this results in a file that is much smaller than the original: 10755 vs 4191 bytes. I'm not sure this is because FFmpeg's bitstream filters reject some of the frame data, or because the MP4 overhead is missing. With the resulting testcase.ivf file, I've tried to run "tools/dav1d -i testcase.ivf --muxer=null [options]" using MSAN in debug mode (so asserts are enabled), but get nothing: no MSAN errors and no asserts triggered. I ran it a couple hundred times in various threading configurations (--framedelay=1 or not, various --threads=X values, including 1).

The difference in size will be largely due to the presence of an alpha track with ID 2 in the AVIF file, but unfortunately according to the stack trace in comment #4 which references these lines it looks like this issue should be occurring in the color track with ID 1, so I would've expected your steps to be able to reproduce it. I believe that ffmpeg normally chooses the color track when converting avif to ivf, although it might be worth checking the monochrome flag to make sure it extracted the right one.

monochrome is not set according to "ffmpeg -i testcase.ivf -bsf trace_headers -c:v copy -f null - 2>&1 | grep mono_chrome".

See Also: → 1819796

I looked a bit into this using the pernos.co thing. An invalid (possibly uninitialized?) value is assigned in line 724 of dav1d_refmvs_load_tmvs(): const int b_ref = rb->ref; - this value should be in the [0,7] range, but the value in the debugger is -13. Anything after that is a consequence of this invalid value. This is for x=0,y=48 (in 8x8 block units) for frame_offset=2.

Looking further, I understand now why we can't reproduce in tools/dav1d: in pernos.co, it's quite obvious that the pixfmt of the sequence where the issue occurs is I400, so grayscale. The picture decoded by dav1d is I420. So the (potential) issue is in the alpha plane, which FFmpeg discards when converting to .ivf. At this point it's difficult for me to work further on this until I have a way of inspecting the file locally in tools/dav1d. Any ideas of converting the alpha plane (instead of the main grayscale/color planes) to .ivf would be appreciated.

Oh, and in terms of origin of the race in that code block: ref=5,pocdiff=2,refpoc=4. This is the frame before poc=2 (standard hierarchical coding: 0,12,4,2,1,..) so this should be easy to debug. Also, n_fc=1 and n_tc=5, so no frame threading which is a bit strange given that we appear to be looking for a potential race condition here.

I can give you the raw binary data that we pass to dav1d_data_wrap/dav1d_send_data, would that do the trick?

The avif decoder in Firefox's use of dav1d is quite simple:

Dav1dSettings settings;
dav1d_default_settings(&settings);
settings.all_layers = 0;
settings.max_frame_delay = 1;

Dav1dContext* colorContext = NULL;
int r = dav1d_open(&colorContext, &settings);

Dav1dData dav1dData;
r = dav1d_data_wrap(&dav1dData, thedata, thelen, Dav1dFreeCallback_s, NULL);
r = dav1d_send_data(colorContext, &dav1dData);

Dav1dPicture picture = { 0 };
r = dav1d_get_picture(colorContext, &picture);

dav1d_close(&colorContext);

where thedata/thelen vary and is either color or alpha from one frame.

As for threading, I can tell you that the testcase that reproduces this reloads every 30 ms, so it's submitting a lot of work to the dav1d threadpool, so there is likely other frames being decoded in parallel, not sure if that helps explain anything on your side.

I'd need all the raw data for the preceeding frames also, not just the one frame where the issue occurs. If you could write a simple testcase using dav1d's API and some const data hardcoded in the example, that would do the trick, yes.

Is the threading required to reproduce this issue? If that's the case, there could be a race condition around c->flush. Although in the pernos.co log, *c->flush=0, and I don't think decoding can ever continue after c->flush was set.

(In reply to Ronald S. Bultje from comment #17)

Is the threading required to reproduce this issue? If that's the case, there could be a race condition around c->flush. Although in the pernos.co log, *c->flush=0, and I don't think decoding can ever continue after c->flush was set.

Threading might be required, I can't say for certain, but it sort of seems like it. It's pretty hard to reproduce this bug. I wasn't able to do it myself, only Tyson is able to reproduce it so far. Thankfully he was able to make the pernosco recording that should let us fully debug the issue. I'm not sure how familiar you are with pernosco, so forgive me if I'm repeating things you know. Pernosco has a full recording of every instruction and memory contents of a full Firefox session in which the bug was reproduced. You can go backwards and forwards in time to any instruction execution and inspect any program state. You can open a gdb terminal from the toolbox, set breakpoints and watch memory and then use rc to reverse continue which will go back in time and stop at the previous break/watch point. So you should be able to fully debug the issue without having to reproduce locally. You can read more about it on the website https://pernos.co/

The code I pasted above should be a working program, I just wrote it in examples/dav1dplay.c and it seemed to work how I expected. However I don't think it will reproduce this bug given how hard it is to reproduce, maybe with a lot of tweaking and extra threads. pernosco should hopefully be easier than this.

I'll also upload a file with the raw binary data we pass for each color/alpha frame to dav1d in case it's helpful. If there's anything else I can do please let me know.

Thanks for the explanation. I'm not familiar with pernos.co. I've (long, long ago) looked into Microsoft Chess, which seemed somewhat overlapping in intent. If reproducing this locally is going to be difficult, I should probably explain the functionality I'm looking for. Video (or image) decoding is a waterfall-style process with dependent tasks. If a particular task fails because it's dependency was not fully resolved (as is the case here: cand_b->b_ref is invalid), the obvious thing is to go back in time to the place where that should have been written. How do I do that in pernos.co? Normally I'd go into refmvs.c:dav1d_refmvs_save_tmvs() and add printf()s there and re-run the application. Can I set a (conditional) breakpoint in pernos.co's debugger in a function that was called earlier and "re-play" the execution until that breakpoint hits?

Alternatively, being able to set a watchpoint on that memory location would be useful also, but again a historical watchpoint so I can see what wrote to it in the past. Normally I'd set a watchpoint at the beginning of the application execution (right after allocation) and run until it hits.

About that data: something seems off there. I see 4 frames being decoded: frame_offst=0,12,4 and 2. But there's only data dumps for 3. Where's the fourth?

Attached file testcase.ivf

I've converted the alpha track to IVF using the following command to select the correct stream in FFmpeg, hopefully it will work to reproduce the issue:

ffmpeg -i testcase.avif -map 0:1 -c:v copy testcase.ivf

(In reply to Ronald S. Bultje from comment #20)

Normally I'd go into refmvs.c:dav1d_refmvs_save_tmvs() and add printf()s there and re-run the application. Can I set a (conditional) breakpoint in pernos.co's debugger in a function that was called earlier and "re-play" the execution until that breakpoint hits?

Yes, you can set conditional breakpoints in the gdb window just like in gdb, and then there are reverse versions of all the usual gdb commands, next, step, continue etc, to go backwards in time.

(In reply to Ronald S. Bultje from comment #21)

Alternatively, being able to set a watchpoint on that memory location would be useful also, but again a historical watchpoint so I can see what wrote to it in the past. Normally I'd set a watchpoint at the beginning of the application execution (right after allocation) and run until it hits.

Yes! You can set a watchpoint on a memory location and then do rc (reverse continue) to find out when it was modified before.

(In reply to Ronald S. Bultje from comment #22)

About that data: something seems off there. I see 4 frames being decoded: frame_offst=0,12,4 and 2. But there's only data dumps for 3. Where's the fourth?

Hmm, not sure, I will take a lot and see what's going on.

(In reply to Ronald S. Bultje from comment #22)

About that data: something seems off there. I see 4 frames being decoded: frame_offst=0,12,4 and 2. But there's only data dumps for 3. Where's the fourth?

Looking at the pernosco I made a table of the values of a few useful values

nsAVIFDecoder.cpp mFrameNum
hdr->frame_offset set in parse_frame_hdr in obu.c
bytes sent to dav1d_send_data
color or alpha

0 0 245 color
0 0 66 alpha
1 12 281 color
1 4 281 color
1 2 281 color
1 1 281 color
1 12 148 alpha
1 4 148 alpha
1 2 148 alpha

So each dav1d_send_data call is resulting in multiple frame_offset values in dav1d. Does this mean that the parsing that Firefox has done to chop up this avif file into frames was not correct, or is that expected?

The pernosco consists of many loads of the testcase. I looked at three loads of the testcase, and the last line (1 2 148 alpha) only appeared right before he hit failed assert.

(In reply to Timothy Nikkel (:tnikkel) from comment #25)

The pernosco consists of many loads of the testcase. I looked at three loads of the testcase, and the last line (1 2 148 alpha) only appeared right before he hit failed assert.

Hmm, I'm beginning to suspect that this is key. Let's call the loads of the testcase where we don't hit the assert the normal loads. In the normal loads we hit this line

https://searchfox.org/mozilla-central/rev/cd2121e7d83af1b421c95e8c923db70e692dab5f/third_party/dav1d/src/decode.c#3474

where we wait for dav1d worker threads to complete and take the retval from them.

And I've just confirmed that https://code.videolan.org/videolan/dav1d/-/issues/419 (aka bug 1814560) is happening here. We should have error'ed out at that point, but because of that bug we report success and proceed. That probably violates some assumptions in the code and we hit this problem.

Bug 1816484 landed which should fix this. Tyson can you confirm? Thanks!

Depends on: 1816484
Flags: needinfo?(twsmith)

I am no longer able to reproduce this issue.

Flags: needinfo?(twsmith)

Thanks. Let's call it fixed by bug 1816484 then.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Assignee: nobody → cchang
Group: gfx-core-security → core-security-release
Target Milestone: --- → 113 Branch
Whiteboard: [adv-main113-]
QA Whiteboard: [post-critsmash-triage]
Flags: qe-verify-

I believe we will need to include this issue in our security advisories, and because it is in a third party paroject, will not be able to issue a CVE for it. Ronald, are you the correct contact person to assign a CVE for this issue?

Flags: needinfo?(rsbultje)
Attached file advisory.txt

Fixed typo.

Attachment #9331914 - Attachment is obsolete: true
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: