Open Bug 942078 Opened 6 years ago Updated Last year

Video thumbnail generation rule

Categories

(Core :: Audio/Video: Playback, defect, P2)

ARM
Gonk (Firefox OS)
defect

Tracking

()

People

(Reporter: bwu, Assigned: bwu)

References

Details

(Whiteboard: [ucid:multimedia10, 1.5, ft:multimedia-platform])

Current video's thumbnail is generated by seeking to 1/4 video duration and getting that I-frame decoded. This frame may be not good since it could be black screen or some meaningless frame to users. 
Another way is to decode the largest I-frame which more likely contains rich information or has high quality.
Hi Chris & David,

According to Gecko media & Gaia video app domain, may we know your opinion on this improvement of generating video thumbnail? 

Thanks.
Flags: needinfo?(dflanagan)
Flags: needinfo?(chris.double)
(In reply to Marco Chen [:mchen] from comment #1)
> 
> According to Gecko media & Gaia video app domain, may we know your opinion
> on this improvement of generating video thumbnail? 

I think this is a good approach. A while back we looked at doing a thumbnail specific API in the media code and that was one of the approaches suggested (bug 873959).
Flags: needinfo?(chris.double)
The current approach comes from very early UX specs for the video app. The idea is that many professional videos begin with a blank frame, so it doesn't make sense to start at the first frame.  IIRC, we don't do 1/4 of the duration. We do 10 seconds or 1/4, which ever is smaller.

But there is no real logic to it, we just hope that that is a better frame than the first one is.

If you want to do something better, that would be great.  Just picking the largest frame seems like it might get you an image that was full of text or something else that is hard to compress.  It seems like it would be better if you could do something like start at the beginning and look for a jump in frame size to find the first non-trivial frame. 

But I'm completely out of my expertese here.  If you can do something smart here, we'll certainly use it for the video app.

Needinfo John to bring this to his attention.
Flags: needinfo?(dflanagan) → needinfo?(johu)
Thanks for bringing this to me, David.

I don't think the frame at 1/4 will be totally useless. We uses canvas to draw the thumbnail and assume that the Video element helps to decode to the frame of specified time. That may or may not be a I-frame but at least a frame which is constructed from it's I-frame and applying some P/B-frames to have the correct image. From my knowledge of H.264, an I-frame contains full image/information and all frames after that I-frame are constructed from that I-frame. It shouldn't be a black or partial black image if we constructed a frame from I-frame and applying all frame changes after that I-frame to the specified time.

Another issue is every video may have different I-frame interval. Some of them may use constant time interval but some of them are not. And the creator of that video can change the strategy of creating I-frame.

I feel if we can find a frame whose color histogram is as a bell shape, that may be a good frame to be the video thumbnail.
Flags: needinfo?(johu)
Usually when seeking in video playback, it seeks to the nearest I-frame around the seeking time. The reason behind this is if it seeks to the frame with exact seeking time, that frame may not be a I-frame. If it is not I-frame, media extractor still needs to find the I-frame before it and decode I-frame with the following P/B frames until reaching the seeking time. This may take some time and user expericen may not be good. 

The reason to pick up the largest size of I-frames is if the raw data of that frame is black or plain, the encoded size would be small. From my understanding of H.264, the more information it contains, the bigger size it has. So it is a pretty simple way to filter out some "bad" frames.

For checking color histogram, it is also a better way. However, it may need to take some time to decode and analyze each I-frame. Not sure how to do it in a quick way.
(In reply to Blake Wu from comment #5)
> If it is not
> I-frame, media extractor still needs to find the I-frame before it and
> decode I-frame with the following P/B frames until reaching the seeking
> time. This may take some time and user expericen may not be good. 

This is what the HTML video element does. It seeks to the nearest key frame then decodes up to the requested seek frame. Implementing the faster, but less accurate, method of seeking to key frame only is bug 778077.
Wow! This's a pretty old bug. I never found that we are going to implement the fast seek. That may be very useful to deal this kind of . There is an extreme case for this kind of API. We may have a key frame per minute or more which causes the fast seek absolute out of sync. And I know that's the trade-off.

I think Blake is trying to fix two things at one shot: 1. fast seek and 2. the significant thumbnail for this video. For the first one, that is exactly the bug 778077. For the second one, that 's largest key-frame.

From the view of gaia, we don't have the API to do "fast seek". If we have, I feel that's good to use it to generate thumbnail. But I still wondering that what's the difference if we use hardware codec. That should be very fast and user may not be aware of the difference. That does provide a large improvement when we fallback to software codec.


(In reply to Chris Double (:doublec) from comment #6)
> This is what the HTML video element does. It seeks to the nearest key frame
> then decodes up to the requested seek frame. Implementing the faster, but
> less accurate, method of seeking to key frame only is bug 778077.
bug 873959 seems easy way to implement this. In android stagefright, mp4 file's thumbnail is determined biggest sample in 20 sync samples from front.

SampleTable::findThumbnailSample()
http://androidxref.com/4.4_r1/xref/frameworks/av/media/libstagefright/SampleTable.cpp#727
(In reply to Sotaro Ikeda [:sotaro] from comment #8)
> bug 873959 seems easy way to implement this. In android stagefright, mp4
> file's thumbnail is determined biggest sample in 20 sync samples from front.
> 
> SampleTable::findThumbnailSample()
> http://androidxref.com/4.4_r1/xref/frameworks/av/media/libstagefright/
> SampleTable.cpp#727

Thumbnail time can be checked also by gaia side.
Hi Sotaro,

We don't have libstagefright in Gecko, see bug 778052. So, We can't have thumbnail time in metadata. We may only have it in ogg file format, since the landing of bug 763010.
Summary: Video thumbnail generation rule → [fugu]Video thumbnail generation rule
Assignee: nobody → bwu
Hi PeiPei,

May I know why you change the summary to include [fugu]?
This is a general improvement for all platform not for fugu only.

Thanks.
Flags: needinfo?(pcheng)
(In reply to Marco Chen [:mchen] from comment #11)
> Hi PeiPei,
> 
> May I know why you change the summary to include [fugu]?
> This is a general improvement for all platform not for fugu only.
> 
> Thanks.

This is to keep track of fugu must have issues. James, could you please clarify?
Flags: needinfo?(pcheng) → needinfo?(james.zhang)
All platform has this issue, not fugu only. You can remove fugu from summary.
Flags: needinfo?(james.zhang)
remove [fugu]
Summary: [fugu]Video thumbnail generation rule → Video thumbnail generation rule
Whiteboard: [ucid:Device14, 1.4:p2, ft:devices]
Hi All, 
I list two possible designs for thumbnail generation in the below link. Please have a look and advice. 
https://docs.google.com/file/d/0B98EFeZYRJgjdGQxUER4cG9IVTg/edit    
The ideas of these two designs are 

Design A 
1. Get thumbnail time from loaded metadata but this is not ready and blocks on bug 778052 
2. Seek to this time to get thumbnail  

Design B
1. Introduce a new class, MediaDataRetriever, to get the decoded frame via media extractor and omx decoder (for example) without going through media decoders, media state machine, and media reader. 
2. Introduce a new value,thumbnail, in preload. 
3. Introduce a new Web API(mozSetThumbnail) or property(thumbnailTime) and a event(ongetthumbnail) to notify that thumbnail is ready.
   API or property is optional which is used flexibly for APP to decide the thumbnail time.    
   3.1 mozGetThumbnail(double time) 
       Ex. 
          var video = document.createElement('video'); 
          video.preload = 'thumbnail';//new 
          video.mozSetThumbnailTime(time); //new and optional
          video.src = url;   
          video.ongetthumbnail = function() { //new 
          :
            captureFrame(video, metadata,callback);
          :
          }         
   3.2 t 
      Ex. 
         var video = document.createElement('video'); 
         video.preload = 'thumbnail';//new 
         video.thumbnailTime = time; //new and optional 
         video.src = url;  
         video.ongetthumbnail = function() { //new 
          :
            captureFrame(video, metadata,callback);
          :
         }        

For better flexibility and efficiency, Design B would be better. So currently I am working on 3.2 as above. 
Should you have other ideas, please share with us.
From the view of gaia, I don't think those two designs have difference. But we may need to know which one has better performance..
(In reply to John Hu [:johnhu] from comment #16)
> From the view of gaia, I don't think those two designs have difference. But
> we may need to know which one has better performance..

The drawback of design A is pointed out in the link at comment 15.
I just help Blake to post it here.

  1. Pros: The thumbnail time reported from metadata will be a I-Frame so we avoid to decode multiple frames for final frame.
  2. Cons: The seek operation will perform into each video and audio codec and it wasted resources on seeking audio which we don't need it.
Hi Chris and David,

May I have your opinions on these two designs?   

Thanks!
Flags: needinfo?(dflanagan)
Flags: needinfo?(chris.double)
I'm not a fan of adding the 'thumbnail' option to 'preload'. What is the exact meaning of 'thumbnail' compared to the existing 'preload' attributes? Does thumbnail operate like 'auto', 'metadata' or 'none' after the thumbnail is loaded for example.
Flags: needinfo?(chris.double)
The original idea is 'thumbnail' could be like 'metadata' which requires some time to process. 
For metadata, 
gaia set preload = 'metadata' and get the metadata via onloadedmetadata. 
So that is why to use the same way to do it.  
set preload = 'thumbnail' and get it via ongetthumbnail.
(In reply to Blake Wu [:bwu] from comment #18)
> Hi Chris and David,
> 
> May I have your opinions on these two designs?   
> 
> Thanks!

I feel that I only barely understand the techincal issues here.  For example, I've never heard of an I-frame before.

However, here are my thoughts:

1) adding a new thumbnail attribute value, a new method or property and a new event handler to the video element  is a lot of new standardization. I would not go down that path unless you've already initiated discussions on the standards mailing lists.

2) I agree with Chris that preload="thumbnail" is underspecified, and think it would be difficult to distinguish this case from preload="metadata".

3) In regular desktop Firefox, what is the default behavior going to be?  If preload is not "none", will the video element automatically start loading the URL and find a thumbnail frame to display? If so, then it certainly makes sense to have an API that exposes either the thumbnail or the fact that the thumbnail is ready.  If you are going to be fetching the thumbnail for display with preload set to "metadata" then you can just fire onthumbnailready when you have it and don't need a new value for preload.

4) Do some videos have embedded metadata that specifies a thumbnail time?  Are you proposing to honor that metadata when specified, or to just allow the user to specify the time explicitly if they choose? I think it would be better to have a simple API that honors the metadata when possible and otherwise makes a best guess.  If the user wants to specify a time explicitly, they can do that with currentTime and onseeked.  They don't need a different property for thumbnail time.

5) Instead of requiring the developer to create a canvas and capture the thumbnail frame with copyImage(), have you considered just defining a getThumbnail() function?  It could take a callback or could return a Promise.  I've just learned about ImageBitmap, and I know that Shih-Chiang is working on implementing ImageBitmap as part of his work on 854795, so maybe you could return the thumbnail as an ImageBitmap (assuming that would be efficient).  I don't feel strongly about that.

I think these various points are probably contradictory, and my advice is uninformed, so please don't take anything I say too seriously!
Flags: needinfo?(dflanagan)
Hi Chris, 
Thanks for your great feedback. My comment is inline. 

(In reply to David Flanagan [:djf] from comment #21)
> (In reply to Blake Wu [:bwu] from comment #18)
> > Hi Chris and David,
> > 
> > May I have your opinions on these two designs?   
> > 
> > Thanks!
> 
> I feel that I only barely understand the techincal issues here.  For
> example, I've never heard of an I-frame before.
> 
> However, here are my thoughts:
> 
> 1) adding a new thumbnail attribute value, a new method or property and a
> new event handler to the video element  is a lot of new standardization. I
> would not go down that path unless you've already initiated discussions on
> the standards mailing lists.
  Got it. To have a efficient discussion in the standard mailing list, I would like to sync with you and Chris first.  

> 2) I agree with Chris that preload="thumbnail" is underspecified, and think
> it would be difficult to distinguish this case from preload="metadata".
  My original purpose is to distinguish meatadata from thumbnail. But it looks like not work well :)
  More reasons are explained in my comment in 3). 

> 3) In regular desktop Firefox, what is the default behavior going to be?  If
> preload is not "none", will the video element automatically start loading
> the URL and find a thumbnail frame to display? If so, then it certainly
> makes sense to have an API that exposes either the thumbnail or the fact
> that the thumbnail is ready.  If you are going to be fetching the thumbnail
> for display with preload set to "metadata" then you can just fire
> onthumbnailready when you have it and don't need a new value for preload.
When preload is none, nothing will be retrieved from video file including metadata.     
When preoload is metadata, only metadata will be retrieved. No any video frame level will be processed. That's why I think a new attribute value may be needed. Besides if we put thumbnail fetch in "meatadata", that would create unnecessary actions/processes if user just want to get metadata only. 


> 4) Do some videos have embedded metadata that specifies a thumbnail time? 
  Maybe but it is not common. 
> Are you proposing to honor that metadata when specified, or to just allow
> the user to specify the time explicitly if they choose? I think it would be
> better to have a simple API that honors the metadata when possible and
> otherwise makes a best guess.  If the user wants to specify a time
> explicitly, they can do that with currentTime and onseeked.  They don't need
> a different property for thumbnail time.
  Just allow the user to specify the time if they want. 
  Using currentTime will seek the unnecessary audio part, so that's why we don't prefer it. 
> 5) Instead of requiring the developer to create a canvas and capture the
> thumbnail frame with copyImage(), have you considered just defining a
> getThumbnail() function?  It could take a callback or could return a
> Promise.  I've just learned about ImageBitmap, and I know that Shih-Chiang
> is working on implementing ImageBitmap as part of his work on 854795, so
> maybe you could return the thumbnail as an ImageBitmap (assuming that would
> be efficient).  I don't feel strongly about that.
  I have thought this as well, but don't see how ImageBitmap can benefit. I will discuss with Shin-Chiang further. Thanks for your information.   
> I think these various points are probably contradictory, and my advice is
> uninformed, so please don't take anything I say too seriously!
Hi All,

With some considerations for future flexibility today, now I also think it will be better to have one dedicated api instead of new attribute value in preload :) I recap the information in Background, Design, Advantages, and Changes as below. May I have your opinions again? 

Background: 
A original reason to put thumbnail in the "preload" is to avoid metadata loading and thumbnail fetch running at the same time (call preload = "metadata" and mozGetThumbnail() ) which may cause codec resource conflict. However, for future flexibility and multi-core support we should let these two things be able to run in parallel. Currently, if gaia set those two at the same time, the later will be returned a error. In current gaia design, createThumbnail is in onloadedmetadata event, so there is no big impact for gaia part.  

Design:  
Introduce a new Web API, mozGetThumbnail (double thumbnailTime), and a event(ongetthumbnail) to notify that thumbnail is ready.
thumbnailTime is optional. If it is negative, a default thumbnail generation mechanism in the gecko will apply. Otherwise, an I-frame close to that time will be chosen as thumbnail.     

Advantages: 
For this new design, 
1. Remove unnecessary audio seeking in current design. 
2. Better to improve, like apply software codec and current mechanism in gecko is complicated and   unnecessary, like media decoder state machine.  
3. Easy and clear api to use for gaia developer.      

Change: 
For Gecko change, 
Introduce a new class, MediaDataRetriever, to get the decoded frame via media extractor and omx decoder (for example) without going through media decoders, media state machine, and media reader. 
For gaia change, 
No much changes. Two lines in metadata.js are required to be changed, line#249 and line#264 as shown below.  

249     //offscreenVideo.currentTime = Math.min(5, offscreenVideo.duration / 10);
        offscreenVideo.mozGetThumbnail(time);  
251     var failed = false;                      // Did seeking fail?
252     var timeout = setTimeout(fail, 10000);   // Fail after 10 seconds
253     offscreenVideo.onerror = fail;           // Or if we get an error event
254     function fail() {                        // Seeking failure case
255       console.warn('Seek failed while creating thumbnail for', videofile.name,
256                    '. Ignoring corrupt file.');
257       failed = true;
258       clearTimeout(timeout);
259       offscreenVideo.onerror = null;
260       metadata.isVideo = false;
261       unload();
262       callback(metadata);
263     }
264     // offscreenVideo.onseeked = function() {   // Seeking success case
        offscreenVideo.ongetthumbnail = function() {    
265       if (failed) // Avoid race condition: if we already failed, stop now
266         return;
267       clearTimeout(timeout);
268       captureFrame(offscreenVideo, metadata, function(poster) {
269         if (poster === null) {
270           // If something goes wrong in captureFrame, it probably means that
271           // this is not a valid video. In any case, if we don't have a
272           // thumbnail image we shouldn't try to display it to the user.
273           // XXX: See bug 869289: maybe we should not fail here.
274           fail();
275         }
276         else {
277           metadata.poster = poster;
278           unload();
279           callback(metadata); // We've got all the metadata we need now.
280         }
281       });
282     };
283   }
The simpler API looks good but I'd suggest asking for feedback on mozilla.dev.webapi for feedback on the Web based interface. They'll be able to advise on the current standard for using events vs other async mechanisms. There will also need to be an implementation for non-OMX backends.
Whiteboard: [ucid:Device14, 1.4:p2, ft:devices] → [ucid:multimedia10, 1.4:p2, ft:multimedia-platform]
Add one more advantage. 
(In reply to Blake Wu [:bwu] from comment #23)

> Advantages: 
> For this new design, 
> 1. Remove unnecessary audio seeking in current design. 
> 2. Better to improve, like apply software codec and current mechanism in
> gecko is complicated and   unnecessary, like media decoder state machine.  
> 3. Easy and clear api to use for gaia developer. 
  4. Improve thumbnail generating time (current seeking waits the seekTime frame from the previous keyframe as comment #6)
blocking-b2g: --- → 1.3?
We are no longer taking features for 1.3, so this is not a blocker.
blocking-b2g: 1.3? → -
According to the user story doc, this is no longer being tracked as part of 1.4.
No longer blocks: 1.4-multimedia
One another idea is 
Do both finding the proper key-frame and decoding (SW decoder) it via JavaScript to be a self-contained module which should be able to save memory (normally HW decoder requires to allocate many memory buffers even there is only one frame for decoding) and could be reused cross-platform.  But the performance (parsing and decoding time) of this module would be needed to check.
Add Bug 971645 as a blocking bug since that can improve the performance of seek which video thumbnail generation requires as well.
Blocks: 971645
blocking-b2g: - → ---
Whiteboard: [ucid:multimedia10, 1.4:p2, ft:multimedia-platform] → [ucid:multimedia10, 1.5, ft:multimedia-platform]
Component: General → Video/Audio
Product: Firefox OS → Core
Component: Audio/Video → Audio/Video: Playback
Rank: 15
Priority: -- → P2
You need to log in before you can comment on or make changes to this bug.