Open Bug 1604994 Opened 5 years ago Updated 3 years ago

SpeechRecognition only capture 1-2 seconds of audio

Tracking

()

Status:

UNCONFIRMED

People

(Reporter: guest271314, Unassigned)

Details

Attachments

(1 file, 1 obsolete file)

nightlySpeechRecognitionOnlyCaptures1SecondOfAudio.html 5 years ago guest271314 3.70 MB, text/html		Details
nightlySpeechRecognitionOnlyCaptures1SecondOfAudio_1.html 5 years ago guest271314 3.70 MB, text/html		Details

guest271314

Reporter

Description

•

5 years ago

User Agent: Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/80.0.3987.7 Chrome/80.0.3987.7 Safari/537.36

Steps to reproduce:

Set media.webspeech.recognition.enable and media.webspeech.recognition.force_enable to true at about:config
Execute SpeechRecognition start() and playback 33 seconds of voice speaking at <audio> element.

Actual results:

speechend event.timeStamp 70529 audio.currentTime 1.101208
audioend event.timeStamp 70531 audio.currentTime 1.101208
end event.timeStamp 71561 audio.currentTime 2.111604

Only 1-2 seconds of voice audio are recognized, then end event is dispatched. Only occasionally does result event fire.

Expected results:

speechend and audioend events should not be dispatched until at least 33 seconds of voice audio playbback.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

5 years ago

Bugbug thinks this bug should belong to this component, but please revert this change in case of error.

Component: Untriaged → Audio/Video: Playback

Product: Firefox → Core

Comment hidden (admin-reviewed)

guest271314

Reporter

Comment 3

•

5 years ago

Does MAX_LISTENING_TIME commence immediately when SpeechRecognition() is executed?

Andreas Pehrson [:pehrsons]

Comment 4

•

5 years ago

We don't ship this API yet so I don't think we can give too much support on it. Especially the voice activity detector is going to be replaced.

There's no need to needinfo so many people on a minor issue like this. It can go through normal triage procedures like everything else.

Also: TypeError: document.body is null

Component: Audio/Video: Playback → Web Speech

Flags: needinfo?(padenot)

Flags: needinfo?(bugs)

Flags: needinfo?(apehrson)

Flags: needinfo?(anatal)

guest271314

Reporter

Comment 5

•

5 years ago

Attached file nightlySpeechRecognitionOnlyCaptures1SecondOfAudio_1.html — Details

See updated attachment.

It is virtually impossible to currently test in the field where input is clipped to 1-2 seconds.

How to send OGG files to the end-point directly to test?

For minimal input using speechSyhthesis.speak() with default voice for input

"Speech synthesis at Nightly"

the output for 3 runs:

"THC"
"peach seed"
"peach seed"

For the attached file the output was

"can I watch"
"can I watch"
"can I watch"

Same input (WAV file)

https://speech-to-text-demo.ng.bluemix.net/

Speaker 1:
    Now watch. 
Speaker 1:
Speaker 2:
    This a science works. 
Speaker 2:
    One. 
Speaker 1:
    Researcher comes up with a result. 
Speaker 1:
    At that is not the truth. 
Speaker 3:
    No no. 
Speaker 2:
    A scientific emergent. 
Speaker 1:
    Truth is not the result of any. 
Speaker 3:
    One. 
Speaker 2:
    Experiment. 
Speaker 1:
    What has to happen is. 
Speaker 2:
    Somebody else has to. 
Speaker 2:
    Verified. 
Speaker 2:
    Prefer a bleak. 
Speaker 1:
    A competitor. 
Speaker 2:
    The food police someone who doesn't want you to be correct.

https://cloud.google.com/speech-to-text/

[...temp2.querySelector('div').children]
.filter(el => el.tagName === "DIV")
.forEach(el => {
   [...el.children].filter(e => e.tagName === "SP-WORD")
   .forEach(e => {
     console.log(e.shadowRoot.querySelector('.word').textContent)
   })
})

Can I watch? 
I'm just have science works. 
One researcher comes up with a result. 
At that is not the truth. 
No, no a scientific emergent truth is not 
the result of any one experiment what 
has to happen is somebody else has to verify it. 
Preferably a competitor. 
Preferably someone who doesn't want you to be correct?

By hand

Now watch. Um, this how science works.
One researcher comes up with a result.
And that is not the truth. No, no.
A scientific emergent truth is not the
result of one experiment. What has to 
happen is somebody else has to verify
it. Preferably a competitor. Preferably
someone who doesn't want you to be correct.

92nd Street Y May 3, 2017 Neil deGrasse Tyson in conversation with Robert Krulwich

Attachment #9116923 - Attachment is obsolete: true

guest271314

Reporter

Comment 6

•

5 years ago

(In reply to guest271314 from comment #5)

How to send OGG files to the end-point directly to test?

https://wiki.mozilla.org/Web_Speech_API_-_Speech_Recognition#How_can_I_test_with_Deep_Speech.3F

Expected file type "Body should be an Opus or Webm audio file".

Output for 3 runs of same file as audio/webm;codecs=opus

"i watch im this a science works
one researcher comes up with a result
and that is not the truth no no a scientific
emergent truth is not the result of any 
one experiment what has to happen is
somebody else has to verify preferably 
a competitor referred ly some one who 
doesn't want you to be correct"

"i watch im this a science works
one researcher comes up with a result 
and that is not the truth no no a scientific 
emergent truth is not the result of any 
one experiment what has to happen is 
somebody else has to verify preferably 
a competitor referred ly some one who 
doesn't want you to be correct"

"i watch im this a science works
one researcher comes up with a result
and that is not the truth no no a scientific
emergent truth is not the result of any
one experiment what has to happen is
somebody else has to verify preferably
a competitor referred ly some one who
doesn't want you to be correct"

4th run

"now watch im this a science works
one researcher comes up with a result
and that is not the troop no no a sientific
emergent truth is not the result of any
one experiment what has to happen is
somebody else has to verify it preferably
a competitor referred ly some one who
doesn't want you to be correct"

The first instance of result where "now" is first word, reflecting input, with relevant headers set to false. The machine learned nonetheless?

Can the endpoint input time be extended to at least 30 seconds for testing?

(In reply to Andreas Pehrson [:pehrsons] (On leave; back Aug 1st 2020) from comment #4)

Especially the voice activity detector is going to be replaced.

Do tracking bugs for removal of "activity detector" and maximum input time exist?

guest271314

Reporter

Comment 7

•

5 years ago

4th run was .opus file.

BugBot [:suhaib / :marco/ :calixte]

Comment 8

•

5 years ago

The priority flag is not set for this bug.
:anatal, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(anatal)

André Natal

Updated

•

5 years ago

Flags: needinfo?(anatal)

Priority: -- → P3

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

SpeechRecognition only capture 1-2 seconds of audio

Categories

(Core :: Web Speech, defect, P3)

Tracking

()

People

(Reporter: guest271314, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated

Attachment

General

Description

File Name

Content Type