Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect timestamps #4

Open
sloganking opened this issue Jan 2, 2023 · 3 comments
Open

incorrect timestamps #4

sloganking opened this issue Jan 2, 2023 · 3 comments

Comments

@sloganking
Copy link

sloganking commented Jan 2, 2023

Output transcriptions such as .vtt and .srt files, have correct transcriptions, but the timestamps for what is said when is wrong. A 3 second sentence is marked as 3 minutes instead. It doesn't appear to be exact seconds to minutes though.

What we're outputting

00:00.000 --> 03:36.000
- Things you prioritize, like the most useful
03:36.000 --> 07:44.000
and interesting conversations, it goes parties,
07:44.000 --> 10:52.000
then workshops, then conference session.
10:52.000 --> 15:28.000
- I'm sorry, but you asked to be roasted, I will roast.
15:28.000 --> 17:56.000
- I'm talking about the level of intelligence
17:56.000 --> 20:52.000
of the cat or dog, okay?

What is outputted by yt-whisper and should be output instead

00:00.000 --> 00:06.500
Things you prioritize, like the most useful and interesting conversations, it goes parties, then workshops, then conference sessions.

00:06.500 --> 00:09.200
I'm sorry, but you asked to be roasted, I will roast.

00:09.200 --> 00:12.500
I'm talking about the level of intelligence of a cat or a dog, okay?
@m1guelpf
Copy link
Owner

m1guelpf commented Jan 9, 2023

@sloganking Could you check if 3f7558a fixes this?

@sloganking
Copy link
Author

@m1guelpf That did not fix is. A 4 second sentence is still registered as 10 minutes long.

@sloganking
Copy link
Author

sloganking commented Jan 23, 2023

Printing out fragment.start and fragment.stop in as_vtt(). They don't seem to correlate with milliseconds at all.

fragment: Utternace { start: 0, stop: 624, text: " Hello world, this is my second test." }
fragment: Utternace { start: 624, stop: 696, text: " Can you hear me very well?" }
fragment: Utternace { start: 696, stop: 1696, text: " [BLANK_AUDIO]" }

This is the output from a 8.3 second audio clip. The first sentence is finished in just under 4 seconds. Yet I don't see how 624 relates to 4s at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants