General form
Each service might give different information for the same request. For this reason, the output of Hooho might vary from one service to another.
The general form of an output is
[
{
"transcript": string,
"start_time": number,
"end_time": number,
"confidence": number,
"phrases": [
{
"content": string,
"type": string,
"start_time": number,
"end_time": number,
"confidence": number,
"speaker_tags": array[integer],
"items": [
{
"content": string,
"type": string,
"start_time": number,
"end_time": number,
"confidence": number,
"speaker_tag": integer
}
]
}
],
"wer_data": [
{
"reference": string,
"wer": number
}
],
"wer_mean": number
}
]
[
{
"transcript": "I use who oh. It makes things easier!",
"start_time": 0.0,
"end_time": 5.38,
"phrases": [
{
"content": "I use who oh.",
"type": "phrase",
"start_time": 0.0,
"end_time": 2.51,
"items": [
{
"content": "I",
"type": "word",
"start_time": 0,
"end_time": 0.39,
"confidence": 1
},
{
"word": "use",
"start_time": 0.39,
"end_time": 1.26,
"confidence": 1
},
{
"content": "who",
"type": "word",
"start_time": 1.26,
"end_time": 1.96,
"confidence": 1
},
{
"content": "oh",
"type": "word",
"start_time": 1.96,
"end_time": 2.51,
"confidence": 1
},
{
"content": ".",
"type": "punctuation",
"start_time": 2.51,
"confidence": 1
}
]
},
{
"content": "It makes things easier.",
"type": "phrase",
"start_time": 2.51,
"end_time": 5.38,
"items": [
{
"content": "It",
"type": "word",
"start_time": 2.51,
"end_time": 3.12,
"confidence": 1
},
{
"word": "makes",
"start_time": 3.12,
"end_time": 3.81,
"confidence": 1
},
{
"content": "things",
"type": "word",
"start_time": 3.81,
"end_time": 4.33,
"confidence": 1
},
{
"content": "easier",
"type": "word",
"start_time": 4.33,
"end_time": 5.38,
"confidence": 1
},
{
"content": "!",
"type": "punctuation",
"start_time": 5.38,
"confidence": 1
}
]
}
],
"wer_data": [
{
"reference": "I love Hooho. It makes things easier!",
"wer": 0.2857142857142857
}
],
"wer_mean": 0.2857142857142857
}
]
Parameter description
transcript
string
transcript
stringThe transcription of your audio file. Based on the service, this transcript might be complete or not. If it is not complete, it should be an alternative result.
For example,
"transcript": "I use who oh. It makes things easier!"
"transcript": "I use Hooho"
start_time
number
start_time
numberThe beginning of the transcript, in seconds. The timestamps of the transcript are determined using the start_time and end_time each items.
For example,
"start_time": 0.37
end_time
number
end_time
numberThe end of the transcript, in seconds. The timestamps of the transcript are determined using the start_time and end_time each items.
For example,
"end_time": 49.63
confidence
number
confidence
numberA confidence score for the transcription, between 0 and 1. The higher it is, the more reliable the transcription is.
For example,
"confidence": 0.96
"confidence": 0.12
phrases
array[object]
phrases
array[object]A list of the recognized phrases in the transcript. Phrases are determined by punctuation. Without it, the phrases will likely be the whole transcript.
Each object of this array contains
content
string The content of the phrase.
"content": "I use Hooho."
"content": "I use Hooho it makes things easier"
type
string The type of this object ("phrase" in this case).
"type": "phrase"
start_time
number The beginning of the phrase, in second.
"start_time": 0.0
end_time
number The end of the phrase, in second.
"end_time": 2.51
confidence
number A confidence score for the phrase, between 0 and 1.
"confidence" : 1
speaker_tags
array[integer] If you enabled speaker diarization, it contains a list of the speaker tags assigned to its items.
"speaker_tags": [0, 1]
items
array[object] The type of this object ("phrase" in this case).
The list of words in a phrase. The details are given below.
items
array[object]
items
array[object]The list of words in a phrase. Words are determined by spaces. A text without spaces will result in a single big word (but that should not happen).
Each object of this array contains
content
string The content of the item.type
string The type of this object ("word" or "punctuation").
"content": "use"
"type": "word"
"content": "."
"type": "punctuation"
start_time
number The beginning of the item, in second.
"start_time": 1.12
end_time
number The end of the item, in second.
"end_time": 2.33
confidence
number A confidence score for the phrase, between 0 and 1.
"confidence" : 0
speaker_tag
integer If you enabled speaker diarization, it contains an integer representing the speaker associated with the word.
"speaker_tag": 1
wer_data
array[object]
wer_data
array[object]The information about WER (Word Error Rate) calculation. For each reference transcript, we will use our results to calculate a WER.
Each object of this array contains
reference
string The content of the reference transcript.wer
number The WER obtained with the transcript and the given reference, between 0 and 1. The lower it is, the closer the transcripts are.
"wer_data": [
{
"reference": "I use Hooho. It makes things easier!",
"wer": 0.2857142857142857
}
]
wer_mean
number
wer_mean
numberThe mean value of the WERs obtained with all the references.
For example,
"wer_mean": 0.2857142857142857
Occurence's conditions
Some fields require to be activated using the associated parameter in the "config" part of your request.
This table summarizes the possible output's field for each service, and the parameter responsible for its appearance. (Note that this table have been created for en-US language, but might vary from one language to another.)
Field | Amazon | Microsoft | Rev.ai | Speechmatics | |
---|---|---|---|---|---|
transcript | Always | Always | Always | Always | Always |
start_time | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
end_time | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
confidence | Never | Always | Never | Never | Always |
phrases | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
phrases {content} | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
phrases {type} | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
phrases {start_time} | Always | enable_word_time_offsets | enable_word_time_offsets | Always | Always |
phrases {end_time} | Always | enable_word_time_offsets | enable_word_time_offsets | Always | Always |
phrases {confidence} | Never | Never | Never | Never | Never |
phrases {speaker_tag} | enable_speaker_diarization | enable_speaker_diarization | enable_speaker_diarization | enable_speaker_diarization | enable_speaker_diarization |
phrases {items} | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
phrases {items {content}} | Always | enable_word_time_offsets OR enable_speaker_diarization | enable_word_time_offsets OR enable_speaker_diarization | Always | Always |
phrases {items {start_time}} | Always | enable_word_time_offsets | enable_word_time_offsets | Always | Always |
phrases {items {end_time}} | Always | enable_word_time_offsets | enable_word_time_offsets | Always | Always |
phrases {items {confidence}} | Always | Never | enable_word_time_offsets | Always | Always |
phrases {items {speaker_tag}} | enable_speaker_diarization | enable_speaker_diarization | enable_speaker_diarization | enable_speaker_diarization | enable_speaker_diarization |
wer_data | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts |
wer_data {reference} | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts |
wer_data {wer} | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts |
wer_mean | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts | get_wer + transcripts |