1. General form
  2. Parameter description
  3. Occurence's conditions

General form

Each service might give different information for the same request. For this reason, the output of Hooho might vary from one service to another.

The general form of an output is

[
	{
		"transcript": string,
		"start_time": number,
		"end_time": number,
		"confidence": number,
		"phrases": [
			{
				"content": string,
				"type": string,
				"start_time": number,
				"end_time": number,
				"confidence": number,
				"speaker_tags": array[integer],
				"items": [
					{
						"content": string,
						"type": string,
						"start_time": number,
						"end_time": number,
						"confidence": number,
						"speaker_tag": integer
					}
				]
			}
		],
		"wer_data": [
			{
				"reference": string,
				"wer": number
			}
		],
		"wer_mean": number
	}
]
[
  {
    "transcript": "I use who oh. It makes things easier!",
    "start_time": 0.0,
    "end_time": 5.38,
    "phrases": [
      {
        "content": "I use who oh.",
        "type": "phrase",
        "start_time": 0.0,
        "end_time": 2.51,
        "items": [
          {
            "content": "I",
            "type": "word",
            "start_time": 0,
            "end_time": 0.39,
            "confidence": 1
          },
          {
            "word": "use",
            "start_time": 0.39,
            "end_time": 1.26,
            "confidence": 1
          },
          {
            "content": "who",
            "type": "word",
            "start_time": 1.26,
            "end_time": 1.96,
            "confidence": 1
          },
          {
            "content": "oh",
            "type": "word",
            "start_time": 1.96,
            "end_time": 2.51,
            "confidence": 1
          },
          {
            "content": ".",
            "type": "punctuation",
            "start_time": 2.51,
            "confidence": 1
          }
        ]
      },
      {
        "content": "It makes things easier.",
        "type": "phrase",
        "start_time": 2.51,
        "end_time": 5.38,
        "items": [
          {
            "content": "It",
            "type": "word",
            "start_time": 2.51,
            "end_time": 3.12,
            "confidence": 1
          },
          {
            "word": "makes",
            "start_time": 3.12,
            "end_time": 3.81,
            "confidence": 1
          },
          {
            "content": "things",
            "type": "word",
            "start_time": 3.81,
            "end_time": 4.33,
            "confidence": 1
          },
          {
            "content": "easier",
            "type": "word",
            "start_time": 4.33,
            "end_time": 5.38,
            "confidence": 1
          },
          {
            "content": "!",
            "type": "punctuation",
            "start_time": 5.38,
            "confidence": 1
          }
        ]
      }
    ],
    "wer_data": [
      {
        "reference": "I love Hooho. It makes things easier!",
        "wer": 0.2857142857142857
      }
    ],
    "wer_mean": 0.2857142857142857
  }
]

Parameter description

transcript string

The transcription of your audio file. Based on the service, this transcript might be complete or not. If it is not complete, it should be an alternative result.

For example,

"transcript": "I use who oh. It makes things easier!"
"transcript": "I use Hooho"

start_time number

The beginning of the transcript, in seconds. The timestamps of the transcript are determined using the start_time and end_time each items.

For example,

"start_time": 0.37

end_time number

The end of the transcript, in seconds. The timestamps of the transcript are determined using the start_time and end_time each items.

For example,

"end_time": 49.63

confidence number

A confidence score for the transcription, between 0 and 1. The higher it is, the more reliable the transcription is.

For example,

"confidence": 0.96
"confidence": 0.12

phrases array[object]

A list of the recognized phrases in the transcript. Phrases are determined by punctuation. Without it, the phrases will likely be the whole transcript.

Each object of this array contains

  • content string The content of the phrase.
"content": "I use Hooho."
"content": "I use Hooho it makes things easier"
  • type string The type of this object ("phrase" in this case).
"type": "phrase"
  • start_time number The beginning of the phrase, in second.
"start_time": 0.0
  • end_time number The end of the phrase, in second.
"end_time": 2.51
  • confidence number A confidence score for the phrase, between 0 and 1.
"confidence" : 1
  • speaker_tags array[integer] If you enabled speaker diarization, it contains a list of the speaker tags assigned to its items.
"speaker_tags": [0, 1]
  • items array[object] The type of this object ("phrase" in this case).
    The list of words in a phrase. The details are given below.

items array[object]

The list of words in a phrase. Words are determined by spaces. A text without spaces will result in a single big word (but that should not happen).

Each object of this array contains

  • content string The content of the item.
  • type string The type of this object ("word" or "punctuation").
"content": "use"
"type": "word"
"content": "."
"type": "punctuation"
  • start_time number The beginning of the item, in second.
"start_time": 1.12
  • end_time number The end of the item, in second.
"end_time": 2.33
  • confidence number A confidence score for the phrase, between 0 and 1.
"confidence" : 0
  • speaker_tag integer If you enabled speaker diarization, it contains an integer representing the speaker associated with the word.
"speaker_tag": 1

wer_data array[object]

The information about WER (Word Error Rate) calculation. For each reference transcript, we will use our results to calculate a WER.

Each object of this array contains

  • reference string The content of the reference transcript.
  • wer number The WER obtained with the transcript and the given reference, between 0 and 1. The lower it is, the closer the transcripts are.
"wer_data": [
  {
    "reference": "I use Hooho. It makes things easier!",
    "wer": 0.2857142857142857
  }
]

wer_mean number

The mean value of the WERs obtained with all the references.

For example,

"wer_mean": 0.2857142857142857

Occurence's conditions

Some fields require to be activated using the associated parameter in the "config" part of your request.

This table summarizes the possible output's field for each service, and the parameter responsible for its appearance. (Note that this table have been created for en-US language, but might vary from one language to another.)

FieldAmazonGoogleMicrosoftRev.aiSpeechmatics
transcriptAlwaysAlwaysAlwaysAlwaysAlways
start_timeAlwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
end_timeAlwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
confidenceNeverAlwaysNeverNeverAlways
phrasesAlwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
phrases {content}Alwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
phrases {type}Alwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
phrases {start_time}Alwaysenable_word_time_offsetsenable_word_time_offsetsAlwaysAlways
phrases {end_time}Alwaysenable_word_time_offsetsenable_word_time_offsetsAlwaysAlways
phrases {confidence}NeverNeverNeverNeverNever
phrases {speaker_tag}enable_speaker_diarizationenable_speaker_diarizationenable_speaker_diarizationenable_speaker_diarizationenable_speaker_diarization
phrases {items}Alwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
phrases {items {content}}Alwaysenable_word_time_offsets OR enable_speaker_diarizationenable_word_time_offsets OR enable_speaker_diarizationAlwaysAlways
phrases {items {start_time}}Alwaysenable_word_time_offsetsenable_word_time_offsetsAlwaysAlways
phrases {items {end_time}}Alwaysenable_word_time_offsetsenable_word_time_offsetsAlwaysAlways
phrases {items {confidence}}AlwaysNeverenable_word_time_offsetsAlwaysAlways
phrases {items {speaker_tag}}enable_speaker_diarizationenable_speaker_diarizationenable_speaker_diarizationenable_speaker_diarizationenable_speaker_diarization
wer_dataget_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcripts
wer_data {reference}get_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcripts
wer_data {wer}get_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcripts
wer_meanget_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcriptsget_wer + transcripts