HEX
Server: Apache/2.4.65 (Ubuntu)
System: Linux ielts-store-v2 6.8.0-1036-gcp #38~22.04.1-Ubuntu SMP Thu Aug 14 01:19:18 UTC 2025 x86_64
User: root (0)
PHP: 7.2.34-54+ubuntu20.04.1+deb.sury.org+1
Disabled: pcntl_alarm,pcntl_fork,pcntl_waitpid,pcntl_wait,pcntl_wifexited,pcntl_wifstopped,pcntl_wifsignaled,pcntl_wifcontinued,pcntl_wexitstatus,pcntl_wtermsig,pcntl_wstopsig,pcntl_signal,pcntl_signal_get_handler,pcntl_signal_dispatch,pcntl_get_last_error,pcntl_strerror,pcntl_sigprocmask,pcntl_sigwaitinfo,pcntl_sigtimedwait,pcntl_exec,pcntl_getpriority,pcntl_setpriority,pcntl_async_signals,
Upload Files
File: //snap/google-cloud-cli/394/help/man/man1/gcloud_alpha_ml_speech_recognize.1
.TH "GCLOUD_ALPHA_ML_SPEECH_RECOGNIZE" 1



.SH "NAME"
.HP
gcloud alpha ml speech recognize \- get transcripts of short (less\ than\ 60\ seconds) audio from an audio file



.SH "SYNOPSIS"
.HP
\f5gcloud alpha ml speech recognize\fR \fIAUDIO\fR \fB\-\-language\-code\fR=\fILANGUAGE_CODE\fR [\fB\-\-additional\-language\-codes\fR=[\fILANGUAGE_CODE\fR,...]] [\fB\-\-enable\-automatic\-punctuation\fR] [\fB\-\-encoding\fR=\fIENCODING\fR;\ default="encoding\-unspecified"] [\fB\-\-filter\-profanity\fR] [\fB\-\-hints\fR=[\fIHINT\fR,...]] [\fB\-\-include\-word\-confidence\fR] [\fB\-\-include\-word\-time\-offsets\fR] [\fB\-\-max\-alternatives\fR=\fIMAX_ALTERNATIVES\fR;\ default=1] [\fB\-\-model\fR=\fIMODEL\fR] [\fB\-\-sample\-rate\fR=\fISAMPLE_RATE\fR] [\fB\-\-audio\-channel\-count\fR=\fIAUDIO_CHANNEL_COUNT\fR\ \fB\-\-separate\-channel\-recognition\fR] [\fB\-\-audio\-topic\fR=\fIAUDIO_TOPIC\fR\ \fB\-\-interaction\-type\fR=\fIINTERACTION_TYPE\fR\ \fB\-\-microphone\-distance\fR=\fIMICROPHONE_DISTANCE\fR\ \fB\-\-naics\-code\fR=\fINAICS_CODE\fR\ \fB\-\-original\-media\-type\fR=\fIORIGINAL_MEDIA_TYPE\fR\ \fB\-\-original\-mime\-type\fR=\fIORIGINAL_MIME_TYPE\fR\ \fB\-\-recording\-device\-name\fR=\fIRECORDING_DEVICE_NAME\fR\ \fB\-\-recording\-device\-type\fR=\fIRECORDING_DEVICE_TYPE\fR] [\fB\-\-enable\-speaker\-diarization\fR\ :\ \fB\-\-max\-diarization\-speaker\-count\fR=\fIMAX_DIARIZATION_SPEAKER_COUNT\fR\ \fB\-\-min\-diarization\-speaker\-count\fR=\fIMIN_DIARIZATION_SPEAKER_COUNT\fR] [\fIGCLOUD_WIDE_FLAG\ ...\fR]



.SH "DESCRIPTION"

\fB(ALPHA)\fR Get a transcript of an audio file that is less than 60 seconds.
You can use an audio file that is on your local drive or a Google Cloud Storage
URL.

If the audio is longer than 60 seconds, you will get an error. Please use
\f5gcloud alpha ml speech recognize\-long\-running\fR instead.



.SH "EXAMPLES"

To get a transcript of an audio file 'my\-recording.wav':

.RS 2m
$ gcloud alpha ml speech recognize 'my\-recording.wav' \e
  \-\-language\-code=en\-US
.RE

To get a transcript of an audio file in bucket 'gs://bucket/myaudio' with a
custom sampling rate and encoding that uses hints and filters profanity:

.RS 2m
$ gcloud alpha ml speech recognize 'gs://bucket/myaudio' \e
  \-\-language\-code=es\-ES \-\-sample\-rate=2200 \-\-hints=Bueno \e
  \-\-encoding=OGG_OPUS \-\-filter\-profanity
.RE



.SH "POSITIONAL ARGUMENTS"

.RS 2m
.TP 2m
\fIAUDIO\fR

The location of the audio file to transcribe. Must be a local path or a Google
Cloud Storage URL (in the format gs://bucket/object).


.RE
.sp

.SH "REQUIRED FLAGS"

.RS 2m
.TP 2m
\fB\-\-language\-code\fR=\fILANGUAGE_CODE\fR

The language of the supplied audio as a BCP\-47
(https://www.rfc\-editor.org/rfc/bcp/bcp47.txt) language tag. Example: "en\-US".
See https://cloud.google.com/speech/docs/languages for a list of the currently
supported language codes.


.RE
.sp

.SH "OPTIONAL FLAGS"

.RS 2m
.TP 2m
\fB\-\-additional\-language\-codes\fR=[\fILANGUAGE_CODE\fR,...]

The BCP\-47 language tags of other languages that the speech may be in. Up to 3
can be provided.

If alternative languages are listed, recognition result will contain recognition
in the most likely language detected including the main language\-code.

.TP 2m
\fB\-\-enable\-automatic\-punctuation\fR

Adds punctuation to recognition result hypotheses.

.TP 2m
\fB\-\-encoding\fR=\fIENCODING\fR; default="encoding\-unspecified"

The type of encoding of the file. Required if the file format is not WAV or
FLAC. \fIENCODING\fR must be one of: \fBalaw\fR, \fBamr\fR, \fBamr\-wb\fR,
\fBencoding\-unspecified\fR, \fBflac\fR, \fBlinear16\fR, \fBmp3\fR, \fBmulaw\fR,
\fBogg\-opus\fR, \fBspeex\-with\-header\-byte\fR, \fBwebm\-opus\fR.

.TP 2m
\fB\-\-filter\-profanity\fR

If True, the server will attempt to filter out profanities, replacing all but
the initial character in each filtered word with asterisks, e.g. \f5f***\fR.

.TP 2m
\fB\-\-hints\fR=[\fIHINT\fR,...]

A list of strings containing word and phrase "hints" so that the speech
recognition is more likely to recognize them. This can be used to improve the
accuracy for specific words and phrases, for example, if specific commands are
typically spoken by the user. This can also be used to add additional words to
the vocabulary of the recognizer. See
https://cloud.google.com/speech/limits#content.

.TP 2m
\fB\-\-include\-word\-confidence\fR

Include a list of words and the confidence for those words in the top result.

.TP 2m
\fB\-\-include\-word\-time\-offsets\fR

If True, the top result includes a list of words with the start and end time
offsets (timestamps) for those words. If False, no word\-level time offset
information is returned.

.TP 2m
\fB\-\-max\-alternatives\fR=\fIMAX_ALTERNATIVES\fR; default=1

Maximum number of recognition hypotheses to be returned. The server may return
fewer than max_alternatives. Valid values are 0\-30. A value of 0 or 1 will
return a maximum of one.

.TP 2m
\fB\-\-model\fR=\fIMODEL\fR

Select the model best suited to your domain to get best results. If you do not
explicitly specify a model, Speech\-to\-Text will auto\-select a model based on
your other specified parameters. Some models are premium and cost more than
standard models (although you can reduce the price by opting into
https://cloud.google.com/speech\-to\-text/docs/data\-logging). \fIMODEL\fR must
be one of:

.RS 2m
.TP 2m
\fBcommand_and_search\fR
short queries such as voice commands or voice search.
.TP 2m
\fBdefault\fR
audio that is not one of the specific audio models. For example, long\-form
audio. Ideally the audio is high\-fidelity, recorded at a 16khz or greater
sampling rate.
.TP 2m
\fBlatest_long\fR
Use this model for any kind of long form content such as media or spontaneous
speech and conversations. Consider using this model in place of the video model,
especially if the video model is not available in your target language. You can
also use this in place of the default model.
.TP 2m
\fBlatest_short\fR
Use this model for short utterances that are a few seconds in length. It is
useful for trying to capture commands or other single shot directed speech use
cases. Consider using this model instead of the command and search model.
.TP 2m
\fBmedical_conversation\fR
Best for audio that originated from a conversation between a medical provider
and patient.
.TP 2m
\fBmedical_dictation\fR
Best for audio that originated from dictation notes by a medical provider.
.TP 2m
\fBphone_call\fR
audio that originated from a phone call (typically recorded at an 8khz sampling
rate).
.TP 2m
\fBphone_call_enhanced\fR
audio that originated from a phone call (typically recorded at an 8khz sampling
rate). This is a premium model and can produce better results but costs more
than the standard rate.
.TP 2m
\fBtelephony\fR
Improved version of the "phone_call" model, best for audio that originated from
a phone call, typically recorded at an 8kHz sampling rate.
.TP 2m
\fBtelephony_short\fR
Dedicated version of the modern "telephony" model for short or even single\-word
utterances for audio that originated from a phone call, typically recorded at an
8kHz sampling rate.
.TP 2m
\fBvideo_enhanced\fR
audio that originated from video or includes multiple speakers. Ideally the
audio is recorded at a 16khz or greater sampling rate. This is a premium model
that costs more than the standard rate.
.RE
.sp


.TP 2m
\fB\-\-sample\-rate\fR=\fISAMPLE_RATE\fR

The sample rate in Hertz. For best results, set the sampling rate of the audio
source to 16000 Hz. If that's not possible, use the native sample rate of the
audio source (instead of re\-sampling).

.TP 2m

Audio channel settings.


.RS 2m
.TP 2m
\fB\-\-audio\-channel\-count\fR=\fIAUDIO_CHANNEL_COUNT\fR

The number of channels in the input audio data. Set this for
separate\-channel\-recognition. Valid values are: 1)LINEAR16 and FLAC are 1\-8
2)OGG_OPUS are 1\-254 3) MULAW, AMR, AMR_WB and SPEEX_WITH_HEADER_BYTE is only
\f51\fR.

This flag argument must be specified if any of the other arguments in this group
are specified.

.TP 2m
\fB\-\-separate\-channel\-recognition\fR

Recognition result will contain a \f5channel_tag\fR field to state which channel
that result belongs to. If this is not true, only the first channel will be
recognized.

This flag argument must be specified if any of the other arguments in this group
are specified.

.RE
.sp
.TP 2m

Description of audio data to be recognized. Note that the Google Cloud
Speech\-to\-text\-api does not use this information, and only passes it through
back into response.


.RS 2m
.TP 2m
\fB\-\-audio\-topic\fR=\fIAUDIO_TOPIC\fR

(DEPRECATED) Description of the content, e.g. "Recordings of federal supreme
court hearings from 2012".

The \f5audio\-topic\fR flag is deprecated and will be removed. The Google Cloud
Speech\-to\-text api does not use it, and only passes it through back into
response.

.TP 2m
\fB\-\-interaction\-type\fR=\fIINTERACTION_TYPE\fR

(DEPRECATED) Determining the interaction type in the conversation.

The \f5interaction\-type\fR flag is deprecated and will be removed. The Google
Cloud Speech\-to\-text api does not use it, and only passes it through back into
response. \fIINTERACTION_TYPE\fR must be one of:

.RS 2m
.TP 2m
\fBdictation\fR
Transcribe speech to text to create a written document, such as a text\-message,
email or report.
.TP 2m
\fBdiscussion\fR
Multiple people in a conversation or discussion.
.TP 2m
\fBphone\-call\fR
A phone\-call or video\-conference in which two or more people, who are not in
the same room, are actively participating.
.TP 2m
\fBpresentation\fR
One or more persons lecturing or presenting to others, mostly uninterrupted.
.TP 2m
\fBprofessionally\-produced\fR
Professionally produced audio (eg. TV Show, Podcast).
.TP 2m
\fBvoicemail\fR
A recorded message intended for another person to listen to.
.TP 2m
\fBvoice\-command\fR
Transcribe voice commands, such as for controlling a device.
.TP 2m
\fBvoice\-search\fR
Transcribe spoken questions and queries into text.
.RE
.sp


.TP 2m
\fB\-\-microphone\-distance\fR=\fIMICROPHONE_DISTANCE\fR

(DEPRECATED) The distance at which the audio device is placed to record the
conversation.

The \f5microphone\-distance\fR flag is deprecated and will be removed. The
Google Cloud Speech\-to\-text api does not use it, and only passes it through
back into response. \fIMICROPHONE_DISTANCE\fR must be one of:

.RS 2m
.TP 2m
\fBfarfield\fR
The speaker is more than 3 meters away from the microphone.
.TP 2m
\fBmidfield\fR
The speaker is within 3 meters of the microphone.
.TP 2m
\fBnearfield\fR
The audio was captured from a microphone close to the speaker, generally within
1 meter. Examples include a phone, dictaphone, or handheld microphone.
.RE
.sp


.TP 2m
\fB\-\-naics\-code\fR=\fINAICS_CODE\fR

(DEPRECATED) The industry vertical to which this speech recognition request most
closely applies.

The \f5naics\-code\fR flag is deprecated and will be removed. The Google Cloud
Speech\-to\-text api does not use it, and only passes it through back into
response.

.TP 2m
\fB\-\-original\-media\-type\fR=\fIORIGINAL_MEDIA_TYPE\fR

(DEPRECATED) The media type of the original audio conversation.

The \f5original\-media\-type\fR flag is deprecated and will be removed. The
Google Cloud Speech\-to\-text api does not use it, and only passes it through
back into response. \fIORIGINAL_MEDIA_TYPE\fR must be one of:

.RS 2m
.TP 2m
\fBaudio\fR
The speech data is an audio recording.
.TP 2m
\fBvideo\fR
The speech data originally recorded on a video.
.RE
.sp


.TP 2m
\fB\-\-original\-mime\-type\fR=\fIORIGINAL_MIME_TYPE\fR

(DEPRECATED) Mime type of the original audio file. Examples: \f5audio/m4a\fR,
\f5audio/mp3\fR.

The \f5original\-mime\-type\fR flag is deprecated and will be removed. The
Google Cloud Speech\-to\-text api does not use it, and only passes it through
back into response.

.TP 2m
\fB\-\-recording\-device\-name\fR=\fIRECORDING_DEVICE_NAME\fR

(DEPRECATED) The device used to make the recording. Examples: \f5Nexus 5X\fR,
\f5Polycom SoundStation IP 6000\fR

The \f5recording\-device\-name\fR flag is deprecated and will be removed. The
Google Cloud Speech\-to\-text api does not use it, and only passes it through
back into response.

.TP 2m
\fB\-\-recording\-device\-type\fR=\fIRECORDING_DEVICE_TYPE\fR

(DEPRECATED) The device type through which the original audio was recorded on.

The \f5recording\-device\-type\fR flag is deprecated and will be removed. The
Google Cloud Speech\-to\-text api does not use it, and only passes it through
back into response. \fIRECORDING_DEVICE_TYPE\fR must be one of:

.RS 2m
.TP 2m
\fBindoor\fR
Speech was recorded indoors.
.TP 2m
\fBoutdoor\fR
Speech was recorded outdoors.
.TP 2m
\fBpc\fR
Speech was recorded using a personal computer or tablet.
.TP 2m
\fBphone\-line\fR
Speech was recorded over a phone line.
.TP 2m
\fBsmartphone\fR
Speech was recorded on a smartphone.
.TP 2m
\fBvehicle\fR
Speech was recorded in a vehicle.
.RE
.sp


.RE
.sp
.TP 2m
\fB\-\-enable\-speaker\-diarization\fR

Enable speaker detection for each recognized word in the top alternative of the
recognition result using an integer speaker_tag provided in the WordInfo.

.TP 2m
\fB\-\-max\-diarization\-speaker\-count\fR=\fIMAX_DIARIZATION_SPEAKER_COUNT\fR

Maximum estimated number of speakers in the conversation being recognized.

.TP 2m
\fB\-\-min\-diarization\-speaker\-count\fR=\fIMIN_DIARIZATION_SPEAKER_COUNT\fR

Minimum estimated number of speakers in the conversation being recognized.


.RE
.sp

.SH "GCLOUD WIDE FLAGS"

These flags are available to all commands: \-\-access\-token\-file, \-\-account,
\-\-billing\-project, \-\-configuration, \-\-flags\-file, \-\-flatten,
\-\-format, \-\-help, \-\-impersonate\-service\-account, \-\-log\-http,
\-\-project, \-\-quiet, \-\-trace\-token, \-\-user\-output\-enabled,
\-\-verbosity.

Run \fB$ gcloud help\fR for details.



.SH "API REFERENCE"

This command uses the speech/v1p1beta1 API. The full documentation for this API
can be found at:
https://cloud.google.com/speech\-to\-text/docs/quickstart\-protocol



.SH "NOTES"

This command is currently in alpha and might change without notice. If this
command fails with API permission errors despite specifying the correct project,
you might be trying to access an API with an invitation\-only early access
allowlist. These variants are also available:

.RS 2m
$ gcloud ml speech recognize
.RE

.RS 2m
$ gcloud beta ml speech recognize
.RE