File: //snap/google-cloud-cli/394/help/man/man1/gcloud_ai-platform_jobs_submit_training.1
.TH "GCLOUD_AI\-PLATFORM_JOBS_SUBMIT_TRAINING" 1
.SH "NAME"
.HP
gcloud ai\-platform jobs submit training \- submit an AI Platform training job
.SH "SYNOPSIS"
.HP
\f5gcloud ai\-platform jobs submit training\fR \fIJOB\fR [\fB\-\-config\fR=\fICONFIG\fR] [\fB\-\-enable\-web\-access\fR] [\fB\-\-job\-dir\fR=\fIJOB_DIR\fR] [\fB\-\-labels\fR=[\fIKEY\fR=\fIVALUE\fR,...]] [\fB\-\-master\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]] [\fB\-\-master\-image\-uri\fR=\fIMASTER_IMAGE_URI\fR] [\fB\-\-master\-machine\-type\fR=\fIMASTER_MACHINE_TYPE\fR] [\fB\-\-module\-name\fR=\fIMODULE_NAME\fR] [\fB\-\-package\-path\fR=\fIPACKAGE_PATH\fR] [\fB\-\-packages\fR=[\fIPACKAGE\fR,...]] [\fB\-\-parameter\-server\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]] [\fB\-\-parameter\-server\-image\-uri\fR=\fIPARAMETER_SERVER_IMAGE_URI\fR] [\fB\-\-python\-version\fR=\fIPYTHON_VERSION\fR] [\fB\-\-region\fR=\fIREGION\fR] [\fB\-\-runtime\-version\fR=\fIRUNTIME_VERSION\fR] [\fB\-\-scale\-tier\fR=\fISCALE_TIER\fR] [\fB\-\-service\-account\fR=\fISERVICE_ACCOUNT\fR] [\fB\-\-staging\-bucket\fR=\fISTAGING_BUCKET\fR] [\fB\-\-use\-chief\-in\-tf\-config\fR=\fIUSE_CHIEF_IN_TF_CONFIG\fR] [\fB\-\-worker\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]] [\fB\-\-worker\-image\-uri\fR=\fIWORKER_IMAGE_URI\fR] [\fB\-\-async\fR\ |\ \fB\-\-stream\-logs\fR] [\fB\-\-kms\-key\fR=\fIKMS_KEY\fR\ :\ \fB\-\-kms\-keyring\fR=\fIKMS_KEYRING\fR\ \fB\-\-kms\-location\fR=\fIKMS_LOCATION\fR\ \fB\-\-kms\-project\fR=\fIKMS_PROJECT\fR] [\fB\-\-parameter\-server\-count\fR=\fIPARAMETER_SERVER_COUNT\fR\ \fB\-\-parameter\-server\-machine\-type\fR=\fIPARAMETER_SERVER_MACHINE_TYPE\fR] [\fB\-\-worker\-count\fR=\fIWORKER_COUNT\fR\ \fB\-\-worker\-machine\-type\fR=\fIWORKER_MACHINE_TYPE\fR] [\fIGCLOUD_WIDE_FLAG\ ...\fR] [\-\-\ \fIUSER_ARGS\fR\ ...]
.SH "DESCRIPTION"
Submit an AI Platform training job.
This creates temporary files and executes Python code staged by a user on Cloud
Storage. Model code can either be specified with a path, e.g.:
.RS 2m
$ gcloud ai\-platform jobs submit training my_job \e
\-\-module\-name trainer.task \e
\-\-staging\-bucket gs://my\-bucket \e
\-\-package\-path /my/code/path/trainer \e
\-\-packages additional\-dep1.tar.gz,dep2.whl
.RE
Or by specifying an already built package:
.RS 2m
$ gcloud ai\-platform jobs submit training my_job \e
\-\-module\-name trainer.task \e
\-\-staging\-bucket gs://my\-bucket \e
\-\-packages trainer\-0.0.1.tar.gz,additional\-dep1.tar.gz,dep2.whl
.RE
If \f5\-\-package\-path=/my/code/path/trainer\fR is specified and there is a
\f5setup.py\fR file at \f5/my/code/path/setup.py\fR, the setup file will be
invoked with \f5sdist\fR and the generated tar files will be uploaded to Cloud
Storage. Otherwise, a temporary \f5setup.py\fR file will be generated for the
build.
By default, this command runs asynchronously; it exits once the job is
successfully submitted.
To follow the progress of your job, pass the \f5\-\-stream\-logs\fR flag (note
that even with the \f5\-\-stream\-logs\fR flag, the job will continue to run
after this command exits and must be cancelled with \f5gcloud ai\-platform jobs
cancel JOB_ID\fR).
For more information, see:
https://cloud.google.com/ai\-platform/training/docs/overview
.SH "POSITIONAL ARGUMENTS"
.RS 2m
.TP 2m
\fIJOB\fR
Name of the job.
.TP 2m
[\-\- \fIUSER_ARGS\fR ...]
Additional user arguments to be forwarded to user code
The '\-\-' argument must be specified between gcloud specific args on the left
and USER_ARGS on the right.
.RE
.sp
.SH "FLAGS"
.RS 2m
.TP 2m
\fB\-\-config\fR=\fICONFIG\fR
Path to the job configuration file. This file should be a YAML document (JSON
also accepted) containing a Job resource as defined in the API (all fields are
optional): https://cloud.google.com/ml/reference/rest/v1/projects.jobs
EXAMPLES:
JSON:
.RS 2m
{
"jobId": "my_job",
"labels": {
"type": "prod",
"owner": "alice"
},
"trainingInput": {
"scaleTier": "BASIC",
"packageUris": [
"gs://my/package/path"
],
"region": "us\-east1"
}
}
.RE
YAML:
.RS 2m
jobId: my_job
labels:
type: prod
owner: alice
trainingInput:
scaleTier: BASIC
packageUris:
\- gs://my/package/path
region: us\-east1
.RE
If an option is specified both in the configuration file **and** via command
line arguments, the command line arguments override the configuration file.
.TP 2m
\fB\-\-enable\-web\-access\fR
Whether you want AI Platform Training to enable [interactive shell access]
(https://cloud.google.com/ai\-platform/training/docs/monitor\-debug\-interactive\-shell)
to training containers. If set to \f5true\fR, you can access interactive shells
at the URIs given by TrainingOutput.web_access_uris or
HyperparameterOutput.web_access_uris (within TrainingOutput.trials).
.TP 2m
\fB\-\-job\-dir\fR=\fIJOB_DIR\fR
Cloud Storage path in which to store training outputs and other data needed for
training.
This path will be passed to your TensorFlow program as the \f5\-\-job\-dir\fR
command\-line arg. The benefit of specifying this field is that AI Platform will
validate the path for use in training. However, note that your training program
will need to parse the provided \f5\-\-job\-dir\fR argument.
If packages must be uploaded and \f5\-\-staging\-bucket\fR is not provided, this
path will be used instead.
.TP 2m
\fB\-\-labels\fR=[\fIKEY\fR=\fIVALUE\fR,...]
List of label KEY=VALUE pairs to add.
Keys must start with a lowercase character and contain only hyphens (\f5\-\fR),
underscores (\f5_\fR), lowercase characters, and numbers. Values must contain
only hyphens (\f5\-\fR), underscores (\f5_\fR), lowercase characters, and
numbers.
.TP 2m
\fB\-\-master\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]
Hardware accelerator config for the master worker. Must specify both the
accelerator type (TYPE) for each server and the number of accelerators to attach
to each server (COUNT).
.RS 2m
.TP 2m
\fBtype\fR
Type of the accelerator. Choices are
nvidia\-tesla\-a100,nvidia\-tesla\-k80,nvidia\-tesla\-p100,nvidia\-tesla\-p4,nvidia\-tesla\-t4,nvidia\-tesla\-v100,tpu\-v2,tpu\-v2\-pod,tpu\-v3,tpu\-v3\-pod,tpu\-v4\-pod
.TP 2m
\fBcount\fR
Number of accelerators to attach to each machine running the job. Must be
greater than 0.
.RE
.sp
.TP 2m
\fB\-\-master\-image\-uri\fR=\fIMASTER_IMAGE_URI\fR
Docker image to run on each master worker. This image must be in Container
Registry. Only one of \f5\-\-master\-image\-uri\fR and
\f5\-\-runtime\-version\fR must be specified.
.TP 2m
\fB\-\-master\-machine\-type\fR=\fIMASTER_MACHINE_TYPE\fR
Specifies the type of virtual machine to use for training job's master worker.
You must set this value when \f5\-\-scale\-tier\fR is set to \f5CUSTOM\fR.
.TP 2m
\fB\-\-module\-name\fR=\fIMODULE_NAME\fR
Name of the module to run.
.TP 2m
\fB\-\-package\-path\fR=\fIPACKAGE_PATH\fR
Path to a Python package to build. This should point to a \fBlocal\fR directory
containing the Python source for the job. It will be built using
\fBsetuptools\fR (which must be installed) using its \fBparent\fR directory as
context. If the parent directory contains a \f5setup.py\fR file, the build will
use that; otherwise, it will use a simple built\-in one.
.TP 2m
\fB\-\-packages\fR=[\fIPACKAGE\fR,...]
Path to Python archives used for training. These can be local paths (absolute or
relative), in which case they will be uploaded to the Cloud Storage bucket given
by \f5\-\-staging\-bucket\fR, or Cloud Storage URLs
('gs://bucket\-name/path/to/package.tar.gz').
.TP 2m
\fB\-\-parameter\-server\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]
Hardware accelerator config for the parameter servers. Must specify both the
accelerator type (TYPE) for each server and the number of accelerators to attach
to each server (COUNT).
.RS 2m
.TP 2m
\fBtype\fR
Type of the accelerator. Choices are
nvidia\-tesla\-a100,nvidia\-tesla\-k80,nvidia\-tesla\-p100,nvidia\-tesla\-p4,nvidia\-tesla\-t4,nvidia\-tesla\-v100,tpu\-v2,tpu\-v2\-pod,tpu\-v3,tpu\-v3\-pod,tpu\-v4\-pod
.TP 2m
\fBcount\fR
Number of accelerators to attach to each machine running the job. Must be
greater than 0.
.RE
.sp
.TP 2m
\fB\-\-parameter\-server\-image\-uri\fR=\fIPARAMETER_SERVER_IMAGE_URI\fR
Docker image to run on each parameter server. This image must be in Container
Registry. If not specified, the value of \f5\-\-master\-image\-uri\fR is used.
.TP 2m
\fB\-\-python\-version\fR=\fIPYTHON_VERSION\fR
Version of Python used during training. Choices are 3.7, 3.5, and 2.7. However,
this value must be compatible with the chosen runtime version for the job.
Must be used with a compatible runtime version:
.RS 2m
.IP "\(em" 2m
3.7 is compatible with runtime versions 1.15 and later.
.IP "\(em" 2m
3.5 is compatible with runtime versions 1.4 through 1.14.
.IP "\(em" 2m
2.7 is compatible with runtime versions 1.15 and earlier.
.RE
.sp
.TP 2m
\fB\-\-region\fR=\fIREGION\fR
Region of the machine learning training job to submit. If not specified, you
might be prompted to select a region (interactive mode only).
To avoid prompting when this flag is omitted, you can set the
\f5\fIcompute/region\fR\fR property:
.RS 2m
$ gcloud config set compute/region REGION
.RE
A list of regions can be fetched by running:
.RS 2m
$ gcloud compute regions list
.RE
To unset the property, run:
.RS 2m
$ gcloud config unset compute/region
.RE
Alternatively, the region can be stored in the environment variable
\f5\fICLOUDSDK_COMPUTE_REGION\fR\fR.
.TP 2m
\fB\-\-runtime\-version\fR=\fIRUNTIME_VERSION\fR
AI Platform runtime version for this job. Must be specified unless
\-\-master\-image\-uri is specified instead. It is defined in documentation
along with the list of supported versions:
https://cloud.google.com/ai\-platform/prediction/docs/runtime\-version\-list
.TP 2m
\fB\-\-scale\-tier\fR=\fISCALE_TIER\fR
Specify the machine types, the number of replicas for workers, and parameter
servers. \fISCALE_TIER\fR must be one of:
.RS 2m
.TP 2m
\fBbasic\fR
Single worker instance. This tier is suitable for learning how to use AI
Platform, and for experimenting with new models using small datasets.
.TP 2m
\fBbasic\-gpu\fR
Single worker instance with a GPU.
.TP 2m
\fBbasic\-tpu\fR
Single worker instance with a Cloud TPU.
.TP 2m
\fBcustom\fR
CUSTOM tier is not a set tier, but rather enables you to use your own cluster
specification. When you use this tier, set values to configure your processing
cluster according to these guidelines (using the \f5\-\-config\fR flag):
.RS 2m
.IP "\(bu" 2m
You \fImust\fR set \f5TrainingInput.masterType\fR to specify the type of machine
to use for your master node. This is the only required setting.
.IP "\(bu" 2m
You \fImay\fR set \f5TrainingInput.workerCount\fR to specify the number of
workers to use. If you specify one or more workers, you \fImust\fR also set
\f5TrainingInput.workerType\fR to specify the type of machine to use for your
worker nodes.
.IP "\(bu" 2m
You \fImay\fR set \f5TrainingInput.parameterServerCount\fR to specify the number
of parameter servers to use. If you specify one or more parameter servers, you
\fImust\fR also set \f5TrainingInput.parameterServerType\fR to specify the type
of machine to use for your parameter servers. Note that all of your workers must
use the same machine type, which can be different from your parameter server
type and master type. Your parameter servers must likewise use the same machine
type, which can be different from your worker type and master type.
.RE
.sp
.TP 2m
\fBpremium\-1\fR
Large number of workers with many parameter servers.
.TP 2m
\fBstandard\-1\fR
Many workers and a few parameter servers.
.RE
.sp
.TP 2m
\fB\-\-service\-account\fR=\fISERVICE_ACCOUNT\fR
The email address of a service account to use when running the training
appplication. You must have the \f5iam.serviceAccounts.actAs\fR permission for
the specified service account. In addition, the AI Platform Training
Google\-managed service account must have the
\f5roles/iam.serviceAccountAdmin\fR role for the specified service account.
Learn more about configuring a service account.
(https://cloud.google.com/ai\-platform/training/docs/custom\-service\-account)
If not specified, the AI Platform Training Google\-managed service account is
used by default.
.TP 2m
\fB\-\-staging\-bucket\fR=\fISTAGING_BUCKET\fR
Bucket in which to stage training archives.
Required only if a file upload is necessary (that is, other flags include local
paths) and no other flags implicitly specify an upload path.
.TP 2m
\fB\-\-use\-chief\-in\-tf\-config\fR=\fIUSE_CHIEF_IN_TF_CONFIG\fR
Use "chief" role in the cluster instead of "master". This is required for
TensorFlow 2.0 and newer versions. Unlike "master" node, "chief" node does not
run evaluation.
.TP 2m
\fB\-\-worker\-accelerator\fR=[\fIcount\fR=\fICOUNT\fR],[\fItype\fR=\fITYPE\fR]
Hardware accelerator config for the worker nodes. Must specify both the
accelerator type (TYPE) for each server and the number of accelerators to attach
to each server (COUNT).
.RS 2m
.TP 2m
\fBtype\fR
Type of the accelerator. Choices are
nvidia\-tesla\-a100,nvidia\-tesla\-k80,nvidia\-tesla\-p100,nvidia\-tesla\-p4,nvidia\-tesla\-t4,nvidia\-tesla\-v100,tpu\-v2,tpu\-v2\-pod,tpu\-v3,tpu\-v3\-pod,tpu\-v4\-pod
.TP 2m
\fBcount\fR
Number of accelerators to attach to each machine running the job. Must be
greater than 0.
.RE
.sp
.TP 2m
\fB\-\-worker\-image\-uri\fR=\fIWORKER_IMAGE_URI\fR
Docker image to run on each worker node. This image must be in Container
Registry. If not specified, the value of \f5\-\-master\-image\-uri\fR is used.
.TP 2m
At most one of these can be specified:
.RS 2m
.TP 2m
\fB\-\-async\fR
(DEPRECATED) Display information about the operation in progress without waiting
for the operation to complete. Enabled by default and can be omitted; use
\f5\-\-stream\-logs\fR to run synchronously.
.TP 2m
\fB\-\-stream\-logs\fR
Block until job completion and stream the logs while the job runs.
Note that even if command execution is halted, the job will still run until
cancelled with
.RS 2m
$ gcloud ai\-platform jobs cancel JOB_ID
.RE
.RE
.sp
.TP 2m
Key resource \- The Cloud KMS (Key Management Service) cryptokey that will be
used to protect the job. The 'AI Platform Service Agent' service account must
hold permission 'Cloud KMS CryptoKey Encrypter/Decrypter'. The arguments in this
group can be used to specify the attributes of this resource.
.RS 2m
.TP 2m
\fB\-\-kms\-key\fR=\fIKMS_KEY\fR
ID of the key or fully qualified identifier for the key.
To set the \f5kms\-key\fR attribute:
.RS 2m
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-key\fR on the command line.
.RE
.sp
This flag argument must be specified if any of the other arguments in this group
are specified.
.TP 2m
\fB\-\-kms\-keyring\fR=\fIKMS_KEYRING\fR
The KMS keyring of the key.
To set the \f5kms\-keyring\fR attribute:
.RS 2m
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-key\fR on the command line with a fully
specified name;
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-keyring\fR on the command line.
.RE
.sp
.TP 2m
\fB\-\-kms\-location\fR=\fIKMS_LOCATION\fR
The Google Cloud location for the key.
To set the \f5kms\-location\fR attribute:
.RS 2m
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-key\fR on the command line with a fully
specified name;
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-location\fR on the command line.
.RE
.sp
.TP 2m
\fB\-\-kms\-project\fR=\fIKMS_PROJECT\fR
The Google Cloud project for the key.
To set the \f5kms\-project\fR attribute:
.RS 2m
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-key\fR on the command line with a fully
specified name;
.IP "\(bu" 2m
provide the argument \f5\-\-kms\-project\fR on the command line;
.IP "\(bu" 2m
set the property \f5core/project\fR.
.RE
.sp
.RE
.sp
.TP 2m
Configure parameter server machine type settings.
.RS 2m
.TP 2m
\fB\-\-parameter\-server\-count\fR=\fIPARAMETER_SERVER_COUNT\fR
Number of parameter servers to use for the training job.
This flag argument must be specified if any of the other arguments in this group
are specified.
.TP 2m
\fB\-\-parameter\-server\-machine\-type\fR=\fIPARAMETER_SERVER_MACHINE_TYPE\fR
Type of virtual machine to use for training job's parameter servers. This flag
must be specified if any of the other arguments in this group are specified
machine to use for training job's parameter servers.
This flag argument must be specified if any of the other arguments in this group
are specified.
.RE
.sp
.TP 2m
Configure worker node machine type settings.
.RS 2m
.TP 2m
\fB\-\-worker\-count\fR=\fIWORKER_COUNT\fR
Number of worker nodes to use for the training job.
This flag argument must be specified if any of the other arguments in this group
are specified.
.TP 2m
\fB\-\-worker\-machine\-type\fR=\fIWORKER_MACHINE_TYPE\fR
Type of virtual machine to use for training job's worker nodes.
This flag argument must be specified if any of the other arguments in this group
are specified.
.RE
.RE
.sp
.SH "GCLOUD WIDE FLAGS"
These flags are available to all commands: \-\-access\-token\-file, \-\-account,
\-\-billing\-project, \-\-configuration, \-\-flags\-file, \-\-flatten,
\-\-format, \-\-help, \-\-impersonate\-service\-account, \-\-log\-http,
\-\-project, \-\-quiet, \-\-trace\-token, \-\-user\-output\-enabled,
\-\-verbosity.
Run \fB$ gcloud help\fR for details.
.SH "NOTES"
These variants are also available:
.RS 2m
$ gcloud alpha ai\-platform jobs submit training
.RE
.RS 2m
$ gcloud beta ai\-platform jobs submit training
.RE