Preprocess¶
preprocess.py
usage: preprocess.py [-h] [-config CONFIG] [-save_config SAVE_CONFIG]
[--data_type DATA_TYPE] --train_src TRAIN_SRC --train_tgt
TRAIN_TGT [--valid_src VALID_SRC] [--valid_tgt VALID_TGT]
[--src_dir SRC_DIR] --save_data SAVE_DATA
[--max_shard_size MAX_SHARD_SIZE]
[--shard_size SHARD_SIZE] [--src_vocab SRC_VOCAB]
[--tgt_vocab TGT_VOCAB]
[--features_vocabs_prefix FEATURES_VOCABS_PREFIX]
[--src_vocab_size SRC_VOCAB_SIZE]
[--tgt_vocab_size TGT_VOCAB_SIZE]
[--vocab_size_multiple VOCAB_SIZE_MULTIPLE]
[--src_words_min_frequency SRC_WORDS_MIN_FREQUENCY]
[--tgt_words_min_frequency TGT_WORDS_MIN_FREQUENCY]
[--dynamic_dict] [--share_vocab]
[--src_seq_length SRC_SEQ_LENGTH]
[--src_seq_length_trunc SRC_SEQ_LENGTH_TRUNC]
[--tgt_seq_length TGT_SEQ_LENGTH]
[--tgt_seq_length_trunc TGT_SEQ_LENGTH_TRUNC] [--lower]
[--filter_valid] [--shuffle SHUFFLE] [--seed SEED]
[--report_every REPORT_EVERY] [--log_file LOG_FILE]
[--log_file_level {INFO,CRITICAL,WARNING,ERROR,NOTSET,DEBUG,20,50,30,40,0,10}]
[--sample_rate SAMPLE_RATE] [--window_size WINDOW_SIZE]
[--window_stride WINDOW_STRIDE] [--window WINDOW]
[--image_channel_size {3,1}]
Named Arguments¶
- -config, --config
config file path
- -save_config, --save_config
config file save path
Data¶
- --data_type, -data_type
Type of the source input. Options are [text|img|audio].
Default: “text”
- --train_src, -train_src
Path to the training source data
- --train_tgt, -train_tgt
Path to the training target data
- --valid_src, -valid_src
Path to the validation source data
- --valid_tgt, -valid_tgt
Path to the validation target data
- --src_dir, -src_dir
Source directory for image or audio files.
Default: “”
- --save_data, -save_data
Output file for the prepared data
- --max_shard_size, -max_shard_size
Deprecated use shard_size instead
Default: 0
- --shard_size, -shard_size
Divide src_corpus and tgt_corpus into smaller multiple src_copus and tgt corpus files, then build shards, each shard will have opt.shard_size samples except last shard. shard_size=0 means no segmentation shard_size>0 means segment dataset into multiple shards, each shard has shard_size samples
Default: 1000000
Vocab¶
- --src_vocab, -src_vocab
Path to an existing source vocabulary. Format: one word per line.
Default: “”
- --tgt_vocab, -tgt_vocab
Path to an existing target vocabulary. Format: one word per line.
Default: “”
- --features_vocabs_prefix, -features_vocabs_prefix
Path prefix to existing features vocabularies
Default: “”
- --src_vocab_size, -src_vocab_size
Size of the source vocabulary
Default: 50000
- --tgt_vocab_size, -tgt_vocab_size
Size of the target vocabulary
Default: 50000
- --vocab_size_multiple, -vocab_size_multiple
Make the vocabulary size a multiple of this value
Default: 1
- --src_words_min_frequency, -src_words_min_frequency
Default: 0
- --tgt_words_min_frequency, -tgt_words_min_frequency
Default: 0
- --dynamic_dict, -dynamic_dict
Create dynamic dictionaries
Default: False
- --share_vocab, -share_vocab
Share source and target vocabulary
Default: False
Pruning¶
- --src_seq_length, -src_seq_length
Maximum source sequence length
Default: 50
- --src_seq_length_trunc, -src_seq_length_trunc
Truncate source sequence length.
- --tgt_seq_length, -tgt_seq_length
Maximum target sequence length to keep.
Default: 50
- --tgt_seq_length_trunc, -tgt_seq_length_trunc
Truncate target sequence length.
- --lower, -lower
lowercase data
Default: False
- --filter_valid, -filter_valid
Filter validation data by src and/or tgt length
Default: False
Random¶
- --shuffle, -shuffle
Shuffle data
Default: 0
- --seed, -seed
Random seed
Default: 3435
Logging¶
- --report_every, -report_every
Report status every this many sentences
Default: 100000
- --log_file, -log_file
Output logs to a file under this path.
Default: “”
- --log_file_level, -log_file_level
Possible choices: INFO, CRITICAL, WARNING, ERROR, NOTSET, DEBUG, 20, 50, 30, 40, 0, 10
Default: “0”
Speech¶
- --sample_rate, -sample_rate
Sample rate.
Default: 16000
- --window_size, -window_size
Window size for spectrogram in seconds.
Default: 0.02
- --window_stride, -window_stride
Window stride for spectrogram in seconds.
Default: 0.01
- --window, -window
Window type for spectrogram generation.
Default: “hamming”
- --image_channel_size, -image_channel_size
Possible choices: 3, 1
Using grayscale image can training model faster and smaller
Default: 3