Data Loaders¶
Data Readers¶
-
class
onmt.inputters.
DataReaderBase
[source]¶ Bases:
object
Read data from file system and yield as dicts.
- Raises
onmt.inputters.datareader_base.MissingDependencyException – A number of DataReaders need specific additional packages. If any are missing, this will be raised.
-
class
onmt.inputters.
TextDataReader
[source]¶ Bases:
onmt.inputters.datareader_base.DataReaderBase
-
read
(sequences, side, _dir=None)[source]¶ Read text data from disk.
- Parameters
sequences (str or Iterable[str]) – path to text file or iterable of the actual text data.
side (str) – Prefix used in return dict. Usually
"src"
or"tgt"
._dir (NoneType) – Leave as
None
. This parameter exists to conform with theDataReaderBase.read()
signature.
- Yields
dictionaries whose keys are the names of fields and whose values are more or less the result of tokenizing with those fields.
-
-
class
onmt.inputters.
ImageDataReader
(truncate=None, channel_size=3)[source]¶ Bases:
onmt.inputters.datareader_base.DataReaderBase
Read image data from disk.
- Parameters
truncate (tuple[int] or NoneType) – maximum img size. Use
(0,0)
orNone
for unlimited.channel_size (int) – Number of channels per image.
- Raises
onmt.inputters.datareader_base.MissingDependencyException – If importing any of
PIL
,torchvision
, orcv2
fail.
-
classmethod
from_opt
(opt)[source]¶ Alternative constructor.
- Parameters
opt (argparse.Namespace) – The parsed arguments.
-
read
(images, side, img_dir=None)[source]¶ Read data into dicts.
- Parameters
images (str or Iterable[str]) – Sequence of image paths or path to file containing audio paths. In either case, the filenames may be relative to
src_dir
(default behavior) or absolute.side (str) – Prefix used in return dict. Usually
"src"
or"tgt"
.img_dir (str) – Location of source image files. See
images
.
- Yields
a dictionary containing image data, path and index for each line.
-
class
onmt.inputters.
AudioDataReader
(sample_rate=0, window_size=0, window_stride=0, window=None, normalize_audio=True, truncate=None)[source]¶ Bases:
onmt.inputters.datareader_base.DataReaderBase
Read audio data from disk.
- Parameters
sample_rate (int) – sample_rate.
window_size (float) – window size for spectrogram in seconds.
window_stride (float) – window stride for spectrogram in seconds.
window (str) – window type for spectrogram generation. See
librosa.stft()
window
for more details.normalize_audio (bool) – subtract spectrogram by mean and divide by std or not.
truncate (int or NoneType) – maximum audio length (0 or None for unlimited).
- Raises
onmt.inputters.datareader_base.MissingDependencyException – If importing any of
torchaudio
,librosa
, ornumpy
fail.
-
classmethod
from_opt
(opt)[source]¶ Alternative constructor.
- Parameters
opt (argparse.Namespace) – The parsed arguments.
-
read
(data, side, src_dir=None)[source]¶ Read data into dicts.
- Parameters
data (str or Iterable[str]) – Sequence of audio paths or path to file containing audio paths. In either case, the filenames may be relative to
src_dir
(default behavior) or absolute.side (str) – Prefix used in return dict. Usually
"src"
or"tgt"
.src_dir (str) – Location of source audio files. See
data
.
- Yields
A dictionary containing audio data for each line.
Dataset¶
-
class
onmt.inputters.
Dataset
(fields, readers, data, dirs, sort_key, filter_pred=None)[source]¶ Bases:
torchtext.data.dataset.Dataset
Contain data and process it.
A dataset is an object that accepts sequences of raw data (sentence pairs in the case of machine translation) and fields which describe how this raw data should be processed to produce tensors. When a dataset is instantiated, it applies the fields’ preprocessing pipeline (but not the bit that numericalizes it or turns it into batch tensors) to the raw data, producing a list of
torchtext.data.Example
objects. torchtext’s iterators then know how to use these examples to make batches.- Parameters
fields (dict[str, Field]) – a dict with the structure returned by
onmt.inputters.get_fields()
. Usually that means the dataset side,"src"
or"tgt"
. Keys match the keys of items yielded by thereaders
, while values are lists of (name, Field) pairs. An attribute with this name will be created for eachtorchtext.data.Example
object and its value will be the result of applying the Field to the data that matches the key. The advantage of having sequences of fields for each piece of raw input is that it allows the dataset to store multiple “views” of each input, which allows for easy implementation of token-level features, mixed word- and character-level models, and so on. (See alsoonmt.inputters.TextMultiField
.)readers (Iterable[onmt.inputters.DataReaderBase]) – Reader objects for disk-to-dict. The yielded dicts are then processed according to
fields
.data (Iterable[Tuple[str, Any]]) – (name,
data_arg
) pairs wheredata_arg
is passed to theread()
method of the reader inreaders
at that position. (See the reader object for details on theAny
type.)dirs (Iterable[str or NoneType]) – A list of directories where data is contained. See the reader object for more details.
sort_key (Callable[[torchtext.data.Example], Any]) – A function for determining the value on which data is sorted (i.e. length).
filter_pred (Callable[[torchtext.data.Example], bool]) – A function that accepts Example objects and returns a boolean value indicating whether to include that example in the dataset.
- Variables
src_vocabs (List[torchtext.data.Vocab]) – Used with dynamic dict/copy attention. There is a very short vocab for each src example. It contains just the source words, e.g. so that the generator can predict to copy them.