Modules¶
Core Modules¶
-
class
onmt.modules.
Embeddings
(word_vec_size, word_vocab_size, word_padding_idx, position_encoding=False, feat_merge='concat', feat_vec_exponent=0.7, feat_vec_size=-1, feat_padding_idx=[], feat_vocab_sizes=[], dropout=0, sparse=False, fix_word_vecs=False)[source]¶ Bases:
torch.nn.modules.module.Module
Words embeddings for encoder/decoder.
Additionally includes ability to add sparse input features based on “Linguistic Input Features Improve Neural Machine Translation” [SH16].
graph LR A[Input] C[Feature 1 Lookup] A-->B[Word Lookup] A-->C A-->D[Feature N Lookup] B-->E[MLP/Concat] C-->E D-->E E-->F[Output]- Parameters
word_vec_size (int) – size of the dictionary of embeddings.
word_padding_idx (int) – padding index for words in the embeddings.
feat_padding_idx (List[int]) – padding index for a list of features in the embeddings.
word_vocab_size (int) – size of dictionary of embeddings for words.
feat_vocab_sizes (List[int], optional) – list of size of dictionary of embeddings for each feature.
position_encoding (bool) – see
PositionalEncoding
feat_merge (string) – merge action for the features embeddings: concat, sum or mlp.
feat_vec_exponent (float) – when using -feat_merge concat, feature embedding size is N^feat_dim_exponent, where N is the number of values the feature takes.
feat_vec_size (int) – embedding dimension for features when using -feat_merge mlp
dropout (float) – dropout probability.
-
emb_luts
¶ Embedding look-up table.
-
forward
(source, step=None)[source]¶ Computes the embeddings for words and features.
- Parameters
source (LongTensor) – index tensor
(len, batch, nfeat)
- Returns
Word embeddings
(len, batch, embedding_size)
- Return type
FloatTensor
-
load_pretrained_vectors
(emb_file)[source]¶ Load in pretrained embeddings.
- Parameters
emb_file (str) – path to torch serialized embeddings
-
word_lut
¶ Word look-up table.
Encoders¶
-
class
onmt.encoders.
EncoderBase
[source]¶ Bases:
torch.nn.modules.module.Module
Base encoder class. Specifies the interface used by different encoder types and required by
onmt.Models.NMTModel
.graph BT A[Input] subgraph RNN C[Pos 1] D[Pos 2] E[Pos N] end F[Memory_Bank] G[Final] A-->C A-->D A-->E C-->F D-->F E-->F E-->G-
forward
(src, lengths=None)[source]¶ - Parameters
src (LongTensor) – padded sequences of sparse indices
(src_len, batch, nfeat)
lengths (LongTensor) – length of each sequence
(batch,)
- Returns
final encoder state, used to initialize decoder
memory bank for attention,
(src_len, batch, hidden)
- Return type
(FloatTensor, FloatTensor)
-
-
class
onmt.encoders.
MeanEncoder
(num_layers, embeddings)[source]¶ Bases:
onmt.encoders.encoder.EncoderBase
A trivial non-recurrent encoder. Simply applies mean pooling.
- Parameters
num_layers (int) – number of replicated layers
embeddings (onmt.modules.Embeddings) – embedding module to use
-
class
onmt.encoders.
RNNEncoder
(rnn_type, bidirectional, num_layers, hidden_size, dropout=0.0, embeddings=None, use_bridge=False)[source]¶ Bases:
onmt.encoders.encoder.EncoderBase
A generic recurrent neural network encoder.
- Parameters
rnn_type (str) – style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]
bidirectional (bool) – use a bidirectional RNN
num_layers (int) – number of stacked layers
hidden_size (int) – hidden size of each layer
dropout (float) – dropout value for
torch.nn.Dropout
embeddings (onmt.modules.Embeddings) – embedding module to use
Decoders¶
-
class
onmt.decoders.
DecoderBase
(attentional=True)[source]¶ Bases:
torch.nn.modules.module.Module
Abstract class for decoders.
- Parameters
attentional (bool) – The decoder returns non-empty attention.
-
class
onmt.decoders.decoder.
RNNDecoderBase
(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', attn_func='softmax', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None, reuse_copy_attn=False, copy_attn_type='general')[source]¶ Bases:
onmt.decoders.decoder.DecoderBase
Base recurrent attention-based decoder class.
Specifies the interface used by different decoder types and required by
NMTModel
.graph BT A[Input] subgraph RNN C[Pos 1] D[Pos 2] E[Pos N] end G[Decoder State] H[Decoder State] I[Outputs] F[memory_bank] A--emb-->C A--emb-->D A--emb-->E H-->C C-- attn --- F D-- attn --- F E-- attn --- F C-->I D-->I E-->I E-->G F---I- Parameters
rnn_type (str) – style of recurrent unit to use, one of [RNN, LSTM, GRU, SRU]
bidirectional_encoder (bool) – use with a bidirectional encoder
num_layers (int) – number of stacked layers
hidden_size (int) – hidden size of each layer
attn_type (str) – see
GlobalAttention
attn_func (str) – see
GlobalAttention
coverage_attn (str) – see
GlobalAttention
context_gate (str) – see
ContextGate
copy_attn (bool) – setup a separate copy attention mechanism
dropout (float) – dropout value for
torch.nn.Dropout
embeddings (onmt.modules.Embeddings) – embedding module to use
reuse_copy_attn (bool) – reuse the attention for copying
copy_attn_type (str) – The copy attention style. See
GlobalAttention
.
-
forward
(tgt, memory_bank, memory_lengths=None, step=None)[source]¶ - Parameters
tgt (LongTensor) – sequences of padded tokens
(tgt_len, batch, nfeats)
.memory_bank (FloatTensor) – vectors from the encoder
(src_len, batch, hidden)
.memory_lengths (LongTensor) – the padded source lengths
(batch,)
.
- Returns
dec_outs: output from the decoder (after attn)
(tgt_len, batch, hidden)
.attns: distribution over src at each tgt
(tgt_len, batch, src_len)
.
- Return type
(FloatTensor, dict[str, FloatTensor])
-
class
onmt.decoders.
StdRNNDecoder
(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', attn_func='softmax', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None, reuse_copy_attn=False, copy_attn_type='general')[source]¶ Bases:
onmt.decoders.decoder.RNNDecoderBase
Standard fully batched RNN decoder with attention.
Faster implementation, uses CuDNN for implementation. See
RNNDecoderBase
for options.Based around the approach from “Neural Machine Translation By Jointly Learning To Align and Translate” [BCB14]
Implemented without input_feeding and currently with no coverage_attn or copy_attn support.
-
class
onmt.decoders.
InputFeedRNNDecoder
(rnn_type, bidirectional_encoder, num_layers, hidden_size, attn_type='general', attn_func='softmax', coverage_attn=False, context_gate=None, copy_attn=False, dropout=0.0, embeddings=None, reuse_copy_attn=False, copy_attn_type='general')[source]¶ Bases:
onmt.decoders.decoder.RNNDecoderBase
Input feeding based decoder.
See
RNNDecoderBase
for options.Based around the input feeding approach from “Effective Approaches to Attention-based Neural Machine Translation” [LPM15]
graph BT A[Input n-1] AB[Input n] subgraph RNN E[Pos n-1] F[Pos n] E --> F end G[Encoder] H[memory_bank n-1] A --> E AB --> F E --> H G --> H
Attention¶
-
class
onmt.modules.
AverageAttention
(model_dim, dropout=0.1)[source]¶ Bases:
torch.nn.modules.module.Module
Average Attention module from “Accelerating Neural Transformer via an Average Attention Network” [ZXS18].
- Parameters
model_dim (int) – the dimension of keys/values/queries, must be divisible by head_count
dropout (float) – dropout parameter
-
cumulative_average
(inputs, mask_or_step, layer_cache=None, step=None)[source]¶ Computes the cumulative average as described in [ZXS18] – Equations (1) (5) (6)
- Parameters
inputs (FloatTensor) – sequence to average
(batch_size, input_len, dimension)
mask_or_step – if cache is set, this is assumed to be the current step of the dynamic decoding. Otherwise, it is the mask matrix used to compute the cumulative average.
layer_cache – a dictionary containing the cumulative average of the previous step.
- Returns
a tensor of the same shape and type as
inputs
.
-
cumulative_average_mask
(batch_size, inputs_len)[source]¶ Builds the mask to compute the cumulative average as described in [ZXS18] – Figure 3
- Parameters
batch_size (int) – batch size
inputs_len (int) – length of the inputs
- Returns
A Tensor of shape
(batch_size, input_len, input_len)
- Return type
(FloatTensor)
-
class
onmt.modules.
GlobalAttention
(dim, coverage=False, attn_type='dot', attn_func='softmax')[source]¶ Bases:
torch.nn.modules.module.Module
Global attention takes a matrix and a query vector. It then computes a parameterized convex combination of the matrix based on the input query.
Constructs a unit mapping a query q of size dim and a source matrix H of size n x dim, to an output of size dim.
graph BT A[Query] subgraph RNN C[H 1] D[H 2] E[H N] end F[Attn] G[Output] A --> F C --> F D --> F E --> F C -.-> G D -.-> G E -.-> G F --> GAll models compute the output as \(c = \sum_{j=1}^{\text{SeqLength}} a_j H_j\) where \(a_j\) is the softmax of a score function. Then then apply a projection layer to [q, c].
However they differ on how they compute the attention score.
- Luong Attention (dot, general):
dot: \(\text{score}(H_j,q) = H_j^T q\)
general: \(\text{score}(H_j, q) = H_j^T W_a q\)
- Bahdanau Attention (mlp):
\(\text{score}(H_j, q) = v_a^T \text{tanh}(W_a q + U_a h_j)\)
- Parameters
dim (int) – dimensionality of query and key
coverage (bool) – use coverage term
attn_type (str) – type of attention to use, options [dot,general,mlp]
attn_func (str) – attention function to use, options [softmax,sparsemax]
-
forward
(source, memory_bank, memory_lengths=None, coverage=None)[source]¶ - Parameters
source (FloatTensor) – query vectors
(batch, tgt_len, dim)
memory_bank (FloatTensor) – source vectors
(batch, src_len, dim)
memory_lengths (LongTensor) – the source context lengths
(batch,)
coverage (FloatTensor) – None (not supported yet)
- Returns
Computed vector
(tgt_len, batch, dim)
Attention distribtutions for each query
(tgt_len, batch, src_len)
- Return type
(FloatTensor, FloatTensor)
Architecture: Transformer¶
-
class
onmt.modules.
PositionalEncoding
(dropout, dim, max_len=5000)[source]¶ Bases:
torch.nn.modules.module.Module
Sinusoidal positional encoding for non-recurrent neural networks.
Implementation based on “Attention Is All You Need” [VSP+17]
- Parameters
dropout (float) – dropout parameter
dim (int) – embedding size
-
class
onmt.modules.position_ffn.
PositionwiseFeedForward
(d_model, d_ff, dropout=0.1)[source]¶ Bases:
torch.nn.modules.module.Module
A two-layer Feed-Forward-Network with residual layer norm.
- Parameters
d_model (int) – the size of input for the first-layer of the FFN.
d_ff (int) – the hidden layer size of the second-layer of the FNN.
dropout (float) – dropout probability in \([0, 1)\).
-
class
onmt.encoders.
TransformerEncoder
(num_layers, d_model, heads, d_ff, dropout, embeddings, max_relative_positions)[source]¶ Bases:
onmt.encoders.encoder.EncoderBase
The Transformer encoder from “Attention is All You Need” [VSP+17]
graph BT A[input] B[multi-head self-attn] C[feed forward] O[output] A --> B B --> C C --> O- Parameters
num_layers (int) – number of encoder layers
d_model (int) – size of the model
heads (int) – number of heads
d_ff (int) – size of the inner FF layer
dropout (float) – dropout parameters
embeddings (onmt.modules.Embeddings) – embeddings to use, should have positional encodings
- Returns
embeddings
(src_len, batch_size, model_dim)
memory_bank
(src_len, batch_size, model_dim)
- Return type
(torch.FloatTensor, torch.FloatTensor)
-
class
onmt.decoders.
TransformerDecoder
(num_layers, d_model, heads, d_ff, copy_attn, self_attn_type, dropout, embeddings, max_relative_positions)[source]¶ Bases:
onmt.decoders.decoder.DecoderBase
The Transformer decoder from “Attention is All You Need”. [VSP+17]
graph BT A[input] B[multi-head self-attn] BB[multi-head src-attn] C[feed forward] O[output] A --> B B --> BB BB --> C C --> O- Parameters
num_layers (int) – number of encoder layers.
d_model (int) – size of the model
heads (int) – number of heads
d_ff (int) – size of the inner FF layer
copy_attn (bool) – if using a separate copy attention
self_attn_type (str) – type of self-attention scaled-dot, average
dropout (float) – dropout parameters
embeddings (onmt.modules.Embeddings) – embeddings to use, should have positional encodings
-
class
onmt.modules.
MultiHeadedAttention
(head_count, model_dim, dropout=0.1, max_relative_positions=0)[source]¶ Bases:
torch.nn.modules.module.Module
Multi-Head Attention module from “Attention is All You Need” [VSP+17].
Similar to standard dot attention but uses multiple attention distributions simulataneously to select relevant items.
graph BT A[key] B[value] C[query] O[output] subgraph Attn D[Attn 1] E[Attn 2] F[Attn N] end A --> D C --> D A --> E C --> E A --> F C --> F D --> O E --> O F --> O B --> OAlso includes several additional tricks.
- Parameters
head_count (int) – number of parallel heads
model_dim (int) – the dimension of keys/values/queries, must be divisible by head_count
dropout (float) – dropout parameter
-
forward
(key, value, query, mask=None, layer_cache=None, type=None)[source]¶ Compute the context vector and the attention vectors.
- Parameters
key (FloatTensor) – set of key_len key vectors
(batch, key_len, dim)
value (FloatTensor) – set of key_len value vectors
(batch, key_len, dim)
query (FloatTensor) – set of query_len query vectors
(batch, query_len, dim)
mask – binary mask indicating which keys have non-zero attention
(batch, query_len, key_len)
- Returns
output context vectors
(batch, query_len, dim)
one of the attention vectors
(batch, query_len, key_len)
- Return type
(FloatTensor, FloatTensor)
Architecture: Conv2Conv¶
(These methods are from a user contribution and have not been thoroughly tested.)
-
class
onmt.encoders.
CNNEncoder
(num_layers, hidden_size, cnn_kernel_width, dropout, embeddings)[source]¶ Bases:
onmt.encoders.encoder.EncoderBase
Encoder based on “Convolutional Sequence to Sequence Learning” [GAG+17].
-
class
onmt.decoders.
CNNDecoder
(num_layers, hidden_size, attn_type, copy_attn, cnn_kernel_width, dropout, embeddings, copy_attn_type)[source]¶ Bases:
onmt.decoders.decoder.DecoderBase
Decoder based on “Convolutional Sequence to Sequence Learning” [GAG+17].
Consists of residual convolutional layers, with ConvMultiStepAttention.
-
class
onmt.modules.
ConvMultiStepAttention
(input_size)[source]¶ Bases:
torch.nn.modules.module.Module
Conv attention takes a key matrix, a value matrix and a query vector. Attention weight is calculated by key matrix with the query vector and sum on the value matrix. And the same operation is applied in each decode conv layer.
-
forward
(base_target_emb, input_from_dec, encoder_out_top, encoder_out_combine)[source]¶ - Parameters
base_target_emb – target emb tensor
input_from_dec – output of decode conv
encoder_out_top – the key matrix for calculation of attetion weight, which is the top output of encode conv
encoder_out_combine – the value matrix for the attention-weighted sum, which is the combination of base emb and top output of encode
-
-
class
onmt.modules.
WeightNormConv2d
(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, init_scale=1.0, polyak_decay=0.9995)[source]¶ Bases:
torch.nn.modules.conv.Conv2d
-
forward
(x, init=False)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
Architecture: SRU¶
-
class
onmt.models.sru.
SRU
(input_size, hidden_size, num_layers=2, dropout=0, rnn_dropout=0, bidirectional=False, use_tanh=1, use_relu=0)[source]¶ Bases:
torch.nn.modules.module.Module
Implementation of “Training RNNs as Fast as CNNs” [LZA17]
TODO: turn to pytorch’s implementation when it is available.
This implementation is adpoted from the author of the paper: https://github.com/taolei87/sru/blob/master/cuda_functional.py.
- Parameters
input_size (int) – input to model
hidden_size (int) – hidden dimension
num_layers (int) – number of layers
dropout (float) – dropout to use (stacked)
rnn_dropout (float) – dropout to use (recurrent)
bidirectional (bool) – bidirectional
use_tanh (bool) – activation
use_relu (bool) – activation
-
forward
(input, c0=None, return_hidden=True)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
Alternative Encoders¶
onmt.modules.AudioEncoder
-
class
onmt.encoders.
AudioEncoder
(rnn_type, enc_layers, dec_layers, brnn, enc_rnn_size, dec_rnn_size, enc_pooling, dropout, sample_rate, window_size)[source]¶ Bases:
onmt.encoders.encoder.EncoderBase
A simple encoder CNN -> RNN for audio input.
- Parameters
rnn_type (str) – Type of RNN (e.g. GRU, LSTM, etc).
enc_layers (int) – Number of encoder layers.
dec_layers (int) – Number of decoder layers.
brnn (bool) – Bidirectional encoder.
enc_rnn_size (int) – Size of hidden states of the rnn.
dec_rnn_size (int) – Size of the decoder hidden states.
enc_pooling (str) – A comma separated list either of length 1 or of length
enc_layers
specifying the pooling amount.dropout (float) – dropout probablity.
sample_rate (float) – input spec
window_size (int) – input spec
onmt.modules.ImageEncoder
-
class
onmt.encoders.
ImageEncoder
(num_layers, bidirectional, rnn_size, dropout, image_chanel_size=3)[source]¶ Bases:
onmt.encoders.encoder.EncoderBase
A simple encoder CNN -> RNN for image src.
- Parameters
num_layers (int) – number of encoder layers.
bidirectional (bool) – bidirectional encoder.
rnn_size (int) – size of hidden states of the rnn.
dropout (float) – dropout probablity.
Copy Attention¶
-
class
onmt.modules.
CopyGenerator
(input_size, output_size, pad_idx)[source]¶ Bases:
torch.nn.modules.module.Module
An implementation of pointer-generator networks [SLM17].
These networks consider copying words directly from the source sequence.
The copy generator is an extended version of the standard generator that computes three values.
\(p_{softmax}\) the standard softmax over tgt_dict
\(p(z)\) the probability of copying a word from the source
\(p_{copy}\) the probility of copying a particular word. taken from the attention distribution directly.
The model returns a distribution over the extend dictionary, computed as
\(p(w) = p(z=1) p_{copy}(w) + p(z=0) p_{softmax}(w)\)
graph BT A[input] S[src_map] B[softmax] BB[switch] C[attn] D[copy] O[output] A --> B A --> BB S --> D C --> D D --> O B --> O BB --> O- Parameters
input_size (int) – size of input representation
output_size (int) – size of output vocabulary
pad_idx (int) –
-
forward
(hidden, attn, src_map)[source]¶ Compute a distribution over the target dictionary extended by the dynamic dictionary implied by copying source words.
- Parameters
hidden (FloatTensor) – hidden outputs
(batch x tlen, input_size)
attn (FloatTensor) – attn for each
(batch x tlen, input_size)
src_map (FloatTensor) – A sparse indicator matrix mapping each source word to its index in the “extended” vocab containing.
(src_len, batch, extra_words)
Structured Attention¶
-
class
onmt.modules.structured_attention.
MatrixTree
(eps=1e-05)[source]¶ Bases:
torch.nn.modules.module.Module
Implementation of the matrix-tree theorem for computing marginals of non-projective dependency parsing. This attention layer is used in the paper “Learning Structured Text Representations” [LL17].
-
forward
(input)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-