torch: nn_multihead_attention – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

nn_multihead_attention

MultiHead attention

Description

Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need

Usage

nn_multihead_attention(
  embed_dim,
  num_heads,
  dropout = 0,
  bias = TRUE,
  add_bias_kv = FALSE,
  add_zero_attn = FALSE,
  kdim = NULL,
  vdim = NULL
)

Arguments

`embed_dim`	total dimension of the model.
`num_heads`	parallel attention heads.
`dropout`	a Dropout layer on attn_output_weights. Default: 0.0.
`bias`	add bias as module parameter. Default: True.
`add_bias_kv`	add bias to the key and value sequences at dim=0.
`add_zero_attn`	add a new batch of zeros to the key and value sequences at dim=1.
`kdim`	total number of features in key. Default: `NULL`
`vdim`	total number of features in value. Default: `NULL`. Note: if kdim and vdim are `NULL`, they will be set to embed_dim such that query, key, and value have the same number of features.

Details

\mbox{MultiHead}(Q, K, V) = \mbox{Concat}(head_1,…,head_h)W^O \mbox{where} head_i = \mbox{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Shape

Inputs:

query: (L, N, E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
key: (S, N, E), where S is the source sequence length, N is the batch size, E is the embedding dimension.
value: (S, N, E) where S is the source sequence length, N is the batch size, E is the embedding dimension.
key_padding_mask: (N, S) where N is the batch size, S is the source sequence length. If a ByteTensor is provided, the non-zero positions will be ignored while the position with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the value of True will be ignored while the position with the value of False will be unchanged.
attn_mask: 2D mask (L, S) where L is the target sequence length, S is the source sequence length. 3D mask (N*num_heads, L, S) where N is the batch size, L is the target sequence length, S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend while the zero positions will be unchanged. If a BoolTensor is provided, positions with True is not allowed to attend while False values will be unchanged. If a FloatTensor is provided, it will be added to the attention weight.

Outputs:

attn_output: (L, N, E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
attn_output_weights: (N, L, S) where N is the batch size, L is the target sequence length, S is the source sequence length.

Examples

if (torch_is_installed()) {
## Not run: 
multihead_attn = nn_multihead_attention(embed_dim, num_heads)
out <- multihead_attn(query, key, value)
attn_output <- out[[1]]
attn_output_weights <- out[[2]]

## End(Not run)

}

torch

Tensors and Neural Networks with 'GPU' Acceleration

v0.3.0

MIT + file LICENSE

Authors

Daniel Falbel [aut, cre, cph], Javier Luraschi [aut], Dmitriy Selivanov [ctb], Athos Damiani [ctb], Christophe Regouby [ctb], Krzysztof Joachimiak [ctb], RStudio [cph]

Initial release