MultiHead attention
Allows the model to jointly attend to information from different representation subspaces. See reference: Attention Is All You Need
nn_multihead_attention( embed_dim, num_heads, dropout = 0, bias = TRUE, add_bias_kv = FALSE, add_zero_attn = FALSE, kdim = NULL, vdim = NULL )
embed_dim |
total dimension of the model. |
num_heads |
parallel attention heads. |
dropout |
a Dropout layer on attn_output_weights. Default: 0.0. |
bias |
add bias as module parameter. Default: True. |
add_bias_kv |
add bias to the key and value sequences at dim=0. |
add_zero_attn |
add a new batch of zeros to the key and value sequences at dim=1. |
kdim |
total number of features in key. Default: |
vdim |
total number of features in value. Default: |
\mbox{MultiHead}(Q, K, V) = \mbox{Concat}(head_1,…,head_h)W^O \mbox{where} head_i = \mbox{Attention}(QW_i^Q, KW_i^K, VW_i^V)
Inputs:
query: (L, N, E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
key: (S, N, E), where S is the source sequence length, N is the batch size, E is the embedding dimension.
value: (S, N, E) where S is the source sequence length, N is the batch size, E is the embedding dimension.
key_padding_mask: (N, S) where N is the batch size, S is the source sequence length.
If a ByteTensor is provided, the non-zero positions will be ignored while the position
with the zero positions will be unchanged. If a BoolTensor is provided, the positions with the
value of True
will be ignored while the position with the value of False
will be unchanged.
attn_mask: 2D mask (L, S) where L is the target sequence length, S is the source sequence length.
3D mask (N*num_heads, L, S) where N is the batch size, L is the target sequence length,
S is the source sequence length. attn_mask ensure that position i is allowed to attend the unmasked
positions. If a ByteTensor is provided, the non-zero positions are not allowed to attend
while the zero positions will be unchanged. If a BoolTensor is provided, positions with True
is not allowed to attend while False
values will be unchanged. If a FloatTensor
is provided, it will be added to the attention weight.
Outputs:
attn_output: (L, N, E) where L is the target sequence length, N is the batch size, E is the embedding dimension.
attn_output_weights: (N, L, S) where N is the batch size, L is the target sequence length, S is the source sequence length.
if (torch_is_installed()) { ## Not run: multihead_attn = nn_multihead_attention(embed_dim, num_heads) out <- multihead_attn(query, key, value) attn_output <- out[[1]] attn_output_weights <- out[[2]] ## End(Not run) }
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.