This repo contains a simple demonstration of pytorch implementation of the transformer architecture. A more detailed post is Annotating the Annotated Transformer, originally inspired by Annotated Tranformer.
The dimensions used in the multi-head attention are as follows:
class MultiHeadedAttention(nn.Module):
def __init__(self, d_in, D_q, D_k, D_v, d_out, h, dropout):
super(MultiHeadedAttention, self).__init__()
self.linear_Q = nn.Linear(d_in, D_q, bias=False)
self.linear_K = nn.Linear(d_in, D_k, bias=False)
self.linear_V = nn.Linear(d_in, D_v, bias=False)
self.linear_Wo = nn.Linear(D_v, d_out)
{...}
def forward(self, query, key, value, mask=None):
{...}
For example: