Attention is All You Need Reading Notes


This is my reading notes of Vaswani A, Shazeer N, Parmar N, et al’s Attention Is All You Need. I just briefly summarized the logistic structure of the paper and list the main idea of the paper. If you need more details of the paper, please read the passage by yourself. If you have any questions want to discuss with me, feel free to contact at haroldliuj@gmail.com. Have a nice trip!

Model Architecture

Okay! Let’s skip the background which may be the most funny part of the paper and look at the model directly!

截屏2020-08-12 下午4.15.53

The model is shown above. It is still an encoder-decoder structure while the inner blocks are all achieved by Attention Mechanism. To fully understand the model, we should first know what is Attention Mechanism.

1. Attention Mechanism

According to this paper’s explanation, attention function can be described as mapping a query and a set of key-value pairs to an output, where the query($Q$), keys($K$), values($V$), and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

My understanding of the attention mechanism is that this is very similar to human attention. Let’s think about why we pay more attention to some parts of the TV series or some parts of the books. Let’s say we will pay more attention to Harry Potter vs Lord Voldemort than Harry Potter is hanging on the corridors of Hogwarts and we always know when the battle is going on. That is because we can infer what’s going on base on the previous story. The Attention Mechanism is doing the same thing.

Let say $Q$ is the previous story, $K$ and $V$ is the current story we are reading. First, we must know the relationship between the previous story and the current story, so we use a similarity function to see if the previous is similar to the current story($S=\operatorname{similarity(Q, K)}$). Then we can build our attention base on this similarity($\operatorname{Softmax}(S)$). Then we can assign this attention to the current story.

So the expression of Attention is:

2. Scaled Dot-Product Attention

This is invented by the authors of this paper to prevent the vanishing gradient of softmax function.

It defines the similarity function as scaled dot-product, that is:

This expression is very similar to dot-product similarity except for $\sqrt{d_k}$ term. They use this term to prevent vanishing gradient.

3. Multihead-Attention

This approach will first do some linear projections for $Q, K, V$ for $h$ times with different linear projections. Then we will do attention operations on each pair of projections. We will contact them as the final output. This allows the model to jointly attend to information from different representation subspaces at different positions. The whole process is shown in the following picture:

截屏2020-08-12 下午5.03.30

4. Model Struture

This model contains N encoders and N decoders. Each encoder has 2 sublayers and each decoder has 3 sublayers.

  • Encoder:

    • Multi-Head Attention: this is the multi-head attention discussed above. And $Q, K, V$ come from the output of the previous encoder, so this is a self-attention.
    • Feed Forward Layer:
      • $ H(x) = Relu(xW_1 + b1)W2 + b2 $
    • Each sublayer is residual connected and followed by layer normalization
  • Decoder:

    • Has the same structure as encoder except for a masked multi-head attention
      • This attention layer ensures that the decoder won’t use the attention information from the following time steps as we can not know what’s happening in future unread chapters, let alone assigning our attention base on those chapters. $Q, K, V$ come from the output of the previous decoder, so this is also a self-attention.
    • Multi-Head Attention: this is the multi-head attention discussed above. $Q$ comes from previous decoder output, $K, V$ comes from the output of the encoder.
  • Position Encoding

    • Since this model abandoned RNN, it can not obtain position information. The authors did Position Encoding to maintain position information, the expression is:I won’t talk too much about this expression here.

So this is all about this paper. I skipped the experiment part. If you want to know more details about the model, please check here.


  • Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.