Attention Is All You Need-Transformer


Model

  • A Transformer is an encoder-decoder model architecture that uses the attention mechanism.
  • Massive advantage over RNN based encoder-decoder architecture since it allows to:
    • Benefiting from the GPU/TPU parallelization.
    • Larger amount of data processed in the same amount of time.
    • process all tokens at once!
  • The Encoder / Decoder components are stacks of Encoders and Decoders, respectively.
    • The paper that introduced Transformers stacked 6 of each on top of each other respectively.
  • The models were built using attention mechanism at the core.
    • because of the architecture, the attention mechanism helps improve the performance of machine translation applications.