Attention Is All You Need

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Abstract

15 pages, 5 figures

Related collections

Author and article information

Journal

Publisher: arXiv

Publication date (Electronic): 2017

Publication date Submitted: 12 June 2017

Publication date Updated: 13 June 2017

Publication date Submitted: 19 June 2017

Publication date Updated: 20 June 2017

Publication date Submitted: 20 June 2017

Publication date Updated: 21 June 2017

Publication date Submitted: 30 June 2017

Publication date Updated: 03 July 2017

Publication date Submitted: 06 December 2017

Publication date Updated: 07 December 2017

Publication date Available: June 2017

Article

DOI: 10.48550/ARXIV.1706.03762

PubMed ID: 35895330

SO-VID: 3f4233f3-765b-4222-bba8-e00a7c457ede

License:

arXiv.org perpetual, non-exclusive license

History

Keywords: Computation and Language (cs.CL),Machine Learning (cs.LG),FOS: Computer and information sciences

Data availability:

Keywords: Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 3,305

See all cited by

Attention Is All You Need

Read this article at

Abstract

Abstract

Related collections

DrugRxiv

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 68

Cited by 3,305