A deep learning model architecture that is designed to process sequential input data like Recurrent Neural Networks (RNNs), but unlike RNNs, uses the idea of self-attention to be able to process long strings of input without having to process each sequentially one at a time. The basis of Transformer architecture comes from the paper ‘Attention Is All You Need’ building upon the idea of a neural network sequence-to-sequence (“Seq2Seq”) architecture, which transforms one sequence (such as a sequence of words in a sentence) to another sequence (a different sequence of words, for example). Like Autoencoders and GANs, Seq2Seq models consist of an Encoder and a Decoder. The encoder translates the first sequence into an intermediate sequence, and then the Decoder translates from that intermediate sequence into the final wanted sequence. In addition, an “attention” mechanism determines which parts of the sequence are important. Traditional Seq2Seq models used LSTMs for the encoders and decoders. Transformers don’t use LSTMs and instead focus on the attention part, using something called multi-head attention. Transformer models have proven to be especially powerful for Natural Language Processing applications and image generation, but have shown versatility in a wide range of applications even in use cases for computer vision. Popularized by models such as GPT-3 and BERT, transformer models are quickly becoming a preferred model of choice for NLP problems and any other approach that requires transformation of some sequence, such as text to image, and other applications.