Working of Encoders in Transformers

An encoder is a neural network component that transforms input sequences (like text) into meaningful numerical representations called embeddings. In transformers, the encoder processes the entire input sequence to capture relationships between all positions. The encoder maps variable-length input sequences to fixed-dimensional feature representations. A common use case is encoding a sentence for classification or question answering.

Encoder---Decoder-Architecture-in-Transformers — Encoder-Decoder Architecture in Transformers

Encoders in Transformers

The encoder functions as the first half of the transformer model, facilitating the internal representation of input elements. It does not merely compress input into vector space but attempts to encode inter-token dependencies via operations that are both parallel and non-local. The encoder architecture learns invariant and position-aware features without relying on recurrence or convolution.

Ability to capture global context and retaining order information
Layer normalization and residual connections for stability
Stacking of multiple identical layers for deeper understanding
Can attend to both past and future tokens simultaneously

Role of Encoders

The encoder serves as a significant component in the transformer architecture and plays an important role:

Acts as the first major block in the transformer model
Takes input embeddings and generates representations
Each encoder layer applies multi-head self-attention and feed-forward networks
In machine translation, the encoder processes the source language sentence (e.g., "Hello world") and creates rich representations that capture the meaning, context, and relationships between words, which can then be used by a decoder to generate the target language translation.

Working Principle of Encoders

The encoder follows a encoding and representing approach:

Embedding input to convert tokens to vector representations
Positional Encoding added to input embeddings
Multi-Layer Processing, applying N layers sequentially
Apply non-linear transformation using Feed Forward Network
Output Representation Generation

Encoder-Architecture — Architecture of Encoders in Transformers

Working of Encoders in Transformer

1. Installing Dependencies

You can refer to these articles to understand more about these libraries: Torch, NN, Math

Python

import torch
import torch.nn as nn
import math

2. Positional Encoding

Transformers don’t have recurrence or convolution, so they need positional information to understand the order of tokens.

Python

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)]
        return x

This class adds sinusoidal positional encodings to token embeddings. These are deterministic and help the model differentiate between positions using sin/cos functions based on dimension.

3. Multi-Head Self-Attention

This module allows the model to attend to different parts of the sequence simultaneously. It splits the input into multiple "heads", computes scaled dot-product attention for each, and then concatenates the results. This helps capture diverse relationships between tokens more effectively than single-head attention.

Linear projections for Q, K, V
Scaled Dot-Product Attention
Softmax to get attention weights
Concatenate heads

Python

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        self.q_linear = nn.Linear(embed_dim, embed_dim)
        self.k_linear = nn.Linear(embed_dim, embed_dim)
        self.v_linear = nn.Linear(embed_dim, embed_dim)
        self.out_linear = nn.Linear(embed_dim, embed_dim)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)
        Q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        K = self.k_linear(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        V = self.v_linear(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attention_weights = torch.softmax(scores, dim=-1)
        attended_output = torch.matmul(attention_weights, V)
        attended_output = attended_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.head_dim)
        output = self.out_linear(attended_output)
        return output

4. Position-wise Feed-Forward Network

Each token's representation is passed through a two-layer MLP with ReLU activation, applied independently. This enhances the model's ability to transform and abstract the attended features, enabling richer representations beyond just attention-based mixing.

Python

class PositionwiseFeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim):
        super().__init__()
        self.linear1 = nn.Linear(embed_dim, ff_dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(ff_dim, embed_dim)

    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))

5. Encoder Layer

This is a single layer of the Transformer encoder. It combines multi-head self-attention and feed-forward sub-layers, each followed by residual connections and layer normalization.

Python

class EncoderLayer(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate):
        super().__init__()
        self.self_attn = MultiHeadAttention(embed_dim, num_heads)
        self.feed_forward = PositionwiseFeedForward(embed_dim, ff_dim)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x, mask=None):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

This setup helps the model learn stable and expressive representations of sequences.

6. Full Encoder (Stack of Encoder Layers)

This stacks multiple Encoder Layer modules to form the full encoder block. It starts with token and positional embeddings, applies dropout, and passes the result through each encoder layer.

Token Embedding
Add Positional Encoding
Pass through N encoder layers

Python

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_layers, num_heads, ff_dim, dropout_rate, max_len=5000):
        super().__init__()
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        self.positional_encoding = PositionalEncoding(embed_dim, max_len)
        self.layers = nn.ModuleList([EncoderLayer(embed_dim, num_heads, ff_dim, dropout_rate) for _ in range(num_layers)])
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, src, src_mask):
        x = self.token_embedding(src)
        x = self.dropout(self.positional_encoding(x))
        for layer in self.layers:
            x = layer(x, src_mask)
        return x

The output is a context-rich representation of the input sequence suitable for downstream tasks like translation or classification.

7. Example Usage

Python

vocab_size = 10000
embed_dim = 512
num_layers = 6
num_heads = 8
ff_dim = 2048
dropout_rate = 0.1

encoder = Encoder(vocab_size, embed_dim, num_layers, num_heads, ff_dim, dropout_rate)
src_tokens = torch.randint(0, vocab_size, (32, 50))
src_mask = (src_tokens != 0).unsqueeze(1).unsqueeze(2)

encoded_representation = encoder(src_tokens, src_mask)
print(encoded_representation.shape) 
print(encoded_representation)

In this example, the encoder is initialized with hyperparameters (embedding size, number of layers/heads, etc.). A random batch of token sequences is passed through, along with a mask to ignore padded tokens during attention. The final output represents the encoded features and the shape.

You can download the source code here.

Applications of Transformer Encoders

Sentence classification
Named Entity Recognition (NER)
Question Answering Systems
Document Embeddings
Machine translation

Working of Encoders in Transformers

Encoders in Transformers

Role of Encoders

Working Principle of Encoders

Working of Encoders in Transformer

1. Installing Dependencies

2. Positional Encoding

3. Multi-Head Self-Attention

4. Position-wise Feed-Forward Network

5. Encoder Layer

6. Full Encoder (Stack of Encoder Layers)

7. Example Usage

Applications of Transformer Encoders

Explore