An encoder is a neural network component that transforms input sequences (like text) into meaningful numerical representations called embeddings. In transformers, the encoder processes the entire input sequence to capture relationships between all positions. The encoder maps variable-length input sequences to fixed-dimensional feature representations. A common use case is encoding a sentence for classification or question answering.

Encoders in Transformers
The encoder functions as the first half of the transformer model, facilitating the internal representation of input elements. It does not merely compress input into vector space but attempts to encode inter-token dependencies via operations that are both parallel and non-local. The encoder architecture learns invariant and position-aware features without relying on recurrence or convolution.
- Ability to capture global context and retaining order information
- Layer normalization and residual connections for stability
- Stacking of multiple identical layers for deeper understanding
- Can attend to both past and future tokens simultaneously
Role of Encoders
The encoder serves as a significant component in the transformer architecture and plays an important role:
- Acts as the first major block in the transformer model
- Takes input embeddings and generates representations
- Each encoder layer applies multi-head self-attention and feed-forward networks
- In machine translation, the encoder processes the source language sentence (e.g., "Hello world") and creates rich representations that capture the meaning, context, and relationships between words, which can then be used by a decoder to generate the target language translation.
Working Principle of Encoders
The encoder follows a encoding and representing approach:
- Embedding input to convert tokens to vector representations
- Positional Encoding added to input embeddings
- Multi-Layer Processing, applying N layers sequentially
- Apply non-linear transformation using Feed Forward Network
- Output Representation Generation

Working of Encoders in Transformer
1. Installing Dependencies
You can refer to these articles to understand more about these libraries: Torch, NN, Math
import torch
import torch.nn as nn
import math
2. Positional Encoding
Transformers donât have recurrence or convolution, so they need positional information to understand the order of tokens.
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
x = x + self.pe[:, :x.size(1)]
return x
This class adds sinusoidal positional encodings to token embeddings. These are deterministic and help the model differentiate between positions using sin/cos functions based on dimension.
3. Multi-Head Self-Attention
This module allows the model to attend to different parts of the sequence simultaneously. It splits the input into multiple "heads", computes scaled dot-product attention for each, and then concatenates the results. This helps capture diverse relationships between tokens more effectively than single-head attention.
- Linear projections for Q, K, V
- Scaled Dot-Product Attention
- Softmax to get attention weights
- Concatenate heads
class MultiHeadAttention(nn.Module):
def __init__(self, embed_dim, num_heads):
super().__init__()
self.num_heads = num_heads
self.head_dim = embed_dim // num_heads
assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
self.q_linear = nn.Linear(embed_dim, embed_dim)
self.k_linear = nn.Linear(embed_dim, embed_dim)
self.v_linear = nn.Linear(embed_dim, embed_dim)
self.out_linear = nn.Linear(embed_dim, embed_dim)
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
Q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
K = self.k_linear(key).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
V = self.v_linear(value).view(batch_size, -1, self.num_heads, self.head_dim).transpose(1, 2)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.head_dim)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = torch.softmax(scores, dim=-1)
attended_output = torch.matmul(attention_weights, V)
attended_output = attended_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.head_dim)
output = self.out_linear(attended_output)
return output
4. Position-wise Feed-Forward Network
Each token's representation is passed through a two-layer MLP with ReLU activation, applied independently. This enhances the model's ability to transform and abstract the attended features, enabling richer representations beyond just attention-based mixing.
class PositionwiseFeedForward(nn.Module):
def __init__(self, embed_dim, ff_dim):
super().__init__()
self.linear1 = nn.Linear(embed_dim, ff_dim)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(ff_dim, embed_dim)
def forward(self, x):
return self.linear2(self.relu(self.linear1(x)))
5. Encoder Layer
This is a single layer of the Transformer encoder. It combines multi-head self-attention and feed-forward sub-layers, each followed by residual connections and layer normalization.
class EncoderLayer(nn.Module):
def __init__(self, embed_dim, num_heads, ff_dim, dropout_rate):
super().__init__()
self.self_attn = MultiHeadAttention(embed_dim, num_heads)
self.feed_forward = PositionwiseFeedForward(embed_dim, ff_dim)
self.norm1 = nn.LayerNorm(embed_dim)
self.norm2 = nn.LayerNorm(embed_dim)
self.dropout = nn.Dropout(dropout_rate)
def forward(self, x, mask=None):
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
ff_output = self.feed_forward(x)
x = self.norm2(x + self.dropout(ff_output))
return x
This setup helps the model learn stable and expressive representations of sequences.
6. Full Encoder (Stack of Encoder Layers)
This stacks multiple Encoder Layer modules to form the full encoder block. It starts with token and positional embeddings, applies dropout, and passes the result through each encoder layer.
- Token Embedding
- Add Positional Encoding
- Pass through N encoder layers
class Encoder(nn.Module):
def __init__(self, vocab_size, embed_dim, num_layers, num_heads, ff_dim, dropout_rate, max_len=5000):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, embed_dim)
self.positional_encoding = PositionalEncoding(embed_dim, max_len)
self.layers = nn.ModuleList([EncoderLayer(embed_dim, num_heads, ff_dim, dropout_rate) for _ in range(num_layers)])
self.dropout = nn.Dropout(dropout_rate)
def forward(self, src, src_mask):
x = self.token_embedding(src)
x = self.dropout(self.positional_encoding(x))
for layer in self.layers:
x = layer(x, src_mask)
return x
The output is a context-rich representation of the input sequence suitable for downstream tasks like translation or classification.
7. Example Usage
vocab_size = 10000
embed_dim = 512
num_layers = 6
num_heads = 8
ff_dim = 2048
dropout_rate = 0.1
encoder = Encoder(vocab_size, embed_dim, num_layers, num_heads, ff_dim, dropout_rate)
src_tokens = torch.randint(0, vocab_size, (32, 50))
src_mask = (src_tokens != 0).unsqueeze(1).unsqueeze(2)
encoded_representation = encoder(src_tokens, src_mask)
print(encoded_representation.shape)
print(encoded_representation)
In this example, the encoder is initialized with hyperparameters (embedding size, number of layers/heads, etc.). A random batch of token sequences is passed through, along with a mask to ignore padded tokens during attention. The final output represents the encoded features and the shape.
You can download the source code here.
Applications of Transformer Encoders
- Sentence classification
- Named Entity Recognition (NER)
- Question Answering Systems
- Document Embeddings
- Machine translation