Reduce and Refine :6 equal to 28
Do All Layers in an LLM Actually Matter?
Large Language Models process text by passing it through a stack of layers, each intended to transform the latent representation in some meaningful way. But this leads us to a question:
Does every layer actually contribute to the final representation?
In theory, each layer should introduce new structure, refine semantic understanding, or reshape intermediate representations. But are we sure that each layer in the LLM is contributing equally to the final output. To answer this question, we did a simple experiment on Qwen3-Embedding-0.6B, measuring how much each layer alters the latent representation of the same input text. For now we refer this measurement as Representation shift in our blog, which reflects how much the latent representation changes from one layer to the next.
Experiment Setup
We prepared a small diverse set of short text snippets which covers casual messages ,reviews to technical statements, error logs, code, fiction, and everyday instructions. These examples were used consistently across all tests:
## texts used for the experiment
texts = [
"Hey, is the package arriving today?",
"This product was amazing, totally worth it!",
"LOL that’s hilarious",
"In quantum mechanics, the wavefunction describes the probability amplitude of a particle’s state.",
"The industrial revolution significantly transformed manufacturing processes and labor systems.",
"User: John Doe, Age: 29, Status: Active",
"{'error': 'Invalid token', 'code': 401}",
"Policy: All employees must adhere to the security guidelines listed below.",
"The dragon circled above the valley, its scales shimmering like molten gold.",
"She opened the old journal, revealing pages filled with cryptic symbols.",
"def add(a, b): return a + b # simple addition function",
"API rate limit exceeded. Retry in 30 seconds.",
"Step 1: Preheat the oven to 350°F.",
"To reset your router, hold the button for 10 seconds until the LED blinks.",
"A small brown dog jumps over a wooden fence.",
"Three people are sitting around a table discussing a project."
]
For above texts, we captured the hidden representations at every layer and computed how much the embedding shifted from one layer to the next. The code and visualization for this are added below:
We first extracted the hidden representations from every decoder layer. For each input sentence, we collected the normalized hidden states across layers and then computed the cosine similarity between consecutive layers. This gives us a direct measure of how much each layer changes the representation: layers with high similarity values make very small updates, while layers with lower similarity values introduce larger transformations. This analysis produces the visualization shown below.
From the visualization, it’s clear that the first few layers and the final layers produce noticeable changes in the latent representation. But the middle layers exhibit shifts extremely close to 1, indicating that they barely alter the representation. This doesn’t necessarily mean these layers contribute nothing, but it does suggest that, for this embedding task, the model is not fully utilizing its depth.
For example, layer 3 alone produces a representation shift of roughly 0.5, showing that a single layer is capable of making a substantial transformation. Yet layers 5 through 20 produce almost no change at all. If so many consecutive layers introduce minimal movement, it raises the possibility that this part of the model is redundant/underutilized for the task.
This leads to our working hypothesis: if a single layer can achieve a large shift, we may be able to remove most of the middle layers and replace them with a single decoder layer that captures their combined effect. In theory, this compressed architecture could perform comparably to the full 28-layer model while being much smaller and faster.
Model Compression and Distillation Framework
Stage 1: Constructing the Student Model
Few notations we used in this blog:
Student Model: The text embedding model that is the subject of training, tasked with learning to produce effective vector representations.
Stage - 1 Student model : The text-embedding model obtained after the first stage of training. In some places, we may refer to it as Tarka-Embedding-300M-V1-Preview, but both names refer to the same model.
Stage - 2 Student model : For Stage 2, we further pruned several layers from the Stage-1 Student Model and used that reduced version for the next training phase. In some places, the notation
Stage-2 Student Model*refers to the pruned-but-not-yet-trained version of the Stage-2 model. The fully trained Stage-2 model is referred to asStage-2 Student Modelor Tarka-Embedding-250M-V1.Teacher Model: The embedding model serving as a teacher, guiding the student model in generating effective vectors and this model wont be trained
sx: The normalized vector representation of a text x produced by the student model.
tx: The normalized vector representation of a text x produced by the teacher model.
SX : A matrix of normalized vector representations for a batch of text X produced by the student model.
TX : A matrix of normalized vector representations for a batch of text X produced by the teacher model.
Layer Pruning Based on Representation Shift
To build the student model for Stage 1, we remove the decoder layers whose representation shift scores are closest to zero. In practice, this corresponds to removing layers 4 through 22 and replaced them with a single layer whose weights were set to the average of the removed layers, allowing the student model to approximate their combined effect with a much smaller architecture. The resulting student model contains 10 decoder layers with roughly 300M parameters.
A few architectural adjustments further tailor the model to the embedding task:
Bidirectional Attention: Qwen Embedding originally uses causal attention, which is unnecessary for embeddings and restricts information flow. We replace it with fully bidirectional attention.
Pooling Strategy: The original model relies on last-token pooling—a natural choice for causal models. With bidirectional attention, mean pooling provides a more balanced representation and is therefore adopted.
The objective of the first stage is to enable the student model to effectively learn text representations from multiple teacher models by aligning its output vectors with the corresponding teacher vectors and see how the hidden layers adopt. To achieve this goal, we inspired from Jasper and stella distillation which designed three loss functions that progress from a specific to a broader perspective. The first loss function is cosine loss, which is formulated as follows:
The second loss function, similarity loss, which models the semantic matching differences between the student and teacher models from a text-pair perspective. This loss function ensures a relatively consistent judgment of similarity between the student model and the teacher models, without enforcing an absolute fit between the student model and the teacher model.
The third loss function leverage relative comparison signals, inspired by CoSENT loss. For each batch of text data, we employ teacher models to automatically generate soft labels for all text pairs, thereby identifying potential positive and negative samples. Subsequently, the student model is trained to ensure that the similarity between positive pairs exceeds that between negative pairs, with the margin hyperparameter controlling the degree of this difference. If the batch size is m, the total number of text pairs (i.e.,N ) is given by .
The Final loss is a weighted sum of all three losses mentioned before with hyperparameters λ1, λ2 and λ3.
Training Details
We used infgrad/jasper_text_distill_dataset along with few other open source datasets as our training dataset, we strip of all the labels and structure and just extracted the text data which makes it unsupervised / data free knowledge distillation. For first stage we used around 30% of data for training with batch size of 64 ,learning rate .For hyperparameters , margin = 0.015.
Results
multilingual-e5-large-instruct
0.6B
65.53
61.21
75.54
49.89
86.24
48.74
53.47
84.72
29.89
NV-Embed-v2
7.8B
69.81
65.00
87.19
47.66
88.69
49.61
62.84
83.82
35.21
GritLM-7B
7.2B
67.07
63.22
81.25
50.82
87.29
49.59
54.95
83.03
35.65
gte-Qwen2-1.5B-instruct
1.5B
67.20
63.26
85.84
53.54
87.52
49.25
50.25
82.51
33.94
stella_en_1.5B_v5
1.5B
69.43
65.32
89.38
57.06
88.02
50.19
52.42
83.27
36.91
gte-Qwen2-7B-instruct
7.6B
70.72
65.77
88.52
58.97
85.9
50.47
58.09
82.69
35.74
gemini-embedding-exp-03-07
-
73.3
67.67
90.05
59.39
87.7
48.59
64.35
85.29
38.28
Qwen3-Embedding-0.6B (Teacher)
0.6B
70.70
64.88
85.76
54.05
84.37
48.18
61.83
86.57
33.43
Stage-1 Student Model
0.3B
66.27
61.42
83.43
52.23
82.06
45.27
51.8
82.75
32.41
The training results looks promising even with using 30% the data, the model is performing well. But what about our main question?.
The expected behavior is that, unlike original Qwen3-embedding , all layers in this student model should have significant representation shift. Let us see how it behaves by running the same experiment again
Stage 2: No Mercy for Redundant Layers
Is the Stage 1 model good? It is not bad in terms of performance, but we still observe four layers whose representation shift is close to zero. This indicates that these layers contribute minimally, which tell us one thing to do: remove those layers as well. Therefore, we repeat the experiment once more, this time removing decoder layers 3, 4, 5, and 6. This reduction yields a student model with only six layers, corresponding to approximately 250M parameters.
Before initiating Stage 2 training, we want to see the performance degradation introduced by pruning layers 3, 4, 5, and 6 from the stage 1 student model. To evaluate the performance we used a subset of MTEB benchmark.
Stage -1 Student model
70.49
Stage-2 Student model*
60.88
Average Score Across 30 MTEB Tasks
Training Details
For Stage 2 training we used full training dataset with batch size of 128 and learning rate .For hyperparameters , margin = 0.015.
Results
multilingual-e5-large-instruct
0.6B
65.53
61.21
75.54
49.89
86.24
48.74
53.47
84.72
29.89
NV-Embed-v2
7.8B
69.81
65.00
87.19
47.66
88.69
49.61
62.84
83.82
35.21
GritLM-7B
7.2B
67.07
63.22
81.25
50.82
87.29
49.59
54.95
83.03
35.65
gte-Qwen2-1.5B-instruct
1.5B
67.20
63.26
85.84
53.54
87.52
49.25
50.25
82.51
33.94
stella_en_1.5B_v5
1.5B
69.43
65.32
89.38
57.06
88.02
50.19
52.42
83.27
36.91
gte-Qwen2-7B-instruct
7.6B
70.72
65.77
88.52
58.97
85.9
50.47
58.09
82.69
35.74
gemini-embedding-exp-03-07
-
73.3
67.67
90.05
59.39
87.7
48.59
64.35
85.29
38.28
Qwen3-Embedding-0.6B
0.6B
70.70
64.88
85.76
54.05
84.37
48.18
61.83
86.57
33.43
Tarka-Embedding-250M-V1
0.25B
67.57
62.38
84.91
53.0
83.57
46.1
54.25
83.38
31.42
Latency
The results are not bad—but what about the core question: do all layers truly contribute to the final output? To answer this, we will repeat the same experiment again :
We observe that all layers provide meaningful refinement to the latent representations, which is encouraging. Although the student model shows some performance degradation compared to its teacher, this is expected given the limited amount of training data. The main goal of this work is not to match the teacher’s performance which can be achieved with more data and longer training, but rather to demonstrate the underlying theory: deeper decoder stacks often contain layers that contribute very little, and therefore deeper models are not always necessary to achieve strong performance. Our experiments clearly highlight this phenomenon and confirm that shallower models can also solve the task effectively.
Sad part here is, In the final 250M model, the embedding matrix still ends up consuming a surprisingly large portion of the parameters. In our previous Tarka model series, we attempted to reduce the embedding parameters, but this came at the cost of supporting fewer languages. The question now is whether we can achieve this reduction without any sacrificing. We believe it is possible, and exploring this will be a key focus of the next Tarka model series. There are many avenues to explore in LLM model compression, and we aim to push the limits to see how much we can shrink the model without any loss in performance.
Experimental Insights
This experiment reveal a significant gap in LLM model training. We observe that typically only the first and last few layers contribute meaningfully to the output, leaving the majority of layers underutilized. This suggests that if we can train models such that all layers provide substantial contribution, a 0.6B parameter model could potentially perform on par with an 8B model.
We believe this principle applies across different model scales. So if we first training a deeper model like 25+ layers and then gradually removing redundant layers, it is possible to substantially reduce model size without sacrificing performance. There are many ways to reduce the layers, here we did more aggressively by removing large no of layers at a time, but we can also try continuous, gradual layer reduction during training so that layers contributing little to the final output are removed over time.
Conclusion
The experiments confirm that not all layers in deep LLMs are equally important, and that strategic pruning based on layer contribution can yield smaller, more efficient models. This insight opens the door to training compact models that retain the performance of much larger counterparts, offering a practical framework for model compression and optimization in future LLM designs.
We need to express our gratitude for jasper and stella team for open sourcing dataset and their work which helped a lot in this project.
For more discussions, let’s continue the conversation here.
Future Work
Explore whether embedding token weights can be reduced without sacrificing language support.
Explore if a latent reasoning embedding model can outperform a standard model of the same size when trained under identical conditions.
Explore the possibility of making an embedding model non-deterministic, and see if it provides any practical benefits.