Reduce and Refine :6 equal to 28

Blog Huggingface Demo

Do All Layers in an LLM Actually Matter?

Large Language Models process text by passing it through a stack of layers, each intended to transform the latent representation in some meaningful way. But this leads us to a question:

Does every layer actually contribute to the final representation?

In theory, each layer should introduce new structure, refine semantic understanding, or reshape intermediate representations. But are we sure that each layer in the LLM is contributing equally to the final output. To answer this question, we did a simple experiment on Qwen3-Embedding-0.6B, measuring how much each layer alters the latent representation of the same input text. For now we refer this measurement as Representation shift in our blog, which reflects how much the latent representation changes from one layer to the next.

Experiment Setup

We prepared a small diverse set of short text snippets which covers casual messages ,reviews to technical statements, error logs, code, fiction, and everyday instructions. These examples were used consistently across all tests:

## texts used for the experiment
texts = [
    "Hey, is the package arriving today?",
    "This product was amazing, totally worth it!",
    "LOL that’s hilarious",
    "In quantum mechanics, the wavefunction describes the probability amplitude of a particle’s state.",
    "The industrial revolution significantly transformed manufacturing processes and labor systems.",
    "User: John Doe, Age: 29, Status: Active",
    "{'error': 'Invalid token', 'code': 401}",
    "Policy: All employees must adhere to the security guidelines listed below.",
    "The dragon circled above the valley, its scales shimmering like molten gold.",
    "She opened the old journal, revealing pages filled with cryptic symbols.",
    "def add(a, b): return a + b  # simple addition function",
    "API rate limit exceeded. Retry in 30 seconds.",
    "Step 1: Preheat the oven to 350°F.",
    "To reset your router, hold the button for 10 seconds until the LED blinks.",
    "A small brown dog jumps over a wooden fence.",
    "Three people are sitting around a table discussing a project."
]

For above texts, we captured the hidden representations at every layer and computed how much the embedding shifted from one layer to the next. The code and visualization for this are added below:

We first extracted the hidden representations from every decoder layer. For each input sentence, we collected the normalized hidden states across layers and then computed the cosine similarity between consecutive layers. This gives us a direct measure of how much each layer changes the representation: layers with high similarity values make very small updates, while layers with lower similarity values introduce larger transformations. This analysis produces the visualization shown below.

From the visualization, it’s clear that the first few layers and the final layers produce noticeable changes in the latent representation. But the middle layers exhibit shifts extremely close to 1, indicating that they barely alter the representation. This doesn’t necessarily mean these layers contribute nothing, but it does suggest that, for this embedding task, the model is not fully utilizing its depth.

For example, layer 3 alone produces a representation shift of roughly 0.5, showing that a single layer is capable of making a substantial transformation. Yet layers 5 through 20 produce almost no change at all. If so many consecutive layers introduce minimal movement, it raises the possibility that this part of the model is redundant/underutilized for the task.

This leads to our working hypothesis: if a single layer can achieve a large shift, we may be able to remove most of the middle layers and replace them with a single decoder layer that captures their combined effect. In theory, this compressed architecture could perform comparably to the full 28-layer model while being much smaller and faster.

Model Compression and Distillation Framework

Stage 1: Constructing the Student Model

Few notations we used in this blog:

  • Student Model: The text embedding model that is the subject of training, tasked with learning to produce effective vector representations.

  • Stage - 1 Student model : The text-embedding model obtained after the first stage of training. In some places, we may refer to it as Tarka-Embedding-300M-V1-Preview, but both names refer to the same model.

  • Stage - 2 Student model : For Stage 2, we further pruned several layers from the Stage-1 Student Model and used that reduced version for the next training phase. In some places, the notation Stage-2 Student Model* refers to the pruned-but-not-yet-trained version of the Stage-2 model. The fully trained Stage-2 model is referred to as Stage-2 Student Model or Tarka-Embedding-250M-V1.

  • Teacher Model: The embedding model serving as a teacher, guiding the student model in generating effective vectors and this model wont be trained

  • sx: The normalized vector representation of a text x produced by the student model.

  • tx: The normalized vector representation of a text x produced by the teacher model.

  • SX : A matrix of normalized vector representations for a batch of text X produced by the student model.

  • TX : A matrix of normalized vector representations for a batch of text X produced by the teacher model.

Layer Pruning Based on Representation Shift

To build the student model for Stage 1, we remove the decoder layers whose representation shift scores are closest to zero. In practice, this corresponds to removing layers 4 through 22 and replaced them with a single layer whose weights were set to the average of the removed layers, allowing the student model to approximate their combined effect with a much smaller architecture. The resulting student model contains 10 decoder layers with roughly 300M parameters.

A few architectural adjustments further tailor the model to the embedding task:

  • Bidirectional Attention: Qwen Embedding originally uses causal attention, which is unnecessary for embeddings and restricts information flow. We replace it with fully bidirectional attention.

  • Pooling Strategy: The original model relies on last-token pooling—a natural choice for causal models. With bidirectional attention, mean pooling provides a more balanced representation and is therefore adopted.

The objective of the first stage is to enable the student model to effectively learn text representations from multiple teacher models by aligning its output vectors with the corresponding teacher vectors and see how the hidden layers adopt. To achieve this goal, we inspired from Jasper and stella distillation which designed three loss functions that progress from a specific to a broader perspective. The first loss function is cosine loss, which is formulated as follows:

Lcosine=x(1sxtx)L_{\text{cosine}} = \sum_{x} \left( 1 - s_x \cdot t_x \right)

The second loss function, similarity loss, which models the semantic matching differences between the student and teacher models from a text-pair perspective. This loss function ensures a relatively consistent judgment of similarity between the student model and the teacher models, without enforcing an absolute fit between the student model and the teacher model.

Lsim=MSE(SXSXT,TXTXT)L_{sim} = MSE(S_X S^T_X , T_X T^ T_X )

The third loss function leverage relative comparison signals, inspired by CoSENT loss. For each batch of text data, we employ teacher models to automatically generate soft labels for all text pairs, thereby identifying potential positive and negative samples. Subsequently, the student model is trained to ensure that the similarity between positive pairs exceeds that between negative pairs, with the margin hyperparameter controlling the degree of this difference. If the batch size is m, the total number of text pairs (i.e.,N ) is given by CCm22C^2_{C^2_m}.

Lresim=1N(titj>tmtn)max(0,  smsnsisj+margin)L_{\text{resim}} = \frac{1}{N} \sum_{(t_i \cdot t_j > t_m \cdot t_n)} \max\left(0,\; s_m \cdot s_n - s_i \cdot s_j + \text{margin} \right)

The Final loss LfinalL_{final} is a weighted sum of all three losses mentioned before with hyperparameters λ1, λ2 and λ3.

L=λ1Lcosine+λ2Lsim+λ3LresimL = \lambda_{1} L_{\text{cosine}} + \lambda_{2} L_{\text{sim}} + \lambda_{3} L_{\text{resim}}

Training Details

We used infgrad/jasper_text_distill_dataset along with few other open source datasets as our training dataset, we strip of all the labels and structure and just extracted the text data which makes it unsupervised / data free knowledge distillation. For first stage we used around 30% of data for training with batch size of 64 ,learning rate 1e41e^{-4}.For hyperparameters λ1=10,λ2=200,λ3=20λ_1 = 10, λ_2 = 200, λ_3 =20 , margin = 0.015.

Results

MTEB English / Models
Param
Mean(Task)
Mean(Type)
Class.
Clust.
Pair Class.
Rerank.
Retri.
STS
Summ.

multilingual-e5-large-instruct

0.6B

65.53

61.21

75.54

49.89

86.24

48.74

53.47

84.72

29.89

NV-Embed-v2

7.8B

69.81

65.00

87.19

47.66

88.69

49.61

62.84

83.82

35.21

GritLM-7B

7.2B

67.07

63.22

81.25

50.82

87.29

49.59

54.95

83.03

35.65

gte-Qwen2-1.5B-instruct

1.5B

67.20

63.26

85.84

53.54

87.52

49.25

50.25

82.51

33.94

stella_en_1.5B_v5

1.5B

69.43

65.32

89.38

57.06

88.02

50.19

52.42

83.27

36.91

gte-Qwen2-7B-instruct

7.6B

70.72

65.77

88.52

58.97

85.9

50.47

58.09

82.69

35.74

gemini-embedding-exp-03-07

-

73.3

67.67

90.05

59.39

87.7

48.59

64.35

85.29

38.28

Qwen3-Embedding-0.6B (Teacher)

0.6B

70.70

64.88

85.76

54.05

84.37

48.18

61.83

86.57

33.43

Stage-1 Student Model

0.3B

66.27

61.42

83.43

52.23

82.06

45.27

51.8

82.75

32.41

The training results looks promising even with using 30% the data, the model is performing well. But what about our main question?.

The expected behavior is that, unlike original Qwen3-embedding , all layers in this student model should have significant representation shift. Let us see how it behaves by running the same experiment again

Stage 2: No Mercy for Redundant Layers

Is the Stage 1 model good? It is not bad in terms of performance, but we still observe four layers whose representation shift is close to zero. This indicates that these layers contribute minimally, which tell us one thing to do: remove those layers as well. Therefore, we repeat the experiment once more, this time removing decoder layers 3, 4, 5, and 6. This reduction yields a student model with only six layers, corresponding to approximately 250M parameters.

Before initiating Stage 2 training, we want to see the performance degradation introduced by pruning layers 3, 4, 5, and 6 from the stage 1 student model. To evaluate the performance we used a subset of MTEB benchmark.

Model
Score

Stage -1 Student model

70.49

Stage-2 Student model*

60.88

Average Score Across 30 MTEB Tasks

Detailed Results by Task
Task
Stage -1 student model
Stage 2- student model*

AmazonCounterfactualClassification

0.905523

0.665993

ArXivHierarchicalClusteringP2P

0.610444

0.580408

ArXivHierarchicalClusteringS2S

0.602545

0.5434

ArguAna

0.60957

0.51911

AskUbuntuDupQuestions

0.593

0.53835

BIOSSES

0.851736

0.801151

Banking77Classification

0.795617

0.681948

BiorxivClusteringP2P.v2

0.454291

0.380074

MTOPDomainClassification

0.95232

0.895539

MassiveIntentClassification

0.744809

0.580872

MassiveScenarioClassification

0.791238

0.683278

MedrxivClusteringP2P.v2

0.408814

0.3592

MedrxivClusteringS2S.v2

0.396702

0.321167

SICK-R

0.80128

0.659112

STS12

0.81324

0.756822

STS13

0.855147

0.724923

STS14

0.824201

0.716758

STS15

0.888463

0.822951

STS17

0.664754

0.506013

STS22.v2

0.646292

0.681692

STSBenchmark

0.863012

0.740534

SprintDuplicateQuestions

0.968126

0.952018

StackExchangeClustering.v2

0.694506

0.539584

StackExchangeClusteringP2P.v2

0.508366

0.421614

SummEvalSummarization.v2

0.323219

0.272467

ToxicConversationsClassification

0.826807

0.588525

TweetSentimentExtractionClassification

0.757187

0.583956

TwentyNewsgroupsClustering.v2

0.504307

0.391558

TwitterSemEval2015

0.64779

0.530892

TwitterURLCorpus

0.84599

0.824594

Training Details

For Stage 2 training we used full training dataset with batch size of 128 and learning rate 1e41e^{-4}.For hyperparameters λ1=20,λ2=400,λ3=40λ_1 = 20, λ_2 = 400, λ_3 =40 , margin = 0.015.

Results

MTEB English / Models
Param.
Mean(Task)
Mean(Type)
Class.
Clust.
Pair Class.
Rerank.
Retri.
STS
Summ.

multilingual-e5-large-instruct

0.6B

65.53

61.21

75.54

49.89

86.24

48.74

53.47

84.72

29.89

NV-Embed-v2

7.8B

69.81

65.00

87.19

47.66

88.69

49.61

62.84

83.82

35.21

GritLM-7B

7.2B

67.07

63.22

81.25

50.82

87.29

49.59

54.95

83.03

35.65

gte-Qwen2-1.5B-instruct

1.5B

67.20

63.26

85.84

53.54

87.52

49.25

50.25

82.51

33.94

stella_en_1.5B_v5

1.5B

69.43

65.32

89.38

57.06

88.02

50.19

52.42

83.27

36.91

gte-Qwen2-7B-instruct

7.6B

70.72

65.77

88.52

58.97

85.9

50.47

58.09

82.69

35.74

gemini-embedding-exp-03-07

-

73.3

67.67

90.05

59.39

87.7

48.59

64.35

85.29

38.28

Qwen3-Embedding-0.6B

0.6B

70.70

64.88

85.76

54.05

84.37

48.18

61.83

86.57

33.43

Tarka-Embedding-250M-V1

0.25B

67.57

62.38

84.91

53.0

83.57

46.1

54.25

83.38

31.42

Detailed Results by Task

Task
Qwen3-Embedding-0.6B
Tarka-Embedding-250M-V1

AmazonCounterfactualClassification

0.904223

0.90806

ArXivHierarchicalClusteringP2P

0.637187

0.62438

ArXivHierarchicalClusteringS2S

0.638185

0.605246

ArguAna

0.70965

0.62528

AskUbuntuDupQuestions

0.651285

0.61015

BIOSSES

0.854893

0.831374

Banking77Classification

0.810097

0.791526

BiorxivClusteringP2P.v2

0.472655

0.468922

CQADupstackGamingRetrieval

0.64142

0.60043

CQADupstackUnixRetrieval

0.51494

0.44402

ClimateFEVERHardNegatives

0.4362

0.35896

FEVERHardNegatives

0.88942

0.85964

FiQA2018

0.46612

0.35588

HotpotQAHardNegatives

0.67689

0.62462

ImdbClassification

0.954392

0.94156

MTOPDomainClassification

0.959576

0.955495

MassiveIntentClassification

0.614375

0.777539

MassiveScenarioClassification

0.835878

0.825017

MedrxivClusteringP2P.v2

0.421841

0.422138

MedrxivClusteringS2S.v2

0.404147

0.403866

MindSmallReranking

0.312339

0.31179

SCIDOCS

0.24407

0.20684

SICK-R

0.848283

0.80648

STS12

0.829953

0.801848

STS13

0.917553

0.879699

STS14

0.871015

0.836702

STS15

0.914472

0.895512

STS17

0.855039

0.894983

STS22.v2

0.718334

0.679928

STSBenchmark

0.911297

0.877868

SprintDuplicateQuestions

0.941094

0.972459

StackExchangeClustering.v2

0.711618

0.691267

StackExchangeClusteringP2P.v2

0.521263

0.510298

SummEvalSummarization.v2

0.334321

0.314192

TRECCOVID

0.90518

0.815

Touche2020Retrieval.v3

0.69896

0.53421

ToxicConversationsClassification

0.821289

0.828418

TweetSentimentExtractionClassification

0.760498

0.764941

TwentyNewsgroupsClustering.v2

0.517341

0.514238

TwitterSemEval2015

0.722557

0.677604

TwitterURLCorpus

0.867482

0.857082

AVG

0.7004227317

0.6757429756

Latency

The results are not bad—but what about the core question: do all layers truly contribute to the final output? To answer this, we will repeat the same experiment again :

We observe that all layers provide meaningful refinement to the latent representations, which is encouraging. Although the student model shows some performance degradation compared to its teacher, this is expected given the limited amount of training data. The main goal of this work is not to match the teacher’s performance which can be achieved with more data and longer training, but rather to demonstrate the underlying theory: deeper decoder stacks often contain layers that contribute very little, and therefore deeper models are not always necessary to achieve strong performance. Our experiments clearly highlight this phenomenon and confirm that shallower models can also solve the task effectively.

Sad part here is, In the final 250M model, the embedding matrix still ends up consuming a surprisingly large portion of the parameters. In our previous Tarka model series, we attempted to reduce the embedding parameters, but this came at the cost of supporting fewer languages. The question now is whether we can achieve this reduction without any sacrificing. We believe it is possible, and exploring this will be a key focus of the next Tarka model series. There are many avenues to explore in LLM model compression, and we aim to push the limits to see how much we can shrink the model without any loss in performance.

Experimental Insights

  • This experiment reveal a significant gap in LLM model training. We observe that typically only the first and last few layers contribute meaningfully to the output, leaving the majority of layers underutilized. This suggests that if we can train models such that all layers provide substantial contribution, a 0.6B parameter model could potentially perform on par with an 8B model.

  • We believe this principle applies across different model scales. So if we first training a deeper model like 25+ layers and then gradually removing redundant layers, it is possible to substantially reduce model size without sacrificing performance. There are many ways to reduce the layers, here we did more aggressively by removing large no of layers at a time, but we can also try continuous, gradual layer reduction during training so that layers contributing little to the final output are removed over time.

Conclusion

The experiments confirm that not all layers in deep LLMs are equally important, and that strategic pruning based on layer contribution can yield smaller, more efficient models. This insight opens the door to training compact models that retain the performance of much larger counterparts, offering a practical framework for model compression and optimization in future LLM designs.

We need to express our gratitude for jasper and stella team for open sourcing dataset and their work which helped a lot in this project.

For more discussions, let’s continue the conversation here.


Future Work

  • Explore whether embedding token weights can be reduced without sacrificing language support.

  • Explore if a latent reasoning embedding model can outperform a standard model of the same size when trained under identical conditions.

  • Explore the possibility of making an embedding model non-deterministic, and see if it provides any practical benefits.