Tarka Embedding 30M V1

Compressed model by 20x.
Recovered approx. 86% performance on MTEB(Eng, v2) Benchmark

In our last blog post we reduce the model size by reducing the depth of the model, this time we will use Factorization, Matrix Factorization provides a straightforward framework to compress models. It aim to reconstruct an original (typically larger) matrix by combining smaller, often lower-rank, matrices (e.g. via a product). With techniques ranging from SVD, to other low-rank factorization methods. Most of the layers in LLM are linear layers which are 2D matrixes , so what we did is replaced the matrixes with the decomposed matrix and reconstruct the original matrix at inference time , so this way we can reduce the model size signifyingly with keeping the computation in higher dimension.

Brief intro to Kronecker Product

The Kronecker product is an operation that transforms two matrices into a larger matrix that contains all the possible products of the entries of the two matrices.

Definition : Let A be m x n matrix and B be a p x q matrix then the Kronecker product between A and B is (m*p x n*q) block matrix

A \otimes B = \left[ \begin{array}{ccc} a_{11} B & \cdots & a_{1n} B \\ \vdots & \ddots & \vdots \\ a_{m1} B & \cdots & a_{mn} B \end{array} \right]

Let us see this with an example

$A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$

$B = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}$

then the Kronecker Product will be

A \otimes B = \begin{bmatrix} 1\cdot B & 0\cdot B \\ 0\cdot B & 1\cdot B \end{bmatrix} = \begin{bmatrix} B & 0 \\ 0 & B \end{bmatrix} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 \\ 0 & 0 & 1 & 1 \end{bmatrix}

Methodology

Model Architecture

We used the same architecture as Qwen3-Embedding-0.6B model, but all linear layer are replaced with KronLinear layers which are upscaled during inference.

Simple Code to understand how KronLinear layer is implemented

class KronLinear(nn.Module):
    def __init__(self, in_features, out_features, rank=64, bias=True):
        super().__init__()
        self.rank = rank
        self.left_layer = nn.Linear(in_features, rank, bias=False) 
        self.right_layer = nn.Linear(1, out_features // rank, bias=False)
    def forward(self, x, elastic_scale=None, elastic_dim=None):
        layer = torch.kron(self.left_layer.weight, self.left_layer.weight)        
        return  x @ layer.T

To understand how this benefit, let us take a simple Linear layer of size 1024 x 2048, if we use fp32 data type, this will take

1024*2048*4 = 8,388,608 bytes = 8 MB.

But if we use the KronLinear, we will have two sub matrix with left layer = 1024*64 and right layer = 1*32 weights , the total memory this will take for fp32 is

1024*64*4 + 32*4 = 262,272 bytes = 256.1 KB.

Despite we are inferring the same matrix of size 1024 x 2048 ,but with using KronLinear layer we can compress the information by 32 times which is very huge because then 640B model can be stored in 20B size.

In our case we compressed the 0.6B model to 0.03B model. Not only that we also integrated elastic property to the model, so 0.03B model can upscaled into multiple model depending on our computation requirements, there are some nice works that already implemented this [Matformer, Nemotron Elastic] but there is something that we have compare to the other works that implemented the elastic property , that is despite they are using portion of the architecture in inference time they still need to save whole model parameters and load for inference ,let say our model is 10B model and we are using 80% of the model for inference , then we still need to load the whole 10B model into the GPU, but the computation cost is decreased as we using only a portion of the model.

But with our method we are storing the compressed version of the model, and we only upscale the model during inference based on the requirement, we will save both memory and computation which is needed when you are working with large models.

Model size

Total Parameters

Total Non Embedding Parameters

Inference

Speed Gain

595M

440M

463M

308M

1.13x

+12.9%

397M

242M

1.22x

+21.5%

^{Model scales that Tarka-Embedding-30M-V1 supports}

Training

As this a new model, we don't have any pretrain weights to use , so first we train the model on on text and then finetune on the embedding datasets.

Stage 1 : Pretraining

Datasets used :

mixedbread-ai/wikipedia-embed-en-2023-11
sentence-transformers/parallel-sentences-wikimatrix
sentence-transformers/parallel-sentences-talks

First we do a knowledge distillation on mixedbread-ai/wikipedia-embed-en-2023-11 with max_tokens=256, Qwen3-Embedding-0.6B as teacher and then just to inject cross lingual consistency we trained on other two parallel sentence datasets mentioned above. Training to support all languages is challenging, as it requires significantly more time and computational resources, so here our main focus will be on English

Stage 2 : Finetuning

Dataset used :

infgrad/jasper_text_distill_dataset

Once we pretrained the model, we finetuned the model on infgrad/jasper_text_distill_dataset with max_tokens=1024.

Loss Functions

We used the same losses that we used in our last blog post in both stage 1 and stage 2. The only difference is we added dynamic weights to the samples, the reason why we did this is instead of giving all samples in the batch same importance , we highlight more for the samples that are not learned yet and less for that are learned.

Results

Qwen3-Embedding-0.6B vs Tarka-Embedding-30M-v1 (S,M,L)

Dataset

Qwen3 Embedding 0.6B

Tarka Embedding 30M v1 (L)

Tarka Embedding 30M v1 (M)

Tarka Embedding 30M v1 (S)

AmazonCounterfactualClassification

0.904223

0.909254

0.607463

0.601194

ArXivHierarchicalClusteringP2P

0.637187

0.621282

0.605213

0.574299

ArXivHierarchicalClusteringS2S

0.638185

0.55656

0.541528

0.521588

ArguAna

0.70965

0.54263

0.44232

0.38499

AskUbuntuDupQuestions

0.651285

0.56424

0.50772

0.48952

BIOSSES

0.854893

0.782518

0.71139

0.658623

Banking77Classification

0.810097

0.721883

0.654026

0.602045

BiorxivClusteringP2P.v2

0.472655

0.43231

0.392766

0.376905

CQADupstackGamingRetrieval

0.64142

0.48537

0.36786

0.21009

CQADupstackUnixRetrieval

0.51494

0.31717

0.20889

0.11813

ClimateFEVERHardNegatives

0.4362

0.26887

0.1812

0.08522

FEVERHardNegatives

0.88942

0.65053

0.42754

0.22403

FiQA2018

0.46612

0.21805

0.11612

0.08185

HotpotQAHardNegatives

0.67689

0.46151

0.31192

0.20287

ImdbClassification

0.954392

0.910672

0.85162

0.74892

MTOPDomainClassification

0.959576

0.895668

0.779093

0.656749

MassiveIntentClassification

0.614375

0.627707

0.510928

0.463921

MassiveScenarioClassification

0.835878

0.734028

0.638769

0.566577

MedrxivClusteringP2P.v2

0.421841

0.400225

0.369895

0.357923

MedrxivClusteringS2S.v2

0.404147

0.36565

0.330783

0.31903

MindSmallReranking

0.312339

0.30223

0.29475

0.27734

SCIDOCS

0.24407

0.13682

0.08092

0.02372

SICK-R

0.848283

0.748472

0.665771

0.655951

STS12

0.829953

0.729423

0.663043

0.593618

STS13

0.917553

0.793518

0.722333

0.654962

STS14

0.871015

0.750527

0.6708

0.595216

STS15

0.914472

0.828597

0.772213

0.684747

STS17

0.855039

0.82899

0.745319

0.691837

STS22.v2

0.718334

0.652645

0.635237

0.622628

STSBenchmark

0.911297

0.807853

0.696727

0.616216

SprintDuplicateQuestions

0.941094

0.949084

0.875845

0.801411

StackExchangeClustering.v2

0.711618

0.566205

0.517875

0.491917

StackExchangeClusteringP2P.v2

0.521263

0.437382

0.393881

0.380062

SummEvalSummarization.v2

0.334321

0.296337

0.284181

0.264393

TRECCOVID

0.90518

0.69897

0.55063

0.42383

Touche2020Retrieval.v3

0.69896

0.4705

0.32769

0.20077

ToxicConversationsClassification

0.821289

0.812109

0.64873

0.623682

TweetSentimentExtractionClassification

0.760498

0.724448

0.63107

0.566271

TwentyNewsgroupsClustering.v2

0.517341

0.379188

0.325512

0.288123

TwitterSemEval2015

0.722557

0.583814

0.46447

0.425734

TwitterURLCorpus

0.867482

0.81445

0.779434

0.761465

models

parameters(B)

Mean (Task)

Mean (TaskType)

Classification

Clustering

Pair Classification

Reranking

Retrieval

STS

Summarization

all-MiniLM-L6-v2

0.023

59.03

55.93

69.25

44.9

82.37

47.14

42.92

78.95

25.96

gte-micro-v4

0.019

58.9

56.04

73.04

43.89

82.67

44.78

39.51

79.78

28.59

snowflake-arctic-embed-xs

0.023

59.77

56.12

42.44

81.33

45.26

52.65

76.21

27.96

gte-micro

0.017

53.89

52.5

67.47

41.86

80.76

43.16

27.66

77.86

28.76

Qwen3 Embedding 0.6B

0.6

70.7

64.88

85.76

54.05

84.37

48.18

61.83

86.57

33.43

Tarka Embedding 30M v1 (L)

0.03

60.43

56.69

79.2

46.99

78.24

43.32

42.5

76.92

29.63

Tarka Embedding 30M v1 (M)

0.03

51.96

49.88

66.52

43.47

70.66

40.12

30.15

69.81

28.42

Tarka Embedding 30M v1 (S)

0.03

46.07

45.22

60.37

41.37

66.29

38.34

19.56

64.15

26.44

Conclusion

We observe a noticeable performance degradation compared to the teacher model, despite using comparable computational resources. This gap may come from several factors. First, our model is trained from scratch, whereas Qwen’s embedding model benefits from significantly longer and more extensive training, as well as large-scale curated data. Or may be it is the limit of expressive capacity of the student model, may be it is not as efficient as we think, Or the training procedure itself , like how we implemented loss functions, optimization strategy, and many more may not yet be optimal.

Importantly, this work should be viewed as an exploratory study into model compression rather than a fully optimized replacement for the teacher. We hope this experiment sheds light on the challenges and opportunities of such structured compression techniques and inspires further work on improved architectures, loss formulations, and training strategies for compressed embedding models.

hashtagBrief intro to Kronecker Product

hashtagMethodology

hashtagModel Architecture

hashtagTraining

hashtagStage 1 : Pretraining

hashtagStage 2 : Finetuning

hashtagLoss Functions

hashtagResults

hashtagConclusion

Brief intro to Kronecker Product

Methodology

Model Architecture

Training

Stage 1 : Pretraining

Stage 2 : Finetuning

Loss Functions

Results

Conclusion