Tarka Embedding 30M V1
Compressed model by 20x.
Recovered approx. 86% performance on MTEB(Eng, v2) Benchmark
In our last blog post we reduce the model size by reducing the depth of the model, this time we will use Factorization, Matrix Factorization provides a straightforward framework to compress models. It aim to reconstruct an original (typically larger) matrix by combining smaller, often lower-rank, matrices (e.g. via a product). With techniques ranging from SVD, to other low-rank factorization methods. Most of the layers in LLM are linear layers which are 2D matrixes , so what we did is replaced the matrixes with the decomposed matrix and reconstruct the original matrix at inference time , so this way we can reduce the model size signifyingly with keeping the computation in higher dimension.
Brief intro to Kronecker Product
The Kronecker product is an operation that transforms two matrices into a larger matrix that contains all the possible products of the entries of the two matrices.
Definition : Let A be m x n matrix and B be a p x q matrix then the Kronecker product between A and B is (m*p x n*q) block matrix
Let us see this with an example
then the Kronecker Product will be
Methodology
Model Architecture
We used the same architecture as Qwen3-Embedding-0.6B model, but all linear layer are replaced with KronLinear layers which are upscaled during inference.
To understand how this benefit, let us take a simple Linear layer of size 1024 x 2048, if we use fp32 data type, this will take
1024*2048*4 = 8,388,608 bytes = 8 MB.
But if we use the KronLinear, we will have two sub matrix with left layer = 1024*64 and right layer = 1*32 weights , the total memory this will take for fp32 is
1024*64*4 + 32*4 = 262,272 bytes = 256.1 KB.
Despite we are inferring the same matrix of size 1024 x 2048 ,but with using KronLinear layer we can compress the information by 32 times which is very huge because then 640B model can be stored in 20B size.
In our case we compressed the 0.6B model to 0.03B model. Not only that we also integrated elastic property to the model, so 0.03B model can upscaled into multiple model depending on our computation requirements, there are some nice works that already implemented this [Matformer, Nemotron Elastic] but there is something that we have compare to the other works that implemented the elastic property , that is despite they are using portion of the architecture in inference time they still need to save whole model parameters and load for inference ,let say our model is 10B model and we are using 80% of the model for inference , then we still need to load the whole 10B model into the GPU, but the computation cost is decreased as we using only a portion of the model.
But with our method we are storing the compressed version of the model, and we only upscale the model during inference based on the requirement, we will save both memory and computation which is needed when you are working with large models.
l
595M
440M
1x
-
m
463M
308M
1.13x
+12.9%
s
397M
242M
1.22x
+21.5%
Model scales that Tarka-Embedding-30M-V1 supports
Training
As this a new model, we don't have any pretrain weights to use , so first we train the model on on text and then finetune on the embedding datasets.
Stage 1 : Pretraining
Datasets used :
mixedbread-ai/wikipedia-embed-en-2023-11sentence-transformers/parallel-sentences-wikimatrixsentence-transformers/parallel-sentences-talks
First we do a knowledge distillation on mixedbread-ai/wikipedia-embed-en-2023-11 with max_tokens=256, Qwen3-Embedding-0.6B as teacher and then just to inject cross lingual consistency we trained on other two parallel sentence datasets mentioned above. Training to support all languages is challenging, as it requires significantly more time and computational resources, so here our main focus will be on English
Stage 2 : Finetuning
Dataset used :
infgrad/jasper_text_distill_dataset
Once we pretrained the model, we finetuned the model on infgrad/jasper_text_distill_dataset with max_tokens=1024.
Loss Functions
We used the same losses that we used in our last blog post in both stage 1 and stage 2. The only difference is we added dynamic weights to the samples, the reason why we did this is instead of giving all samples in the batch same importance , we highlight more for the samples that are not learned yet and less for that are learned.
Results
all-MiniLM-L6-v2
0.023
59.03
55.93
69.25
44.9
82.37
47.14
42.92
78.95
25.96
gte-micro-v4
0.019
58.9
56.04
73.04
43.89
82.67
44.78
39.51
79.78
28.59
snowflake-arctic-embed-xs
0.023
59.77
56.12
67
42.44
81.33
45.26
52.65
76.21
27.96
gte-micro
0.017
53.89
52.5
67.47
41.86
80.76
43.16
27.66
77.86
28.76
Qwen3 Embedding 0.6B
0.6
70.7
64.88
85.76
54.05
84.37
48.18
61.83
86.57
33.43
Tarka Embedding 30M v1 (L)
0.03
60.43
56.69
79.2
46.99
78.24
43.32
42.5
76.92
29.63
Tarka Embedding 30M v1 (M)
0.03
51.96
49.88
66.52
43.47
70.66
40.12
30.15
69.81
28.42
Tarka Embedding 30M v1 (S)
0.03
46.07
45.22
60.37
41.37
66.29
38.34
19.56
64.15
26.44
Conclusion
We observe a noticeable performance degradation compared to the teacher model, despite using comparable computational resources. This gap may come from several factors. First, our model is trained from scratch, whereas Qwen’s embedding model benefits from significantly longer and more extensive training, as well as large-scale curated data. Or may be it is the limit of expressive capacity of the student model, may be it is not as efficient as we think, Or the training procedure itself , like how we implemented loss functions, optimization strategy, and many more may not yet be optimal.
Importantly, this work should be viewed as an exploratory study into model compression rather than a fully optimized replacement for the teacher. We hope this experiment sheds light on the challenges and opportunities of such structured compression techniques and inspires further work on improved architectures, loss formulations, and training strategies for compressed embedding models.