Tarka Embedding V1
The Tarka Embedding V1 series is a compact and efficient text embedding model family developed to explore the capabilities of knowledge distillation, coreset selection, and model compression techniques. These models are distilled from high-performing teacher architectures, transferring their representational and reasoning abilities into smaller, faster student models with minimal performance degradation.
Keywords
feature extraction , knowledge distillation , text embeddings , representation learning
What models does V1 series cover?
We are very strict in terms of model selection and dataset size, as these should be run by one with very low budget as possible. This V1 series cover two different compute budget. The Larger one is Tarka-Embedding-350M-V1 has 350M parameters with an embedding size of 1024 , built on a 16 layers hybrid model of convolution and attention blocks. And the smaller model is Tarka-Embedding-150-V1 which is a 24 layer hybrid attention model with combination of sliding attention and full attention.
Vocab size
64400
65536
Layers
24
16
Embedding size
768
1024
Intermediate size
1152
6656
Bi directional attention
True
True
Polling
mean
mean
Instruction Aware
Yes
Yes
Training data
2 billion tokens
1 billion tokens
on-air dynamic sampling
No
Yes
These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.
Approach


In this work, we introduce our Tarka Embedding V1, developed using high-quality open-source multilingual datasets covering a wide range of domains. The model is built upon a strong multilingual foundation and trained through an innovative adaptive teacher–student learning framework designed to maximize efficiency and generalization.
Our training pipeline employs an on-air dynamic sampling mechanism, where a teacher model continuously evaluates the student’s responses to identify which concepts or examples the student has not yet mastered. Instead of training on the entire dataset, the teacher selectively samples only the most informative data points the student struggles with. This dynamic feedback loop enables the student to focus on harder or under-learned examples, resulting in faster convergence and better generalization. Despite having access to over 2 billion text tokens, the model achieves competitive performance by training on fewer than 1 billion tokens by dynamically sampling the data thereby reducing training costs and computational overhead.
Additionally, we introduce a dynamic weighted MSE–Cosine loss that adaptively penalizes samples where the student’s output diverges significantly from the teacher’s output distribution. This hybrid loss balances magnitude alignment (via MSE) and directional similarity (via cosine distance), ensuring that the student not only learns accurate representations but also preserves semantic consistency with the teacher.
Combined with multi-stage fine-tuning and model merging strategies, our approach yields a family of models that are robust, efficient, and highly adaptable. The final models support flexible embedding dimensions and customizable instruction prompts for both embedding and reranking tasks, facilitating seamless integration into diverse downstream applications
Tarka Embedding 150M V1
This Model is Build on Embeddinggemma as basemodel. This model cover more than 100+ languages With vocab of 262,144 , but this consume half the total papers just for embed weight which makes the model less scope to learn. So to optimise this we replaced the exsisting tokenizer and embedding weights with optimised tokenizer from LFM2-350 which have 65,536 vacab which will save more than 100M parameters which in cost of lanaguae coverage, as this tokenizer cover only 8 languages.
Training Details
Teacher model : google/embeddinggemma-300m
Context Length: 2048
Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.
Training Data: 2 billion text tokens curated from multiple open source datasets.
Compute Resources: 40 GPU hours on NVIDIA A100
Tarka Embedding 350M V1
This model is built on LFM2-350M as the base architecture and supports over eight languages with a vocabulary of 65,536 tokens. LFM employs Liquid Time-constant Networks (LTCs) — a new class of continuous-time recurrent neural networks (RNNs) based on linear dynamical systems modulated by nonlinear, interlinked input gates. These gates serve as a continuous-time generalization of input- and state-dependent gating mechanisms in traditional RNNs, enabling finer temporal control over system evolution and allowing the model to learn complex, adaptive “liquid” dynamics directly from data. refer
However, the original LFM design relies on causal attention, which restricts each token to attend only to preceding tokens. While suitable for generative tasks, this limitation is suboptimal for embedding models, where it is essential to leverage information from the entire input sequence simultaneously.
To address this, we introduce key architectural modifications by converting the causal attention into a bidirectional attention mechanism, allowing the model to access and integrate context from all tokens at once. This change enhances semantic understanding and representation quality. Additionally, we replace the final token representation with an average pooling layer across all token embeddings, producing a more stable, context-aware sentence representation. These enhancements make the model significantly more efficient and effective for embedding and retrieval tasks.


Training Details
Teacher model : Qwen/Qwen3-Embedding-4B
Context Length: 32k
Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.
Training Data: The dataset comprises 2 billion text tokens, with approximately 1 billion tokens utilized during training via dynamic sampling.
Compute Resources: 40 GPU hours on NVIDIA A100
How Do They Perform on Benchmarks?
GIST-large-Embedding-v0
335M
66.25
61.96
78.91
48.84
86.7
48.76
54.52
84.44
31.52
mxbai-embed-large-v1
335M
66.26
62.04
79.1
47.48
87.2
48.05
55.4
84.42
32.63
UAE-Large-V1
335M
66.4
61.85
79.08
47.86
87.25
48.35
55.91
84.37
30.13
GIST-Embedding-v0
109M
65.5
61.4
78.16
48.5
86.33
47.52
53.59
83.35
32.32
bge-large-en-v1.5
335M
65.89
61.87
78.34
48.01
87.13
48.26
55.44
82.79
33.13
multilingual-e5-large-instruct
560M
65.53
61.21
75.54
49.89
86.24
48.74
53.47
84.72
29.89
gte-large
335M
64.77
60.86
75.47
48.2
85.08
47.84
53.29
83.27
32.9
bilingual-embedding-large
559M
63.77
60.2
77.17
46.53
85.62
46.25
46.86
86
32.95
mmlw-roberta-large
434M
61.8
59.45
79.66
47.89
85.2
47.56
39.69
81.2
34.97
e5-large
335M
63.13
59.68
75.61
45.88
85.94
45.43
49.64
82
33.26
mmlw-e5-base
278M
61.43
58.61
77.88
47.11
84.88
46.4
40.21
81.92
31.87
e5-large-v2
335M
62.79
59.4
76.44
45.23
86.06
45.72
49.31
80.67
32.34
Tarka-Embedding-150M-V1
150M
66.40
61.35
86.28
51.94
81.65
45.66
51.62
81.48
30.86
Tarka-Embedding-350M-V1
350M
69.29
63.29
88.43
55.73
83.96
47.77
55.14
84.59
27.43
Results on MTEB(eng, v2)
Conclusion
In short, our Tarka Embedding V1 models are a compressed and distilled variant of a high-quality teacher model, designed to retain most of its semantic strength while being far more efficient. Although it exhibits a slight reduction in accuracy compared to the teacher, it outperforms models of similar or even larger sizes in terms of accuracy ,computational efficiency and inference speed. This makes it exceptionally well-suited for large-scale applications such as retrieval-augmented generation (RAG), semantic search, and knowledge-intensive reasoning, where speed, scalability, and quality must coexist seamlessly.