Tarka Embedding V1

The Tarka Embedding V1 series is a compact and efficient text embedding model family developed to explore the capabilities of knowledge distillation, coreset selection, and model compression techniques. These models are distilled from high-performing teacher architectures, transferring their representational and reasoning abilities into smaller, faster student models with minimal performance degradation.

Keywords

feature extraction , knowledge distillation , text embeddings , representation learning

What models does V1 series cover?

We are very strict in terms of model selection and dataset size, as these should be run by one with very low budget as possible. This V1 series cover two different compute budget. The Larger one is Tarka-Embedding-350M-V1 has 350M parameters with an embedding size of 1024 , built on a 16 layers hybrid model of convolution and attention blocks. And the smaller model is Tarka-Embedding-150-V1 which is a 24 layer hybrid attention model with combination of sliding attention and full attention.

Tarka-Embedding-150M-V1
Tarka-Embedding-350-V1

Vocab size

64400

65536

Layers

24

16

Embedding size

768

1024

Intermediate size

1152

6656

Bi directional attention

True

True

Polling

mean

mean

Instruction Aware

Yes

Yes

Training data

2 billion tokens

1 billion tokens

on-air dynamic sampling

No

Yes

These models are optimized for use cases such as semantic retrieval, classification, RAG, sentiment analysis, and code search—providing a strong alternative to existing solutions like Gemini Embedding and OpenAI’s embedding APIs.

Approach

In this work, we introduce our Tarka Embedding V1, developed using high-quality open-source multilingual datasets covering a wide range of domains. The model is built upon a strong multilingual foundation and trained through an innovative adaptive teacher–student learning framework designed to maximize efficiency and generalization.

Our training pipeline employs an on-air dynamic sampling mechanism, where a teacher model continuously evaluates the student’s responses to identify which concepts or examples the student has not yet mastered. Instead of training on the entire dataset, the teacher selectively samples only the most informative data points the student struggles with. This dynamic feedback loop enables the student to focus on harder or under-learned examples, resulting in faster convergence and better generalization. Despite having access to over 2 billion text tokens, the model achieves competitive performance by training on fewer than 1 billion tokens by dynamically sampling the data thereby reducing training costs and computational overhead.

Additionally, we introduce a dynamic weighted MSE–Cosine loss that adaptively penalizes samples where the student’s output diverges significantly from the teacher’s output distribution. This hybrid loss balances magnitude alignment (via MSE) and directional similarity (via cosine distance), ensuring that the student not only learns accurate representations but also preserves semantic consistency with the teacher.

Combined with multi-stage fine-tuning and model merging strategies, our approach yields a family of models that are robust, efficient, and highly adaptable. The final models support flexible embedding dimensions and customizable instruction prompts for both embedding and reranking tasks, facilitating seamless integration into diverse downstream applications

Tarka Embedding 150M V1

This Model is Build on Embeddinggemma as basemodel. This model cover more than 100+ languages With vocab of 262,144 , but this consume half the total papers just for embed weight which makes the model less scope to learn. So to optimise this we replaced the exsisting tokenizer and embedding weights with optimised tokenizer from LFM2-350 which have 65,536 vacab which will save more than 100M parameters which in cost of lanaguae coverage, as this tokenizer cover only 8 languages.

Training Details

  • Context Length: 2048

  • Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.

  • Training Data: 2 billion text tokens curated from multiple open source datasets.

  • Compute Resources: 40 GPU hours on NVIDIA A100

Tarka Embedding 350M V1

This model is built on LFM2-350M as the base architecture and supports over eight languages with a vocabulary of 65,536 tokens. LFM employs Liquid Time-constant Networks (LTCs) — a new class of continuous-time recurrent neural networks (RNNs) based on linear dynamical systems modulated by nonlinear, interlinked input gates. These gates serve as a continuous-time generalization of input- and state-dependent gating mechanisms in traditional RNNs, enabling finer temporal control over system evolution and allowing the model to learn complex, adaptive “liquid” dynamics directly from data. refer

However, the original LFM design relies on causal attention, which restricts each token to attend only to preceding tokens. While suitable for generative tasks, this limitation is suboptimal for embedding models, where it is essential to leverage information from the entire input sequence simultaneously.

To address this, we introduce key architectural modifications by converting the causal attention into a bidirectional attention mechanism, allowing the model to access and integrate context from all tokens at once. This change enhances semantic understanding and representation quality. Additionally, we replace the final token representation with an average pooling layer across all token embeddings, producing a more stable, context-aware sentence representation. These enhancements make the model significantly more efficient and effective for embedding and retrieval tasks.

Training Details

  • Teacher model : Qwen/Qwen3-Embedding-4B

  • Context Length: 32k

  • Supported Languages: English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish.

  • Training Data: The dataset comprises 2 billion text tokens, with approximately 1 billion tokens utilized during training via dynamic sampling.

  • Compute Resources: 40 GPU hours on NVIDIA A100

How Do They Perform on Benchmarks?

Models
Param
Mean(Task)
Mean(Type)
Class.
Clust.
Pair Class.
Rerank.
Retri.
STS
Summ.

GIST-large-Embedding-v0

335M

66.25

61.96

78.91

48.84

86.7

48.76

54.52

84.44

31.52

mxbai-embed-large-v1

335M

66.26

62.04

79.1

47.48

87.2

48.05

55.4

84.42

32.63

UAE-Large-V1

335M

66.4

61.85

79.08

47.86

87.25

48.35

55.91

84.37

30.13

GIST-Embedding-v0

109M

65.5

61.4

78.16

48.5

86.33

47.52

53.59

83.35

32.32

bge-large-en-v1.5

335M

65.89

61.87

78.34

48.01

87.13

48.26

55.44

82.79

33.13

multilingual-e5-large-instruct

560M

65.53

61.21

75.54

49.89

86.24

48.74

53.47

84.72

29.89

gte-large

335M

64.77

60.86

75.47

48.2

85.08

47.84

53.29

83.27

32.9

bilingual-embedding-large

559M

63.77

60.2

77.17

46.53

85.62

46.25

46.86

86

32.95

mmlw-roberta-large

434M

61.8

59.45

79.66

47.89

85.2

47.56

39.69

81.2

34.97

e5-large

335M

63.13

59.68

75.61

45.88

85.94

45.43

49.64

82

33.26

mmlw-e5-base

278M

61.43

58.61

77.88

47.11

84.88

46.4

40.21

81.92

31.87

e5-large-v2

335M

62.79

59.4

76.44

45.23

86.06

45.72

49.31

80.67

32.34

Tarka-Embedding-150M-V1

150M

66.40

61.35

86.28

51.94

81.65

45.66

51.62

81.48

30.86

Tarka-Embedding-350M-V1

350M

69.29

63.29

88.43

55.73

83.96

47.77

55.14

84.59

27.43

Results on MTEB(eng, v2)

Conclusion

In short, our Tarka Embedding V1 models are a compressed and distilled variant of a high-quality teacher model, designed to retain most of its semantic strength while being far more efficient. Although it exhibits a slight reduction in accuracy compared to the teacher, it outperforms models of similar or even larger sizes in terms of accuracy ,computational efficiency and inference speed. This makes it exceptionally well-suited for large-scale applications such as retrieval-augmented generation (RAG), semantic search, and knowledge-intensive reasoning, where speed, scalability, and quality must coexist seamlessly.