Skip to content

Datasets

Datasets

Dharmamitra develops and hosts large-scale datasets that are essential for training our models and can be used by other researchers for philological studies, machine translation, and semantic analysis.


MITRA-parallel

MITRA-parallel is a large-scale, sentence-aligned parallel corpus for Sanskrit, Buddhist Chinese, and Tibetan. It contains 1.74 million parallel sentence pairs and is designed to support research in machine translation and semantic retrieval.

  • Publication (currently under preperation): MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan
  • License: CC BY-SA 4.0
  • Repository: dharmamitra/mitra-parallel

SansTib

SansTib is a Sanskrit-Classical Tibetan parallel corpus that was automatically aligned at the sentence level. The corpus contains approximately 317,000 sentence pairs and has been a foundational resource for developing bilingual sentence embedding models.