Datasets

Dharmamitra develops and hosts large-scale datasets that are essential for training our models and can be used by other researchers for philological studies, machine translation, and semantic analysis.

MITRA-parallel

MITRA-parallel is a large-scale, sentence-aligned parallel corpus for Sanskrit, Buddhist Chinese, and Tibetan. It contains 1.74 million parallel sentence pairs and is designed to support research in machine translation and semantic retrieval.

Publication (currently under preperation): MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan
License: CC BY-SA 4.0
Repository: dharmamitra/mitra-parallel

SansTib

SansTib is a Sanskrit-Classical Tibetan parallel corpus that was automatically aligned at the sentence level. The corpus contains approximately 317,000 sentence pairs and has been a foundational resource for developing bilingual sentence embedding models.

Publication: SansTib, a Sanskrit - Tibetan Parallel Corpus and Bilingual Sentence Embedding Model (LREC 2022)
License: CC BY-SA 4.0
Repository: sebastian-nehrdich/sanstib