Datasets
Datasets
Dharmamitra develops and hosts large-scale datasets that are essential for training our models and can be used by other researchers for philological studies, machine translation, and semantic analysis.
MITRA-parallel
MITRA-parallel is a large-scale, sentence-aligned parallel corpus for Sanskrit, Buddhist Chinese, and Tibetan. It contains 1.74 million parallel sentence pairs and is designed to support research in machine translation and semantic retrieval.
- Publication (currently under preperation): MITRA: A Large-Scale Parallel Corpus and Multilingual Pretrained Language Model for Machine Translation and Semantic Retrieval for Pāli, Sanskrit, Buddhist Chinese, and Tibetan
- License: CC BY-SA 4.0
- Repository: dharmamitra/mitra-parallel
SansTib
SansTib is a Sanskrit-Classical Tibetan parallel corpus that was automatically aligned at the sentence level. The corpus contains approximately 317,000 sentence pairs and has been a foundational resource for developing bilingual sentence embedding models.
- Publication: SansTib, a Sanskrit - Tibetan Parallel Corpus and Bilingual Sentence Embedding Model (LREC 2022)
- License: CC BY-SA 4.0
- Repository: sebastian-nehrdich/sanstib