
Hello, I'm
Rajvee Sheth
Junior Research Fellow, IIT Gandhinagar
I am a Junior Research Fellow at Lingo Labs, Indian Institute of Technology Gandhinagar, specializing in Natural Language Processing and code-mixed text processing. My work focuses on developing tools and datasets for Hindi-English code-mixing, aiming to enhance multilingual NLP applications. My research is guided by Prof. Mayank Singh.
I am passionate about advancing the field of NLP through innovative research and practical applications. My current projects include the development of the COMI-LINGUA dataset and the COMMENTATOR annotation framework.
Research Interests
Natural Language Processing
Developing robust computational frameworks for multilingual text understanding, with emphasis on advanced annotation methodologies and scalable evaluation systems that enable interpretable and secure NLP solutions across diverse linguistic contexts.
Code-Mixing
Redefining code-mixed NLP with curated Hindi-English datasets, robust annotation frameworks and comprehensive evaluation on existing state-of-the-art LLMs to improve language understanding.
Technical Skills
Programming Languages
Web Technologies
Databases & Tools
Libraries
Projects
Curating and constructing benchmarks and development of ML models for low level NLP tasks in Hindi-English code-mixing
Funded by ANRF - This comprehensive project encompasses the development of benchmarks and Annotation Framework for Hindi-English code-mixing research. Key deliverables include:
COMI-LINGUA Dataset: A large-scale, expert-annotated dataset for multitask NLP in Hindi-English code-mixing, covering tasks like Matrix Language Identification, POS Tagging, and Named Entity Recognition. Available at: Hugging Face.
COMMENTATOR Portal: A code-mixed multilingual text annotation framework designed to facilitate the annotation and analysis of Hindi-English texts. This tool supports the creation and management of annotated datasets for research purposes. Available at: GitHub Repo.
More about the work: Project Website
Publications
COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework
R. Sheth, S. Nisar, H. Prajapati, H. Beniwal, M. Singh, "COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework," Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024. [PDF]
COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
R. Sheth, H. Beniwal, M. Singh, "COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing," Findings of Empirical Methods in Natural Language Processing, 2025. [PDF]
Eka-Eval: A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
Sinha, S. R., Sheth, R., Upperwal, A., & Singh, M., "Eka-Eval: A Comprehensive Evaluation Framework for Large Language Models in Indian Languages," arXiv preprint arXiv:2507.01853, 2025. [PDF]
News and Updates
August 2025: Paper on "COMI-LINGUA" dataset accepted at EMNLP Findings 2025.
February 2025: Showcased research at CoLab 2025 | An IITGN Industry Open House. IITGN's flagship event connecting industry and academia to explore research partnerships and foster innovation collaborations.
January 2025: Participated in the Curiosity Carnival for School Children at IIT Gandhinagar, engaging with young minds and promoting STEM education.
October 2024: Paper on "COMMENTATOR" framework accepted at EMNLP 2024.
June - July 2024: Actively volunteered at the ACM INDIA Summer School 2024 on GenAI for Text (June 24 - July 5), contributing to educational initiatives in artificial intelligence.
April 2024: Served as a Programming Technical Assistant for a one-day workshop on Python Programming and AI Applications, supporting hands-on learning experiences.
February 2024: Participated in Science Day and CoLab 2024, engaging in collaborative research and academic discussions.
November 2023: Joined Lingo Labs at IIT Gandhinagar as a Junior Research Fellow.