Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications, driving the accelerated development of a large number of diverse models. However, these individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets. A promising direction is to efficiently harness the diverse capabilities of LLMs to overcome these individual limitations. To address these limitations, we introduce a novel LLM selection algorithm called SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool, ensuring that the selected models collectively provide accurate responses. SelectLLM employs a multi-label classifier and policy based on the classifier’s predictions and confidence scores in selecting an optimal, query-aware, and lightweight subset of LLMs. Our findings indicate that the proposed model outperforms existing ensemble-based baselines and achieves competitive performance with similarly sized top-performing LLMs while maintaining efficiency. Specifically, it achieves a huge reduction in inference latency on two challenging reasoning benchmarks: 13% on GSM8K and 70% on MMLU, compared to the top-performing baseline. Also, we establish a theoretical upper bound by an Oracle with LLMs and perform an in-depth linguistic analysis to understand the performance gap between the Oracle and SelectLLM.
@inproceedings{maurya2025selectllm,author={Maurya, Kaushal Kumar and Srivatsa, KV Aditya and Kochmar, Ekaterina},title={{SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models}},booktitle={Findings of the Association for Computational Linguistics: ACL 2025},year={2025},address={Vienna, Austria},publisher={Association for Computational Linguistics},}
Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback
Large language models (LLMs) have demonstrated the ability to generate formative feedback and instructional hints in English, making them increasingly relevant for AI-assisted education. However, their ability to provide effective instructional support across different languages, especially for mathematically grounded reasoning tasks, remains largely unexamined. In this work, we present the first large-scale simulation of multilingual tutor–student interactions using LLMs. A stronger model plays the role of the tutor, generating feedback in the form of hints, while a weaker model simulates the student. We explore 352 experimental settings across 11 typologically diverse languages, four state-ofthe-art LLMs, and multiple prompting strategies to assess whether language-specific feedback leads to measurable learning gains. Our study examines how student input language, teacher feedback language, model choice, and language resource level jointly influence performance. Results show that multilingual hints can significantly improve learning outcomes, particularly in low-resource languages when feedback is aligned with the student’s native language. These findings offer practical insights for developing multilingual, LLM-based educational tools that are both effective and inclusive.
@inproceedings{maurya2025selectlln,author={Tonga, Junior Cedric and Srivatsa, KV Aditya and Maurya, Kaushal Kumar and Koto, Fajri and Kochmar, Ekaterina},title={{Simulating LLM-to-LLM Tutoring for Multilingual Math Feedback}},year={2025},}
AITutor-AssessmentKit: Open-Source Library to Measure Pedagogical Ability of AI Tutors in Educational Dialogues
We introduce the design and functionality of AITutor-AssessmentKit, the first opensource library to enable the pedagogical abilities assessment of AI tutors in educational dialogues.
The library comprises three modular components: autoeval for automated evaluation, llmeval for Large Language Model (LLM)-based evaluation, and visualizer for
evaluation scores visualization and interpretation. This unified framework: (i) evaluates AI tutor responses across eight comprehensive dimensions in the context of student error remediation
task in mathematics, and (ii) offers a pluggable and customizable interface for integrating models and LLM releases from the community. By providing an efficient, scalable
alternative to costly and subjective human evaluations, AITutor-AssessmentKit facilitates on-the-fly assessment of AI tutors. It is opensource, available as a pip-installable package, and comes with comprehensive tutorials.
@unpublished{maurya2025aitutor,author={Maurya, Kaushal Kumar and Kochmar, Ekaterina},title={AITutor-AssessmentKit: Open-Source Library to Measure Pedagogical Ability of AI Tutors in Educational Dialogues},year={2025},}
Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
@inproceedings{kochmar2025bea,author={Kochmar, Ekaterina and Maurya, Kaushal Kumar and Petukhova, Kseniia and Srivatsa, KV Aditya and Tack, Anaïs and Vasselli, Justin},title={Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors},booktitle={Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)},year={2025},address={Vienna, Austria},publisher={Association for Computational Linguistics},}
Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension?
@inproceedings{srivatsa2025llms,author={Srivatsa, KV Aditya and Maurya, Kaushal Kumar and Kochmar, Ekaterina},title={Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?},booktitle={Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)},year={2025},address={Vienna, Austria},publisher={Association for Computational Linguistics},}
Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems
The recent advancements in large language models (LLMs) have led to the development of intelligent tutoring systems (ITSs) that can provide personalized learning experiences. However, the evaluation of these systems remains a challenge, particularly in terms of their pedagogical effectiveness. In this paper, we propose a pedagogy-driven evaluation framework for generative AI-powered ITSs, focusing on the pedagogical dimensions of feedback and guidance provided by these systems. We introduce a set of metrics that assess the quality of feedback and guidance based on established pedagogical principles. Our framework is designed to be applicable across different domains and educational contexts, providing a comprehensive evaluation of the pedagogical value of ITSs. We demonstrate the effectiveness of our framework through a case study involving an LLM-powered ITS, highlighting its potential to enhance the evaluation process and inform the design of more effective educational technologies.
@inproceedings{maurya2025pedagogy,author={Maurya, Kaushal Kumar and Kochmar, Ekaterina},title={{Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems}},booktitle={Proceedings of the 26th International Conference on Artificial Intelligence in Education (AIED 2025), Blue Sky Track},year={2025},}
Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors
In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have beenlimited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusions in the mathematical domain. We release MRBench – a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 and Llama-3.1-8B LLMs as evaluators and analyze each tutor’s pedagogical abilities, highlighting which LLMs are good tutors and which ones are more suitable as question-answering systems. We believe that the presented taxonomy, benchmark, and human-annotated labels will streamline the evaluation process and help track the progress in AI tutors’ development.
@inproceedings{maurya-etal-2025-unifying,title={Unifying {AI} Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of {LLM}-Powered {AI} Tutors},author={Maurya, Kaushal Kumar and Srivatsa, Kv Aditya and Petukhova, Kseniia and Kochmar, Ekaterina},editor={Chiruzzo, Luis and Ritter, Alan and Wang, Lu},booktitle={Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)},month=apr,year={2025},address={Albuquerque, New Mexico},publisher={Association for Computational Linguistics},pages={1234--1251},isbn={979-8-89176-189-6}}
LLMs in Education: Novel Perspectives, Challenges, and Opportunities
The role of large language models (LLMs) in education is an increasing area of interest today, considering the new opportunities they offer for teaching, learning, and assessment. This cutting-edge tutorial provides an overview of the educational applications of NLP and the impact that the recent advances in LLMs have had on this field. We will discuss the key challenges and opportunities presented by LLMs, grounding them in the context of four major educational applications: reading, writing, and speaking skills, and intelligent tutoring systems (ITS). This COLING 2025 tutorial is designed for researchers and practitioners interested in the educational applications of NLP and the role LLMs have to play in this area. It is the first of its kind to address this timely topic.
@inproceedings{alhafni-etal-2025-llms,title={LLMs in Education: Novel Perspectives, Challenges, and Opportunities},author={Alhafni, Bashar and Vajjala, Sowmya and Bannò, Stefano and Maurya, Kaushal Kumar and Kochmar, Ekaterina},booktitle={Proceedings of the 31st International Conference on Computational Linguistics},month=jan,year={2025},address={Abu Dhabi, UAE},publisher={Association for Computational Linguistics},}
2024
Extending Generative NLP: Incorporating Diversity, Context, and Inclusivity in Neural Text Generation
Supervisor: Dr. Maunendra Sankar Desarkar, External Examiners: Prof. Ganesh Ramakrishnan (IITB) and Prof. Monojit Choudhury (MBZUAI), and Doctoral Review Committee Members: Prof. Vineeth Balasubramanian (IITH), Prof. J. Balasubramaniam (IITH), and Dr. P. K. Srijith (IITH)
Advancements in deep learning have yielded remarkable success in Natural Language Generation (NLG), driven by advancements in neural architectures and the availability of large datasets. However, the wide adoption of these NLG models for downstream tasks is often challenging, especially in scenarios such as applications requiring diverse text generation, limited context in data, and limited volume of taskspecific labeled data. Diverse text generation necessitates a one-to-many setup, where the model generates multiple outputs that are semantically similar yet lexically diverse, all derived from a single input. In the limited context scenario, the model often generates unexpected output due to the lack of relevant context in the input text. The limited data scenario is a frequent and more challenging problem, particularly for low-resource languages (LRLs). Current NLP research has primarily focused on high-resource languages (HRLs), e.g., English, which benefit from computationally accessible large training data. Despite the exciting progress in HRLs, there are over 7,000 languages globally, and the majority lack the necessary resources to train modern deep neural networks. In fact, collecting labeled data for these LRLs is often prohibitively expensive or infeasible. The scarcity of task-specific labeled data is more pronounced for NLG tasks, which limits the extension of NLG technology to LRLs. In this thesis, we address the aforementioned challenges and extend NLG modeling to diverse text generation, limited context, and limited data (i.e., low-resource languages) scenarios. This thesis contains two parts. The first part addresses the diverse text generation and limited context issues. In particular, we have designed a semantic decoupling and multi-decoder-based approach to guide diverse text generation. Further, we explore the retrieval-augmented generation (RAG) type of modeling approach to augment relevant external context in deep neural networks to address limited context issues. The second part of the thesis is dedicated to extending NLG modeling to LRLs. Here, we focus on cross-lingual modeling - transferring supervision from HRLs to LRLs. Our primary focus is on zero-shot modeling for scalability. In particular, we first focus on well-formed zero-shot text generation in LRLs by mitigating the catastrophic forgetting problem. We achieve this through unsupervised adaptive training. Next, we propose a novel meta-learning-based approach to transfer more uniform cross-lingual supervision across multiple LRLs and NLG tasks. Finally, we extend NLG modeling for extremely low-resource languages (ELRLs) that lack parallel data, have no or limited monolingual data, and are absent in modern large multilingual pre-trained language models. To achieve this, we propose noise augmentation techniques inspired by surface-level lexical similarity between closely-related HRLs and ELRLs. These proposed modeling approaches successfully overcome the mentioned limitations and extend NLG modeling to benefit a wider population.
@inproceedings{maurya-phdthesis-2024,title={Extending Generative NLP: Incorporating Diversity, Context, and Inclusivity in Neural Text Generation},author={Maurya, Kaushal Kumar},month=fab,school={Indian Institute of Technology Hyderabad},year={2024},ype={Ph.D. Dissertation},address={Hyderabad, India},}
CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from *closely-related* high-resource language (HRL). The development of an MT system for ELRL is challenging because these languages typically lack parallel corpora and monolingual corpora, and their representations are absent from large multilingual language models. Many ELRLs share lexical similarities with some HRLs, which presents a novel modeling opportunity. However, existing subword-based neural MT models do not explicitly harness this lexical similarity, as they only implicitly align HRL and ELRL latent embedding space. To overcome this limitation, we propose a novel, CharSpan, approach based on character-span noise augmentation into the training data of HRL. This serves as a regularization technique, making the model more robust to \textitlexical divergences between the HRL and ELRL, thus facilitating effective cross-lingual transfer. Our method significantly outperformed strong baselines in zero-shot settings on closely related HRL and ELRL pairs from three diverse language families, emerging as the state-of-the-art model for ELRLs.
@inproceedings{maurya-etal-2024-charspan,title={CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages},author={Maurya, Kaushal Kumar and Kejriwal, Rahul and Desarkar, Maunendra and Kunchukuttan, Anoop},editor={Graham, Yvette and Purver, Matthew},booktitle={Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)},month=mar,year={2024},address={St. Julian{'}s, Malta},publisher={Association for Computational Linguistics},pages={294--310},}
DAC: Quantized Optimal Transport Reward-based Reinforcement Learning Approach to Detoxify Query Auto-Completion
Modern Query Auto-Completion (QAC) systems utilize natural language generation (NLG) using large language models (LLM) to achieve remarkable performance. However, these systems are prone to generating biased and toxic completions due to inherent learning biases. Existing detoxification approaches exhibit two key limitations: (1) They primarily focus on mitigating toxicity for grammatically well-formed long sentences but struggle to adapt to the QAC task, where queries are short and structurally different (include spelling errors, do not follow grammatical rules and have relatively flexible word order). (2) These approaches often view detoxification through a binary lens where all text labeled as toxic is undesirable, and non-toxic is considered desirable. To address these limitations, we propose DAC, an intuitive and efficient reinforcement learning-based model to detoxify QAC. With DAC, we introduce an additional perspective of considering the third query class of addressable toxicity. These queries can encompass implicit toxicity, subjective toxicity, or non-toxic queries containing toxic words. We incorporate this three-class query behavior perspective into the proposed model through quantized optimal transport to learn distinctions and generate truly non-toxic completions. We evaluate toxicity levels in the generated completions by DAC across two real-world QAC datasets (Bing and AOL) using two classifiers: a publicly available generic classifier (Detoxify) and a search queryspecific classifier, which we develop (TClassify). We find that DAC consistently outperforms all existing baselines on the Bing dataset and achieves competitive performance on the AOL dataset for query detoxification.
@inproceedings{maheswaran2024dac,title={DAC: Quantized Optimal Transport Reward-based Reinforcement Learning Approach to Detoxify Query Auto-Completion},author={Maheswaran, Aishwarya and Maurya, Kaushal Kumar and Gupta, Manish and Desarkar, Maunendra Sankar},booktitle={Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval},pages={608--618},year={2024},acmisbn={979-8-4007-0431-4/24/07},acmdoi={10.1145/3626772.3657779},}
DQAC: Detoxifying Query Auto-completion with Adapters
Recent Query Auto-completion (QAC) systems leverage natural language generation or pre-trained language models (PLMs) to demonstrate remarkable performance. However, these systems also suffer from biased and toxic completions. Efforts have been made to address language detoxification within PLMs using controllable text generation (CTG) techniques, involving training with non-toxic data and employing decoding time approaches. As the completions for QAC systems are usually short, these existing CTG methods based on decoding and training are not directly transferable. Towards these concerns, we propose the first public QAC detoxification model, Detoxifying Query Auto-Completion (or DQAC), which utilizes adapters in a CTG framework. DQAC operates on latent representations with no additional overhead. It leverages two adapters for toxic and non-toxic cases. During inference, we fuse these representations in a controlled manner that guides the generation of query completions towards non-toxicity. We evaluate toxicity levels in the generated completions across two real-world datasets using two classifiers: a publicly available (Detoxify) and a search query-specific classifier which we develop (QDETOXIFY). DQAC consistently outperforms all existing baselines and emerges as a state-of-the-art model providing high quality and low toxicity.
@inproceedings{maheswaran2024dqac,title={DQAC: Detoxifying Query Auto-completion with Adapters},author={Maheswaran, Aishwarya and Maurya, Kaushal Kumar and Gupta, Manish and Desarkar, Maunendra Sankar},booktitle={Pacific-Asia Conference on Knowledge Discovery and Data Mining},pages={108--120},year={2024},organization={Springer},}
Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing
With the rapid development of LLMs, it is natural to ask how to harness their capabilities efficiently. In this paper, we explore whether it is feasible to direct each input query to a single most suitable LLM. To this end, we propose LLM routing for challenging reasoning tasks. Our extensive experiments suggest that such routing shows promise but is not feasible in all scenarios, so more robust approaches should be investigated to fill this gap.
@inproceedings{srivatsa-etal-2024-harnessing,title={Harnessing the Power of Multiple Minds: Lessons Learned from {LLM} Routing},author={Srivatsa, KV Aditya and Maurya, Kaushal Kumar and Kochmar, Ekaterina},editor={Tafreshi, Shabnam and Akula, Arjun and Sedoc, Jo{\~a}o and Drozd, Aleksandr and Rogers, Anna and Rumshisky, Anna},booktitle={Proceedings of the Fifth Workshop on Insights from Negative Results in NLP},month=jun,year={2024},address={Mexico City, Mexico},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2024.insights-1.15/},doi={10.18653/v1/2024.insights-1.15},pages={124--134},}
2023
SelectNoise: Unsupervised Noise Injection to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
In this work, we focus on the task of machine translation (MT) from extremely low-resource language (ELRLs) to English. The unavailability of parallel data, lack of representation from large multilingual pre-trained models, and limited monolingual data hinder the development of MT systems for ELRLs. However, many ELRLs often share lexical similarities with high-resource languages (HRLs) due to factors such as dialectical variations, geographical proximity, and language structure. We utilize this property to improve cross-lingual signals from closely related HRL to enable MT for ELRLs. Specifically, we propose a novel unsupervised approach, SelectNoise, based on selective candidate extraction and noise injection to generate noisy HRLs training data. The noise injection acts as a regularizer, and the model trained with noisy data learns to handle lexical variations such as spelling, grammar, and vocabulary changes, leading to improved cross-lingual transfer to ELRLs. The selective candidates are extracted using BPE merge operations and edit operations, and noise injection is performed using greedy, top-p, and top-k sampling strategies. We evaluate the proposed model on 12 ELRLs from the FLORES-200 benchmark in a zero-shot setting across two language families. The proposed model outperformed all the strong baselines, demonstrating its efficacy. It has comparable performance with the supervised noise injection model. Our code and model are publicly available.
@inproceedings{brahma-etal-2023-selectnoise,title={{S}elect{N}oise: Unsupervised Noise Injection to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages},author={Brahma, Maharaj and Maurya, Kaushal Kumar and Desarkar, Maunendra},editor={Bouamor, Houda and Pino, Juan and Bali, Kalika},booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},month=dec,year={2023},address={Singapore},publisher={Association for Computational Linguistics},doi={10.18653/v1/2023.findings-emnlp.109},pages={1615--1629},}
Towards Low-resource Language Generation with Limited Supervision
We present a research narrative aimed at enabling language technology for multiple natural language generation (NLG) tasks in low-resource languages (LRLs). With approximately 7,000 languages spoken globally, many lack the resources required for model training. NLG applications for LRLs present two additional key challenges: (i) The training is more pronounced, and (ii) Zero-shot modeling is a viable research direction for scalability; however, generating zero-shot well-formed text in target LRLs is challenging. Addressing these concerns, this narrative introduces three promising research explorations that serve as a step toward enabling language technology for many LRLs. These approaches make effective use of transfer learning and limited supervision techniques for modeling. Evaluations were conducted mostly in the zero-shot setting, enabling scalability. This research narrative is an ongoing doctoral thesis.
@inproceedings{maurya-desarkar-2023-towards,title={Towards Low-resource Language Generation with Limited Supervision},author={Maurya, Kaushal Kumar and Desarkar, Maunendra},editor={Elazar, Yanai and Ettinger, Allyson and Kassner, Nora and Ruder, Sebastian and A. Smith, Noah},booktitle={Proceedings of the Big Picture Workshop},month=dec,year={2023},address={Singapore},publisher={Association for Computational Linguistics},doi={10.18653/v1/2023.bigpicture-1.7},pages={80--92},}
Trie-NLG: Trie Context Augmentation to Improve Personalized Query Auto-completion for Short and Unseen Prefixes
Query auto-completion (QAC) aims at suggesting plausible completions for a given query prefix. Traditionally, QAC systems have leveraged tries curated from historical query logs to suggest most popular completions. In this context, there are two specific scenarios that are difficult to handle for any QAC system: short prefixes (which are inherently ambiguous) and unseen prefixes. Recently, personalized Natural Language Generation (NLG) models have been proposed to leverage previous session queries as context for addressing these two challenges. However, such NLG models suffer from two drawbacks: (1) some of the previous session queries could be noisy and irrelevant to the user intent for the current prefix, and (2) NLG models cannot directly incorporate historical query popularity. This motivates us to propose a novel NLG model for QAC, Trie-NLG, which jointly leverages popularity signals from trie and personalization signals from previous session queries. We train the Trie-NLG model by augmenting the prefix with rich context comprising of recent session queries and top trie completions. This simple modeling approach overcomes the limitations of trie-based and NLG-based approaches, and leads to state-of-the-art performance. We evaluate the Trie-NLG model using two large QAC datasets. On average, our model achieves huge 57% and 14% boost in MRR over the popular trie-based lookup and the strong BART-based baseline methods, respectively.
@article{maurya2023trie,title={Trie-NLG: Trie Context Augmentation to Improve Personalized Query Auto-completion for Short and Unseen Prefixes},author={Maurya, Kaushal Kumar and Desarkar, Maunendra Sankar and Gupta, Manish and Agrawal, Puneet},booktitle={ECML-PKDD, Data Mining and Knowledge Discovery},volume={37},number={6},pages={2306--2329},year={2023},publisher={Springer},}
DivHSK: Diverse Headline Generation using Self-Attention based Keyword Selection
Diverse headline generation is an NLP task where given a news article, the goal is to generate multiple headlines that are true to the content of the article but are different among themselves. This task aims to exhibit and exploit semantically similar one-to-many relationships between a source news article and multiple target headlines. Toward this, we propose a novel model called DIVHSK. It has two components:KEYSELECT for selecting the important keywords, and SEQGEN, for finally generating the multiple diverse headlines. In KEYSELECT, we cluster the self-attention heads of the last layer of the pre-trained encoder and select the most-attentive theme and general keywords from the source article. Then, cluster-specific keyword sets guide the SEQGEN, a pre-trained encoder-decoder model, to generate diverse yet semantically similar headlines. The proposed model consistently outperformed existing literature and our strong baselines and emerged as a state-of-the-art model. We have also created a high-quality multi-reference headline dataset from news articles.
@inproceedings{e-etal-2023-divhsk,title={DivHSK: Diverse Headline Generation using Self-Attention based Keyword Selection},author={E, Venkatesh and Maurya, Kaushal Kumar and Kumar, Deepak and Desarkar, Maunendra Sankar},editor={Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki},booktitle={Findings of the Association for Computational Linguistics: ACL 2023},month=jul,year={2023},address={Toronto, Canada},publisher={Association for Computational Linguistics},doi={10.18653/v1/2023.findings-acl.118},pages={1879--1891},}
2022
Meta-XNLG: A Meta-Learning Approach Based on Language Clustering for Zero-Shot Cross-Lingual Transfer and Generation
Recently, the NLP community has witnessed a rapid advancement in multilingual and cross-lingual transfer research where the supervision is transferred from high-resource languages (HRLs) to low-resource languages (LRLs). However, the cross-lingual transfer is not uniform across languages, particularly in the zero-shot setting. Towards this goal, one promising research direction is to learn shareable structures across multiple tasks with limited annotated data. The downstream multilingual applications may benefit from such a learning setup as most of the languages across the globe are low-resource and share some structures with other languages. In this paper, we propose a novel meta-learning framework (called Meta-X_NLG) to learn shareable structures from typologically diverse languages based on meta-learning and language clustering. This is a step towards uniform cross-lingual transfer for unseen languages. We first cluster the languages based on language representations and identify the centroid language of each cluster. Then, a meta-learning algorithm is trained with all centroid languages and evaluated on the other languages in the zero-shot setting. We demonstrate the effectiveness of this modeling on two NLG tasks (Abstractive Text Summarization and Question Generation), 5 popular datasets and 30 typologically diverse languages. Consistent improvements over strong baselines demonstrate the efficacy of the proposed framework. The careful design of the model makes this end-to-end NLG setup less vulnerable to the accidental translation problem, which is a prominent concern in zero-shot cross-lingual NLG tasks.
@inproceedings{maurya-desarkar-2022-meta,title={Meta-XNLG: A Meta-Learning Approach Based on Language Clustering for Zero-Shot Cross-Lingual Transfer and Generation},author={Maurya, Kaushal Kumar and Desarkar, Maunendra},booktitle={Findings of the Association for Computational Linguistics: ACL 2022},month=may,year={2022},address={Dublin, Ireland},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2022.findings-acl.24},doi={10.18653/v1/2022.findings-acl.24},pages={269--284},}
Hostility Detection in Online Hindi-English Code-Mixed Conversations
Aditi Bagora, Kamal Shrestha, Kaushal Kumar Maurya and Maunendra Sankar Desarkar
With the rise in accessibility and popularity of various social media platforms, people have started expressing and communicating their ideas, opinions, and interests online. While these platforms are active sources of entertainment and idea-sharing, they also attract hostile and offensive content equally. Identification of hostile posts is an essential and challenging task. In particular, Hindi-English Code-Mixed online posts of conversational nature (which have a hierarchy of posts, comments, and replies) have escalated the challenges. There are two major challenges: (1) the complex structure of Code-Mixed text and (2) filtering the relevant previous context for a given utterance. To overcome these challenges, in this paper, we propose a novel hierarchical neural network architecture to identify hostile posts/comments/replies in online Hindi-English Code-Mixed conversations. We leverage large multilingual pre-trained (mLPT) models like mBERT, XLMR, and MuRIL. The mLPT models provide a rich representation of code-mix text and hierarchical modeling leads to a natural abstraction and selection of the relevant context. The propose model consistently outperformed all the baselines and emerged as a state-of-the-art performing model. We conducted multiple analyses and ablation studies to prove the robustness of the proposed model.
@inproceedings{10.1145/3501247.3531579,author={Bagora, Aditi and Shrestha, Kamal and Maurya, Kaushal Kumar and Desarkar, Maunendra Sankar},title={Hostility Detection in Online Hindi-English Code-Mixed Conversations},year={2022},isbn={9781450391917},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3501247.3531579},doi={10.1145/3501247.3531579},booktitle={14th ACM Web Science Conference 2022},pages={390–400},numpages={11},location={Barcelona, Spain},series={WebSci '22},}
2021
ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation
Despite the recent advancement in NLP research, cross-lingual transfer for natural language generation is relatively understudied. In this work, we transfer supervision from high resource language (HRL) to multiple lowresource languages (LRLs) for natural language generation (NLG). We consider four NLG tasks (text summarization, question generation, news headline generation, and distractor generation) and three syntactically diverse languages, i.e., English, Hindi, and Japanese. We propose an unsupervised crosslingual language generation framework (called ZmBART) that does not use any parallel or pseudo-parallel/back-translated data. In this framework, we further pre-train mBART sequence-to-sequence denoising auto-encoder model with an auxiliary task using monolingual data of three languages. The objective function of the auxiliary task is close to the target tasks which enriches the multi-lingual latent representation of mBART and provides good initialization for target tasks. Then, this model is fine-tuned with task-specific supervised English data and directly evaluated with low-resource languages in the Zero-shot setting. To overcome catastrophic forgetting and spurious correlation issues, we applied freezing model component and data arguclosely related languages. We also show that the proposed character-span noise injection performs better than the unigram-character noise injection.mentation approaches respectively. This simple modeling approach gave us promising results. We experimented with few-shot training (with 1000 supervised data-points) which boosted the model performance further. We performed several ablations and cross-lingual transferability analysis to demonstrate the robustness of ZmBART.
@inproceedings{maurya-etal-2021-zmbart,title={ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation},author={Maurya, Kaushal Kumar and Desarkar, Maunendra Sankar and Kano, Yoshinobu and Deepshikha, Kumari},booktitle={Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021},month=aug,year={2021},address={Online},publisher={Association for Computational Linguistics},doi={10.18653/v1/2021.findings-acl.248},pages={2804--2818},}
Coarse and Fine-grained Hostility Detection in Hindi Posts using Fine-tuned Multilingual Embeddings
Arkadipta De, Venkatesh Elangovan, Kaushal Kumar Maurya and Maunendra Sankar Desarkar
Constraint workshop at AAAI 2021 (Shared Task Paper)
Due to the wide adoption of social media platforms like Facebook, Twitter, etc., there is an emerging need of detecting online posts that can go against the community acceptance standards. The hostility detection task has been well explored for resource-rich languages like English, but is unexplored for resource-constrained languages like Hindi due to the unavailability of large suitable data. We view this hostility detection as a multi-label multi-class classification problem. We propose an effective neural network-based technique for hostility detection in Hindi posts. We leverage pre-trained multilingual Bidirectional Encoder Representations of Transformer (mBERT) to obtain the contextual representations of Hindi posts. We have performed extensive experiments including different pre-processing techniques, pre-trained models, neural architectures, hybrid strategies, etc. Our best performing neural classifier model includes One-vs-the-Rest approach where we obtained 92.60%, 81.14%, 69.59%, 75.29% and 73.01% F1 scores for hostile, fake, hate, offensive, and defamation labels respectively. The proposed model (https://github.com/Arko98/Hostility-Detection-in-Hindi-Constraint-2021) outperformed the existing baseline models and emerged as the state-of-the-art model for detecting hostility in the Hindi posts.
@inproceedings{de2021coarse,title={Coarse and Fine-grained Hostility Detection in Hindi Posts using Fine-tuned Multilingual Embeddings},author={De, Arkadipta and Elangovan, Venkatesh and Maurya, Kaushal Kumar and Desarkar, Maunendra Sankar},booktitle={Combating Online Hostile Posts in Regional Languages during Emergency Situation: First International Workshop, CONSTRAINT 2021, Collocated with AAAI 2021, Virtual Event, February 8, 2021, Revised Selected Papers 1},pages={201--212},year={2021},organization={Springer},}
A Neural Approach for Detecting Inline Mathematical Expressions from
Scientific Documents
Scientific documents generally contain multiple mathematical expressions in them. Detecting inline mathematical expressions are one of the most important and challenging tasks in scientific text mining. Recent works that detect inline mathematical expressions in scientific documents have looked at the problem from an image processing perspective. There is little work that has targeted the problem from NLP perspective. Towards this, we define a few features and applied Conditional Random Fields (CRF) to detect inline mathematical expressions in scientific documents. Apart from this feature based approach, we also propose a hybrid algorithm that combines Bidirectional Long Short Term Memory networks (Bi-LSTM) and feature-based approach for this task. Experimental results suggest that this proposed hybrid method outperforms several baselines in the literature and also individual methods in the hybrid approach.
@article{DBLP:journals/es/MadisettyMAD21,author={Madisetty, Sreekanth and Maurya, Kaushal Kumar and Aizawa, Akiko and Desarkar, Maunendra Sankar},title={A Neural Approach for Detecting Inline Mathematical Expressions from
Scientific Documents},booktitle={Expert Syst. J. Knowl. Eng.},volume={38},number={4},year={2021},url={https://doi.org/10.1111/exsy.12576},doi={10.1111/exsy.12576},timestamp={Tue, 13 Jul 2021 13:25:01 +0200},biburl={https://dblp.org/rec/journals/es/MadisettyMAD21.bib},bibsource={dblp computer science bibliography, https://dblp.org},}
2020
Learning to Distract: A Hierarchical Multi-Decoder Network for Automated Generation of Long Distractors for Multiple-Choice Questions for Reading Comprehension
The task of generating incorrect options for multiple-choice questions is termed as distractor generation problem. The task requires high cognitive skills and is extremely challenging to automate. Existing neural approaches for the task leverage encoder-decoder architecture to generate long distractors. However, in this process two critical points are ignored - firstly, many methods use Jaccard similarity over a pool of candidate distractors to sample the distractors. This often makes the generated distractors too obvious or not relevant to the question context. Secondly, some approaches did not consider the answer in the model, which caused the generated distractors to be either answer-revealing or semantically equivalent to the answer.In this paper, we propose a novel Hierarchical Multi-Decoder Network (HMD-Net) consisting of one encoder and three decoders, where each decoder generates a single distractor. To overcome the first problem mentioned above, we include multiple decoders with a dis-similarity loss in the loss function. To address the second problem, we exploit richer interaction between the article, question, and answer with a SoftSel operation and a Gated Mechanism. This enables the generation of distractors that are in context with questions but semantically not equivalent to the answers. The proposed model outperformed all the previous approaches significantly in both automatic and manual evaluations. In addition, we also consider linguistic features and BERT contextual embedding with our base model which further push the model performance.
@inproceedings{10.1145/3340531.3411997,author={Maurya, Kaushal Kumar and Desarkar, Maunendra Sankar},year={2020},isbn={9781450368599},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3340531.3411997},doi={10.1145/3340531.3411997},booktitle={Proceedings of the 29th ACM International Conference on Information & Knowledge Management},pages={1115–1124},numpages={10},keywords={distractor generation, natural language generation, question-answering},location={Virtual Event, Ireland},series={CIKM '20},}
Machine Translation Evaluation: Manual versus Automatic—a Comparative Study
Kaushal Kumar Maurya, Renjith P Ravindran, Ch Ram Anirudh and Kavi Narayana Murthy
The quality of machine translation (MT) is best judged by humans well versed in both source and target languages. However, automatic techniques are often used as these are much faster, cheaper and language independent. The goal of this paper is to check for correlation between manual and automatic evaluation, specifically in the context of Indian languages. To the extent automatic evaluation methods correlate with the manual evaluations, we can get the best of both worlds. In this paper, we perform a comparative study of automatic evaluation metrics—BLEU, NIST, METEOR, TER and WER, against the manual evaluation metric (adequacy), for English-Hindi translation. We also attempt to estimate the manual evaluation score of a given MT output from its automatic evaluation score. The data for the study was sourced from the Workshop on Statistical Machine Translation WMT14.
@inproceedings{maurya2020machine,title={Machine Translation Evaluation: Manual versus Automatic—a Comparative Study},author={Maurya, Kaushal Kumar and Ravindran, Renjith P and Anirudh, Ch Ram and Murthy, Kavi Narayana},booktitle={Data Engineering and Communication Technology: Proceedings of 3rd ICDECT-2K19},pages={541--553},year={2020},organization={Springer},}