In this work, we focus on the task of machine translation (MT) from extremely low-resource language (ELRLs) to English. The unavailability of parallel data, lack of representation from larger multilingual pre-trained models, and limited monolingual data hinder the development of MT systems for ELRLs. However, many ELRLs often share lexical similarities with high-resource languages (HRLs) due to factors such as dialectical variations, geographical proximity, and language structure. We utilize this property to transfer cross-lingual signals from closely related HRL to enable MT for ELRLs. Specifically, we propose a novel unsupervised approach based on selective candidate extraction and noise injection to generate noisy HRLs training data. The noise injection acts as a regularizer and the trained model learns to handle lexical variations such as spelling, grammar, and vocabulary changes, leading to improved cross-lingual transfer to ELRLs. The selective candidates are extracted using BPE merge operations and edit operations, and noise injection is performed using greedy, top-p, and top-k sampling strategies. We evaluate our proposed model on 12 ELRLs from the FLORES-200 benchmark in a zero-shot setting, encompassing two closely related HRLs and LRL families. The proposed model significantly outperformed all the strong baselines, demonstrating its efficacy. The proposed model has comparable performance with the supervised noise injection model.
UnsupervisedTransfer-LearningMeta-Learning
Towards Low-resource Language Generation with Limited Supervision
We present a research narrative aimed at enabling language technology for multiple natural language generation (NLG) tasks in low-resource languages (LRLs). With approximately 7000 languages spoken globally, many lack the resources required for model training. For NLG tasks, resource scarcity is more pronounced, making modeling more challenging, as well-formed text needs to be generated in LRLs. Addressing these concerns, this narrative introduces three promising research explorations that serve as a step toward enabling language technology for multiple LRLs. These approaches make effective use of transfer learning and limited supervision techniques for model training. Evaluations were conducted mostly in the zero-shot setting, enabling scalability. This research narrative is an ongoing doctoral thesis
Auto-CompletionTrie QACNLG Augmentation
Trie-NLG: Trie Context Augmentation to Improve Personalized Query Auto-Completion for Short and Unseen Prefixes
Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Manish Gupta, and
1 more author
Query auto-completion (QAC) aims at suggesting plausible completions for a given query prefix. Traditionally, QAC systems have leveraged tries curated from historical query logs to suggest most popular completions. In this context, there are two specific scenarios that are difficult to handle for any QAC system: short prefixes (which are inherently ambiguous) and unseen prefixes. Recently, personalized Natural Language Generation (NLG) models have been proposed to leverage previous session queries as context for addressing these two challenges. However, such NLG models suffer from two drawbacks: (1) some of the previous session queries could be noisy and irrelevant to the user intent for the current prefix, and (2) NLG models cannot directly incorporate historical query popularity. This motivates us to propose a novel NLG model for QAC, Trie-NLG, which jointly leverages popularity signals from trie and personalization signals from previous session queries. We train the Trie-NLG model by augmenting the prefix with rich context comprising of recent session queries and top trie completions. This simple modeling approach overcomes the limitations of trie-based and NLG-based approaches, and leads to state-of-the-art performance. We evaluate the Trie-NLG model using two large QAC datasets. On average, our model achieves huge 57% and 14% boost in MRR over the popular trie-based lookup and the strong BART-based baseline methods, respectively.
Extremely LRLsNeural MTLexical Similarity
Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages
Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar Desarkar, and
1 more author
We address the task of machine translation from an extremely low-resource language (LRL) to English using cross-lingual transfer from a closely related high-resource language (HRL). For many of these languages, no parallel corpora are available, even monolingual corpora are limited and representations in pre-trained sequence-to-sequence models are absent. These factors limit the benefits of cross-lingual transfer from shared embedding spaces in multilingual models. However, many extremely LRLs have a high level of lexical similarity with related HRLs. We utilize this property by injecting character and character-span noise into the training data of the HRL prior to learning the vocabulary. This serves as a regularizer which makes the model more robust to lexical divergences between the HRL and LRL and better facilitates cross-lingual transfer. On closely related HRL and LRL pairs from multiple language families, we observe that our method significantly outperforms the baseline MT as well as approaches proposed previously to address cross-lingual transfer between closely related languages. We also show that the proposed character-span noise injection performs better than the unigram-character noise injection.
NLGDiverse HeadlinesSelf-attention
DIVHSK: Diverse Headline Generation using Self-Attention based Keyword Selection
Diverse headline generation is an NLP task where given a news article, the goal is to generate multiple headlines that are true to the content of the article, but are different among themselves. This task aims to exhibit and exploit semantically similar one-to-many relationships between a source news article and multiple target headlines. Towards this, we propose a novel model called DivHSK. It has two components: KeySelect for selecting the important keywords, and SeqGen, for finally generating the multiple diverse headlines. In KeySelect, we cluster the self-attention heads of the last layer of the pre-trained encoder and select the most-attentive theme and general keywords from the source article. Then, cluster-specific keyword sets guide the SeqGen, a pre-trained encoder-decoder model, to generate diverse yet semantically similar headlines. The proposed model consistently outperformed existing literature and our strong baselines and emerged as a state-of-the-art model. Additionally, We have also created a high-quality multi-reference headline dataset from news articles.
2022
Cross-LingualMeta-LearningTypology
Meta-X_NLG: A Meta-Learning Approach Based on Language Clustering for Zero-Shot Cross-Lingual Transfer and Generation
Recently, the NLP community has witnessed a rapid advancement in multilingual and cross-lingual transfer research where the supervision is transferred from high-resource languages (HRLs) to low-resource languages (LRLs). However, the cross-lingual transfer is not uniform across languages, particularly in the zero-shot setting. Towards this goal, one promising research direction is to learn shareable structures across multiple tasks with limited annotated data. The downstream multilingual applications may benefit from such a learning setup as most of the languages across the globe are low-resource and share some structures with other languages. In this paper, we propose a novel meta-learning framework (called Meta-X_NLG) to learn shareable structures from typologically diverse languages based on meta-learning and language clustering. This is a step towards uniform cross-lingual transfer for unseen languages. We first cluster the languages based on language representations and identify the centroid language of each cluster. Then, a meta-learning algorithm is trained with all centroid languages and evaluated on the other languages in the zero-shot setting. We demonstrate the effectiveness of this modeling on two NLG tasks (Abstractive Text Summarization and Question Generation), 5 popular datasets and 30 typologically diverse languages. Consistent improvements over strong baselines demonstrate the efficacy of the proposed framework. The careful design of the model makes this end-to-end NLG setup less vulnerable to the accidental translation problem, which is a prominent concern in zero-shot cross-lingual NLG tasks.
HostilityCode-mix NLP
Hostility Detection in Online Hindi-English Code-Mixed Conversations
With the rise in accessibility and popularity of various social media platforms, people have started expressing and communicating their ideas, opinions, and interests online. While these platforms are active sources of entertainment and idea-sharing, they also attract hostile and offensive content equally. Identification of hostile posts is an essential and challenging task. In particular, Hindi-English Code-Mixed online posts of conversational nature (which have a hierarchy of posts, comments, and replies) have escalated the challenges. There are two major challenges: (1) the complex structure of Code-Mixed text and (2) filtering the relevant previous context for a given utterance. To overcome these challenges, in this paper, we propose a novel hierarchical neural network architecture to identify hostile posts/comments/replies in online Hindi-English Code-Mixed conversations. We leverage large multilingual pre-trained (mLPT) models like mBERT, XLMR, and MuRIL. The mLPT models provide a rich representation of code-mix text and hierarchical modeling leads to a natural abstraction and selection of the relevant context. The propose model consistently outperformed all the baselines and emerged as a state-of-the-art performing model. We conducted multiple analyses and ablation studies to prove the robustness of the proposed model.
2021
Cross-LingualUnsupervisedTransfer-Learning
ZmBART: An Unsupervised Cross-lingual Transfer Framework for Language Generation
Kaushal Kumar Maurya, Maunendra Sankar Desarkar, Yoshinobu Kano, and
1 more author
In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Aug 2021
Despite the recent advancement in NLP research, cross-lingual transfer for natural language generation is relatively understudied. In this work, we transfer supervision from
high resource language (HRL) to multiple lowresource languages (LRLs) for natural language generation (NLG). We consider four
NLG tasks (text summarization, question generation, news headline generation, and distractor generation) and three syntactically diverse languages, i.e., English, Hindi, and
Japanese. We propose an unsupervised crosslingual language generation framework (called
ZmBART) that does not use any parallel
or pseudo-parallel/back-translated data. In
this framework, we further pre-train mBART
sequence-to-sequence denoising auto-encoder
model with an auxiliary task using monolingual data of three languages. The objective
function of the auxiliary task is close to the
target tasks which enriches the multi-lingual
latent representation of mBART and provides
good initialization for target tasks. Then, this
model is fine-tuned with task-specific supervised English data and directly evaluated with
low-resource languages in the Zero-shot setting. To overcome catastrophic forgetting
and spurious correlation issues, we applied
freezing model component and data arguclosely related languages. We also show that the proposed character-span noise injection performs better than the unigram-character noise injection.mentation approaches respectively. This simple
modeling approach gave us promising results.
We experimented with few-shot training (with
1000 supervised data-points) which boosted
the model performance further. We performed
several ablations and cross-lingual transferability analysis to demonstrate the robustness of
ZmBART.
HostilityMultilingual
Coarse and Fine-Grained Hostility Detection in Hindi Posts using Fine
Tuned Multilingual Embeddings
Due to the wide adoption of social media platforms like Facebook, Twitter, etc., there is an emerging need of detecting online posts that can go against the community acceptance standards. The hostility detection task has been well explored for resource-rich languages like English, but is unexplored for resource-constrained languages like Hindi due to the unavailability of large suitable data. We view this hostility detection as a multi-label multi-class classification problem. We propose an effective neural network-based technique for hostility detection in Hindi posts. We leverage pre-trained multilingual Bidirectional Encoder Representations of Transformer (mBERT) to obtain the contextual representations of Hindi posts. We have performed extensive experiments including different pre-processing techniques, pre-trained models, neural architectures, hybrid strategies, etc. Our best performing neural classifier model includes One-vs-the-Rest approach where we obtained 92.60%, 81.14%, 69.59%, 75.29% and 73.01% F1 scores for hostile, fake, hate, offensive, and defamation labels respectively. The proposed model (https://github.com/Arko98/Hostility-Detection-in-Hindi-Constraint-2021) outperformed the existing baseline models and emerged as the state-of-the-art model for detecting hostility in the Hindi posts.
Token-ClassificationMath Document
A neural approach for detecting inline mathematical expressions from
scientific documents
Scientific documents generally contain multiple mathematical expressions in them. Detecting inline mathematical expressions are one of the most important and challenging tasks in scientific text mining. Recent works that detect inline mathematical expressions in scientific documents have looked at the problem from an image processing perspective. There is little work that has targeted the problem from NLP perspective. Towards this, we define a few features and applied Conditional Random Fields (CRF) to detect inline mathematical expressions in scientific documents. Apart from this feature based approach, we also propose a hybrid algorithm that combines Bidirectional Long Short Term Memory networks (Bi-LSTM) and feature-based approach for this task. Experimental results suggest that this proposed hybrid method outperforms several baselines in the literature and also individual methods in the hybrid approach.
2020
Q&A, MCQMulti-DecoderLSTM
Learning to Distract: A Hierarchical Multi-Decoder Network for Automated Generation of Long Distractors for Multiple-Choice Questions for Reading Comprehension
The task of generating incorrect options for multiple-choice questions is termed as distractor generation problem. The task requires high cognitive skills and is extremely challenging to automate. Existing neural approaches for the task leverage encoder-decoder architecture to generate long distractors. However, in this process two critical points are ignored - firstly, many methods use Jaccard similarity over a pool of candidate distractors to sample the distractors. This often makes the generated distractors too obvious or not relevant to the question context. Secondly, some approaches did not consider the answer in the model, which caused the generated distractors to be either answer-revealing or semantically equivalent to the answer.In this paper, we propose a novel Hierarchical Multi-Decoder Network (HMD-Net) consisting of one encoder and three decoders, where each decoder generates a single distractor. To overcome the first problem mentioned above, we include multiple decoders with a dis-similarity loss in the loss function. To address the second problem, we exploit richer interaction between the article, question, and answer with a SoftSel operation and a Gated Mechanism. This enables the generation of distractors that are in context with questions but semantically not equivalent to the answers. The proposed model outperformed all the previous approaches significantly in both automatic and manual evaluations. In addition, we also consider linguistic features and BERT contextual embedding with our base model which further push the model performance.
MT EvaluationCorrelation study
Machine translation evaluation: Manual versus automatic—a comparative study
The quality of machine translation (MT) is best judged by humans well
versed in both source and target languages. However, automatic techniques are often
used as these are much faster, cheaper and language independent. The goal of this
paper is to check for correlation between manual and automatic evaluation, specifically in the context of Indian languages. To the extent automatic evaluation methods
correlate with the manual evaluations, we can get the best of both worlds. In this
paper, we perform a comparative study of automatic evaluation metrics—BLEU,
NIST, METEOR, TER and WER, against the manual evaluation metric (adequacy),
for English-Hindi translation. We also attempt to estimate the manual evaluation
score of a given MT output from its automatic evaluation score. The data for the study
was sourced from the Workshop on Statistical Machine Translation WMT14.