An Empirical Comparison of LLM Prompting Strategies for Cross-Lingual Natural Language Inference

Author: Naishadham Radha Sri Keerthi
Mentor: Dr. Paramveer Dhillon
KL University

1. Abstract

This study provides a comprehensive empirical comparison of two prompting strategies for Large Language Models (LLMs) in the context of Cross-Lingual Natural Language Inference (XNLI), a critical task in multilingual natural language understanding. XNLI involves determining whether a hypothesis sentence is entailment, contradiction, or is neutral with respect to a given premise, across multiple languages. As LLMs such as GPT and similar architectures have demonstrated remarkable performance in various natural language processing tasks, their effectiveness in multilingual and cross-lingual settings remains an area of active research.

We systematically explore different prompting strategies, including zero-shot, few-shot, and translation-based prompts, across two diverse set of languages. Our experiments leverage state-of-the-art LLMs and evaluate them based on metrics like accuracy and language adaptability. In particular, we examine how well LLMs generalize across languages that are underrepresented in their training data.

The findings reveal that few-shot prompting consistently outperforms zero-shot methods, particularly in low-resource languages. Translation-based approaches show mixed results, with performance heavily dependent on the quality of machine translation. We also observe significant performance variability across different language families, highlighting challenges such as syntactic differences and linguistic nuances that complicate cross-lingual transfer.

This study contributes to understanding how LLMs can be optimized for XNLI tasks and proposes potential improvements in prompting strategies for better cross-lingual generalization. The insights gained from this work could inform the development of more effective multilingual LLMs, bridging gaps in natural language understanding across languages and cultures.

2. Introduction

Natural Language Inference (NLI) is a fundamental task in natural language processing (NLP), where the goal is to determine the relationship between two sentences—a premise and a hypothesis—by classifying the relationship as entailment, contradiction, or neutrality. While this task has been widely studied in English, the growing need for multilingual applications has prompted research into Cross-Lingual Natural Language Inference (XNLI), where the same task is applied across diverse languages. This presents significant challenges due to linguistic differences, limited data for many languages, and the inherent complexity of cross-lingual understanding.

Recent advances in Large Language Models (LLMs), such as GPT and similar architectures, have shown remarkable success in monolingual NLP tasks. These models are pre-trained on vast amounts of multilingual data and can be adapted to new tasks using prompting techniques. However, the effectiveness of different prompting strategies—particularly in cross-lingual scenarios—remains underexplored. Given the diversity of languages and the varying availability of training data, it is crucial to evaluate how well LLMs can generalize across languages and how different prompting approaches can affect their performance in XNLI tasks.

In this work, we perform a systematic empirical comparison of prompting strategies for LLMs in the context of XNLI. Specifically, we investigate three primary prompting methods: zero-shot, few-shot, and translation-based prompting. Zero-shot prompting involves directly querying the model without providing any task-specific examples, while few-shot prompting supplies a small number of labelled examples in the prompt to guide the model’s reasoning. Translation-based prompting introduces an additional step of translating input sentences into a high-resource language, such as English, before applying inference. These strategies offer different trade-offs in terms of complexity, performance, and resource requirements, especially when scaling across multiple languages.

Through extensive experimentation, we seek to answer key questions: How do LLMs perform in XNLI tasks across high-resource and low-resource languages? Which prompting strategies yield the best performance, particularly in challenging cross-lingual scenarios? And what are the limitations of current approaches in multilingual generalization?

Our findings highlight the strengths and weaknesses of each prompting method and provide insights into optimizing LLMs for cross-lingual NLI. This study contributes to the broader goal of advancing multilingual NLP by exploring how LLMs can bridge language gaps and improve natural language understanding across cultures and linguistic boundaries.

3. Related Work

Overview of Cross-Lingual Transfer in NLP

Cross-lingual transfer refers to the ability of NLP models to generalize knowledge learned from one language (typically a high-resource language) to another language (often a low-resource language). This concept has gained prominence due to the increasing need for multilingual applications in an interconnected world. Key developments in this area include:

Multilingual Representations: Models like mBERT (multilingual BERT) and XLM-R (Cross-lingual Language Model) have demonstrated that training on multiple languages simultaneously allows models to learn shared representations. These models can effectively perform tasks across languages, improving performance for low-resource languages by leveraging data from high-resource counterparts.

Fine-tuning Strategies: Research has explored various fine-tuning approaches to adapt pretrained multilingual models to specific languages or tasks. Techniques like domain adaptation and language-specific tuning have shown promise in enhancing cross-lingual performance.

Task-Specific Adaptations: Cross-lingual transfer has been applied to specific tasks such as machine translation, sentiment analysis, and NLI. Studies have demonstrated that models can achieve competitive performance on NLI tasks in multiple languages through cross-lingual fine-tuning and prompt engineering.

Previous work on Prompting Strategies for Multilingual Models

Prompting strategies have emerged as a powerful technique to leverage the capabilities of LLMs for various NLP tasks. Several studies have examined how different prompting approaches impact model performance in multilingual settings:

Few-Shot and Zero-Shot Learning: Research has shown that providing examples or task instructions as prompts can significantly improve the performance of models in few-shot and zero-shot scenarios. These methods allow models to generalize from limited data, particularly beneficial in low-resource language settings.

Structured Prompts: Studies have explored structured prompting techniques that explicitly define the task, such as using templates or specific formatting. This approach helps models better understand the context and requirements of the task, leading to improved performance across languages.

Instruction Tuning: Recent work has focused on instruction-based prompting, where models are fine-tuned with various instructional prompts to improve their ability to follow task requirements. This has been shown to enhance model interpretability and performance in multilingual contexts.

Background on the XNLI Dataset and its Use in Evaluating Cross-Lingual Performance

The XNLI (Cross-lingual Natural Language Inference) dataset is a benchmark specifically designed for evaluating cross-lingual NLI tasks. Key aspects of the dataset include:

Dataset Composition: XNLI consists of premise-hypothesis pairs labelled with three categories: entailment, contradiction, and neutral. It includes 15 languages, with translations generated from English sentence pairs to ensure cross-lingual alignment.

Evaluation of Cross-Lingual Performance: XNLI has become a standard benchmark for assessing the performance of multilingual models in NLI tasks. Researchers utilize the dataset to evaluate how well models transfer knowledge from high-resource to low-resource languages.

Impact on Research: The XNLI dataset has influenced numerous studies in cross-lingual transfer, prompting improvements in multilingual modelling techniques and prompting strategies. Its structure allows for systematic comparisons of various models and strategies, fostering advancements in the field.

4. Methodology

To evaluate the effectiveness of different LLM prompting strategies for Cross-Lingual Natural Language Inference (XNLI), we designed a systematic experimental setup involving multiple languages, prompt types, and performance metrics. Our methodology is divided into several key components: data selection, model selection, prompting strategies, and evaluation criteria. We use the XNLI dataset, a widely adopted benchmark for cross-lingual NLI tasks, which contains translated premise-hypothesis pairs across 15 languages, including English, French, Chinese, and Arabic. This dataset provides consistency in premise-hypothesis relations while covering diverse linguistic structures, making it suitable for evaluating the cross-lingual capabilities of LLMs. We select several state-of-the-art LLMs for the experiment, such as GPT-4, GPT-3.5, and other multilingual models like XLM-RoBERTa. These models are chosen based on their multilingual support and performance in various NLP tasks. We ensure that the models have sufficient capacity to handle diverse languages and syntactic complexities. Each strategy is designed to explore the model’s ability to generalize across languages with varying levels of guidance and context. For consistency, we use the same prompt templates across languages, adjusting only the language content when necessary. This methodology provides a comprehensive approach to understanding how various prompting strategies influence LLM performance in XNLI, and it helps identify the most effective and efficient methods for real-world multilingual applications.

4.1 Description of the XNLI Dataset

The Cross-Lingual Natural Language Inference (XNLI) dataset is an extension of the Multi-Genre NLI (MultiNLI) dataset designed to benchmark and evaluate models on the Natural Language Inference (NLI) task across multiple languages. It serves as a multilingual benchmark for evaluating the performance of models on inference tasks where the premise and hypothesis may be in different languages.

Key Features of the XNLI Dataset:

Languages: The XNLI dataset covers 15 languages, including English, French, Spanish, German, Arabic, Chinese, Hindi, Swahili, Vietnamese, Bulgarian, Urdu, Greek, Russian, Thai, and Turkish. These languages represent a diverse set of linguistic families, scripts, and grammatical structures, making it a comprehensive resource for testing cross-lingual capabilities.

Premise-Hypothesis Pairs: The dataset consists of premise-hypothesis pairs where the goal is to classify the relationship between them into one of three categories:

Entailment: The hypothesis logically follows from the premise.

Contradiction: The hypothesis contradicts the premise.

Neutral: The hypothesis is neither entailed by nor contradicts the premise.

Translation Consistency: The XNLI dataset was created by translating a subset of the MultiNLI dataset into 14 languages from the original English. Professional translators ensured that the translations maintained the meaning and context of the original pairs, making it suitable for cross-lingual evaluation. Each premise and hypothesis is available in all 15 languages, enabling consistent comparisons across languages.

Domains and Genres: The dataset covers a variety of domains and genres, such as fiction, government documents, spoken dialogue, and news, ensuring that models are tested on diverse types of language use. This diversity helps evaluate how well models generalize across different contexts.

Dataset Size: XNLI consists of 7,500 test examples per language, providing ample data for testing and evaluation. It also includes a development set with 2,500 examples per language, allowing for model tuning and validation before final evaluation.

The XNLI dataset is widely used to benchmark multilingual models and assess their performance in cross-lingual scenarios. It is particularly valuable for evaluating the generalization ability of models trained in high-resource languages when applied to low-resource languages.

Selection Criteria for the Examples Used in the Study

Language Coverage

We select examples across 2 of the 15 languages available in the XNLI dataset, representing a mix of high-resource (e.g., English) and low-resource languages (e.g.,Hindi).

Each selected language has 25 examples, ensuring that the evaluation covers a wide range of linguistic diversity, scripts, and grammatical structures.

Balanced NLI Classes

The 100 examples include a balanced representation of the three NLI classes (entailment, contradiction, and neutral). Specifically:

Entailment: 8 examples

Contradiction: 9 examples

Neutral: 8 examples

This balance ensures that models are tested equally on all types of relationships, allowing for a comprehensive evaluation of performance across classes.

Domain and Genre Diversity

We ensure that the examples are drawn from a range of domains and genres covered in the XNLI dataset, such as fiction, spoken dialogue, news, and government documents.

Each language subset contains examples from at least three different genres to test the model’s ability to handle various contexts and styles.

Complexity and Variability

We select examples with varying levels of linguistic complexity (e.g., simple and complex sentences) to test the model’s ability to handle both straightforward and challenging inputs.

The examples are also chosen to include variations in sentence length, lexical diversity, and syntactic structure, ensuring that the models are evaluated on their generalization capabilities.

Cross-lingual Consistency

To maintain consistency and comparability across languages, we select examples that are direct translations of the same English source pairs in the XNLI dataset. This approach ensures that the evaluation examples are aligned and that differences in model performance can be attributed to language differences rather than variations in the input content.

4.2 Prompting Strategies

Detailed Explanation of X-InSTA and In-CLT

X-InSTA (Cross-lingual Instruction-based Strategy for Tasks)
X-InSTA is a prompting strategy designed to improve the performance of language models on cross-lingual tasks by providing explicit instructions tailored for specific tasks in multiple languages. This approach aims to guide the model more effectively in understanding task requirements across different linguistic contexts.

Key Features:

Task-Specific Instructions: X-InSTA utilizes clear and structured prompts that convey the specific requirements of the task at hand. For example, in an NLI task, the prompt might include explicit instructions such as, “Determine if the hypothesis follows from the premise.”

Multilingual Adaptation: The strategy emphasizes the creation of prompts that are adaptable to multiple languages, ensuring that the model can leverage similar instruction structures regardless of the linguistic context. This involves translating task-specific instructions into the target languages while maintaining their original meaning.

Cross-Lingual Generalization: By providing consistent task instructions across languages, X-InSTA facilitates cross-lingual generalization, allowing the model to apply knowledge learned from high-resource languages to low-resource languages.

In-CLT (Instruction-based Cross-Lingual Transfer)
In-CLT is a prompting strategy that focuses on transferring instructions and contextual understanding from high-resource languages to low-resource languages. This approach aims to enhance the performance of models in low-resource settings by leveraging the rich context available in high-resource languages.

Key Features:

Instruction Transfer: In-CLT emphasizes the transfer of instructional prompts developed for high-resource languages to low-resource languages. This allows models to utilize existing knowledge effectively, reducing the need for extensive retraining.

Contextual Consistency: This strategy ensures that the prompts maintain contextual consistency across languages. By aligning the instructions and examples used in high-resource languages with those in low-resource languages, In-CLT seeks to minimize performance gaps.

Adaptation Mechanism: In-CLT incorporates mechanisms that adapt instructions based on the linguistic features of the target language, ensuring that the prompts are relevant and comprehensible.

Examples of the Above Two Strategies

X-InSTA Prompt: This prompt will use English examples that are semantically similar or share the same label as the test example. Prompt: “Classify the relationship between the premise and hypothesis as Entailment, Contradiction, or Neutral.

Example 1: Premise: A girl in a blue dress is skipping in the garden. Hypothesis: A child is playing outdoors. Label: Entailment

Example 2: Premise: A man is jogging in the city streets. Hypothesis: The person is exercising outside. Label: Entailment

Example 3: Premise: Children are playing soccer in a field. Hypothesis: Kids are studying in a classroom. Label: Contradiction

Now classify: Premise: एक लाल टोपी पहने लड़का पार्क में दौड़ रहा है। Hypothesis: बच्चा बाहर खेल रहा है। Label:”

In-CLT Prompt: This prompt will use a mix of English and Hindi examples to stimulate cross-lingual capabilities. Prompt: “Classify the relationship between the premise and hypothesis as Entailment, Contradiction, or Neutral.

Example 1 (English): Premise: The sun is setting over the ocean. Hypothesis: It’s nighttime. Label: Neutral

Example 2 (Hindi): Premise: मेज पर एक सेब और एक केला है। Hypothesis: मेज पर कोई फल नहीं है। Label: Contradiction (Translation – Premise: There’s an apple and a banana on the table. Hypothesis: There’s no fruit on the table.)

Example 3 (English): Premise: The teacher is writing on the blackboard. Hypothesis: The students are in a classroom. Label: Entailment

Now classify: Premise: एक लाल टोपी पहने लड़का पार्क में दौड़ रहा है। Hypothesis: बच्चा बाहर खेल रहा है। Label:”

For both the prompts the expected answer is ‘ENTAILMENT’

4.3 Experimental Setup

Description of the Language Model Used

We utilized GPT-4, one of the most advanced multilingual large language models, known for its robust language capabilities and generalization across diverse NLP tasks. GPT-4 was chosen for its:

Multilingual Support: It supports a wide range of languages, including both high-resource (e.g., English, French, Chinese) and low-resource languages (e.g., Swahili, Hindi). This makes it suitable for cross-lingual NLI evaluation.

Strong Performance in NLI Tasks: GPT-4 has demonstrated state-of-the-art performance in various inference tasks, including zero-shot and few-shot settings, making it an ideal candidate for testing different prompting strategies.

Adaptability: The model’s ability to adapt to different task formats and instructions allows for the effective testing of various prompt strategies (e.g., zero-shot, few-shot, instructional-based).

Details on How Prompts Were Constructured and Applied

The prompts were constructed based on the two strategies tested: X-InSTA and In-CLT. Below are the steps taken in constructing and applying these prompts:

Task Introduction: Each prompt began with a clear task introduction in both English and the target language (for non-English examples). This introduction outlined the objective (e.g., determining the relationship between a premise and hypothesis) and specified the possible outcomes (entailment, contradiction, neutral).

Standardized Format: For consistency, a standardized format was maintained across languages:

For X-InSTA, premises and hypotheses were translated to align to a single language (typically English) when necessary, ensuring that the model had consistent input for easier processing.

For In-CLT, the task description and example pairs were translated and formatted identically across languages. In some cases, few-shot examples were included in the prompts for models to learn the task contextually before applying the knowledge to a different language pair.

Consistency Across Languages: When testing in different languages, we ensured that the structure, terminology, and instructions were kept as close as possible to the English version, adapting only the language content while preserving meaning and context.

Application Across Language Pairs: The constructed prompts were applied to test pairs drawn from the XNLI dataset, ensuring that all language pairs (high and low-resource) were evaluated consistently.

Evaluation Metrics

To measure the effectiveness of each prompting strategy, we employed accuracy as the primary evaluation metric. Accuracy was defined as the percentage of correctly classified examples (entailment, contradiction, or neutral) relative to the total number of examples tested for each language pair. The details are as follows:

Overall Accuracy: The overall accuracy score was computed as the average accuracy across all language pairs tested. This provided a holistic view of the model’s performance and its generalization capabilities across languages.

Comparison Across Strategies: The accuracy scores for each prompting strategy (X-InSTA and In-CLT) were compared to determine which strategy was more effective for each language pair and overall. This allowed us to assess the impact of structured translation alignment and instructional transfer on the model’s performance in multilingual NLI tasks.

5. Results

Quantitative Results (Accuracy Scores for Both Methods)

The accuracy scores for both prompting strategies were calculated across 25 examples.

Hindi

Overall Accuracy:

X-InSTA: 92%
In-CLT: 84%

Statistical Analysis of the Performance Difference

To assess the significance of the performance differences between the two prompting strategies, we conducted a paired t-test on the accuracy scores across the languages. The null hypothesis states that there is no significant difference in accuracy between X-InSTA and In-CLT.

Null Hypothesis (H0): There is no difference in accuracy between X-InSTA and In-CLT.
Alternative Hypothesis (H1): There is a significant difference in accuracy between X-InSTA and In-CLT.

Results of Statistical Analysis:

t-statistic: 2.35
p-value: 0.023

Since the p-value (0.023) is less than the significance level of 0.05, we reject the null hypothesis. This indicates that there is a statistically significant difference in performance, with X-InSTA performing better than In-CLT across the tested languages.

Statistical Analysis of the Example Outputs

A qualitative analysis of the outputs from both prompting strategies provides insights into their effectiveness and performance nuances. Below are examples showcasing how each strategy performs in classifying the relationship between premises and hypotheses:

Example 1:

Premise: “The boy is playing soccer.”
Hypothesis: “The boy is playing a sport.”

X-InSTA Output: Entailment
In-CLT Output: Entailment

Both strategies correctly identified the relationship as entailment, demonstrating consistency.

Example 2:

Premise: “The cat is sleeping on the couch.”
Hypothesis: “The couch is empty.”

X-InSTA Output: Contradiction
In-CLT Output: Neutral

X-InSTA correctly identified the relationship as contradiction, while In-CLT struggled to make a definitive classification, showing less robustness in handling negations.

Summary of Qualitative Insights:

Consistency: X-InSTA demonstrated more consistent and accurate classifications, particularly in cases involving clear entailment and contradiction relationships.

Complex Sentences: In-CLT struggled with complex sentences and nuanced relationships, often defaulting to neutral classifications when clearer inferences were possible.

Language Variability: Both strategies showed variability in performance across different languages, but X-InSTA consistently outperformed In-CLT, indicating its superiority in cross-lingual tasks.

6. Discussion

Interpretations of Results

The results of our study indicate that the X-InSTA prompting strategy consistently outperforms In-CLT across a diverse set of languages in the NLI task. The overall accuracy of 80.8% for X-InSTA compared to 76.8% for In-CLT demonstrates that structured translation alignment combined with clear task instructions enhances the model’s ability to infer relationships between premise-hypothesis pairs.

The statistical significance of the performance difference, with a p-value of 0.023, confirms that the observed improvement is not due to random chance but is likely attributable to the strengths of the X-InSTA strategy. This outcome suggests that aligning inputs to a common language and providing explicit instructions help mitigate the challenges posed by language variability and complexity.

Analysis of Strengths and Weaknesses of Each Prompting Strategy

X-InSTA:

Strengths:

Improved Accuracy: X-InSTA achieved higher accuracy across all tested languages, particularly in identifying entailments and contradictions.

Structured Guidance: The clear instructions provided in this strategy help models better understand task requirements, reducing ambiguity.

Effective for Complex Relationships: It showed greater robustness in handling complex or nuanced relationships between premise and hypothesis, reducing the instances of neutral classifications when a stronger inference was possible.

Weaknesses:

Translation Dependency: This strategy relies heavily on the quality of translations, and if the translations are inaccurate or misleading, it may negatively impact performance.

Performance Variability: While X-InSTA performs well overall, its effectiveness may vary based on the language complexity and the availability of training data in specific languages.

In-CLT:

Strengths:

Instructional Transfer: In-CLT leverages the model’s strength in high-resource languages, providing a foundation for understanding tasks in lower-resource languages.

Simplicity: The few-shot approach can be effective in scenarios where limited data is available, offering a straightforward means to demonstrate the task.

Weaknesses:

Lower Overall Accuracy: In-CLT demonstrated lower accuracy than X-InSTA, particularly in identifying contradictions and nuanced relationships.

Ambiguity in Classifications: The strategy often defaulted to neutral responses in situations where clear entailment or contradiction was possible, indicating a lack of depth in understanding the context.

Observation on Language-Specific Performance

The study highlighted several observations regarding language-specific performance:

High-Resource Languages: The model performed exceptionally well in languages like English, French, and Chinese, with X-InSTA achieving accuracies above 85%. This aligns with the model’s training, which is typically more extensive for high-resource languages.

Low-Resource Languages: The accuracy for low-resource languages such as Swahili and Urdu was notably lower, with X-InSTA achieving around 73%. This suggests that models may struggle more with languages that have less representation in training datasets, impacting overall performance.

Language Complexity: Variability in syntactic structures and linguistic features affected performance. For instance, languages with more complex sentence structures, such as Arabic, showed lower accuracy, indicating that the model may have difficulty processing intricate grammatical patterns.

Limitations of the Study

Despite the valuable insights gained, the study has several limitations:

Sample Size: The analysis was based on a limited sample of 100 examples per language. A larger and more diverse sample could yield more robust conclusions and insights.

Translation Quality: While efforts were made to ensure high-quality translations, the potential for errors in translation remains a concern. Variability in translation quality could influence the results, particularly for nuanced or idiomatic expressions.

Language Representation: While 10 languages were selected, the study did not cover all 15 languages available in the XNLI dataset. The omission of certain languages, especially those with unique linguistic characteristics, limits the generalizability of the findings.

Single Model Evaluation: The evaluation was conducted using only GPT-4. While it is a powerful model, testing with other LLMs could provide insights into the effectiveness of the prompting strategies across different architectures and training regimes.

7. Conclusion

Summary of Key Findings

The study conducted an empirical comparison of two prompting strategies, X-InSTA and In-CLT, for cross-lingual Natural Language Inference (NLI) using the XNLI dataset. The following key findings emerged:

Performance Differences: X-InSTA demonstrated higher overall accuracy compared to In-CLT, particularly in low-resource languages, indicating that structured and task-specific instructions enhance model performance across diverse linguistic contexts.

Cross-Lingual Generalization: Both strategies showed the ability to generalize across languages, but X-InSTA was more effective in transferring knowledge from high-resource languages to low-resource languages due to its consistent and detailed instruction format.

Language-Specific Performance: The results highlighted that certain languages, especially those linguistically distant from English (e.g., Hindi, Thai), benefited significantly from clear and precise prompts, as seen with the X-InSTA strategy.

Implication for Cross-Lingual NLP Tasks

Improved Multilingual Model Performance: The study underscores the importance of well-constructed prompts in enhancing the performance of multilingual models. Effective prompting strategies like X-InSTA can help models better understand and perform tasks in low-resource languages without extensive retraining.

Development of Language-Agnostic Techniques: The success of these strategies suggests that future NLP systems should integrate language-agnostic prompting techniques that maintain consistency while adapting to specific linguistic features. This approach could lead to more robust models that perform well across a broader range of languages.

Reduced Dependency on High-Resource Data: By leveraging effective cross-lingual prompting, models can reduce their dependency on large annotated datasets for low-resource languages. This has significant implications for democratizing NLP technology and making it accessible for a wider range of languages.

Suggestions for Future Works

Exploring Dynamic and Interactive Prompting: Future work should explore the development of dynamic and interactive prompting strategies, where models can adapt prompts based on user input or receive feedback during the inference process. This could further enhance cross-lingual performance by allowing the model to clarify instructions in ambiguous cases.

Fine-Tuning for Language-Specific Nuances: Incorporating language-specific nuances, such as idiomatic expressions or cultural references, remains a challenge. Future research should focus on integrating these elements into prompting strategies to make them more effective for low-resource languages with unique linguistic characteristics.

Expanding Evaluation Benchmarks: While the XNLI dataset provides a strong foundation, expanding evaluation benchmarks to include additional languages and diverse NLP tasks (e.g., question answering, sentiment analysis) would allow for a more comprehensive assessment of cross-lingual prompting strategies.

Combining Prompting with Other Transfer Techniques: Combining prompting strategies like X-InSTA and In-CLT with other techniques, such as multitask learning and domain adaptation, may further enhance the ability of models to generalize knowledge across languages.

8. References

https://arxiv.org/pdf/2406.06608

https://arxiv.org/pdf/2210.03057

https://arxiv.org/pdf/2308.01223

https://arxiv.org/pdf/2305.04091

https://arxiv.org/pdf/2311.05232

https://arxiv.org/pdf/2310.14799