Leveraging Statistical Significance in Comparative Emotion Recognition: A Performance Analysis of CARER and MM- EMOR Models

Author: Aviva Kahlon
Mentor: Prof. Rabih Younes, Duke University
University of Petroleum and Energy Studies, Dehradun

Abstract

Emotion recognition is a burgeoning field in artificial intelligence, particularly in natural language processing and affective computing. This study explores the application of statistical significance to compare two cutting-edge emotion recognition models—CARER and MM-EMOR. These models have been widely recognized for their ability to capture emotional expressions from social media data, yet the real question lies in whether their performance differences are significant or simply a product of random variance.

We utilized a paired T-test to examine performance metrics such as accuracy, F1 scores, and precision on a large-scale Twitter dataset annotated with noisy labels. With a p-value of 0.365, the study reveals that the difference between the models is not statistically significant. This outcome challenges the assumption that one model necessarily outperforms the other in real-world applications. By highlighting the importance of statistical significance in machine learning evaluations, this paper contributes to the body of research aiming to standardize model validation practices in the field of emotion recognition.

Introduction

The rise of social media platforms has transformed emotion recognition from a niche area of research into a critical tool for industries ranging from marketing to mental health. Automated emotion recognition models, which analyze vast quantities of unstructured data like text and multimedia, offer a way to gauge public sentiment in real time. Two of the most prominent models in this domain are CARER and MM-EMOR.

CARER (Contextualized Affect Representations for Emotion Recognition) is a graph-based model that integrates word embeddings with graph structures to capture the syntactic and semantic relationships within social media text (Zhang & Zeng, 2023). This model excels at identifying complex emotional expressions, making it suitable for varied linguistic contexts. On the other hand, MM-EMOR (Multi-Modal Emotion Recognition) leverages concatenated deep learning networks to integrate multi-modal data—both text and audio—for emotion detection (Smith & Lee, 2023). MM-EMOR has demonstrated significant accuracy improvements across several datasets, such as IEMOCAP and MELD.

The key objective of this study is to examine whether the differences in performance between these two models are statistically significant. By using a paired T-test, we compare the models’ results on identical datasets to determine if one model truly outperforms the other, or if the observed differences are attributable to random chance. This research contributes to the growing field of affective computing by emphasizing the importance of p-values and statistical significance in model evaluation.

Literature Review

Emotion recognition, particularly from textual data, has become increasingly prominent in fields like affective computing and natural language processing (NLP). Various models and algorithms have been developed to capture human emotions from text, ranging from traditional machine learning models to more advanced deep learning and multi-modal architectures.

In early work on emotion recognition, researchers relied on simple rule-based methods that used sentiment lexicons and syntactic parsing to identify emotional states. These early models lacked the sophistication to capture more complex emotions or nuances in language, particularly those involving sarcasm, irony, or contextual emotions. Later, with the advent of machine learning, models such as Support Vector Machines (SVMs) and Naive Bayes were applied to text data to improve accuracy in emotion recognition. However, these models had limited capabilities in understanding contextualized language or generalizing across various datasets, often failing when dealing with large-scale, unstructured data like social media text.

The rise of deep learning marked a significant turning point in emotion recognition. Deep learning models, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), revolutionized the field by providing a more nuanced understanding of textual data (Kim, 2014; Huang & Chen, 2019). CNNs, traditionally used in image recognition, were adapted for text classification tasks. They performed well by capturing local dependencies between words, making them useful for sentiment analysis and basic emotion classification. However, CNNs struggled with capturing long-term dependencies in sequences, leading to the adoption of RNNs, which are better suited for sequential data. RNN variants like Long Short-Term Memory (LSTM) networks showed significant improvements in handling the intricacies of emotion recognition, especially in tasks where the emotional state of a speaker or writer could shift over time or across multiple sentences (Kim, 2014).

LSTM networks, in particular, have been widely applied in emotion recognition because of their ability to capture the contextual flow of a conversation or piece of text. They maintain a memory of previous states, which helps in identifying changes in emotion that might not be apparent when considering individual words or phrases. For instance, LSTM networks have been used in sentiment analysis for large-scale datasets, such as social media platforms, where emotions often fluctuate based on the topic or ongoing discussions (Singh & Bhatia, 2022). These networks have achieved superior performance compared to traditional machine learning models, especially in handling unstructured data with nuanced emotional content.

More recently, the development of Transformer models, such as BERT (Bidirectional Encoder  Representations from Transformers), has further advanced the field of emotion recognition. Unlike LSTMs, Transformers do not rely on sequential processing and can capture dependencies between words regardless of their distance in a sentence. This ability makes Transformers particularly powerful in understanding complex emotional expressions, especially in tasks requiring deep contextualization. BERT has been applied successfully to various NLP tasks, including sentiment analysis and emotion classification, achieving state-of-the-art results on several benchmarks. The model’s attention mechanism allows it to focus on relevant parts of the input text, providing a more accurate representation of emotional states (Xu & Chen, 2018).

Multi-modal emotion recognition has also gained traction in recent years. Models like MM-EMOR utilize both text and audio features to capture emotions more accurately. Multi-modal approaches leverage the complementary nature of different data sources, such as text, speech, and even visual cues, to enhance emotion recognition performance. Studies have shown that integrating multiple modalities leads to better generalization and robustness (Li & Zhao, 2021; Smith & Lee, 2023). For example, emotion recognition tasks that include audio data can detect changes in tone or pitch, which are often indicative of a speaker’s emotional state. MM-EMOR combines text and audio inputs using deep learning architectures, showing marked improvements in performance across various datasets, including IEMOCAP and MELD.

CARER, a graph-based model for emotion recognition, further pushes the boundaries of emotion detection by incorporating syntactic and semantic relationships within the text. The model employs word embeddings, which are mapped into a graph structure that captures the intricate dependencies between words. This graph-based approach allows CARER to model more complex interactions between emotions, making it particularly effective in recognizing subtle emotional shifts, such as changes in tone or emotional intensity. By integrating contextualized word embeddings into a graph, CARER can capture the relational dependencies between different emotional expressions, which are often overlooked by traditional models (Wang & Tang, 2020).

Statistical significance plays a crucial role in model evaluation within machine learning. Traditionally, researchers have relied on accuracy, precision, and F1 scores to compare model performance. However, without assessing whether these differences are statistically significant, it becomes challenging to determine if one model genuinely outperforms another or if the observed differences are due to random chance. The use of statistical tests, such as the paired T-test, offers a robust method for comparing models by accounting for the variability inherent in different datasets. In the case of CARER and MM-EMOR, a paired T-test ensures that performance differences are validated through rigorous statistical methods, offering a more reliable means of model comparison.

Dataset and Methodology

The dataset used in this study comprises over 2 million tweets, annotated through distant supervision using hashtags as weak labels. This approach to data annotation is both innovative and scalable, allowing researchers to generate vast corpora of data for machine learning tasks. Hashtags such as #happy, #sad, and #angry provide an initial set of labels that are then processed using graph based methods to build a more refined understanding of the emotional content in the text (Zhang & Zeng, 2023).

To compare the performance of the CARER and MM-EMOR models, we selected three key performance metrics: accuracy, precision, and F1 scores. These metrics are commonly used to evaluate the performance of classification models in machine learning. Since both models were trained and tested on the same dataset, a paired T-test was the appropriate statistical test to use. This test allowed us to compare the mean performance of the two models across multiple subsets of the data, controlling for the variability inherent in each subset (Xu & Chen, 2018).

The null hypothesis for the T-test was that there is no significant difference between the two models’ performances. A p-value lower than 0.05 would indicate that we could reject the null hypothesis and conclude that the difference in performance was statistically significant. However, if the p-value exceeded this threshold, it would suggest that any observed differences were likely due to random variation in the data.

Key Findings and Statistical Analysis


The F1 scores and accuracy metrics for both CARER and MM-EMOR models were sourced from the respective papers by Zhang & Zeng (2023) and Smith & Lee (2023). The metrics were then processed using Jamovi software to conduct a paired T-test, with results displayed in the diagram.

The F1 scores and accuracy metrics for both CARER and MM-EMOR models were sourced from the respective papers by Zhang & Zeng (2023) and Smith & Lee (2023). The metrics were then processed using Jamovi software to conduct a paired T-test, with results displayed in the diagram.

The paired T-test analysis produced a p-value of 0.365, which is well above the commonly accepted significance level of 0.05. This means that we failed to reject the null hypothesis, suggesting that there is no significant difference in the performance of CARER and MM-EMOR in terms of accuracy, F1 scores, or precision. Both models performed exceptionally well in recognizing emotions from the Twitter dataset, but their differences are not statistically significant enough to justify a clear preference for one model over the other based solely on performance metrics.

This finding has important implications for researchers and practitioners in the field of affective computing and NLP. First, it underscores the importance of using statistical significance as a validation tool when comparing machine learning models. Too often, researchers may be tempted to select a model based on marginal improvements in accuracy without considering whether those improvements are statistically meaningful. This study demonstrates that statistical significance tests like the paired T-test can provide more robust evidence for model selection.

Additionally, the generalizability of both models across various datasets further strengthens their validity. The CARER model, validated through 10-fold cross-validation, ensures robust performance across multiple subsets of the dataset, while MM-EMOR’s use of multi-modal data makes it applicable to diverse linguistic and emotional contexts (Zhang & Zeng, 2023; Smith & Lee, 2023).

Conclusion

In conclusion, this study shows that there is no statistically significant difference between the  CARER and MM-EMOR models when applied to emotion recognition tasks on a large-scale  Twitter dataset. The results of the paired T-test, with a p-value of 0.365, suggest that the observed differences in accuracy, F1 scores, and precision are likely due to random variance rather than meaningful improvements in model performance.

This research highlights the importance of using statistical significance to validate machine learning models, particularly in domains where minor differences in performance can have significant real-world implications, such as sentiment analysis, social media monitoring, and affective computing. Future work should focus on expanding the scope of the analysis to include additional emotion recognition models and datasets, thereby providing further validation for the use of statistical tests in model evaluation.

References

Huang, L., & Chen, Y. “Deep Learning Models for Emotion Recognition in Social Media Text.” Neural Networks, 2019.

Kim, Y. “Convolutional Neural Networks for Sentence Classification.” EMNLP, 2014.

Li, Y., & Zhao, Q. “Multi-Modal Emotion Recognition Using Audio-Visual Features.” Journal of Signal Processing, 2021.

Pang, B., & Lee, L. “Opinion Mining and Sentiment Analysis.” Foundations and Trends in Information Retrieval, 2008.

Singh, A., & Bhatia, S. “Affective Computing and Emotion Recognition: A Comprehensive Review.” IEEE Access, 2022.

Smith, J., & Lee, D. “MM-EMOR: Multi-Modal Emotion Recognition Using Deep Learning Networks.” MDPI Journal of Affective Computing, 2023.

Wang, J., & Tang, W. “Graph-Based Emotion Recognition Models: A Comparative Study.” IEEE Transactions on Affective Computing, 2020.

Xu, F., & Chen, W. “Cross-Domain Emotion Recognition Using Deep Learning Approaches.” Information Sciences, 2018.

Zhang, X., & Zeng, W. “CARER: Contextualized Affect Representations for Emotion Recognition.” Journal of Affective Computing, 2023.


About the author

Aviva Kahlon

Aviva is a final-year student pursuing a Bachelor of Technology in Computer Science Engineering, specializing in Big Data, at the University of Petroleum and Energy Studies, India. With a strong foundation in data analysis, generative AI, and statistical tools, she is passionate about solving real-world problems through data-driven approaches. She has worked on several academic and industry-based projects, including weather data analysis, mentorship platforms, and the integration of APIs for data visualization in business intelligence tools like Superset.

Aviva plans to pursue a master’s in data analysis to further enhance her expertise in statistical methods and big data technologies. Her long-term goal is to contribute to cutting-edge advancements in data analysis and predictive modeling. Outside of academics, she enjoys singing and is the female lead singer of her college band, reflecting her creative and multifaceted approach to both personal and professional life.