machine learning Archives - Exploratio Journal

Implementing a ResNet50 Architecture to Decrease Social Anxiety in Autistic Children by Detecting Emotions Portrayed Through Body Language Micro Gestures

Kedaar Rentachintala — Sun, 27 Aug 2023 15:54:22 +0000

Author: Kedaar Rentachintala
BASIS Independent Silicon Valley

Abstract

Autism is one of the most prominent developmental disorders in the world today, affecting communication, social cues, learning, and other common tasks many people often take for granted. A lack of emotional awareness often leads to a more challenging time building relationships, achieving success in jobs and job interviews, and comprehending lessons (Ohl et al. 2017). Although researchers have conducted many experiments and analyses regarding facial emotions, few have delved deep into body language and associated body language cues that make up the basis of social and emotional awareness. To accomplish this task, we implemented the iMiGUE dataset, a concoction of over 350 post-tennis match interviews and subsequent player emotional states and micro gestures, or minuscule actions players performed during a duration of frames. After creating specialized images for each frame of a video and analyzing them through a ResNet50 architecture, we concluded that our model obtained favorable results for the surprise and sadness-correlated images. At the same time, it produced less accurate results for happy, fearful, uncomfortable, and focused images. To improve these results and make our product a universally effective tool, we plan on adding more training data and implementing various model architectures.

1. Introduction

Society has consistently maintained a stigma towards people with disabilities. However, for those on the more severe end of the autism spectrum, that stigma tends to be higher, primarily because autistic students and adults have poor social awareness and decision-making. A core component of autism is decreased social awareness and aversion to standard communication measures like eye contact and speaking face to face. Those social impediments have immense consequences throughout life. Children have difficult experiences making friends and building lasting relationships, while older students have trouble understanding lecture material and asking for help when needed. Adults with autism spectrum disorder (ASD) experience high unemployment and underemployment rates compared to adults with other disabilities or no disabilities (Ohl et al. 2017).

For families with access to therapy, therapy and general medical aid are expensive, with the most cost-effective form of cognitive behavioral therapy ranging from $7,300 to $9,330 per year (Crow et al. 2009). Furthermore, expensive therapists are tasked with holding therapy sessions for a small subset of the autistic population in the current treatment space. Meanwhile, underprivileged households rarely have access to accredited facilities with licensed therapists. The cost, travel time to and from a therapist’s office, and parental commitment in and out of the home that are associated with traditional behavioral therapy mechanisms are tremendous and avoidable.

In order to close this socioeconomic gap and allow those with autism and other social awareness-impeding conditions to achieve their goals, we developed a machine learning-based tool with a clear goal in mind: take in various common body language micro gestures (crossing hands, fidgeting) and identify the emotions (happy, sad, mad, disgusted) to which those gestures respond. We chose body language as an emotion prediction mechanism because little research exists regarding emotion detection through gestures rather than the face. We also decided to focus on younger children, given that younger students have an easier time overcoming their autistic tendencies and becoming more socially aware, a phenomenon commonly referred to as early intervention (Landa, 2018). The general approach to this project was to implement a ResNet50 architecture, training it with various images obtained from post-match tennis interviews. We used an existing model pulled from a Kaggle FER-2013 emotion detection model (Shakir, 2022) and modified it to suit our needs. We obtained our initial data of YouTube video links and frame durations from the iMiGUE dataset, as described by Liu et al. (2021). Our model obtained favorable results for the surprise and sadness-correlated images, while it produced less accurate results for happy, fearful, uncomfortable, and focused images.

To improve the accuracy of our ResNet50 model, we will train our model with more data from various settings and subjects and use different model frameworks like VGG-16. New settings would allow our project to become more commercially usable, given that new environments for training images would be more representative of a household or user’s location.

2. Background and Significance

Autism spectrum disorder (ASD) is a complex brain condition caused by many factors ranging from the environment to genetics. The Centers for Disease Control and Prevention (CDC) stated that 1 in 68 children are born with autism, which is more common in males by a four-to-one ratio. As a condition with no diagnostic criteria, behavioral therapists and doctors examine patient actions and general behavioral trends to identify a potential cause (Amaral 2017). Characterized by poor social interaction and repetitive behavior, ASD has a complex and intriguing genetic makeup, with a convoluted familial inheritance pattern and nearly 1000 genes contributing to its rise (Ramaswami et al., 2018). As a neurodevelopmental disorder often referred to as a pervasive developmental disorder (PDD), ASD can also restrict and stereotype patterns of behaviors or interests (Faras et al. 2010). Using recent genomic techniques like microarray and next-generation sequencing (NGS), researchers have analyzed the genes in question to identify variance and clarify potential causes. Furthermore, many institutions have conducted exome sequencing analyses and general genome sequencing to identify mutations and create genomic datasets (Choi et al., 2021). Recently, researchers discovered over 36 de novo variants in the genetic makeup of individuals with autism, using Ingenuity Pathway Analysis software to construct networks , identify those autism-related genes, and find relationships among them, especially among six genetic networks (Kim et al. 2020).

Although researchers are still looking into official environmental causes, some researchers have found correlations between autism and parental age, assisted reproductive technology, nutrition, material infections and diseases, toxicants and general environmental chemicals, and medications (Gialloretti et al. 2019). German measles and similar diseases are related to the rise of ASD. However, influenza and other widely infectious diseases do not correlate. Furthermore, drugs used during the first and second trimesters of pregnancy, specifically serotonin reuptake inhibitors for depression treatment, are directly linked to autism. Researchers have investigated automobile pollution and cigarette smoke, among other infectious chemicals. However, more research must be conducted before making an official conclusion (Amaral 2017). As big data and molecular biology develop, scientists are continuously coming up with new ways to understand the genetic makeup of autistic patients to make better predictions about the origins of autism and its subsequent development (Gialloretti et al., 2019).

Figure 1: State-Level Weighted Prevalence of Autism Spectrum Disorder (Xu et al. 2019)

Autism may be chronic, but some treatments can weaken its impact. The most effective therapy method is the implementation of intensive behavioral interventions that improve the child’s functioning in question. These therapy sessions focus on language practice, social responsiveness, imitation, and etiquette. Examples of these therapy routines are the ABA (Applied Behavioral Analysis), and TEACCH (Treatment and Education of Autistic and Related Communication Handicapped Children) approaches (Faras et al. 2010). Furthermore, the UCLA Lovaas model and Early Start Denver Model (ESDM) have proven to improve children’s cognitive performance, language, and socially adaptive behavior. Furthermore, the drugs risperidone and aripiprazole demonstrate improvements in behavior like distress, aggression, self-injury, and hyperactivity but have more harm than benefits. Intervention is beneficial for children under 2, but more studies must occur before a conclusion is deemed legitimate (Warren et al., 2011).

2.1 Definitions

2.1.1 Data Augmentation – Data Augmentation is a critical tool for developing comprehensive deep learning classification-oriented models, especially when there is little data (Aryan and Unver, 2018). As its name suggests, data augmentation creates various copies of existing data by rotating, shifting, flipping, and zooming images to specified requirements. These copies form a more extensive dataset and help create a more robust model.
2.1.2 Gamify – Gamification refers to the intention of creating an application or tool that incorporates elements of gameplay. Examples of gamification include competition, a point system, and universal rules or regulations
2.1.3 Micro Gesture – Per Liu et al. (2021), a micro gesture is a gesture that is unknowingly or subconsciously emitted. Rather than an intentional wave, symbolizing greeting or departure, a micro gesture would be covering the face in sadness or crossing the arms to portray a feeling of anger. Viewers can gather hidden emotions from the subjects that are revealed through these micro gestures, allowing that subject to give off genuine feelings and appear vulnerable.

Figure 2: Sample Micro Gesture during post-match conference; crossing arms Liu et al. (2021)

2.1.4 ResNet50 Architecture – As stated by Liu et al. (2021), deep neural networks fail to achieve their full potential due to vanishing gradient and saddle point problems. ResNet50 solves those issues by implementing a custom algorithm by which it can efficiently train models and eliminate the issues mentioned above. As Wen et al. (2019) state, ResCNN skips several blocks of convolutional layers and implements shortcuts to overcome vanishing or exploding gradients. Zaeemzadeh et al. (2021) prove that networks of nearly 1,000 layers increased accuracy using a ResNet architecture.

Figure 3: Residual modules before and after improvement (Wang et al. 2022)

2.2 Relevant Project Resources

2.2.1 iMiGUE Dataset – Introduced by Liu et al. (2021), the iMiGUE dataset is an identity-free dataset that focuses its work on micro gestures. iMiGUE is unique because it focuses on nonverbal cues without facial emotions while existing emotion detection focuses on the face. The dataset contains over 350 YouTube tennis post-match links and various start and end frame stats labeled by a particular class. For instance, the dataset would label frames 250 to 300 as certain behaviors. We then classified those behaviors into the certain emotions they best represented. For instance, we consider crossing arms an angry pose, while covering the face would be sad.

Figure 4: iMiGUE characteristics per Liu et al.(2021)

2.2.2 Kaggle FER-2013 Model – The model we implemented was developed for a Kaggle dataset named FER-213. The FER-2013 dataset contained seven emotions (angry, disgusted, fearful, happy, neutral, sad, surprised), with each image saved as a 48×48 pixel grayscale item. The dataset contains over 35,685 examples of images. Our model was a take on (Shakir, 2022) and their detection solution. The model fed in the data, implemented data augmentation on a training and validation set, froze the last four layers, created the model, compiled and trained the data, and displayed the output, basing its results on the accuracy, precision, recall, AUC, and f1 score. The output is displayed below.

Figure 5: Accuracy, Loss, AUC, Precision, and f1 score of the Kaggle ResNet (Shakir, 2022)

3. Related Work

Researchers have begun to focus on body language recognition by creating custom datasets and implementing tools like Laban Movement Analysis to identify specific gestures. Laban Movement Analysis (LMA), as summarized by Tsachor et al. (2019), is an internationally used system for observing movement. LMA identifies various movement variables often used for statistical analysis and general therapeutics. Tsachor et al. analyzed dance/movement therapy (DMT) and its effects on psychological wellbeing, a niche but relevant example of the analysis technique in a real-world scenario. Dance is one of the many applications of body language detection, but understanding dance emotion creates a new experience for viewers replicated in other industries and hobbies.

Accordingly, therapy relies on emotion, but facial features are not the only parts of the body that show emotion. In order to get an accurate, holistic view of any human emotion, AI must analyze spontaneous gestures, also known as micro gestures. Chen et al. (2018) prove this through their project, implementing spontaneous gestures and deep learning to analyze stress levels. As of mid-2022, there are limited resources regarding the study of gestures and subsequent identification by AI/ML models. This study uses a Spontaneous Micro Gesture (SMG) dataset of nearly 4,000 labeled gestures. Chen et al. (2018) found that many individuals use micro gestures to add a new layer to their expression that further emphasizes their feelings.

The group proposes a framework to encode gestures to a Bayesian network and help predict emotion states. Their results show that most participants naturally perform micro gestures to relieve their mental fatigue. Additionally, over 40 participants were interviewed extensively and put through a story-telling game with two emotional states. In order to contrast these story-telling game participants with ordinary individuals, the group also conducted a short test with ordinary people and trained orators. As of its publication, it was the first gesture dataset and demonstrated immense progress as a gesture prediction tool, especially for hidden emotions.

Figure 6: All of the micro gestures in the Chen et al. (2018) dataset

Adding onto the findings of Chen et al. (2018), Automated Recognition of Bodily Expression of Emotion(ARBEE) identifies the effectiveness of Laban Movement Analysis (LMA) in identifying bodily expressions. It also compares other methods and studies and analyzes two data sets, body skeletons, and a raw image. After analyzing the data, the system intertwines emotional detection and body language and forms a holistic model of what the character in question is feeling. The study also cites a variety of applications, ranging from personal assistants to social and police robots, all cases that require an immense understanding of the human form and human feelings. The ARBEE technology is a pioneering mechanism through which modern tools relying on social and emotional recognition can operate (Luo et al., 2020).

Schindler et al. (2008) tackle the lack of study of body gestures. Most people perceive emotion as a smile, frown, or other facial features rather than gestures or body movements. This group created a computational model of the visual cortex to construct a set of neural detectors that can find seven emotional states from static pictures of poses. The group created a visual hierarchy of models which can discriminate seven emotional states from static views of body poses. The researchers also evaluated the model on human test subjects.

Similarly, Aristidou et al. (2015) found that with the increased availability of motion databases and advancements in analyzing that motion, utilizing data has been more accessible than ever before. Their paper analyzes Russell’s circumplex model and Laban Movement Analysis (LMA) to extract LMA components and index and classify dance movements and the related emotions those dancers convey. The results of their experiments show that, with LMA extracting and indexing various movements and classifying them by emotion, researchers can learn how and why people express emotion, all while working in parallel with other motion tracking applications and classification tools.

Figure 7: A subject performing actions for Laban Movement Analysis (Tsachor et al. 2019)

Liu et al. (2021) elaborate on dance emotion classification through the iMiGUE dataset, a micro gesture dataset for emotion analysis. This dataset is unique because it focuses entirely on nonverbal features, different from the face’s original and more common scope. Each video in the dataset has specified video frames during which a subject expresses a particular emotion and performs a specific action with their hands and head. Additionally, to limit imbalance in the dataset samples, the project uses unsupervised learning to capture the micro gestures and investigate the micro gestures, enhancing the accessibility of emotion AI. By identifying the location of those gestures and classifying them into 1-32 classes of emotion and action, the researchers could create a holistic perspective of body language during interviews and press conferences of significant athletes.

Analyzing all the components of a successful video and how viewers react to a creator may be used to identify how viral a video might become (Biel et al., 2022). Numerous researchers have picked an industry or a situation in which someone shows emotion nonverbally, but Biel et al. (2022) take their analysis to a different dimension: that of social media. Their project details social media, particularly vlogging, to identify personality and general characteristics that identify these vloggers, mainly relying on nonverbal cues and audiovisual analysis. The tool will significantly impact consumers because the personality impressions are crowdsourced and do not rely on a computer model. The model can make educated predictions regarding future content and view reactions by analyzing these cues.

Dael et al. (2012) take a more general approach. Instead of tackling the face and its movements, the group tackles body posture by implementing the Body Action and Posture (BAP) coding system to examine the types and frequency of various body movements. The project asked ten actors to recreate 12 emotions and investigated whether these patterns support emotion theory, bidimensional theory, and componential appraisal theory. The study revealed that body movement patterns occur when creating various emotions, leading to a clear cut path for further research and a potential AI model. The study also found that although a few emotions were expressed by just one pattern, the heavy majority of emotions were represented in various forms, allowing for a wide variety of emotion differentiation and potential research in the minutia of expression, primarily action readiness, and eagerness. Although much additional work is required in the field, using the frameworks listed in this group’s research will allow for a more calculated approach to experimentation and hypothesis testing.

Yin et al. (2008) take emotion detection to the next level through in-depth analysis in multimodal sensing. With most facial emotion analysis taking place with a static, 2D photograph, research leaves out many details from the classification process. In order to improve analysis results and produce a holistic view of the head, multimodal sensing and the BU-3DFE dataset produced from comprehensive data scans can be implemented, pioneering a wholly revamped study of emotion and expression in humans. The approach showed an 83% correct recognition rate in classifying six expressions (happiness, disgust, fear, anger, surprise, sadness). To reduce bias, the database also contains over 100 subjects ranging from 18 to 70 years old and from various backgrounds (White, Black, East-Asian, Middle-East Asian, Indian, Hispanic).

Figure 8: Sample multimodal sensing model for a man

On emotion detection, Benitez-Quiroz et al. (2016) propose a new algorithm to analyze and annotate millions of facial expressions. In order to accurately identify the face and the emotions it displays, there must be annotated databases. However, those databases take far too long to create manually. Their project, titled EmotioNet, identifies facial expressions by implementing a unique computer vision system that annotates a database of over 1 million images from the internet. Furthermore, the program downloads millions of images with emotional keywords and annotates them accordingly. This project will be instrumental in any analysis of facial emotions in the future because of its sheer scale and accuracy. By annotating images, users of their dataset can get a good view of the face and train models more accurately than ever before.

Along with EmotioNet, the ImageNet training set is a massive image database that researchers can use for object detection and image classification on an enormous scale. Krizhevsky et al. (2012) trained a convolutional neural network to classify 1.3 million images in the LSVRC-2010 ImageNet training data set into exactly 1000 classes. They received error rates of 39.7% and 18.9%, better than other models publicized at the time. The network used 60 million parameters and 500,000 neurons with five convolutional layers, some of which were followed by max-pooling layers and two layers with a final 1000-way softmax. They used non-saturating neurons and an efficient GPU implementation of the networks. To reduce overfitting, they used a regularization model. The whole system proved highly effective at classification and is one of many deep learning models with immense success at image classification, the results of which can extend to facial emotion and body language emotion.

Many researchers have written papers detailing their deep learning endeavors in body language prediction and the creation of vast datasets. However, Stanford University recently funded an Autism Glass project to see the effect of all this facial emotion analysis in the real world. Various researchers developed a mobile application to gamify, improving emotion recognition and social awareness in autistic children. The project implemented Google Glass, a hardware tool that uses computer vision and other algorithms to create user immersive experiences. Parents would switch on the camera and point at various people, all displaying different emotions. The child would then guess which emotion was correct. If they were right, they received a prize and a verbal appreciation. If they did not get the answer right, they kept trying until they did. Parents considered the project an immense success, leading the researchers toward more studies, development, and public outreach. Although more research is needed before researchers implement a full-scale in the mainstream market, initial success has led to a large market and target audience for the tool.

Figure 9: Autism Glass Project sample application screen

4. Methods and Analysis

We must understand the vanishing gradient problem to understand our use of a Residual Neural Network or ResNet instead of a classic deep learning network. As stated by Liu et al. (2021), deep neural networks achieve poor results or even training failure due to vanishing gradient and saddle point problems. Jha et al. (2021) elaborate on these issues by pointing out that advanced deep learning techniques were limited in their use because large datasets were relatively rare. However, with the increasing availability of these datasets, many people decided to implement deeper neural networks to boost their model performance. To their dismay, the performance worsened due to the vanishing gradient problem. As more layers join the system, gradients of the loss function approach zero, making the model increasingly difficult to train (Wang, 2019). We implemented a residual neural network to achieve our deep learning goals and maximize our accuracy. As Wen et al. (2019) state, a residual neural network or ResCNN solves the problems posed by general deep learning. Classic deep learning models face the vanishing gradient problem paired with backpropagation, but ResCNN skips several blocks of convolutional layers and implements shortcuts to overcome exploding gradients. Zaeemzadeh et al. (2021) discuss what makes a ResNet model so accurate. Networks of more than 1,000 layers saw significant gains in their overall accuracy through residual neural networks.

Furthermore, ResNet/ResCNN leads to the preservation of the norm of the gradient and stable backpropagation, optimizing accuracy. They analyze ResNet through skip connections and create new theoretical results on the advantages of those connections in a modern ResNet system. Although more research is continuously revising the highest-performing Residual Neural Networks, ResNet has opened a door for deep learning to become accurate and feasible for even the most extensive datasets.

Figure 10: Sample ResNet 50 architecture from Ji et al. (2019)

Figure 11: General CNN Architecture (O’Shea & Nash, 2015)

Because of the ResNet/ResCNN’s efficacy and efficiency, researchers have already implemented it in various endeavors. McAllister et al. (2018) combined supervised machine learning and deep residual neural networks to log food intake and maintain sustainable, healthy lifestyles for the obese, a population more likely to endure chronic conditions like diabetes, heart disease, sleep apnea, and cancer. Their research concluded that deep CNN’s were remarkably accurate and that the Food-101 dataset they had created would work in a wide variety of food image classification tasks. Panahi et al. (2022) took inspiration from a current event and implemented their version of a residual neural network on chest x-rays to identify COVID-19 patients. They substantiated a claim that machine learning models paired with radiography imagery can reliably predict lung-related diseases, from pneumonia to COVID-19.

With the benefits of ResNet, we began exploring suitable images to train our model. The iMiGUE dataset developed by Liu et al. (2021) best suited our needs. iMiGUE, an acronym for “identify-free video dataset for micro gesture understanding and emotion analysis,” focuses on nonverbal gestures prompted by unintentional behaviors rather than gestures given purposefully for effective communication cues. Accurate emotional detection algorithms must pick up these unintentional actions to effectively gauge a person’s inner feelings and holistically identify emotions. Liu et al. (2021) provide more context to understand these two types of gestures better. Existing studies and datasets focus on illustrative gestures (like waving hands to symbolize leaving or greeting) rather than gestures like covering the head to indicate disappointment or sadness.

Additionally, similar studies ended up inadvertently causing participants in the studies to suppress their emotions, particularly negative ones, during their interactions with researchers, as many opted not to express genuine emotion because of existing social norms or a lack of comfort with the study. Furthermore, existing data provides a general overview of behaviors rather than an in-depth analysis of those behaviors to identify hidden emotions. iMiGUE provides revolutionary new data to combat existing data’s flaws.

Figure 12: List of the various classes of behaviors the iMiGUE dataset contains (Liu et al. 2021)

The figure above shows that the iMiGUE dataset analyses over 450 tennis post-match interviews and labels them according to 32 classes. Each class demonstrates a different action that can correspond to a list of emotions: angry, fear, focused, happy, neutral, sad, surprised, and uncomfortable. See the figure below for an example.

Figure 13: A sample “angry” image from class 17

The figure above shows that a tennis player has crossed her hands, a clear symbol of class 17. We then use that information to classify the image as an example of anger, primarily because a critical characteristic of being upset or angry is a pair of visibly crossed hands.

However, the iMiGUE dataset is not an image dataset. Instead, it relies on video classification and analysis from YouTube post-match links, meaning that in order to fit the dataset for our image classification needs, we had to create a custom script to decode each frame of each video and use iMiGUE’s in-built frame labels to name each frame according to what class it symbolized. We implemented PyTube to download each video and immediately renamed it to its video id as stated by iMiGUE’s dataset. We then looked through iMiGUE’s frame database sheet and read each video id column, start frame, and end frame. The frames in between the start and end frames were the frames we would write to an emotion’s folder. We iterated through each set of frames and designated classes to create the 48,000 converted grayscale images we used to fit our model. The figures below show sample iMiGUE data.

Figure 14: Samples of the iMiGUE dataset

Our program labels each image through a particular naming convention. The frame number and video ID are displayed, all while the algorithm sorts the images into the emotion folders (happy, sad, angry, focused, uncomfortable, neutral, fear, and surprise.) See a snapshot of the data below.

Figure 15: The naming convention we used to store images and videos

With the data identified, we began the implementation of our model.

We first generated new image data through data augmentation. We ran the augmentation script on the training and validation data (splitting the data into 20% validation and 80% training.) Each image was fed as a 100 by 100 NumPy array.
We then imported the model from Tensorflow. A detailed overview of our weights exists below.

Figure 16: Model summary of our ResNet50 architecture

3. After building the model, we compiled it using the Adam optimizer, a gradient-based optimization system based on adaptive estimation of lower-order moments per Kingma et al. (2014).

Figure 17: A high-level diagram of our data collection.

5. Analysis and Results

Figure 18: Results with 10,000 images of data

Figure 19: Results with 49,000 images of data

We evaluated the model using accuracy, precision, recall, AUC, and fl score through 10 epochs (as more epochs led to minor improvement). After multiple trials, it appeared clear the model was overfitting, as validation loss increased as validation accuracy decreased. The model appeared to classify training data fairly accurately while struggling to identify emotions and correctly predict information from the validation data. A particular flaw was apparent from the beginning of the model-building and data collection. Even though 32 classes effectively identified various behaviors a particular player exhibited during an interview, it could not have been entirely accurate. The above example described crossing hands as a symbol of anger or unhappiness. However, it could also mean fear or even discomfort if the situation was in an unfavorable environment. Assigning an emotion to behavior will not lead to entirely accurate predictions but rather an educated guess. A case study is examined below.

Figure 20: An image of a tennis player with interlocking fingers. Marked as “fearful.”

Figure 21: An image of a tennis player with interlocking fingers. Marked as “angry.”

The two preceding images are virtually identical. Their images were taken on different days at different zooming levels but captured the inherent flaw of this experiment: the lack of context. A tennis player interlocks her fingers and stares straight into a reporter’s eyes with no emotion. However, one image is classified as “fearful” while the algorithm classifies the other as “angry.” Body language is harder to identify and draw conclusions from when compared to facial emotions. A happy face is characterized by a smile or similar action, while a happy person is represented in various ways by body language. These ways often overlap with other emotions, leading to further confusion. However, an inability to distinguish one of these images from another decreased the accuracy of the results but still led to a peak validation accuracy of 87.7% and 87.8% for 10,000 and 49,000 images, respectively.

Furthermore, happy, fearful, uncomfortable, and focused images were more challenging to classify than surprise and sadness because they contained overlapping emotional gestures like interlocking fingers or crossed hands. In contrast, surprise and sadness often had clear signs like covering the face with hands. More data based on real-world scenarios will make this model run more effectively in scenarios besides a still frame, all while preventing overfitting and increasing the likelihood of commercial use.

6. Future Work

So far, this project offers limited insight into analyzing complex real-world scenarios, given that we only trained our model with still images from an unchanging setting. Tennis players giving post-match interviews often sit at a podium rather than on the move or in a natural habitat. We will incorporate more data sources from various scenes and subjects to solve those issues and improve this project. The more data our model is trained on, its prediction mechanisms can be more extensive. Additionally, our data had a flaw of not matching emotions to gestures accurately. Many behaviors exhibited numerous emotions and required context to come up with an effective conclusion. Therefore, combining facial emotions and body language can help provide a holistic view of a person’s emotional state, helping highly neurodivergent individuals learn how to identify those hidden states accurately and not learn incorrect correlations between a micro gesture and a potential emotion that the gesture is exhibiting.

Furthermore, ResNet50 is undoubtedly not the only model we can implement for our classification needs. VGG-16, Inceptionv3, and EfficientNet all have unique benefits that can outperform ResNet50. However, more experimentation is needed before we can identify a clear conclusion. On the topic of model architectures, we must discuss video classification. The iMiGUE dataset was a video dataset that required extensive image slicing per frame and tedious labeling. Instead of splitting the video into frames, we can experiment with analyzing the videos themselves and conducting video classification. This video classification may make emotion recognition more accurate because it provides context and developments leading to a potential micro gesture, including potential facial signals and emotional expressions.

After more testing and analysis, we plan to make this product a commercial tool by developing a web application. Like the Autism Glass Project, we plan on making this product a gamified application that allows highly neurodivergent individuals, particularly children, to guess which emotions their parents are making. By making the experience a fun and exciting activity, children are more likely to play the game and learn the emotions at hand, enhancing their understanding of social cues and better preparing them for communication in the real world. Gamification has proven effective in real-world settings, particularly in medical endeavors. Sardi et al. (2017) write that by applying game mechanisms to non-game contexts, individuals are more likely to experience cognitive and motivational benefits. Although more research is required to make gamification a long-lasting beneficial technique, it has yielded immense promise.

Figure 22: Autism Glass Project backend use; this example identifies the pupil (Wall-Lab, 2019)

7. Conclusion

Although researchers are making technological breakthroughs in autism, there is a clear need for a tool that can enhance the emotional development of younger individuals. There has been some progress in body language sensing and movement analysis. As shown by Schindler et al. (2008) and others, but not enough research has been done linking emotions to micro gestures. Autistic individuals lack emotional recognition capabilities and have hard times analyzing social cues, proving that a tool that can take people’s movements and teach what emotions those movements indicate would be life-changing for users. Autistic children who learn emotion detection early will perform better in job interviews, relationship building, school, and numerous other situations previously hindered by autism’s characteristic symptoms (Morgan et al. 2014). Furthermore, this tool will be instrumental in treating autism in underserved communities. Rather than paying for extensive therapy that can exceed tens of thousands of dollars, autistic children can improve their social skills in the comfort of their families and home (Horlin et al., 2014).

We wanted our tool to be convenient for autistic users, so we focused on identifying common emotions in a casual, seated setting. For autistic individuals, identifying emotions relies heavily on facial signals and body language. Thus, we incorporated a developed and public ResNet50 model and trained it on over 49,000 images, each of which was a frame of over 490 post-match interviews. Our algorithm sorted each image into an associated folder, with each folder named after an emotion. To make our project feasible and practical, we picked eight common emotions: happy, angry, sad, surprised, focused, uncomfortable, neutral, and fearful. Our experiments proved that happy, fearful, uncomfortable, and focused images were more challenging to classify than surprise and sadness due to the overlapping nature of expression. Interlocking hands, for instance, may show fear, focus, or anger. Hence, the lack of new data with different behavioral expressions for each emotion tainted our results and lowered our accuracy. To solve our low accuracy issue, we are training our model on more data from different vantage points and settings, all while ensuring that the behaviors expressed by the subjects in the data are distinguishable.

The ultimate goal of our project, given resources, time, and additional knowledge, would be to build a production-ready application that implements a game experience for autistic children. Each child guesses emotions and earns points depending on their score, winning prizes along the way. The prizes, paired with the fun of guessing the emotions of a family member or friend, will motivate the children to keep building their knowledge base. Ultimately, our project attempts to contribute to the world and cultivate a positive community in which computer vision can effectively recognize emotions and improve the long-term and short-term lives of autistic children, opening doors for them that were previously immensely difficult to reach. By detecting emotions and teaching autistic children about social intelligence and the importance of communication, our project will help those children reach their fullest potential.

References

Ohl, A., Grice Sheff, M., Small, S., Nguyen, J., Paskor, K., & Zanjirian, A. (2017). Predictors of employment status among adults with Autism Spectrum Disorder. Work (Reading, Mass.), 56(2), 345–355. https://doi.org/10.3233/WOR-172492

Crow, S. J., Mitchell, J. E., Crosby, R. D., Swanson, S. A., Wonderlich, S., & Lancanster, K. (2009). The cost effectiveness of cognitive behavioral therapy for bulimia nervosa delivered via telemedicine versus face-to-face. Behaviour research and therapy, 47(6), 451–453. https://doi.org/10.1016/j.brat.2009.02.006

Landa R. J. (2018). Efficacy of early interventions for infants and young children with, and at risk for, autism spectrum disorders. International review of psychiatry (Abingdon, England), 30(1), 25–39. https://doi.org/10.1080/09540261.2018.1432574

Sambare, M. (2020). FER-2013, Version 1. Retrieved September 1, 2022 from https://www.kaggle.com/datasets/msambare/fer2013

Shakir, Y. (2021). Emotion Recognition with ResNet50, Version 1, Retrieved September 1, 2022 from https://www.kaggle.com/code/yasserhessein/emotion-recognition-with-resnet50

Amaral D. G. (2017). Examining the Causes of Autism. Cerebrum : the Dana forum on brain science, 2017, cer-01-17.

Ramaswami, G., & Geschwind, D. H. (2018). Genetics of autism spectrum disorder. Handbook of clinical neurology, 147, 321–329. https://doi.org/10.1016/B978-0-444-63233-3.00021-X

Faras, H., Al Ateeqi, N., & Tidmarsh, L. (2010). Autism spectrum disorders. Annals of Saudi medicine, 30(4), 295–300. https://doi.org/10.4103/0256-4947.65261

Choi, L., & An, J. Y. (2021). Genetic architecture of autism spectrum disorder: Lessons from large-scale genomic studies. Neuroscience and biobehavioral reviews, 128, 244–257. https://doi.org/10.1016/j.neubiorev.2021.06.028

Kim, N., Kim, K. H., Lim, W. J., Kim, J., Kim, S. A., & Yoo, H. J. (2020). Whole Exome Sequencing Identifies Novel De Novo Variants Interacting with Six Gene Networks in Autism Spectrum Disorder. Genes, 12(1), 1. https://doi.org/10.3390/genes12010001

Emberti Gialloreti, L., Mazzone, L., Benvenuto, A., Fasano, A., Alcon, A. G., Kraneveld, A., Moavero, R., Raz, R., Riccio, M. P., Siracusano, M., Zachor, D. A., Marini, M., & Curatolo, P. (2019). Risk and Protective Environmental Factors Associated with Autism Spectrum Disorder: Evidence-Based Principles and Recommendations. Journal of clinical medicine, 8(2), 217. https://doi.org/10.3390/jcm8020217

Xu, G., Strathearn, L., Liu, B., O’Brien, M., Kopelman, T. G., Zhu, J., Snetselaar, L. G., & Bao, W. (2019). Prevalence and Treatment Patterns of Autism Spectrum Disorder in the United States, 2016. JAMA pediatrics, 173(2), 153–159. https://doi.org/10.1001/jamapediatrics.2018.4208

Warren, Z., Veenstra-VanderWeele, J., Stone, W., Bruzek, J. L., Nahmias, A. S., Foss-Feig, J. H., Jerome, R. N., Krishnaswami, S., Sathe, N. A., Glasser, A. M., Surawicz, T., & McPheeters, M. L. (2011). Therapies for Children With Autism Spectrum Disorders. Agency for Healthcare Research and Quality (US).

Liu, M., Chen, L., Du, X., Jin, L., & Shang, M. (2021). Activated Gradients for Deep Neural Networks. IEEE transactions on neural networks and learning systems, PP, 10.1109/TNNLS.2021.3106044. Advance online publication. https://doi.org/10.1109/TNNLS.2021.3106044

Wen, L., Dong, Y., & Gao, L. (2019). A new ensemble residual convolutional neural network for remaining useful life estimation. Mathematical biosciences and engineering : MBE, 16(2), 862–880. https://doi.org/10.3934/mbe.2019040

Zaeemzadeh, A., Rahnavard, N., & Shah, M. (2021). Norm-Preservation: Why Residual Networks Can Become Extremely Deep?. IEEE transactions on pattern analysis and machine intelligence, 43(11), 3980–3990. https://doi.org/10.1109/TPAMI.2020.2990339

Wang, H., Li, K., & Xu, C. (2022). A New Generation of ResNet Model Based on Artificial Intelligence and Few Data Driven and Its Construction in Image Recognition Model. Computational intelligence and neuroscience, 2022, 5976155. https://doi.org/10.1155/2022/5976155

Tsachor, R. P., & Shafir, T. (2019). How Shall I Count the Ways? A Method for Quantifying the Qualitative Aspects of Unscripted Movement With Laban Movement Analysis. Frontiers in psychology, 10, 572. https://doi.org/10.3389/fpsyg.2019.00572

H. Chen, X. Liu, X. Li, H. Shi and G. Zhao, “Analyze Spontaneous Gestures for Emotional Stress State Recognition: A Micro-gesture Dataset and Analysis with Deep Learning,” 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019, pp. 1-8, doi: 10.1109/FG.2019.8756513.

Luo, Y., Ye, J., Adams, R. B., Jr, Li, J., Newman, M. G., & Wang, J. Z. (2020). ARBEE: Towards Automated Recognition of Bodily Expression of Emotion in the Wild. International journal of computer vision, 128(1), 1–25. https://doi.org/10.1007/s11263-019-01215-y

Schindler, K., Van Gool, L., & de Gelder, B. (2008). Recognizing emotions expressed by body pose: a biologically inspired neural model. Neural networks : the official journal of the International Neural Network Society, 21(9), 1238–1246. https://doi.org/10.1016/j.neunet.2008.05.003

Aristidou, Andreas & Charalambous, Panayiotis & Chrysanthou, Yiorgos. (2015). Emotion Analysis and Classification: Understanding the Performers’ Emotions Using the LMA Entities. Computer Graphics Forum. 34. 10.1111/cgf.12598.

J. -I. Biel and D. Gatica-Perez, “The YouTube Lens: Crowdsourced Personality Impressions and Audiovisual Analysis of Vlogs,” in IEEE Transactions on Multimedia, vol. 15, no. 1, pp. 41-55, Jan. 2013, doi: 10.1109/TMM.2012.2225032.

Dael, N., Mortillaro, M., & Scherer, K. R. (2012). Emotion expression in body action and posture. Emotion (Washington, D.C.), 12(5), 1085–1101. https://doi.org/10.1037/a0025737

“A 3D Facial Expression Database For Facial Behavior Research” by Lijun Yin; Xiaozhou Wei; Yi Sun; Jun Wang; Matthew J. Rosato, 7th International Conference on Automatic Face and Gesture Recognition, 10-12 April 2006 P:211 – 216

Jha, D., Gupta, V., Ward, L., Yang, Z., Wolverton, C., Foster, I., Liao, W. K., Choudhary, A., & Agrawal, A. (2021). Enabling deeper learning on big data for materials informatics applications. Scientific reports, 11(1), 4244. https://doi.org/10.1038/s41598-021-83193-1

McAllister, P., Zheng, H., Bond, R., & Moorhead, A. (2018). Combining deep residual neural network features with supervised machine learning algorithms to classify diverse food image datasets. Computers in biology and medicine, 95, 217–233. https://doi.org/10.1016/j.compbiomed.2018.02.008

Panahi, A., Askari Moghadam, R., Akrami, M., & Madani, K. (2022). Deep Residual Neural Network for COVID-19 Detection from Chest X-ray Images. SN computer science, 3(2), 169. https://doi.org/10.1007/s42979-022-01067-3

Sardi, L., Idri, A., & Fernández-Alemán, J. L. (2017). A systematic review of gamification in e-Health. Journal of biomedical informatics, 71, 31–48. https://doi.org/10.1016/j.jbi.2017.05.011

Morgan, L., Leatzow, A., Clark, S., & Siller, M. (2014). Interview skills for adults with autism spectrum disorder: a pilot randomized controlled trial. Journal of autism and developmental disorders, 44(9), 2290–2300. https://doi.org/10.1007/s10803-014-2100-3

Horlin, C., Falkmer, M., Parsons, R., Albrecht, M. A., & Falkmer, T. (2014). The cost of autism spectrum disorders. PloS one, 9(9), e106552. https://doi.org/10.1371/journal.pone.0106552

Ji, Qingge & Huang, Jie & He, Wenjie & Sun, Yankui. (2019). Optimized Deep Convolutional Neural Networks for Identification of Macular Diseases from Optical Coherence Tomography Images. Algorithms. 12. 51. 10.3390/a12030051.

O’Shea, K., & Nash, R. (2015). An Introduction to Convolutional Neural Networks. https://arxiv.org/pdf/1511.08458.pdf

Liu, X., Shi, H., Chen, H., Yu, Z., Li, X., & Zhao, G. (n.d.). iMiGUE: An Identity-free Video Dataset for Micro-Gesture Understanding and Emotion Analysis. Retrieved September 8, 2022, from https://openaccess.thecvf.com/content/CVPR2021/papers/Liu_iMiGUE_An_Identity-Free_ Video_Dataset_for_Micro-Gesture_Understanding_and_Emotion_CVPR_2021_paper.pdf

Kingma, D., & Lei Ba, J. (2017). ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION. https://arxiv.org/pdf/1412.6980.pdf

The Wall Lab | Autism Therapy on Glass. (2019). Stanford.edu. https://wall-lab.stanford.edu/projects/autism-therapy-on-glass/

Fabian Benitez-Quiroz, C., Srinivasan, R., & Martinez, A. M. (2016). EmotioNet: An Accurate, Real-Time Algorithm for the Automatic Annotation of a Million Facial Expressions in the Wild. Www.cv-Foundation.org.https://www.cv-foundation.org/openaccess/content_cvpr_2016/ html/Benitez-Quiroz_EmotioNet_An_Accurate_CVPR_2016_paper.html

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems, 25.https://proceedings.neurips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45 b-Abstract.html

Chi-Feng Wang. (2019, January 8). The Vanishing Gradient Problem. Medium; Towards Data Science. https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484

Stanford University. (n.d.). Autism Glass Project. Autismglass.stanford.edu. Retrieved September 8, 2022, from https://autismglass.stanford.edu/

Ayan, E., & Ünver, H. M. (2018, April). Data augmentation importance for classification of skin lesions via deep learning. In 2018 Electric Electronics, Computer Science, Biomedical Engineerings’ Meeting (EBBT) (pp. 1-4). IEEE.

About the author

Kedaar Rentachintala

Kedaar is a senior at BASIS Independent Silicon Valley in San José, CA. He is interested in machine learning, data science, and artificial intelligence applications for those on the autism spectrum to improve their motor function, social awareness, and emotional intelligence.

The post Implementing a ResNet50 Architecture to Decrease Social Anxiety in Autistic Children by Detecting Emotions Portrayed Through Body Language Micro Gestures appeared first on Exploratio Journal.

Multi-Armed Bandit: Study of Exploration vs. Exploitation

Avi Shukla — Sun, 02 Oct 2022 12:44:00 +0000

Author: Avi Shukla
Mentor: Dr. Osman Yağan
Leland High School

Abstract

Imagine being given several slot machines [bandits], each rigged to provide a variable amount of money, and are asked to play a total of T times across all machines. The problem leaves one in a puzzling dilemma as people naturally want to test every available option to confidently secure the best bandit. However, people do not wish to lose potential profit by employing substandard bandits. This fundamental trade-off of exploration vs. exploitation is the essence of a widely relevant issue we dub the “multi-arm bandit problem.” Multi-armed bandits are powerful and dynamic algorithms used to make decisions under uncertainty. Bandits come in a wide variety and assortment of which one would pick based on their situation and desired approach. Throughout this paper, we will explore the various algorithms one can utilize to address the multi-arm bandit problem and any potential benefits and drawbacks of each method. This paper will discuss the evolution of multiple bandit methods and algorithms in a simple, digestible, and educational format.

Keywords: Bandit, Multi-armed bandits, Stochastic Bandits, explore then commit, ETC, Upper Confidence Bound, UCB, UCB Minimax Optimality, MOSS, Adversarial Bandits, Contextual Bandits, machine learning

Introduction

A bandit problem is a sequential game played between the learner and the environment over many rounds. The positive natural number n, termed the horizon, denotes the number of rounds that will be played. The learner’s goal is to maximize cumulative rewards within the set horizon. The challenge comes from how the environment is hidden from the learner, forcing them to use their histories created from their previous actions. The learner can map these histories to efforts to develop a policy that can guide a learner’s interactions through an environment.

To gauge the quality of our predictions, we can utilize regret, where we measure the learner’s performance with the optimal competitor class policy, obtaining tangible feedback about the learner. The main goal is simply to minimize regret as much as possible over all possible environments, resulting in a policy nearly identical to one of the environment classes. Whereas regret is for an individual trial t, we can also measure the mean difference between the best arm and an arm “i” through the sub-optimality gap often represented with ∆.

As per regret theory, the optimization problem for the decision-maker is usually to minimize regret, which is different from maximizing utility. One strange aspect of this measure of “regret” is that it can be negative (i.e., you can gain more than expected). Hence, this measure of “regret” does not correspond to the usual definition in regret theory nor the expected regret in regret theory. It is a hybrid of these and does not obey the property of non-negativity.

Let R_a,t denote the return at time t for action a = 1,….K

Let a = (a₁,….a_t) denote the vector of actions, and

Let R_t = R_a,tdenote the return of the chosen action at the time t

then the usual form for regret would be the random variable:

There is a multitude of bandit algorithms, some of which are described below.

Stochastic bandits are generally defined with the rewards for each specific arm independently and identically distributed from a probability distribution. Different Stochastic Bandits algorithms are explored below.

Explore then Commit Algorithm (ETC)

One of the simplest kinds of stochastic bandit algorithms is dubbed the “explore then commit” algorithm (ETC), also sometimes known as the greedy algorithm. The algorithm explores a total of “m” (times explored each arm) × k (number of arms) times before committing to an arm. The formal equation is as follows:

At = (t mod k) + 1, if t A_t = (t mod k) + 1 , if t ≤ mk ; or
A_t = argmax_i μ^_i(mk), if t > mk

It is important to note that if “m” becomes too large, the policy explores for longer than it should, whereas if “m” is too tiny, the probability the algorithm commits to the wrong arm will grow. Therefore, choosing “m” becomes integral to the algorithm’s performance. Choosing m, in turn, becomes one of the algorithm’s most significant drawbacks as it tends to be chosen blindly. Moreover, due to the non-adaptive nature of “m,” optimizing it to select a good constraint can become challenging.

A massive benefit to ETC is the ease and practicality at which one can work with them, making them ideal candidates for a simple algorithm to get the job done. ETC generally behaves best with two arms (also known as A/B testing) but loses effectiveness as more arms are added. Another issue is that the algorithm is not at anytime; i.e., it requires knowing the horizon “n.” However, this limitation can generally be overcome by utilizing a method known as the doubling trick, allowing the algorithm to develop and polish itself without being halted by the horizon.

The Optimistic-Greedy Algorithm

A simple way to modify the Greedy Algorithm, to make it explore the set of available actions in search of the optimal action, is to set the initial action estimates to very high values.

By allowing exploration time “m” to become a data-dependent variable, it is possible to heavily optimize the regret without knowledge of the sub-optimality gap. This method is known as successive elimination and works with increasingly sensitive hypothesis tests, which slowly eliminate arms. Specifically, it alternates arms until “a” is worse than some other arm with a high probability, subsequently discarding arm “a” and narrowing the selection of arms.

The major drawback of the Optimistic-Greedy algorithm is that a poorly chosen initial value can result in a sub-optimal total return. Selecting the best initial deals can be challenging without prior knowledge of the range of possible rewards.

The Epsilon-Greedy Algorithm (ε-Greedy)

A pure Greedy strategy has a very high risk of selecting a sub-optimal hand and sticking with it.

ETC algorithms also have a randomized relative known as the ε-Greedy algorithm. At round “t”, the algorithm either plays the empirically best arm with the probability “1-εt” or simply explores. The Epsilon-Greedy strategy is an easy way to add exploration to the basic Greedy algorithm. Due to the random sampling of actions, the estimated reward values of all actions will converge on their valid values.

The downside of the ε-Greedy algorithm is that non-optimal actions continue to be chosen, and their reward estimates are refined long after they have been identified as non-optimal.

The Upper Confidence Bound (UCB) Algorithm

The ETC algorithm and its variations do not fare well for a generic solution. The pure Greedy algorithm does not fare much better than a simple random selection method. However, with some slight modifications, such as Optimistic-Greedy’s use of large initial values or Epsilon Greedy’s approach of introducing random exploration, the selection performance can be significantly improved. Optimistic-Greedy’s performance depends on the values selected for its initial rewards. Epsilon Greedy continues to explore the set of all actions long after it has gained sufficient knowledge to know which actions are insufficient.

To improve the performance, what if we could take ETC but make the algorithm more dynamic? The answer to this question comes with a new algorithm known as Upper confidence Bound Bandit or UCB. This algorithm is called UCB, as we are only concerned with the upper bound, given that we are trying to find the arm with the highest reward rate. UCB has several advantages over ETC as it does not depend on advanced knowledge of the sub-optimality gap, behaves well with more than two arms, and depends on the horizon “n,” which can also be eliminated with the aforementioned doubling trick.

The unknown mean of the i^th arm can be defined as

UCBi(t − 1, δ) = ∞ if Ti( t – 1 ) = 0; if Ti( t – 1 ) = 0

UCBi(t − 1, δ) = μi( t – 1 ) + √(2 log(1/δ) / Ti(t−1)); otherwise

Where δ is the confidence level, μ is the mean subgaussian random variable, and t is the round number where the learner has observer Ti (t-1) samples from arm i and received rewards from that arm with an empirical mean of μi( t – 1 ).

Something to note about UCB is that it functions on the principle of optimism under uncertainty, which states that one should act as if the environment is as nice as possible. It achieves optimism by adding the exploration bonus √(2 log(1/δ) / Ti(t−1)), which, unlike ETC, allows underrepresented arms to be favored so they don’t fall behind.

The algorithm for UCB(δ) can be given as

UCB functions as an index algorithm, meaning it tries to maximize a quality metric called the index which can be used to compare rewards from different arms. In the case of UCB, this index is the sum of the empirical mean of the rewards experienced in conjugation with the aforementioned exploration bonus, also called confidence width.

Note that choosing the correct value of confidence level “δ” is not easy. It needs to balance to ensure optimism with high probability but still deter suboptimal arms from being explored excessively.

If ETC is tuned with the optimal choice of commitment time for each choice ∆, it can outperform the parameter-free UCB; otherwise, it will not outperform UCB. We will explore a variant of UCB which will outperform the ETC with even optimally tuned ETC.

UCB Algorithm: Asymptotic Optimality

This algorithm is asymptotically optimal, meaning that no algorithm can perform better than it in the limit of horizon n going to infinity. Put differently, if the algorithm is to be used for a long time, then the algorithm presented next is optimal. In the UCB described earlier, the right confidence level “δ” value is not easy, and if not chosen optimally, the results are suboptimal. Asymptotically optimal UCB can be described by

Regret bound for the above algorithm is much more complicated than choosing the static value for the UCB algorithm. The important thing is that the dominant terms have the same order, but the constant multiplying the dominant term is smaller. The significance of this is that the long- term behavior of the algorithm is controlled by this constant. The worst-case regret for the above algorithm is

R_n = O(√(k_n log(n))

UCB Algorithm: Minimax Optimality (MOSS)

The UCB Minimax Optimality (MOSS) algorithm is an asymptotically optimal algorithm. The worst-case regret for the above algorithm is R_n = O(√(k_n log(n)). By modifying the confidence levels of the algorithm, it is possible to remove the log factor entirely. Building on UCB, the directly named ‘minimax optimal strategy in the stochastic case’ (MOSS) algorithm was the first to make this modification, as presented below. MOSS again depends on prior knowledge of the horizon, a requirement that may be relaxed, as we explain in the notes. The term minimax is used because, except for constant factors, the worst-case bound proven in this chapter cannot be improved on by any algorithm.

One of these algorithms’ main novelties is how their confidence level can be chosen based on the number of plays for individual arms, as well as for “n” and “k.” The significance of this algorithm is that, unlike UCB, it tries to optimize the worst-case regret.

The USB MOSS algorithm can be given as the following algorithm.

While being nearly asymptotically optimal is a huge advantage, the algorithm also has two significant drawbacks. Asymptotic regret often leads to a finite time regret, meaning that for a shorter horizon, the algorithm would perform worse than UCB but would converge to match as the horizon moves toward infinity. Another issue lies with the algorithm that pushes the expected regret to be lowered too hard, causing the distribution of regret to be unstable and much less well-behaved.

UCB: Bernoulli Noise (KL-UCB)

In previous algorithms, we assumed that the noise of the rewards was σ-subgaussian for some known σ > 0. This has the advantage of simplicity and relative generality, but stronger assumptions are sometimes justified and often lead to more robust results. Suppose the rewards are assumed to be Bernoulli, which means that Xt ∈ {0, 1}. This is a fundamental setting found in many applications. For example, in click-through prediction, the user either clicks on the link or not. A Bernoulli bandit is characterized by the mean pay-off vector μ ∈ [0, 1]k and the reward observed in round t is Xt ∼ B(μAt ).

The difference between KL-UCB and UCB is that bounds are used to define the upper confidence bound. The algorithm is given as follows:

This algorithm works best when there are only two choices [0,1] or closes to 0 or 1. It could be used for many practical applications which need a binary selection; as the number of choices increases in range, it becomes more and more undesirable.

Adversarial Bandits

Adversarial bandits force the user to drop all preexisting notions of how rewards are generated except that they are in a bounded set and are chosen without knowing the learner’s actions. This forces us to redefine our existing idea of regret from being worse than an optimal policy to simply being worse than a set of constant policies. Unlike stochastic bandits, there is nothing to be learned from as we cannot now assume there is a fixed distribution or that there is any specific rule to how it changes. While harder to work with, the benefit of such a setting is that algorithms will generally be more robust than their stochastic counterpart.

Adversarial Bandits – The Exp3 Algorithm

The EXPonential weight algorithm for EXPloration and EXPloitation (EXP3) is one such algorithm that functions in the adversarial setting. The EXP3 algorithm functions by utilizing exponential weighting, where it maintains a set of weights for each action to decide the next step randomly. Based on whether the resulting payoff is good or bad, these weights will either increase or decrease in value. The formal equations are as follows:

It is important to note that η is the learning rate. When the learning rate is large, P_t concentrates on the arm with the most significant estimated reward, and the algorithm exploits aggressively. As P_t focuses, it causes the variance of weights for poorly performing arms to increase dramatically, making it an unreliable estimator. Conversely, P_t is more uniform for small learning rates, and the algorithm explores more frequently.

Adversarial Bandits – The Exp3-IX Algorithm

The objective of EXP3-IX is to modify EXP3 so that the regret stays small in expectation and is simultaneously well concentrated about its mean. Such results are called high-probability bounds. The poor behavior of EXP3 occurs because the variance of the importance-weighted estimators can become very large.

EXP3-IX, where IX stands for implicit exploration, adds a chosen constant γ to the divisor to smoothen the variance and hence a better estimator. γ could be constant or calculated after each arm.

Contextual Bandits

The algorithms introduced so far work well in stationary environments with only a few actions. Unfortunately, real-world problems are seldom this simple. For example, a bandit algorithm designed for targeted advertising may have thousands of actions. Even more troubling, the algorithm can access contextual information about the user and the advertisement. Ignoring this information would make the problem highly non-stationary, but the earlier algorithm cannot use these contexts.

While everything previously discussed can be helpful, when moving into the algorithms implemented into real-life scenarios, many more external factors termed “contexts” were previously not present. This is additional information that may help predict the quality of action. When including contexts in the algorithms, we bias our regret to align with the expert opinion. This context weight for each arm can dynamically change with the determination of each trial.

These contexts can directly influence any of the algorithms discussed earlier to get better results. It is vital to have the proper context. Else, we may end up getting the wrong conclusions. For example, if somebody is trying to choose a movie rating from Russia, we can conclude that they would prefer a Russian movie, and we can give higher weight to Russian-language movies. But an American traveling to Russia might get recommendations for Russian-language movies unless we keep the contexts that he is American, which overweighs the other recommendation.

Contextual Bandits – Bandits with Expert Advice

When the context set C is extensive, using one bandit algorithm per context will almost always be a poor choice because the additional precision is wasted unless the amount of data is enormous. Fortunately, however, it is seldom the case that the context set is both large and unstructured.

For example, the person’s demographics might reduce the bigger set into the smaller set to get better rewards. This could be done on the smaller partition of arms and use a similarity function for better predictions. Yet another option is to run a supervised learning method, training on a batch of data to find better predictions.

Contextual Bandits – Exp4

The number “4” in Exp4 is not just an increased version number, but rather indicates the four e’s in the official name of the algorithm (exponential weighting for exploration and exploitation with experts). The idea of the algorithm is straightforward. Since exponential weighting worked so well in the standard bandit problem, we aim to adapt it to the problem at hand. However, since the goal is to compete with the best expert in hindsight, it is not the actions we will score but the experts. Exp4 thus maintains a probability distribution Q_t over experts and uses this to come up with the following action in an obvious way, by first choosing an expert M_t at random from Q_t and then following the chosen expert’s advice to select At ∼ E (t) M_t. This is the same as sampling A_t from P_t = Q_tE(t) where Q_t is treated as a row vector. Once the action is chosen, one can use their favorite reward estimation procedure to estimate the rewards for all the actions. This is then used to estimate how much total reward the individual experts would have made so far. The reward estimates are then used to update Q_t using exponential weighting.

Real Life Uses

Bandit algorithms have a wide variety of use cases. One of the most basic ways it is applied is through A/B testing, which functions like the ETC algorithm but with only two arms. Another common application of bandits can be routing the best path, whether from network routing to routing in transportation systems. Another common application is dynamic pricing, where a bandit algorithm can adjust prices by classifying them as either “too high” or “too low” until it finds an optimal price range.

Contextual bandit algorithms are commonly utilized in advert placement, where contexts are used to decide what a user is interested in. An example of context can be when a user recently searched for pet food; one might want to recommend pet-based items. This kind of advertising based directly on a user’s history is known as targeted advertising in the marketing world and is a common practice by many companies. However, it is essential to understand that finding the best result is not as simple as “choose an algorithm and use it” as with real-world problems like advertising. Real-world scenarios involve many more essential things than maximum clicks, such as freshness, fairness, and satisfaction, to name a few. These extra factors can make it challenging to implement bandits into the real world as they will often require lots of extra work to overcome. Another common usage of contextual bandit algorithms can be seen in recommendation systems, commonly used by companies like Netflix for personalized recommendations. For recommendation systems with companies like Netflix in particular, building a recommendation system can be challenging as the set of all arms is extremely large, and the movies the user picks to watch are relatively few. The reward for a movie will typically be calculated by combining (1) the movie watch time and (2) the movie rating. This kind of reward measurement can lead to another problem where the algorithm will recommend only certain movies. Therefore, only those specific movies will have any data, while those with no data are left out of the loop. There are always ways around this, such as putting a “new/upcoming series” to promote shows and movies with fewer data. It is essential to understand that implementing a contextual bandit algorithm for the real world requires much more consideration and creative usage than stochastic bandits, as the problems faced are inherently more complex and multi-layered. Below are provided links to examples of places bandit algorithms can be found.

Google Analytics
Washington Post
Netflix presentation on their bandit algorithms
Network routing with BGP
StitchFix Experimentation Platform

Conclusion

Multi-armed bandits have been studied for over a century. New and more robust algorithms are constantly introduced, typically as combinations of the algorithms discussed here or their variations. Naturally, practical applications are combination algorithms for more reinforcement learning. None of the above algorithms are suited to all applications, and careful consideration of specific applications and the data available is needed to decide which algorithm best fits your use. For any further research, one can turn to GitHub, as it contains many implementations of the algorithms discussed in this paper.

References

Lattimore, Tor, and Szepesv ́ari, Csaba, Bandit Algorithms (1st edition), Cambridge University Press (

Slivkins, Aleksandrs: Introduction to Multi-Armed Bandits, Foundations and Trends in Machine Learning (2019)

Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA (1998)

Watkins, C.: Learning from Delayed Rewards. Ph.D. thesis, University of Cambridge, Cambridge, England (1989)

Zhao, Qing, and Srikant, R.: Multi-Armed Bandits: Theory and Applications to Online Learning in Networks (Synthesis Lectures on Communication Networks) (2019)

About the author

Avi Shukla

Avi is a senior at Leland High School, San Jose, CA. He is interested in Data Science, machine learning, and artificial intelligence. Avi is fascinated by how neural networks can emulate the thinking and learning process of the human mind. He loves reading all kinds of books, and his current interests include reading and learning human psychology. Avi has been trained in classical piano since childhood and loves arranging music. He loves to go on long biking trips and hikes whenever possible.

The post Multi-Armed Bandit: Study of Exploration vs. Exploitation appeared first on Exploratio Journal.

Data Science Analysis of Stroke Prediction

Charisse Yeung — Sun, 11 Sep 2022 16:35:22 +0000

Author: Charisse Yeung
Mentor: Dr. Gino Del Ferraro
Carlmont High School

1. Introduction

Today’s market is constantly altered by the rising popularity of AI and Machine Learning. Data science utilizes these technologies by solving modern problems and linking similar data for future use. Data science is extensively used in numerous industry domains, such as marketing, healthcare, finance, banking, and policy. For my research project, I used data science for healthcare, precisely stroke predictions. Stroke is the fifth leading cause of death in the United States and a leading cause of severe long-term disability worldwide. With its costly treatment and prolonged effects, prevention efforts and identification of the possibility and early stages of stroke benefit a significant population in the country, especially the disadvantaged. My goal is to help society use technology with stroke predictions. The paper is structured as follows: Section 2 introduces the cause and problem of stroke in the US population; Section 3 discusses the steps of a data science project; Section 4 introduces Machine Learning as a tool to make predictions; finally, Section 5 applies all these analyses to a data set of stroke patients to make predictions.

2. Stroke Prediction

Every year, about 800,000 people in the United States are directly affected by stroke. The two major strokes are ischemic and hemorrhagic (Figure 2.1). Ischemic stroke results from a blocked artery that cuts blood to an area of the brain. North African, Middle Eastern, sub-Saharan African, North American, and Southeast Asian countries had the highest rates of ischemic stroke. Hemorrhagic stroke results from a broken or leaking blood vessel leading to blood spilling into the brain.

Figure 2.1 Ischemic vs. Hemorrhagic stroke

In both cases, the brain does not receive enough oxygen and nutrients, and brain cells begin to die. Risk factors for stroke are old age, overweight, physical inactivity, heavy alcohol consumption, drug consumption, smoking, hypertension, diabetes, and heart disease (Figure 2.2). One in 3 American adults has at least one of these conditions or habits: high blood pressure, high cholesterol, smoking, obesity, and diabetes. In my project, I investigated risk factors in stroke patients to find a correlation and make stroke predictions. Furthermore, I chose to focus my research on American patients since stroke risk factors are much more prevalent in the United States than in other countries.

Figure 2.2 Risk factors of stroke

3. Process of a Data Science Project

In problem-solving, one must follow a particular series of steps and a deliberate plan to reach a resolution. The same technique applies to a data science project. A dataset isn’t enough to solve a problem; One needs an approach or a method that will give the most accurate results. A data science process is a guideline defining how to execute a project. The general steps in the data science process include: defining the topic of research, obtaining the data, organizing the data, exploring the data, modeling the data, and finally communicating the results.

Before starting any data science project, the topic of the research project must be defined. It is critical to brainstorm numerous relevant research ideas and then refine the focus on one worth doing the project. Relevancy is the factor in research that helps both the data scientist and the reader develop confidence about the investigation’s findings and outcome. Relevant research topics can be social, economic, intellectual, environmental, etc., as long as they are up to date. For example, gun control would be a relevant social issue for research, and stroke prediction would be a relevant medical research idea. To get a deeper insight into the topic, thorough research on the specific topic should be conducted and explored, such as reading articles on the internet or talking to an expert on the topic. After developing a high understanding, there should be a general idea of the ultimate purpose and goal of the project. One should ask themselves: “What problem am I trying to solve?” In my case, the problem I am trying to solve is the leading cause of death by stroke annually in the US. The purpose of this project is to use data science to make stroke predictions and further limit the effects of stroke on the population by identifying the early stages of stroke with some correlations regarding stroke. Understanding and framing the problem will help build an effective model that will positively impact the organization.

3.1 Data Acquisition

Next, one must find the data to be analyzed in the project. When researching for data, one should discover high-quality and targeted datasets. Not only does the topic of research needs to be relevant but also the data. Data from different sources can be extracted and sorted into categories to form a particular dataset. This process is also known as data scraping. One can find sources on the internet from research centers, government organizations, and specific websites for data scientists, such as Kaggle (Figure 2.3). The data must be accessible, so the most convenient formats for data science are CSV, JSON, or Excel files. Once the datasets have been downloaded, it is necessary to import them into an environment that can directly read data from these data sources into the data science programs. In most cases, data scientists will be using and importing the data into Python or R programming languages. In my case, I downloaded a CSV file of stroke data consisting of patients from the US and their conditions from Kaggle, and then I imported my data into the Juypter Notebook in Python for use.

Figure 3.1 Stroke patient data downloaded from Kaggle

3.2 Data Cleaning

The data acquired and imported is not perfect on its own. Thus, the data must be organized and “clean” to ensure the best quality. Duplicate and unnecessary data are removed, and missing data are replaced. Unnecessary data could be infinities, outliers, or data that does not belong in the sample. For my project on stroke predictions, I removed the data of particular patients from the set if their BMI is infinity (Figure 3.2) or they live outside of the United States, which is the scope of our study.

Figure 3.2 Example of infinities in a data set

There are also irrelevant data that are not as obvious and require analyzing the correlation between the parameter and the target. If the correlation is very low, it is irrelevant and should be removed. If there is a missing parameter in the dataset, locate the correct missing data instead and replace it or delete the patient from the dataset. The data is then consolidated by splitting, merging, and extracting columns to organize it and maximize its efficiency. The efficiency and accuracy of the analysis will depend considerably on the quality of the data, especially when used for making predictions.

3.3 Data Exploration

A critical factor in exploring and analyzing the data is to find covariations, as mentioned earlier. Different datasets, such as numerical, categorical, and ordinal, require different treatments. Numerical data is a measurement or a count. Categorical data is a characteristic such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take numerical values, such as “0” indicating no and “1” indicating yes, but those numbers don’t have mathematical meaning. In my case, I used numerical data—for the age, average glucose levels, and BMI—and a categorical dataset—for gender, hypertension, heart disease, marriage status, work, residence, smoking status, and stroke. I detected patterns and trends in the data using visualization features on Python with Numpy, Matplotlib, Pandas, and Scipy. With Numpy and Matplotlib, I could plot linear regressions, bar charts, and a heat map in correlation to select parameters and the target. Using insights made by observing the visualizations and finding correlations, one can start to make conjectures about the problem being solved. This step is crucial for data modeling.

3.4 Data Modeling

Data modeling is the climax of the data science process. The pre-processed data will be used for model building to learn algorithms and to perform a multi-component analysis. At this stage, a model will be created to reach the goal and solve the problem. In my case, I used a Machine Learning algorithm as the model, which can be trained and tested using the dataset. Machine Learning is the use and development of computer systems that can learn and adapt without following explicit instructions by using algorithms and statistical models to analyze and draw inferences from patterns in data. The first step to data modeling with Machine Learning is data splicing (Figure 3.3), where the entire data set is divided into two parts: training data and testing data. Generally, data scientists split 80% of their data for training and the remaining 20% for testing. The Machine Learning model is fed with the training input data to train the data. The data is then tagged according to the defined criteria so that the Machine Learning model can produce the desired output. During this operation, the model will recognize the patterns within the parameters and target of the training data. Algorithms are trained to associate certain features with tags based on manually tagged samples, then learn to make predictions when processing unseen data. The model will be tested for accuracy with the remaining 20% of the data. Since the correct parameters for each individual in the set are already known, it would be known whether the predictions made by the model are accurate by running the model with the testing data.

Figure 3.3 Diagram of the Training-Testing cycle

The goal is to maximize the model’s accuracy by making final edits and testing it. One may encounter issues during testing and must fix them before deploying the model into production. This stage builds a model that best solves the problem.

3.5 Data Interpretation

The concluding step of the data science process is to execute and communicate the results made from the model. The project is completed, and the goal is accomplished. Consequently, one must present their results to an audience through a research paper or a presentation. The presentation is comprehensible to a non-technical audience. The findings could be visualized with graphs, scatterplots, heat maps, or other conceivable visualizations. Useful data visualization tools for Python are Matplotlib, ggplot, Seaborn, Tableau, and d3js. To visualize the covariance between stroke and its primary causes, I used Matplotlib and Seaborn to create a heatmap. During the presentation, report the results and carefully explain the results’ reasoning and meaning. My ultimate goal is to make predictions for strokes with given patient data, and I hope my research paper will raise awareness of this technology and its global benefits for stroke patients. A successful presentation will prompt the audience to take action in response to the purpose.

4. Machine Learning

The popularity of Machine Learning, particularly its subset of Deep Learning, has rapidly grown in the past decade with skyrocketing interest in Artificial Intelligence. However, the history of Machine Learning dates back to the mid-twentieth century. Machine Learning is a subset of Artificial Intelligence that imitates human behavior and cognition. The “learning” in Machine Learning expresses how the algorithm automatically learns from the data and improves from experience by constantly tuning its parameters to find the best solution. The data set trains a mathematical model to know what to output when it sees a similar one in the future. Machine Learning can be classified into three algorithm types: Supervised Learning, Unsupervised Learning, and Reinforcement Learning (Figure 4.1). While Supervised and Unsupervised Learning is presented with a given set of data, Reinforcement Learning, known as an agent, learns by interactions with its environment. The agent makes observations and selects an action. When it takes action, it receives feedback rather than a reward or a punishment. Its goal is to maximize rewards and minimize penalties; thus, it would learn and tune its knowledge to take the actions leading to reward and avoid the activities leading to punishment.

Figure 4.1 Web diagram of Machine Learning

4.1 Supervised & Unsupervised Problems

The significant distinction between Supervised and Unsupervised Learning is the labeling status of the given data set. In Supervised Learning, the machine is given pre-labeled data. For my project, I used Supervised Learning and already had data from researchers who labeled each patient with or without stroke. I used a portion of this labeled data to train the model to distinguish which patients have or do not have a stroke based on their given conditions. The system would make a mapping function that uses the pre-existing data to create the best-fit curve or line and make estimations. Subsequently, I used the remaining portion of my labeled data to test the model for its accuracy. The goal is to maximize the accuracy of the model’s approximations when given new input data. In Unsupervised Learning, the machine is given unlabeled and uncategorized data, so it uses statistical methods on the data without prior training. For example, I would be using Unsupervised Learning if I were to predict which of the given patients have diabetes without previous data on diabetes. To form a model, I must analyze the data distribution and separate it based on similar patterns. Without any labeling, I would divide the patients into two groups based on their similar characteristics and behavior. Unsupervised Learning is split into two types: clustering and dimensionality reduction. In clustering, the goal is to find the inherent groupings and reveal the structure of the data. Some examples of clustering would be my previous example of predicting a patient with diabetes, targeted marketing, recommender systems, and customer segmentation. In dimensionality reduction, the goal is to reduce the number of dimensions rather than examples.

4.2 Classification & Regression

Supervised Learning is divided into two types: classification and regression. The goal of classification is to determine the specific labeled group the given input belongs to. The output variable would be a discrete category or a class. The only possibilities for my project are “stroke” or “no stroke.” The given data on the patients trains the model to correlate various parameters—their conditions and behavior—to the corresponding output of “stroke” or “no stroke.” The output could also be a defined set of numbers, such as “0” representing no stroke and “1” representing stroke. The accuracy of its categorization evaluates the classification algorithm. As a result, the model could predict whether a new patient would have a stroke. For regression, the outputs are continuous and have an infinite set of possibilities, generally real numbers. For instance, the machine could be estimating a house’s cost based on its location, size, and age parameters. Standard regression algorithms are linear regression, logistic regression, and polynomial regression.

In the following sections, I will discuss two regression models: linear and logistic regression. The former is used as an introduction to the regression problem whereas, the latter is the algorithm that I used to perform stroke predictions.

4.3 Linear Regression

Linear regression uses the relationship between the points or outputs of the data to draw a straight line, known as the line of best fit, through all of them. This line of best fit is then used to predict output values. A linear function has a constant change or slope and is usually written in the mathematical form:

y = θ1x + θ0 (Equation 4.1) where m is the constant slope and b is the y-intercept. When finding the line of best fit, there will be infinite possible straight lines through the values (Figure 4.2), and the θ1-values (slopes) and θ0-values (y-intercepts) will be adjusted. The “θ0” and “θ1” are the two parameters of the function. Regression is the predicting of the exact numeric value the variable would take to have the line of best fit. When given a data set, there exist various x-variables (features or input) and a y-variable (label or output). In my case, the features included gender, age, multiple diseases, and smoking status. The label is stroke or no stroke, listed as “0” and “1.” When using actual data, there will always be a distance between the actual and predicted y-values. This distance, known as the error, is minimized as much as possible to form the best fit line.

Figure 4.2 Possible lines of best fit for a given dataset

The error is often represented by a cost function, which is the sum of the square of the actual output subtracted by the predicted output:

where y_i is the real label output, g(x) is the approximation of the output, and (y_i – g(x)) is the error. The error is squared to ensure that the result of the cost function will be the sum of positive values. The line of best fit is created when the mean square error is the smallest it can be. In Machine Learning, the data receives training to find the line of best fit using Gradient Descent, an optimization algorithm to find the local minimum of a differentiable function. The Gradient Descent can be represented with the formula:

where ⍺ is the learning rate and is the instantaneous rate of change of the cost function at θ. The learning rate determines the magnitude of each increment of each step. Data scientists often make 0. 001 < ⍺ < 0. 01 because an ⍺ too large will never converge to the minimum and ⍺ too large will never converge to the minimum and ⍺ too small will take too long to reach the minimum. Moving down the function of C(θ), θ_n and θ_n – 1 approach each other. Once the difference is very small or |θ_n – θ_n-1| < 0. 001, the line of best fit is found. One example of linear regression would be the number of sales based on the product’s price. There would be a set of data with various products at different prices (the inputs) and each of their sales (the outputs). Assuming the trend of the relationship between the costs and the sales is linear, one would be able to find a linear model with the slightest mean square error. Thus, one can predict the number of sales at a new price. When two inputs or independent variables exist, the function becomes three-dimensional (Figure 4.3), and the model becomes a plane of best fit.

Figure 4.3 Plane of best fit on a three-dimensional graph

4.4 Logistic Regression

The data may not always fit into a linear model. For my data set on stroke predictions, the only two possible labels are stroke and no stroke or “0” and “1,” which is an example of binary classification. Thus, linear regression is non-ideal in the case of binary classification.

Figure 4.4 Linear regression used in binary classification

The line of best fit would exceed the 0 and 1 range and not be a good representation of the data, as seen in Figure 4.4. That’s why we will be using a logistic function to model the data. A logistic function, also known as a sigmoid curve, is an “S”-shaped curve (Figure 4.5) that can be represented by the function:

where L is the curve’s maximum value and (θ₀ and θ₁x) = g(x) or the linear regression function.

Figure 4.5 Logistic regression used in binary classification

In the case of a common sigmoid function, the output is in the range of 0 and 1, so L would be 1. There exists a threshold at 0.5; Outputs less than 0.5 will be still to 0 while outputs greater than equal to 0.5 will be set to one. Logistic regression finds the curve of best fit or the best sigmoid function for the given data set. For linear regression, we found the line of best fit with Gradient Descent. For logistic regression, we will use the Cross-Entropy Loss Function to determine the curve of best fit. Cross-entropy loss is the sum of the negative logarithm of the predicted probabilities of each model. For my case, I had only two labels and used Binary Cross-Entropy Loss which can be represented in the formula:

where s_i is inputs, f is the sigmoid function, and t_i is the target prediction. The goal is to minimize the loss; thus, the smaller the loss the better the model. When the best sigmoid function is found, the Binary Cross-Entropy should be very close to 0. The machine completes most of the logistic regression process internally, so it will solve and find the best function, which can be applied to make accurate predictions.

5. Process of Stroke Prediction Project

In the following session, I will apply the previous machine learning skills, specifically the logistic regression algorithm, to the case of stroke predictions. The data set introduced in Section 2 and the data science project process discussed in Section 2 will be used. I will describe the process of my project in detail and explain the analysis involved in interpreting the accuracy and efficiency of my model.

5.1 Data Acquisition

Before I started the data science research project, I researched various topics and current events and chose to do my project on stroke prediction. I obtained my organized data from the Kaggle website, which allowed me to download the file as a CSV file conveniently. I used the Jupyter Notebook application via Anaconda as my environment for this project. I imported my downloaded CSV file to the notebook (Figure 5.1).

Figure 5.1 First 15 lines of the imported dataset

As seen in the top row of Figure 5.1, there are various parameters or features: gender, age, hypertension, heart disease, marriage status, work type, residence type, average glucose level, BMI, and smoking status. The output or target I investigated was whether or not the patient had a stroke. The variables hypertension, heart disease, and stroke are defined by “0” being no and “1” being yes.

5.2 Data Cleaning

During the data cleaning process, I removed the redundant data for clarity by deleting other values in gender, never_worked values from work_type, and the id column (Figure 5.2 & Figure 5.4). In addition, I labeled all categorical features, or non-numerical columns, as ‘category’ when converting them into numerical values for analysis (Figure 5.2 & 5.3). Since the age values are non-integers, I converted them into integers in the last row of my code (Figure 5.2).

Figure 5.2 Code for removal and revision of dataset

Figure 5.3 Conversion of categorical to numerical

Figure 5.4 Histograms before and after removal of unnecessary data

The next part of data cleaning is removing outliers. I identified those outliers by recognizing the “null” or nonexistent values (Figure 5.5), labeled as NaN in the data as seen previously in Figure 3.2. Any non-zero output means there is a presence of outliers.

Figure 5.5 Identification of outlier

In my dataset, the only outlier was BMI. Thus, I removed those outlier values and replaced them with the mean BMI value in the code in Figure 5.6. I was confident no more null values were present in my data since all outputs were zero.

Figure 5.6 Removal of outlier

5.3 Data Balancing

Even after data cleaning, my dataset was not yet ready for use after data cleaning due to imbalance. Imbalanced data refers to the issue in classification when the classes or targets are not equally represented. The number of patients with stroke was much higher than without stroke (left plot in Figure 5.8). To create a fair model, the number of patients in stroke and no stroke classes must be equal. I could have resampled the data by undersampling (downsizing the larger class) or oversampling (upsizing the smaller class). I chose to oversample with the SMOTE algorithm (Figure 5.7) because the number of patients in the stroke class was too small and would lower the accuracy with undersampling.

Figure 5.7 Code for resampling

Figure 5.8 Histogram of gender to stroke before and after balancing

As a result of the oversampling, the ratio of stroke to no stroke should be 1:1 and thus balanced (Figure 5.7 & right plot in Figure 5.8).

5.4 Data Modeling

After dividing the resampled data into 80% training and 20% testing, I created a logistic regression model with the training data (Figure 5.9).

Figure 5.9 Code for logistic regression

The logistic regression algorithm was imported from sklearn.linear_model and automatically found the best fit curve representing the dataset.

5.5 Data Performance

In order to determine the accuracy of my model, I found the mean square error or MSE (from Equation 4.2). The MSE could be found with three methods: score method, sklearn.metrics, and equation (Figure 5.10).

Figure 5.10 Three methods of finding MSE

As a result, my model had approximately 91.1% accuracy. For a more detailed understanding of the model’s performance, I used a confusion matrix, which is a 2×2 table dividing the accuracy of the data into four categories (Figure 5.11).

Figure 5.11 Confusion matrix plot

The four categories, as shown in Figure 5.11, are true positive (bottom right), true negative (top left), false positive (top right), and false negative (bottom left). The accuracy of the model is high as long as most of the results are in the true positive and true negative categories because the predicted values are equal to the actual values. Using the confusion matrix, I further analyzed the performance of the model by calculating the F-1 score (Equation 5.1 & Figure 5.12). The F-1 score shows not only accuracy but also precision. I used the sklearn.metric algorithm to calculate my F-1 score (Figure 5.12), but I also could have used the equation.

Figure 5.12 Code for F-1 score

As a result, my model had an F-1 score of approximately 90.8%. Both my MSE and F-1 score were above 90.0%, and thus my model had high accuracy and precision.

5.6 Features Selection

Although my model already had high performance, I attempted to further increase it by removing certain features from my data. I hypothesized that the accuracy would improve if I removed the unimportant features or features with little correlation to the presence of stroke. On the other hand, the accuracy would drastically decrease when I removed important features. I determined the important and unimportant features with a correlation matrix plot (Figure 5.13).

Figure 5.13 Correlation matrix plot

The labeled bar on the right of Figure 5.13 shows the correlation between the features and output. The algorithm found the correlation with the following equation:

Where cov is the covariance, o_x is the variability of x with respect to the mean (the variance), x_i is an output of function x, x is the mean of x, and the y-variables have the same meanings using the y data set. When used to find the correlation between the parameters and stroke, I focused on the right-most column of the map. A correlation of 1.0 means the trends of the feature and output are equivalent, while a correlation of -1.0 means the trends of the feature and output are completely opposite. Both types of correlation are considered crucial when creating the logistic regression model. On the other hand, the feature and output are entirely unrelated if the correlation is 0. Therefore, I considered the features with a correlation close to 0—gender, residence type, children, and unknown smoking status—unimportant and removed them from my dataset (Figure 5.14).

Figure 5.14 Code for removal of unimportant features

After the removal, I repeated the processes of splitting the data, training the data, creating the logistic regression model, and calculating its accuracy with MSE and F-1 scores. Surprisingly, the accuracy and F-1 score lowered to approximately 86.6%; hence, the data removal led to a smaller training set and thus a less accurate and precise model. I further tested this theory by removing the important features or only keeping the features deemed unimportant and then repeated the data modeling process. Understandably, the accuracy lowered to 66.2%, and the F-1 score reduced to 71.9%. In conclusion, I kept my original model with all the features because it had the highest accuracy and precision.

6. Conclusion

In this data science project, I applied Machine Learning algorithms into predicting the likeliness of a patient in the United States to have a stroke. The goal of making such predictions is to prevent the consequences of stroke, which impacts a large population of Americans today. Throughout the project, I closely followed each step of the data science project process: data acquisition, data cleaning, data exploration, data modeling, and data interpretation. I discussed the difference between Supervised and Unsupervised Learning is whether the given data is labeled. Within Supervised Learning, there is Classification, using categorical data, and Regression, using numerical data. These data sets can be modeled with linear and logistic regression. In my project, I used a logistic regression algorithm to test and train my data. As a result, I tested my model with MSE and F-1 scores, and my model had an accuracy of 90%, which is a very promising outcome. To ensure the highest accuracy has been reached, I removed features with low correlation deemed unimportant and features with high correlation deemed important. The removal of important features led to a drastic drop in accuracy, and thus those features of the dataset should continue to be collected and studied for stroke predictions. Meanwhile, the removal of the irrelevant features had a small drop in accuracy, so those features are still of good use and are to be collected with the important features in this study. There may be other factors that play a role in the risk of stroke, however, the factors I have mentioned are of greatest significance based on the accuracy of my model.

Works Cited

Yeung, C. (2022, August 11). Stroke_Predictions_Project_Charisse_Yeung.ipynb. GitHub. Retrieved September 3, 2022, from https://github.com/honyeung21/data_science/blob/main/Stroke_Predictions_Project_Charis se_Yeung.ipynb

Medlock, B. (2022). Stroke. Headway. Retrieved September 3, 2022, from https://www.headway.org.uk/about-brain-injury/individuals/types-of-brain-injury/stroke/

Initiatives, C. H. (n.d.). Stroke prevention. CHI Health. Retrieved September 3, 2022, from https://www.chihealth.com/en/services/neuro/neurological-conditions/stroke/stroke-prevent ion.html

Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. Retrieved September 3, 2022, from https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Wolff, R. (2020, November 2). What is training data in machine learning? MonkeyLearn Blog. Retrieved September 3, 2022, from https://monkeylearn.com/blog/training-data/

Pant, A. (2019, January 22). Introduction to machine learning for beginners. Medium. Retrieved September 3, 2022, from https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fd b08

V.Kumar, V. (2020, May 28). ML 101: Linear regression. Medium. Retrieved September 3, 2022, from https://towardsdatascience.com/ml-101-linear-regression-bea0f489cf54

Gupta, S. (2020, July 17). What makes logistic regression a classification algorithm? Medium. Retrieved September 3, 2022, from https://towardsdatascience.com/what-makes-logistic-regression-a-classification-algorithm- 35018497b63f

About the author

Charisse Yeung

Charisse is currently a 12th grader at the Carlmont High School in California. Her academic interests are data science, computer science, healthcare, and mathematics.

The post Data Science Analysis of Stroke Prediction appeared first on Exploratio Journal.

The Theory and Implementation of Common Machine Learning Algorithms

Amanbir Behniwal — Mon, 02 May 2022 14:53:58 +0000

Author: Amanbir Behniwal
Mentor: Dr. Gino Del Ferraro
Vincent Massey Secondary School

1. Introduction

Machine Learning jobs are growing to become one of the most in de- mand jobs in the world. In the 1940’s, the idea of machine learning first started to grow; it was something that would emulate human think- ing and learning. Machine Learning has since grown to become a big part of our daily lives. For example, in speech recognition software, the software will map the different tones and nuances when someone speaks and try to match this to a specific person. Another example is a translator, which tries to understand the accents of people speaking a language and then translates it to another language. Many applications that we use today, such as Alexa, Siri, and Google Translate, use these machine learning algorithms. Furthermore, we are trying to integrate machine learning into our vehicles. Cars like the Tesla use unsupervised learning algorithms to self-drive in traffic and detect any danger. The future holds many possibilities due to machine learning.

In theory, we input great amounts of data into machine-learning programs, which using statistics, will categorize or predict outcomes by finding and applying patterns in the data. We can further categorize the different types of algorithms used in Machine Learning to supervised, unsupervised learning and reinforcement learning. Supervised learning consists of regression and classification while unsupervised learning consists of clustering and association.

In this report, we will first discuss important terminology needed to understand the contents of the report. We will then begin to dis- cuss the theory behind some of the machine learning algorithms. The algorithms implemented in this report are all regression algorithms, however, we will also discuss the theory behind other algorithms. Finally, we will see how to implement the code. There are GitHub links provided with the actual code.

2. Terminology

Before we can get started with all the theory, we must develop an understanding of some key terminology that we will use quite often when working with machine learning programs. These are some basic terms that we should be familiar with:

2.1 Features

When we are trying to extrapolate from data using a linear model such as a line of best fit, we want the line to have an equation that best fits the data. In general a line has an equation of h = θ₀ + θ₁x₁ + θ₂x₂ θ_nx_n. Here we consider x₁, x₂, , x_n₁, x_nthe features. We will go more in depth about this later on in the report.

2.2 Inputs

When we run a python program, we must somehow store the data so that our program knows what we want it to work with. We then take ’input’ of the data in a convenient way for us to work with it. For example, lets say we had a document that contained a few coordinates. We may want our program to take input of this data where the x- coordinates and y-coordinates are stored separately. The program written to complete this process is called ’taking input’. This process is explained in greater deal in the code.

2.3 Outputs

After our code has calculated what we wanted it to, we want to see this information in an organized manner so that we can study it. We then make our program ’output’ this information. Outputs can consist of words, integers, etc.

2.4 Predicted Values

Let us say that we received input of many coordinates and we wanted our program to calculate the line of best fit. When we are testing different equations to see if they best fit the data, we input the same x-coordinates as the ones in our input data. However, our y-coordinates may not always be the exact same as that of the input data. We thus call our y-coordinates predicted values, since they are what our program predicted the coordinate lies at based on the equation that we came up with.

2.5 Expected Values

The values that we get from the inputted data are our expected values since they are the original values that we are comparing the predicted values to.

3. Supervised Learning

Supervised learning is the most commonly used algorithm in Machine Learning and it is also the simplest to implement. When using super- vised learning, we must train the algorithm by pairing labelled inputs with outputs. The program in this stage is trained to look for patterns that correlate the input to the output. When we have provided the algorithm with a good amount of example pairings, the algorithm will be able to apply this to new inputs it receives. We can further split supervised learning into classification and regression.

3.1 Classification

Classification is a type of supervised learning. In classification, our output will always be a category that the algorithm has mapped the input to. An example of this would be our program receiving input of pictures of animals and then outputting what animal they are (their category). We first have to train the program by inputting many pictures of dogs and cats in their respective categories so that the program will be able to establish patterns between the images of the dogs and the images of the cats. After we have inputted a sufficient number of images, the program will get accurate in determining if an animal is a cat or dog when it receives an input that it has not seen before.

3.2 Regression

Regression is another type of supervised learning. In regression, our output is not a category but rather a value such as money or age. We can take for example the price of houses and the total square footage of the house. Using regression, we identify the function that best fits between these values where we have reduced the amount of error as much as we can. We can then use the equation of this line to predict how much a house with a certain square footage will cost.

Figure 1: https://medium.com/machine-learning-in-practice/a-gentle-introduction-to-machine-learning-concepts-cfe710910eb

3.2.1 Linear Regression

When performing linear regression, the program will take input of data and plot it on a graph. It will then find a line of best fit and be able to make predictions based on this line of best fit. For example, we can graph the number of hours a student watches TV rather than studying compared to their test scores.

Figure 2: onlinemath4all.com/scatter-plots-and-trend-lines.html

As we can see, the graph looks fairly linear and it only has one feature; the amount of time spent watching TV rather than studying. This makes it a perfect model for linear regression. We want our program to come up with an approximate equation with which we can estimate a students’ test score based on how long they spent watching TV instead of studying. Really, we are looking for our program to find the line of best fit, since this line would be best for extrapolating the data and providing an as accurate as possible estimate of a test score based on the number of hours that were spent watching TV. Our program would then test many different lines until it reaches one line that fits the data better than any other line.

Figure 3: onlinemath4all.com/scatter-plots-and-trend-lines.html

As we can deduce, when calculating the equation of the line of best fit, our slope and y-intercept variables matter a lot. In fact, we are just making changes to these variables to try to find the line of best fit. Machine learning algorithms rely on these parameters (y-intercept, slope/bias, etc.) to run. When we want to find the best model for our data, we need to keep adjusting these parameters so that the direction of our line better fits the data and our predicted values are closer to the expected values. We must then introduce a function that changes these parameters by determining the amount of error that we are getting with the current parameters. This function is called the cost function.

4. Cost Function

The cost function essentially helps our program minimize the error it produces compared to the actual data set. When we are doing linear regression, it is very rare that we will get a data-set where the data fits precisely on a line. Therefore, when we are computing the line of best fit, we want to find a line such that it has the least possible difference (error) between the actual coordinates and the coordinates our line gives (predicted values). There are multiple ways of defining the cost function, some examples are explained further in the following sections.

4.1 Mean Absolute Error

When we take the mean absolute error, we are taking the absolute value of the difference between the predicted y-value and the expected y- value. The reasoning for this is that, since we are adding up all the error for each data point, we want to keep track of how much error we are accumulating.

Figure 4: https://gist.github.com/FisherKK/86f400f6d88facbf5375286db7029ca2

In this graph, the blue points are the original points of the data set, while the orange points are the ‘predicted’ points that our program is currently testing for the line of best fit. As we can see, each d_irepresents the amount of ‘error’ our model/line produces for each point in the data set.

However, if we add negative numbers (our predicted point is below the original point), our program actually thinks it’s producing less error. To deal with this we take the absolute value, which is always non-negative, so that our program does not add negative error. Then our program can plug this into the formula which is defined as

Where mis the number of training examples, yˆ(i) is the predicted value, y(i) is the expected value and i is the index of the data point since we want to sum the error of all the data points.

4.2 Mean Squared Error

When we take the mean squared error, instead of taking the absolute value of the difference between the predicted and expected value, we take their square. In this way, we still don’t add up negative error since any real number squared is non-negative. The equation is defined as:

When using mean absolute error, we took the absolute value of the distance between the predicted value and the expected value. We are now taking the square of the area of the square whose side length is the distance between the predicted value and the expected value. All these regions are summed and averaged.

Now that we have discussed how our program will calculate the error that our model/line is producing, we must find a way to minimize the value our cost function is returning. The gradient descent algorithm is one of the most effective ways of doing so.

Figure 5: https://gist.github.com/FisherKK/86f400f6d88facbf5375286db7029ca2

For linear regression models, we assume that our data has a linear dependence and therefore can be modelled by using a linear equation as follows;

h_θ(x) = θ^Tx= θ₀ + θ₁x,

where θ₀ is our bias (y-intercept) and θ₁ is our slope. Then, we want to change our parameters θ₀ and θ₁ in such a way that our line better fits the data and the cost function produces less error. In batch gradient descent, we update our theta values continuously with the following equation;

Here, θ_jis the value that we are updating. Again, mis the size of the data (how many points there are). Alpha here represents the learning rate of our algorithm. If alpha is too big, our program may be a lot faster, but it will not be nearly as accurate in determining the equation of a line of best fit as a smaller value of alpha may be. However, when we use too small a value for alpha, our program will be incredibly slow. It is best to find a good median between these two values.

6. Multi-Linear Regression

Now that we have discussed how to optimize our program so that it can calculate the best line of fit with equation h = θ₀ + θ₁ x₁, we think of what we would do when we have multiple features. Currently we have only been working with one feature, which in the example presented, was the number of hours spent watching TV rather than studying. Let’s take another example of the price of a house. When determining the price of a house, we must determine its area, how many rooms it has, how old it is, among other things. In this instance our data when plotted still looks linear however we cannot use the exact same technique as linear regression, since we have more than one feature. We use multi-linear regression in this situation because of its suitability to deal with more than one feature.

Multi-linear regression can be used with as many features as we’d like. Our equation is now

h= θ₀ + θ₁ ·x₁ + θ₂ ·x₂ + ···+ θ_n·x_n,

where all x_irepresent the different features. When we now implement gradient descent, we must use it to update all θ_iso that our line better fits the data. The cost function can be implemented in much the same way.

The interesting thing to note about multi linear regression is that we need an n-D graph to plot all the points, however, if we take a 3-D graph for example, our program is essentially finding the line of best fit in a plane that best suits all the points.

Figure 6: https://aegis4048.github.io/mutiple linear regression and visualization in python

7. Unsupervised Learning

Unlike supervised learning, in unsupervised learning, we do not train the program with inputs and corresponding outputs. Rather, the pro- gram uses its built-in algorithms to try to find patterns in the unlabelled data and produce an output. For example, if we give input of shapes with different sizes, the algorithm can separate these based on how many sides there are in each shape. In general, unsupervised learning requires much less data then supervised learning. We can further split unsupervised learning into clustering and grouping.

7.1 Clustering

As discussed earlier, in unsupervised learning, we input unlabelled data into our program. Graphing our data, it may look like the following:

Figure 7: https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/

Once our program has graphed the data, we want our program to try to find patterns in the data. Specifically, clustering algorithms will try to look for clusters of points that seem to be together. The graph could then be divided into the following clusters:

Figure 8: https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/

Among the many applications of clustering, we can use the example of social networks. We may want to find which people seem to be very close friends on their social networks so our algorithm would make clusters of people that appear to be close friends.

Figure 9: https://www.mghassany.com/MLcourse/introduction.html

A more common example in our daily lives would be our spam filter. Our email uses clustering algorithms to try to group spam emails, update emails, advertisement emails, etc. together.

Furthermore, we can classify clustering as hard clustering and soft clustering. In hard clustering, a data point can either belong in a cluster or not. This type of clustering is useful in binary situations such as whether a movie is good or not. On the contrary, when using soft clustering, a data point can belong to many clusters. This is more useful when we may want to determine which books are similar.

7.2 Association

Association algorithms try to see if two items depend on each other. For example, if we take a customer at a supermarket. If this customer has gone to buy bread, then it is very probable that the customer is also looking to buy butter or milk. In this way, we can associate different items based off of their dependency on each other. Many companies use this technique to place associated items away from each other in a store so that the customer see’s many other items on the way and may consider buying additional things. An example of the different associations in a store are given below:

Figure 10: https://annalyzin.files.wordpress.com/2016/04/association-rules-network-graph2.png

8 Reinforcement Learning

In reinforcement learning, the program learns what to do by trial and error in its current environment. We can think of it as the program receiving a reward if it does something correct and a penalty if it does something incorrect. Take the analogy of a child, when a child is young, they do not know what is good or bad. The only way the child can learn is by trying new things. The child may touch something electric, get a shock, then instinctively not go near the thing again. The child now knows that that object is something that shouldn’t be touched because it will hurt. A reinforcement learning program works in a similar way. The difference here is that the machine can try thousands of operations in one second and even though it may start by making very bad decisions, it will learn over time and will become a lot more sophisticated in its decision. We can simulate giving a program a reward or penalty by giving it a score in which, if it does something incorrect, the score will lower, and conversely, if it does something correct, the score will increase. This type of program is based entirely on trial and error on the programs part, it is also one of the closest things to a machine’s own creativity.

One of the most useful implementations of reinforcement learning are simulations. For example, the program can be used to help create the optimal rocket engine for a rocket launch. If we put our in a rocket launch environment in which the environment responds to the actions of our program, we can ‘reward’ the program if it’s helping the rocket launch with its actions or ‘punish’ the program if it’s not helping the rocket launch.

Figure 11: https://riptutorial.com/machine-learning/example/32668/reinforcement-learning

9. Linear Regression Implementation

For the linear regression code, we took input of the population of a city in 10, 000s and its profit in $10, 000. We then plotted all of the coordinates and got the resulting graph:

As we can see the graph looks fairly linear, thus we can use linear regression on this.

The full code can be found at: https://github.com/ABehniwal/face-recognition/ blob/main/Numpy-Linear-Regression.ipynb

10. Multi-Linear Regression Implementation

For the multi-linear regression code, we took input of the different features of a car (Engine Size, Cylinders, Fuel Consumption (City), Fuel Consumption (Comb)) and the resulting CO2 emission. We then plotted all of these features of the car separately with the CO2 Emissions to get a visual of how the different graphs look. This resulted in the following graphs.

10.1 Engine Size Graph

10.2 Cylinders Graph

10.3 Fuel Consumption (City) Graph

10.4 Fuel Consumption (Comb) Graph

Again, we see that all the graphs look fairly linear, however, since we have multiple different features of the car that we must take into account, we use multi-linear regression. The full code can be found at: https://github.com/ABehniwal/face-recognition/blob/main/Multi-Linear-Regression. ipynb

About the author

Amanbir Behniwal

Amanbir is currently an 11th grader at the Vincent Massey Secondary School in Ontario, Canada. He enjoys challenging myself with difficult math and computer science problems by participating in various contests. Amanbir is an avid fan of Barcelona and has been playing soccer for many years. Amongst other things, he likes to read books, help others with problem-solving, and delve deeper into the field of computer science.

The post The Theory and Implementation of Common Machine Learning Algorithms appeared first on Exploratio Journal.

Machine Learning: Theory of learning models and practice in python

Mahesh V N — Wed, 17 Mar 2021 17:03:59 +0000

Author: Mahesh V N
Shanghai American School
November, 2020

What is Machine Learning

Machine Learning is a subset of Artificial Intelligence (AI) which provides machines the ability to learn automatically and improve from experience without being explicitly programmed. Machine Learning is used anywhere from automating mundane tasks to offering intelligent insights. With growing statistics, machine learning gained popularity and the intersection of computer science and statistics gave birth to probabilistic approach in AI. Having large-scale data available, scientists started to build intelligent systems that were able to analyze and learn from large amounts of data. Machine Learning is a type of AI that mimics learning and becomes more accurate over time.

Figure (1.0)
(left) Segmented approach to AI, over the decades (right) Thought process of ML functioning pics. Credit https://towardsdatascience.com/ and www.edureka.com

Applications and Prospects.

Artificial Intelligence (AI) is all around us and we are using it in one way or the other. One of the popular applications of AI is Machine Learning (ML), in which computers, software, and devices learn through data how to make predictions. Companies are using ML to improve business decisions, forecast weather and much more. Machine learning is about training algorithms on a given set of data and make predictions on another set of unseen data. Some of the most common machine learning applications are:

Learning to predict whether and email is spam or not.
Clustering Wikipedia entries into different categories.
Social Media Analysis (Recognise words and understand context behind them) – LionBridge project is a sentiment analysis tool provides users with insights based on social media posts.
Smart Assistance – analyse voice requests or automate daily tasks as well as adapt to changing user needs – Alexa by Amazon uses all collected data to improve its pattern recognition skills and be able to address user needs.
News Classification – As the amount of content produced exponentially, business and individuals need tools that classify and sort out the information. With algorithms able to run through millions of articles in many languages and select the ones relevant to user interests and habits.
Image Recognition
Video Surveillance – With complex algorithms developed using machine learning for video recognition, at first using human supervision the system will learn to spot human figures, unknown cars and other suspicious objects, soon it will be possible to imagine a video surveillance system that functions without human intervention
Optimisation of Search engine results – Algorithms can learn from search statistics, they would not be relying on meta tags and keywords, but instead analyse the contents of the page. Suitably this is how – Google Rank Brain works.

Types of Machine Learning approaches

Machine learning is a unique way of programming computers. The underlying algorithm is selected or designed by a human. However, the algorithms learn from data, rather than direct human intervention. The parameters of a mathematical model are learnt by the algorithms in order to making predictions. Humans don’t know or set those parameters — the machine does.

To explain in simple terms Machine Learning is using a data set to train a mathematical model which is fed with enough sample to have a predictable analysis and output a sensible result. Machine Learning can be divided into a series of subclasses: Supervised Learning, Unsupervised Learning, and Reinforcement learning. The supervised learning category is further divided into Regression and Classification for more streamlined approach to a task, which results in faster and accurate results.

Nomenclature of Machine Learning terms: A feature is an individual measurableproperty of the phenomenon being observed, e.g. square footage of a house to predict the house’s cost, 2D image input to perform object recognition, etc… It is a characteristic of the data that is used as input of the ML algorithm and for this reason it is often used as a synonymous of the ML input and it is commonly denoted as x.

The output of the algorithm is also called “prediction” or “outcome” and it is denoted with “y hat”. The label of the machine learning algorithm is usually denoted with y. So, for instance, the purpose of a supervised algorithm is to use input x to generate the output “y hat” which is as close as possible to the real outcome y.

Figure (2.0) Hierarchy of ML Classification. Credit: https://towardsdatascience.com/

Supervised Learning

Supervised learning is an approach to creating artificial intelligence (AI), where the program is given labeled input data and the expected output results. The AI system is specifically told what to look for, thus the model is trained until it can detect the underlying patterns and relationships, enabling it to yield good results when presented with never-before-seen data. Supervised learning problems can be of two types: classification and regression problems. Examples are: determining what category a news article belongs to or predicting the volume of sales for a given future date.

How does supervised learning work?

Like all machine learning algorithms, supervised learning is based on training. The system is fed with massive amounts of data during its training phase, which instruct the system what output should be obtained from each specific input value.

The trained model is then presented with test data to verify the result of the training and measure the accuracy.

Types of Supervised learning

Supervised learning can be further classified into two different machine learning categories:

Classification
Regression

Figure 3.0 – division of problem set in supervised learning. Credit: www.edureka.com

Figure (3.1) (left – a) labelled output feedback (center – b) learning approach. (right – c) Aim of supervised learning. Credit: www.edureka.com

In case of supervised learning shown in Fig (3.1 – a) we train the algorithm to take up data and come up with an output which is known as this model is fed with input which is labelled. In the (fig – 3.1(b)) The sample Maps the labelled input into a known output. Few of the application of the supervised learning is risk evaluation and forecast of sales in a given business.

Classification in Machine Learning.

Classification is a supervised Machine learning approach used to categorise a set of data into classes. Typically our outcomes are CLASSES or categories. For example, this is a case where a user is trying to predict what is in a given image. Wherein a clear demarcation is provided.

Classification to start with has two sub categories basically named as Lazy Learners and Eager Learners. In case of the former it just stores the training data and wait until a testing data is presented, although they have more time predicting data. Whereas in the later case, they construct a classification model based on the given training data before performing the task of predictions, they consume a lot of time in training and less time in prediction as they commit to a single hypothesis that shall work for the entire space.

Fig 4.0 Pictorial representation of Classification in Machine learning. Credit: www.edureka.com

In the Figure – 4.0 (above) we have a classifier, which in most cases is an algorithm used to map an input data to a specific category. Classification model is trained to predict the class or category of the data.

Classification problems can be divided into two different types: binary classification and multi-class classification.

Binary classification has two outcomes based on the categories specified for therequired output (Example: the output is to be noted only as a dog or a cat.) .

In case of a multi-class classification the outputs or predictions are more than two (in general a group of M possible predictions). The goal of the algorithm is to predict the specific output among the possible M.

Note: Classification is going to be discussed further in more detail in further sections.

Credit: www.edureka.com

Regression Machine Learning

Regression is a predictive statistical process where the model attempts to find the important relationship between dependent and independent variables. The goal of a regression algorithm is to predict a continuous number such as sales, income, test scores and so on. Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables

Types of Regression: Simple (Univariate)

Conditions: One variable is considered

Figure 5.1 pectoral representation of types of Regression Machine Learning. Credit: https://www.codeproject.com

A Mathematical Approach into Linear Regression

Y = ß0 + ß1 X, is theform of equation of linear regression. Where X isthe independent variable, and Y is the dependent variable. ß0 is the slope of the line and beta1 is the intercept ( the value of y when x = 0 ). Fitting this equation into linear regression model of the data provides successful results in prediction of the unknown data. The fitting procedure consists in finding the parameter ß0 and ß1 that best fit the data. In machine learning terms, this means to “learn” the parameters a and b that produce the least error in fitting the data.

Least-Squares Regression

This is the most common method for fitting a regression line into the Regression model. It is the process of calculating the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each point to the line (vertical deviation being 0 for the point fitted line). The deviations are squared and summed in order to prevent the cancellation between negative and positive values. Least-Square Regression uses the “Ordinary Least Square” or “OLS” method to determine the best fitting line. This is the method where a line is chosen that “minimizes the distance between every point and the regression line”. In figure 6 we give a visual description of this process. The distance is also known as the error in the system (see Figure(6.2)). In Figure 6.1, as an example, we consider the price of the commodity and the year of the production.

Figure 6.1 shows multiple possible best fit candidates lines that could fit the data. Figure 6.2 shows a visual representation of the error for one of these lines.

Finally, figure 6.3 shows the best fitting line, i.e. the line which produces the least squared error. Credit: https://towardsdatascience.com/

Once the best fitting line has been found, we can use it to predict the price of the commodity in the year 2020 based on all the data from the previous years.

What is Multiple Linear Regression?

Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical.

A Mathematical approach

Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.

Model for Multiple Linear Regression, given n observations,

y = ß0 + ß1X1 + ….. + Xnßn + e

Where the values depict,

y – predicted value of the dependent variables.

ß0 – the y-intercept calculated when the other values are set to 0.

ß1X1 – regression coefficient (ß1) of the first independent variable (X1), which depicts the change in the predicted y – value with change in the independent variable.

ßnXn – regression coefficient of the nth independent variable (feature).

e – variation in the prediction of y value, also known as the model error.

(X1, X2, X3, …. Xn ) are the dependent variables factoring into the y value.

Unsupervised Learning

Unsupervised Learning is often used in the more advanced applications of artificial intelligence. It involves giving unlabeled training data to an algorithm and asking it to pick up whatever associations it can on its own. Unsupervised learning is popular in applications of clustering (the act of uncovering groups within data) and association (predicting rules that describe data).

Figure 4.0 – division of problem set in Unsupervised learning

(a) (b) (c) Figure (4.1) (a) Machine is unaware of output (b) learning approach. (c) Applications. Credit: www.edureka.com

In the Figure – 4.0 (above) we have a classifier, which in most cases is an Algorithm which is used to map an input data to a specific category. Classification model predicts/draw conclusion of the class or category the data needs to be segregated. A feature is an individual measurable property of the phenomenon being observed. Binary classification has two outcomes based on the categories specified for the required output (Example: the output is to be noted only as a dog or a cat.) . Whereas in multi-class classification in each sample is assigned to a set of labels or targets (more than two).

Cost Function and Gradient Descent

A measure of how wrong the model is in terms of its ability to estimate the relationship between x and y is a cost function of the Machine Learning model (in the examples above OLS is a cost function for linear regression). Which is typically expressed as a difference or distance between the predicted value and the actual value. The Objective of a Machine Learning model is find parameters, structure etc that minimizes the cost function.

The best-fit line is found by minimizing the difference between the actual value and the predicted values. Linear Regression does not apply brute force to achieve this, instead it applies an elegant measure known as Gradient Descent to minimize the cost function and identifies the best-fit line.

Gradient Descent is an iterative optimisation algorithm for finding the local minimum of a function. The process of determining the local function is by taking steps proportional to the negative of the gradient of the function at the current point. Gradient descent is defined by the algorithm defined as

The goal of the gradient descent algorithm is to minimize the given cost function by performing two steps iteratively.

Compute the gradient (slope), the first order derivative of the function at that point.

Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point.

Alpha is the learning rate – a tuning parameter in the optimization process, which determines the length of the steps to be taken.

Credit: Coursera

Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point.

We are going to examine different methods of determining the steps with Alpha – the learning rate
a) Learning rate is optimal, model converges to the minimum.
b) Learning rate is too small, it takes more time but converges to the minimum.
c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C).
d) Learning rate is very large, it overshoots and diverges, moves away from the minima, performance decreases on learning.

Note: As the gradient decreases while moving towards the local minima, the size of the step decreases. So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.

Classification: Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable when the target has only two possible outcomes, for instance either 1 (stands for success/yes) or 0 (stands for failure/no). Logistic Regression is further classified into to categories namely binomial and multinomial.

Linear Regression vs Logistic Regression

Figure – 7.1 (a) Linear Regression graph. (b) Logistic Regression graph. Credit: www.edureka.com

In the above shown figure 7.1 (a), the value of Y or the dependent variable lies within a range that can take values beyond 0 and 1. Whereas in the case Logistic Regression (figure 7.1(b)) the outcome/predictions will be between 0 and 1 (see figure 7.1(b) and figure 7.2). To have then a binary output, we threshold every value below 0.5 to 0 and every value above 0.5 to 1 (see figure 7.2).

Figure 7.2 sigmoid function curve graphical representation. Credit: www.edureka.com

Sigmoid Function Curve

The sigmoid function curve also known as the S-curve is used in conversion of any values between negative infinity to positive infinity to a discrete values or in our into binary format of 0 or 1 (refer figure 7.2). The S-curve has values of x-axis between -4 to +4 (in our example) is considered as the transition values. Considering the data point value as 0.8 which falls neither in the category of a discrete 0 or a discrete 1. The concept of “Threshold Value” 0.5 is applied to get a discrete value of 0 or 1. The process is of threshold value application is dependent on the value of given datapoint is greater or lesser than the threshold value. If the datapoint is greater than threshold the discrete predicted value is 1 or the discrete predicted datapoint is 0 otherwise.

Figure 7.3 comparison between Linear and Logistic Regression. Credit: www.edureka.com

Practical Use – Linear vs Logistic Regression.

Let us consider Weather forecast of a particular given calendar day. Logistic regression is used in predicting the outcome of the weather for a possibility of a rain, snow, tide or a sunny weather. The above predictions of a possibility fall in to the category of Yes or No which are discrete outcomes. If we consider further, Linear Regression can also be applied to weather forecast such as the temperature of a given day (same calendar day, considered earlier) . Since the temperature of day is a continuous value, linear regression can be applied to the same set of samples to predict the outcomes.

Linear Regression – coding exercise

To develop practical coding skills I have implemented Linear Regression in python from scratch. The problem that I consider was to predict the price of a house from its square footage, using existing real data. The code notebook can be found here: https://github.com/gdelfe/Machine-Learning-basics-course/blob/master/Exercise1/exercise1_MLbasics.ipynb

About the author

Mahesh V N

Mahesh is a student at the R. N. Shetty Institute of Technology in Bangalore.

The post Machine Learning: Theory of learning models and practice in python appeared first on Exploratio Journal.

machine learning Archives - Exploratio Journal

Implementing a ResNet50 Architecture to Decrease Social Anxiety in Autistic Children by Detecting Emotions Portrayed Through Body Language Micro Gestures

Abstract

1. Introduction

2. Background and Significance

2.1 Definitions

2.2 Relevant Project Resources

3. Related Work

4. Methods and Analysis

5. Analysis and Results

6. Future Work

7. Conclusion

References

About the author

Kedaar Rentachintala

Multi-Armed Bandit: Study of Exploration vs. Exploitation

Abstract

Introduction

Explore then Commit Algorithm (ETC)

The Optimistic-Greedy Algorithm

The Epsilon-Greedy Algorithm (ε-Greedy)

The Upper Confidence Bound (UCB) Algorithm

UCB Algorithm: Asymptotic Optimality

UCB Algorithm: Minimax Optimality (MOSS)

UCB: Bernoulli Noise (KL-UCB)

Adversarial Bandits

Adversarial Bandits – The Exp3 Algorithm

Adversarial Bandits – The Exp3-IX Algorithm

Contextual Bandits

Contextual Bandits – Bandits with Expert Advice

Contextual Bandits – Exp4

Real Life Uses

Conclusion

References

About the author

Avi Shukla

Data Science Analysis of Stroke Prediction￼

1. Introduction

2. Stroke Prediction

3. Process of a Data Science Project

3.1 Data Acquisition

3.2 Data Cleaning

3.3 Data Exploration

3.4 Data Modeling

3.5 Data Interpretation

4. Machine Learning

4.1 Supervised & Unsupervised Problems

4.2 Classification & Regression

4.3 Linear Regression

4.4 Logistic Regression

5. Process of Stroke Prediction Project

5.1 Data Acquisition

5.2 Data Cleaning

5.3 Data Balancing

5.4 Data Modeling

5.5 Data Performance

5.6 Features Selection

6. Conclusion

Works Cited

About the author

Charisse Yeung

The Theory and Implementation of Common Machine Learning Algorithms

1. Introduction

2. Terminology

2.1 Features

2.2 Inputs

2.3 Outputs

2.4 Predicted Values

2.5 Expected Values

3. Supervised Learning

3.1 Classification

3.2 Regression

3.2.1 Linear Regression

4. Cost Function

4.1 Mean Absolute Error

4.2 Mean Squared Error

6. Multi-Linear Regression

7. Unsupervised Learning

7.1 Clustering

7.2 Association

Data Science Analysis of Stroke Prediction