data analysis Archives - Exploratio Journal

Data Science Analysis of Stroke Prediction

Charisse Yeung — Sun, 11 Sep 2022 16:35:22 +0000

Author: Charisse Yeung
Mentor: Dr. Gino Del Ferraro
Carlmont High School

1. Introduction

Today’s market is constantly altered by the rising popularity of AI and Machine Learning. Data science utilizes these technologies by solving modern problems and linking similar data for future use. Data science is extensively used in numerous industry domains, such as marketing, healthcare, finance, banking, and policy. For my research project, I used data science for healthcare, precisely stroke predictions. Stroke is the fifth leading cause of death in the United States and a leading cause of severe long-term disability worldwide. With its costly treatment and prolonged effects, prevention efforts and identification of the possibility and early stages of stroke benefit a significant population in the country, especially the disadvantaged. My goal is to help society use technology with stroke predictions. The paper is structured as follows: Section 2 introduces the cause and problem of stroke in the US population; Section 3 discusses the steps of a data science project; Section 4 introduces Machine Learning as a tool to make predictions; finally, Section 5 applies all these analyses to a data set of stroke patients to make predictions.

2. Stroke Prediction

Every year, about 800,000 people in the United States are directly affected by stroke. The two major strokes are ischemic and hemorrhagic (Figure 2.1). Ischemic stroke results from a blocked artery that cuts blood to an area of the brain. North African, Middle Eastern, sub-Saharan African, North American, and Southeast Asian countries had the highest rates of ischemic stroke. Hemorrhagic stroke results from a broken or leaking blood vessel leading to blood spilling into the brain.

Figure 2.1 Ischemic vs. Hemorrhagic stroke

In both cases, the brain does not receive enough oxygen and nutrients, and brain cells begin to die. Risk factors for stroke are old age, overweight, physical inactivity, heavy alcohol consumption, drug consumption, smoking, hypertension, diabetes, and heart disease (Figure 2.2). One in 3 American adults has at least one of these conditions or habits: high blood pressure, high cholesterol, smoking, obesity, and diabetes. In my project, I investigated risk factors in stroke patients to find a correlation and make stroke predictions. Furthermore, I chose to focus my research on American patients since stroke risk factors are much more prevalent in the United States than in other countries.

Figure 2.2 Risk factors of stroke

3. Process of a Data Science Project

In problem-solving, one must follow a particular series of steps and a deliberate plan to reach a resolution. The same technique applies to a data science project. A dataset isn’t enough to solve a problem; One needs an approach or a method that will give the most accurate results. A data science process is a guideline defining how to execute a project. The general steps in the data science process include: defining the topic of research, obtaining the data, organizing the data, exploring the data, modeling the data, and finally communicating the results.

Before starting any data science project, the topic of the research project must be defined. It is critical to brainstorm numerous relevant research ideas and then refine the focus on one worth doing the project. Relevancy is the factor in research that helps both the data scientist and the reader develop confidence about the investigation’s findings and outcome. Relevant research topics can be social, economic, intellectual, environmental, etc., as long as they are up to date. For example, gun control would be a relevant social issue for research, and stroke prediction would be a relevant medical research idea. To get a deeper insight into the topic, thorough research on the specific topic should be conducted and explored, such as reading articles on the internet or talking to an expert on the topic. After developing a high understanding, there should be a general idea of the ultimate purpose and goal of the project. One should ask themselves: “What problem am I trying to solve?” In my case, the problem I am trying to solve is the leading cause of death by stroke annually in the US. The purpose of this project is to use data science to make stroke predictions and further limit the effects of stroke on the population by identifying the early stages of stroke with some correlations regarding stroke. Understanding and framing the problem will help build an effective model that will positively impact the organization.

3.1 Data Acquisition

Next, one must find the data to be analyzed in the project. When researching for data, one should discover high-quality and targeted datasets. Not only does the topic of research needs to be relevant but also the data. Data from different sources can be extracted and sorted into categories to form a particular dataset. This process is also known as data scraping. One can find sources on the internet from research centers, government organizations, and specific websites for data scientists, such as Kaggle (Figure 2.3). The data must be accessible, so the most convenient formats for data science are CSV, JSON, or Excel files. Once the datasets have been downloaded, it is necessary to import them into an environment that can directly read data from these data sources into the data science programs. In most cases, data scientists will be using and importing the data into Python or R programming languages. In my case, I downloaded a CSV file of stroke data consisting of patients from the US and their conditions from Kaggle, and then I imported my data into the Juypter Notebook in Python for use.

Figure 3.1 Stroke patient data downloaded from Kaggle

3.2 Data Cleaning

The data acquired and imported is not perfect on its own. Thus, the data must be organized and “clean” to ensure the best quality. Duplicate and unnecessary data are removed, and missing data are replaced. Unnecessary data could be infinities, outliers, or data that does not belong in the sample. For my project on stroke predictions, I removed the data of particular patients from the set if their BMI is infinity (Figure 3.2) or they live outside of the United States, which is the scope of our study.

Figure 3.2 Example of infinities in a data set

There are also irrelevant data that are not as obvious and require analyzing the correlation between the parameter and the target. If the correlation is very low, it is irrelevant and should be removed. If there is a missing parameter in the dataset, locate the correct missing data instead and replace it or delete the patient from the dataset. The data is then consolidated by splitting, merging, and extracting columns to organize it and maximize its efficiency. The efficiency and accuracy of the analysis will depend considerably on the quality of the data, especially when used for making predictions.

3.3 Data Exploration

A critical factor in exploring and analyzing the data is to find covariations, as mentioned earlier. Different datasets, such as numerical, categorical, and ordinal, require different treatments. Numerical data is a measurement or a count. Categorical data is a characteristic such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take numerical values, such as “0” indicating no and “1” indicating yes, but those numbers don’t have mathematical meaning. In my case, I used numerical data—for the age, average glucose levels, and BMI—and a categorical dataset—for gender, hypertension, heart disease, marriage status, work, residence, smoking status, and stroke. I detected patterns and trends in the data using visualization features on Python with Numpy, Matplotlib, Pandas, and Scipy. With Numpy and Matplotlib, I could plot linear regressions, bar charts, and a heat map in correlation to select parameters and the target. Using insights made by observing the visualizations and finding correlations, one can start to make conjectures about the problem being solved. This step is crucial for data modeling.

3.4 Data Modeling

Data modeling is the climax of the data science process. The pre-processed data will be used for model building to learn algorithms and to perform a multi-component analysis. At this stage, a model will be created to reach the goal and solve the problem. In my case, I used a Machine Learning algorithm as the model, which can be trained and tested using the dataset. Machine Learning is the use and development of computer systems that can learn and adapt without following explicit instructions by using algorithms and statistical models to analyze and draw inferences from patterns in data. The first step to data modeling with Machine Learning is data splicing (Figure 3.3), where the entire data set is divided into two parts: training data and testing data. Generally, data scientists split 80% of their data for training and the remaining 20% for testing. The Machine Learning model is fed with the training input data to train the data. The data is then tagged according to the defined criteria so that the Machine Learning model can produce the desired output. During this operation, the model will recognize the patterns within the parameters and target of the training data. Algorithms are trained to associate certain features with tags based on manually tagged samples, then learn to make predictions when processing unseen data. The model will be tested for accuracy with the remaining 20% of the data. Since the correct parameters for each individual in the set are already known, it would be known whether the predictions made by the model are accurate by running the model with the testing data.

Figure 3.3 Diagram of the Training-Testing cycle

The goal is to maximize the model’s accuracy by making final edits and testing it. One may encounter issues during testing and must fix them before deploying the model into production. This stage builds a model that best solves the problem.

3.5 Data Interpretation

The concluding step of the data science process is to execute and communicate the results made from the model. The project is completed, and the goal is accomplished. Consequently, one must present their results to an audience through a research paper or a presentation. The presentation is comprehensible to a non-technical audience. The findings could be visualized with graphs, scatterplots, heat maps, or other conceivable visualizations. Useful data visualization tools for Python are Matplotlib, ggplot, Seaborn, Tableau, and d3js. To visualize the covariance between stroke and its primary causes, I used Matplotlib and Seaborn to create a heatmap. During the presentation, report the results and carefully explain the results’ reasoning and meaning. My ultimate goal is to make predictions for strokes with given patient data, and I hope my research paper will raise awareness of this technology and its global benefits for stroke patients. A successful presentation will prompt the audience to take action in response to the purpose.

4. Machine Learning

The popularity of Machine Learning, particularly its subset of Deep Learning, has rapidly grown in the past decade with skyrocketing interest in Artificial Intelligence. However, the history of Machine Learning dates back to the mid-twentieth century. Machine Learning is a subset of Artificial Intelligence that imitates human behavior and cognition. The “learning” in Machine Learning expresses how the algorithm automatically learns from the data and improves from experience by constantly tuning its parameters to find the best solution. The data set trains a mathematical model to know what to output when it sees a similar one in the future. Machine Learning can be classified into three algorithm types: Supervised Learning, Unsupervised Learning, and Reinforcement Learning (Figure 4.1). While Supervised and Unsupervised Learning is presented with a given set of data, Reinforcement Learning, known as an agent, learns by interactions with its environment. The agent makes observations and selects an action. When it takes action, it receives feedback rather than a reward or a punishment. Its goal is to maximize rewards and minimize penalties; thus, it would learn and tune its knowledge to take the actions leading to reward and avoid the activities leading to punishment.

Figure 4.1 Web diagram of Machine Learning

4.1 Supervised & Unsupervised Problems

The significant distinction between Supervised and Unsupervised Learning is the labeling status of the given data set. In Supervised Learning, the machine is given pre-labeled data. For my project, I used Supervised Learning and already had data from researchers who labeled each patient with or without stroke. I used a portion of this labeled data to train the model to distinguish which patients have or do not have a stroke based on their given conditions. The system would make a mapping function that uses the pre-existing data to create the best-fit curve or line and make estimations. Subsequently, I used the remaining portion of my labeled data to test the model for its accuracy. The goal is to maximize the accuracy of the model’s approximations when given new input data. In Unsupervised Learning, the machine is given unlabeled and uncategorized data, so it uses statistical methods on the data without prior training. For example, I would be using Unsupervised Learning if I were to predict which of the given patients have diabetes without previous data on diabetes. To form a model, I must analyze the data distribution and separate it based on similar patterns. Without any labeling, I would divide the patients into two groups based on their similar characteristics and behavior. Unsupervised Learning is split into two types: clustering and dimensionality reduction. In clustering, the goal is to find the inherent groupings and reveal the structure of the data. Some examples of clustering would be my previous example of predicting a patient with diabetes, targeted marketing, recommender systems, and customer segmentation. In dimensionality reduction, the goal is to reduce the number of dimensions rather than examples.

4.2 Classification & Regression

Supervised Learning is divided into two types: classification and regression. The goal of classification is to determine the specific labeled group the given input belongs to. The output variable would be a discrete category or a class. The only possibilities for my project are “stroke” or “no stroke.” The given data on the patients trains the model to correlate various parameters—their conditions and behavior—to the corresponding output of “stroke” or “no stroke.” The output could also be a defined set of numbers, such as “0” representing no stroke and “1” representing stroke. The accuracy of its categorization evaluates the classification algorithm. As a result, the model could predict whether a new patient would have a stroke. For regression, the outputs are continuous and have an infinite set of possibilities, generally real numbers. For instance, the machine could be estimating a house’s cost based on its location, size, and age parameters. Standard regression algorithms are linear regression, logistic regression, and polynomial regression.

In the following sections, I will discuss two regression models: linear and logistic regression. The former is used as an introduction to the regression problem whereas, the latter is the algorithm that I used to perform stroke predictions.

4.3 Linear Regression

Linear regression uses the relationship between the points or outputs of the data to draw a straight line, known as the line of best fit, through all of them. This line of best fit is then used to predict output values. A linear function has a constant change or slope and is usually written in the mathematical form:

y = θ1x + θ0 (Equation 4.1) where m is the constant slope and b is the y-intercept. When finding the line of best fit, there will be infinite possible straight lines through the values (Figure 4.2), and the θ1-values (slopes) and θ0-values (y-intercepts) will be adjusted. The “θ0” and “θ1” are the two parameters of the function. Regression is the predicting of the exact numeric value the variable would take to have the line of best fit. When given a data set, there exist various x-variables (features or input) and a y-variable (label or output). In my case, the features included gender, age, multiple diseases, and smoking status. The label is stroke or no stroke, listed as “0” and “1.” When using actual data, there will always be a distance between the actual and predicted y-values. This distance, known as the error, is minimized as much as possible to form the best fit line.

Figure 4.2 Possible lines of best fit for a given dataset

The error is often represented by a cost function, which is the sum of the square of the actual output subtracted by the predicted output:

where y_i is the real label output, g(x) is the approximation of the output, and (y_i – g(x)) is the error. The error is squared to ensure that the result of the cost function will be the sum of positive values. The line of best fit is created when the mean square error is the smallest it can be. In Machine Learning, the data receives training to find the line of best fit using Gradient Descent, an optimization algorithm to find the local minimum of a differentiable function. The Gradient Descent can be represented with the formula:

where ⍺ is the learning rate and is the instantaneous rate of change of the cost function at θ. The learning rate determines the magnitude of each increment of each step. Data scientists often make 0. 001 < ⍺ < 0. 01 because an ⍺ too large will never converge to the minimum and ⍺ too large will never converge to the minimum and ⍺ too small will take too long to reach the minimum. Moving down the function of C(θ), θ_n and θ_n – 1 approach each other. Once the difference is very small or |θ_n – θ_n-1| < 0. 001, the line of best fit is found. One example of linear regression would be the number of sales based on the product’s price. There would be a set of data with various products at different prices (the inputs) and each of their sales (the outputs). Assuming the trend of the relationship between the costs and the sales is linear, one would be able to find a linear model with the slightest mean square error. Thus, one can predict the number of sales at a new price. When two inputs or independent variables exist, the function becomes three-dimensional (Figure 4.3), and the model becomes a plane of best fit.

Figure 4.3 Plane of best fit on a three-dimensional graph

4.4 Logistic Regression

The data may not always fit into a linear model. For my data set on stroke predictions, the only two possible labels are stroke and no stroke or “0” and “1,” which is an example of binary classification. Thus, linear regression is non-ideal in the case of binary classification.

Figure 4.4 Linear regression used in binary classification

The line of best fit would exceed the 0 and 1 range and not be a good representation of the data, as seen in Figure 4.4. That’s why we will be using a logistic function to model the data. A logistic function, also known as a sigmoid curve, is an “S”-shaped curve (Figure 4.5) that can be represented by the function:

where L is the curve’s maximum value and (θ₀ and θ₁x) = g(x) or the linear regression function.

Figure 4.5 Logistic regression used in binary classification

In the case of a common sigmoid function, the output is in the range of 0 and 1, so L would be 1. There exists a threshold at 0.5; Outputs less than 0.5 will be still to 0 while outputs greater than equal to 0.5 will be set to one. Logistic regression finds the curve of best fit or the best sigmoid function for the given data set. For linear regression, we found the line of best fit with Gradient Descent. For logistic regression, we will use the Cross-Entropy Loss Function to determine the curve of best fit. Cross-entropy loss is the sum of the negative logarithm of the predicted probabilities of each model. For my case, I had only two labels and used Binary Cross-Entropy Loss which can be represented in the formula:

where s_i is inputs, f is the sigmoid function, and t_i is the target prediction. The goal is to minimize the loss; thus, the smaller the loss the better the model. When the best sigmoid function is found, the Binary Cross-Entropy should be very close to 0. The machine completes most of the logistic regression process internally, so it will solve and find the best function, which can be applied to make accurate predictions.

5. Process of Stroke Prediction Project

In the following session, I will apply the previous machine learning skills, specifically the logistic regression algorithm, to the case of stroke predictions. The data set introduced in Section 2 and the data science project process discussed in Section 2 will be used. I will describe the process of my project in detail and explain the analysis involved in interpreting the accuracy and efficiency of my model.

5.1 Data Acquisition

Before I started the data science research project, I researched various topics and current events and chose to do my project on stroke prediction. I obtained my organized data from the Kaggle website, which allowed me to download the file as a CSV file conveniently. I used the Jupyter Notebook application via Anaconda as my environment for this project. I imported my downloaded CSV file to the notebook (Figure 5.1).

Figure 5.1 First 15 lines of the imported dataset

As seen in the top row of Figure 5.1, there are various parameters or features: gender, age, hypertension, heart disease, marriage status, work type, residence type, average glucose level, BMI, and smoking status. The output or target I investigated was whether or not the patient had a stroke. The variables hypertension, heart disease, and stroke are defined by “0” being no and “1” being yes.

5.2 Data Cleaning

During the data cleaning process, I removed the redundant data for clarity by deleting other values in gender, never_worked values from work_type, and the id column (Figure 5.2 & Figure 5.4). In addition, I labeled all categorical features, or non-numerical columns, as ‘category’ when converting them into numerical values for analysis (Figure 5.2 & 5.3). Since the age values are non-integers, I converted them into integers in the last row of my code (Figure 5.2).

Figure 5.2 Code for removal and revision of dataset

Figure 5.3 Conversion of categorical to numerical

Figure 5.4 Histograms before and after removal of unnecessary data

The next part of data cleaning is removing outliers. I identified those outliers by recognizing the “null” or nonexistent values (Figure 5.5), labeled as NaN in the data as seen previously in Figure 3.2. Any non-zero output means there is a presence of outliers.

Figure 5.5 Identification of outlier

In my dataset, the only outlier was BMI. Thus, I removed those outlier values and replaced them with the mean BMI value in the code in Figure 5.6. I was confident no more null values were present in my data since all outputs were zero.

Figure 5.6 Removal of outlier

5.3 Data Balancing

Even after data cleaning, my dataset was not yet ready for use after data cleaning due to imbalance. Imbalanced data refers to the issue in classification when the classes or targets are not equally represented. The number of patients with stroke was much higher than without stroke (left plot in Figure 5.8). To create a fair model, the number of patients in stroke and no stroke classes must be equal. I could have resampled the data by undersampling (downsizing the larger class) or oversampling (upsizing the smaller class). I chose to oversample with the SMOTE algorithm (Figure 5.7) because the number of patients in the stroke class was too small and would lower the accuracy with undersampling.

Figure 5.7 Code for resampling

Figure 5.8 Histogram of gender to stroke before and after balancing

As a result of the oversampling, the ratio of stroke to no stroke should be 1:1 and thus balanced (Figure 5.7 & right plot in Figure 5.8).

5.4 Data Modeling

After dividing the resampled data into 80% training and 20% testing, I created a logistic regression model with the training data (Figure 5.9).

Figure 5.9 Code for logistic regression

The logistic regression algorithm was imported from sklearn.linear_model and automatically found the best fit curve representing the dataset.

5.5 Data Performance

In order to determine the accuracy of my model, I found the mean square error or MSE (from Equation 4.2). The MSE could be found with three methods: score method, sklearn.metrics, and equation (Figure 5.10).

Figure 5.10 Three methods of finding MSE

As a result, my model had approximately 91.1% accuracy. For a more detailed understanding of the model’s performance, I used a confusion matrix, which is a 2×2 table dividing the accuracy of the data into four categories (Figure 5.11).

Figure 5.11 Confusion matrix plot

The four categories, as shown in Figure 5.11, are true positive (bottom right), true negative (top left), false positive (top right), and false negative (bottom left). The accuracy of the model is high as long as most of the results are in the true positive and true negative categories because the predicted values are equal to the actual values. Using the confusion matrix, I further analyzed the performance of the model by calculating the F-1 score (Equation 5.1 & Figure 5.12). The F-1 score shows not only accuracy but also precision. I used the sklearn.metric algorithm to calculate my F-1 score (Figure 5.12), but I also could have used the equation.

Figure 5.12 Code for F-1 score

As a result, my model had an F-1 score of approximately 90.8%. Both my MSE and F-1 score were above 90.0%, and thus my model had high accuracy and precision.

5.6 Features Selection

Although my model already had high performance, I attempted to further increase it by removing certain features from my data. I hypothesized that the accuracy would improve if I removed the unimportant features or features with little correlation to the presence of stroke. On the other hand, the accuracy would drastically decrease when I removed important features. I determined the important and unimportant features with a correlation matrix plot (Figure 5.13).

Figure 5.13 Correlation matrix plot

The labeled bar on the right of Figure 5.13 shows the correlation between the features and output. The algorithm found the correlation with the following equation:

Where cov is the covariance, o_x is the variability of x with respect to the mean (the variance), x_i is an output of function x, x is the mean of x, and the y-variables have the same meanings using the y data set. When used to find the correlation between the parameters and stroke, I focused on the right-most column of the map. A correlation of 1.0 means the trends of the feature and output are equivalent, while a correlation of -1.0 means the trends of the feature and output are completely opposite. Both types of correlation are considered crucial when creating the logistic regression model. On the other hand, the feature and output are entirely unrelated if the correlation is 0. Therefore, I considered the features with a correlation close to 0—gender, residence type, children, and unknown smoking status—unimportant and removed them from my dataset (Figure 5.14).

Figure 5.14 Code for removal of unimportant features

After the removal, I repeated the processes of splitting the data, training the data, creating the logistic regression model, and calculating its accuracy with MSE and F-1 scores. Surprisingly, the accuracy and F-1 score lowered to approximately 86.6%; hence, the data removal led to a smaller training set and thus a less accurate and precise model. I further tested this theory by removing the important features or only keeping the features deemed unimportant and then repeated the data modeling process. Understandably, the accuracy lowered to 66.2%, and the F-1 score reduced to 71.9%. In conclusion, I kept my original model with all the features because it had the highest accuracy and precision.

6. Conclusion

In this data science project, I applied Machine Learning algorithms into predicting the likeliness of a patient in the United States to have a stroke. The goal of making such predictions is to prevent the consequences of stroke, which impacts a large population of Americans today. Throughout the project, I closely followed each step of the data science project process: data acquisition, data cleaning, data exploration, data modeling, and data interpretation. I discussed the difference between Supervised and Unsupervised Learning is whether the given data is labeled. Within Supervised Learning, there is Classification, using categorical data, and Regression, using numerical data. These data sets can be modeled with linear and logistic regression. In my project, I used a logistic regression algorithm to test and train my data. As a result, I tested my model with MSE and F-1 scores, and my model had an accuracy of 90%, which is a very promising outcome. To ensure the highest accuracy has been reached, I removed features with low correlation deemed unimportant and features with high correlation deemed important. The removal of important features led to a drastic drop in accuracy, and thus those features of the dataset should continue to be collected and studied for stroke predictions. Meanwhile, the removal of the irrelevant features had a small drop in accuracy, so those features are still of good use and are to be collected with the important features in this study. There may be other factors that play a role in the risk of stroke, however, the factors I have mentioned are of greatest significance based on the accuracy of my model.

Works Cited

Yeung, C. (2022, August 11). Stroke_Predictions_Project_Charisse_Yeung.ipynb. GitHub. Retrieved September 3, 2022, from https://github.com/honyeung21/data_science/blob/main/Stroke_Predictions_Project_Charis se_Yeung.ipynb

Medlock, B. (2022). Stroke. Headway. Retrieved September 3, 2022, from https://www.headway.org.uk/about-brain-injury/individuals/types-of-brain-injury/stroke/

Initiatives, C. H. (n.d.). Stroke prevention. CHI Health. Retrieved September 3, 2022, from https://www.chihealth.com/en/services/neuro/neurological-conditions/stroke/stroke-prevent ion.html

Fedesoriano. (2021, January 26). Stroke prediction dataset. Kaggle. Retrieved September 3, 2022, from https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

Wolff, R. (2020, November 2). What is training data in machine learning? MonkeyLearn Blog. Retrieved September 3, 2022, from https://monkeylearn.com/blog/training-data/

Pant, A. (2019, January 22). Introduction to machine learning for beginners. Medium. Retrieved September 3, 2022, from https://towardsdatascience.com/introduction-to-machine-learning-for-beginners-eed6024fd b08

V.Kumar, V. (2020, May 28). ML 101: Linear regression. Medium. Retrieved September 3, 2022, from https://towardsdatascience.com/ml-101-linear-regression-bea0f489cf54

Gupta, S. (2020, July 17). What makes logistic regression a classification algorithm? Medium. Retrieved September 3, 2022, from https://towardsdatascience.com/what-makes-logistic-regression-a-classification-algorithm- 35018497b63f

About the author

Charisse Yeung

Charisse is currently a 12th grader at the Carlmont High School in California. Her academic interests are data science, computer science, healthcare, and mathematics.

The post Data Science Analysis of Stroke Prediction appeared first on Exploratio Journal.

United States Crime Data Analysis Using Modern Applied Statistics Methodologies

Adarsh Sasikumar — Sun, 11 Sep 2022 15:24:32 +0000

Author: Adarsh Sasikumar
Mentor: Dr. Hong Pan
Sri Kumaran Children’s Home of Educational Council

Abstract

While there is a burgeoning research literature on crime trends, much of the extant research has adopted a relatively narrow approach, efforts across studies are highly variable. In this paper, we outline a method to establish the relation between crime rate in the two years of 1959 and 1960 to the political, social and economic factors of that time, whether the crime does depend on the poverty state and mindset of the people in that era. The correlation between crime rate and all the factors being studied is visualized using a scatterplot matrix (SPLOM). Multiple linear regression models under ANOVA framework are performed to evaluate which factors and their interactions affect the crime rates significantly. Hence, what should be done in order to reduce the crime rate and help in the development of the country in all aspects. Statistically, the factors which influence crime rate the most are police expenditure in both the years, Gross Domestic Product (GDP), State population and probability of imprisonment.

I. Introduction

A. Background

Crime is an illegal act which is subjectable to punishment by the government. It involves violating the standard law code prescribed by a country’s parliament and judiciary. It is an unlawful act and a grave offense against human morality. The topic covered in this paper is the data analysis of crimes in the United States. It is a systematic way of detecting and investigating patterns and trends in crime.

The crime in the U.S. has been recorded since the early 1600s. The crime rates have varied over time, with a sharp rise after 1900 and reaching a broad bulging peak between the 1970s and early 1990s. The range of these criminal activities vary from pickpocketing to serial killing and assassinations. The basic aspect of a crime considers the offender, the victim, type of crime, severity and level, and location. These are the basic questions asked by law enforcement when investigating any situation. This information is formatted into a government record by a police arrest report, also known as an incident report.

Society has a strong misconception about crime rates due to media aspects heightening their fear factor. The manner, in which America’s crime rate compared to other countries of similar wealth and development depends on the nature of the crime used in the comparison. Overall crime statistic comparisons are difficult to conduct, as the definition and categorization of crimes varies across countries. Thus, an agency in a foreign country may include crimes in its annual reports which the United States omits, and vice versa. However, some countries such as Canada have similar definitions of what constitutes a violent crime as America. Overall, the total crime rate is more in the US than it is in other developed countries across Europe.

B. Problem Statement

The criminal behaviour has traditionally been linked to the offender’s presumed unique motivation which might be various factors such as unemployment, family circumstances due to poverty. It also depends on the dark figure of crime which is the gap between reported and unreported crimes calls the reliability of an official crime statistics into question, but all measures of crime have a dark figure to some degree. The gap in official statistics is largest for less serious crimes. It also varies from region to region or as a matter of fact, state to state. It depends on the type of community living in that region as well. The crime in metropolitan statistical areas tends to be above the national average; however, wide variance exists among and within metropolitan areas.

In this report, we will deep dive into various factors contributing/influencing the rate of crime, which serves as our dependent variable and how it depends on 15 independent variables, which are: % of males aged 14–24, indicator variables for a Southern state, mean years of schooling, police expenditure in 1960 and 1959, labour force participation rate, number of males per 1000 females, state population, number of non- whites per 1000 people, unemployment rate of urban males age 14–24 & 35–39, gross domestic product per head, income inequality, probability of imprisonment, and average time served in state prisons, that affect the rate of crime in a specific category per head population.

II. Methodology

A. Descriptive

In this section, we are going to describe, explore and confirm how the economic, social and political factors are affecting the crime rate using the MASS Library in R.

Table 1 : Variables used in this Model

Each row represents a state in the USA and Unit are shown in Table 1.

Table 2 : Crime Dataset

The rate of crime serves as the dependent variable which depends on a number of independent variables such as police expenditure, mean year of schooling, unemployment rate, income inequality, imprisonment probability and so on.

A SPLOM plot is used to show the basic information of the dataset. It is a collection of scatter plots being organized into matrix. It gives us the direction, strength, and linearity of the relationship between the dependent and independent variables and also helps in determining the correlation between the independent variable and dependent variable as seen in the Figure 1.

Figure 1 : SPLOM Plot with All Variables

B. Exploratory:

The data is first explored and wrangled to check for the presence of outliers and those which are present are removed.

The potential correlation between the variables is further evaluated via analysis by 14 simple regression models, to examine which independent variables are significant factors. Based on this set of simple regression models, 4 variables are chosen which have a statistically significant correlations with the dependent variable Crime Rate.

Regression is used to predict a continuous outcome based on one or more continuous predictor variables. Regression line is the straight line passing through the data that minimizes the sum of the squared differences between the original data and the fitted points.

An aov (analysis of variance) model is performed to check the interrelation between the 4 independent variables from the simple regression model. This analysis is performed iteratively on the various combinations in which the least significant combination (the one with highest p value) is removed after each step. Multiple regression analysis is performed on these 4 independent variables with respect to the dependant variable that is crime rate.

C. Confirmatory:

ANOVA model is used for deciding the best multiple regression model among the given combination of the 4 independent variables. ANOVA and step aic are used to find out the interdependency between the 4 variables. Likelihood ratio test is performed to check goodness of the fit of the nested regression models.

ANOVA is fundamental for all statistical approaches. ANOVA, which stands for Analysis of Variance, is a statistical test used to analyses the difference between the means of more than two groups.

ANOVA is used in the analysis of comparative experiments, those in which only the difference in outcomes is of interest. The statistical significance of the experiment is determined by a ratio of two variances. This ratio is independent of several possible alterations to the experimental observations: Adding a constant to all observations does not alter significance. Multiplying all observations by a constant does not alter significance.

A one-way ANOVA uses one independent variable, while a two-way ANOVA uses two independent variables.

In ANOVA, the null hypothesis is that there is no difference among group means. If any group differs significantly from the overall group mean, then the ANOVA will report a statistically significant result.

So, ANOVA statistical significance result is independent of constant bias and scaling errors as well as the units used in expressing observations. This makes it the ideal model.

We use ANOVA, when we want to test a hypothesis. From our 3 models of different combinations, we choose the model (independent variables) with maximum influence on the dependent variable that is crime rate as an ideal model. Finally, the model with all 4 variables is plotted. Following this, the best model is chosen from all the 3 models.

The likelihood-ratio test compares the goodness of fit of two nested regression models based on the ratio of their likelihoods, specifically one obtained by maximization over the entire parameter space and another obtained after imposing some constraint. A nested model is simply a subset of the predictor variables in the overall regression model.

III. Results

A. Descriptive

In this section, we will be depicting the various results of our analysis of regression and ANOVA Models.

Figure 2 : Correlation

From Figure 2, we can see that police expenditure in 1959 and 1960 have high correlation with one another. This condition is not ideal for linear regression because no predictor variable should strongly correlate with one another, hence we are combining both into a single variable.

SPLOM PLOT

As we can see for police expenditure, state population and gdp vs Crime Rate from Figure 3, the data points are closer to the line moving in the upward direction indicating a strong positive linear relationship with crime rate. Similarly, for probability of imprisonment, it is in the downward direction indicating a strong negative relationship with Crime Rate.

From the plot, we can see that four independent variables with most correlation with the Crime Rate is Police expenditure, GDP, Probability of imprisonment and State Population.

Figure 3 : SPLOM Plot with 4 Variables

B. Exploratory:

Data Wrangling

In Figure 4, the data is plotted with and without outlier.

The predictor variable during a simple linear regression must always have high linear relationships with target variables. So, for this data we need to first check for the presence of an outlier and if there are any, we must remove them.

If we look at the abline of the plot, we can see that the outliers of our data have a high influence over our model. Therefore, it is safer to remove the outliers from the data. The removal of outliers increases the R-squared of the model ~0.6 points, and the p value has decreased along with the value of intercept which shows the removal of outlier improves the model.

Figure 4 : Data Wrangling

A simple linear regression model is estimated for each of these 14 independent variables and their result has been tabulated below.

Figure 5 : Simple Regression

From these simple regression models, we can conclude that the relation between police expenditure, GDP, probability of imprisonment, state population and crime rate is highest as their p value is less than 0.05 with r being 72% ,44% ,43% and 36% respectively. So, with these 4 variables, a multiple regression and stepAIC is plotted. As next step, ANOVA is used to find the best model.

C. Confirmatory:

Interrelation between variables using multiple regression and stepAIC

AOV is performed to check which variables among the 4 variables have the most interrelation amongst one another and hence affect the crime rate.

Figure 6 : Result of Interrelation – 4 Variables

Multiple regression and stepAIC are performed for the 4 interdependent variables as shown in Figure 7

Figure 7 : Model I – crime_rate ~ pol_exp * gdp * prob_imp * state_pop

The highest p value (gdp:state_pop and gdp) are iteratively removed and performed multiple regression and stepAIC as shown in Figure 8 and 9

Figure 8 : Model II – crime_rate ~ pol_exp * gdp * prob_imp * state_pop – (gdp:state_pop)

Figure 9 : Model III – crime_rate ~ pol_exp * gdp * prob_imp * state_pop – (gdp:state_pop) – gdb

ANOVA is performed to check the best model.

Figure 10 : ANOVA Model

ANOVA
Likelihood ratio test is performed on the 3 nested models, since the p value is greater than 0.05 as seen in the figure, we have to accept the null hypothesis which means there is not much difference between the models and the best fit is the first model (pol_exp * gdp * prob_imp * state_pop)

Figure 11 : Likelihood Ratio Test

D. Model Validation

Figure 12 : Model Validation – Shapiro

Figure 13 : Best Fit Model – Residuals Plots

Figure 14 : Best Fit Model – Histogram Plots

Figure 15 : Best Fit Model – Plots

So, from the Shapiro test we can see that p value is more than 0.05 and hence confirms the data is normally distributed.

IV. Discussion

From the simple linear regression, multiple regression and ANOVA model, we can conclude that 4 factors which crime rate depends the most are police expenditure, GDP, State Population and probability of imprisonment.

Out of these 4 it depends on police expenditure the most followed by probability of imprisonment, which has a negative relation and then GDP followed by state population. From, the interrelation model, we can see how closely these 4 are interrelated and affect crime rate. As we can see the economic factors especially the police expenditure and the GDP which affected crime rate. Population factors also affected the crime rate as highly populated states experienced more crimes than the other states.

This was closely followed by putting fear into people’s mind by the probability of imprisonment playing a key role. It is an extremely surprising fact that crime rate is more influenced by the economic factors than unemployment.

Change in population in turn influences the gdp. The gdp of a state decides its expenditure on police. Despite the significant findings, there were several limitations to note in the study. The first limitation was the medium sample size.

V. Conclusion

From the simple linear regression, multiple regression and ANOVA model, we can conclude that 4 factors which crime rate depends the most are police expenditure, GDP, state population, and probability of imprisonment. Out of these 4 it depends on police expenditure the most followed by probability of imprisonment, which has a negative relation and then GDP followed by state population. From, the interrelation model, we can see how closely these 4 are interrelated and affect crime rate. As we can see the economic factors especially the police expenditure and the GDP which affected crime rate. Population factors also affected the crime rate as highly populated states experienced more crimes than the other states. This was closely followed by putting fear into people’s mind by the probability of imprisonment playing a key role. It is an extremely surprising fact that crime rate is more influenced by the economic factors than unemployment.

Change in population in turn influences the gdp. The gdp of a state decides its expenditure on police.

The state must allocate funds towards police training, enforce strict law & order and improves GDP by increasing Government spending, Export and Investment. This helps to bring down the crime rate.

VI. Reference list

Venables, W. N. & Ripley, B. D., Modern Applied Statistics with S, Fourth edition (2003)
Isaac Ehrlich, Participation in Illegitimate Activities: A Theoretical and Empirical Investigation (1973)
Geoffrey R Norman and David L Streiner, BIOSTATISTICS: The Bare Essentials, Third Edition (2008)
Adarsh S, GitHub Repository, https://github.com/AdarshS20/UScrimeDataAnalysis (2022)\

Acknowledgements
I would like to take this opportunity to express my profound gratitude and deep regards to my mentor Dr. Hong Pan for his exemplary guidance, monitoring and constant encouragement for this research paper. I would also like to take this opportunity to express my gratitude to Ms. Danielle Voorhies for her cordial support and guidance. Lastly, I thank my parents and friends for their encouragement and support to complete this research paper.

About the author

Adarsh Sasikumar

Adarsh is a 12th Grader at Sri Kumaran Children’s Home of Educational Council, Bangalore, India. He discovered the subject of Mathematics at the age of 6 and it was love at first sight. As he grew up, he felt like he had encapsulated himself into the number, angles, variables, and equations. He is planning to pursue an undergraduate major in Mathematics and is interested in predictive analysis.

The post United States Crime Data Analysis Using Modern Applied Statistics Methodologies appeared first on Exploratio Journal.

SARS Data Analysis

Seoyeon Cho — Mon, 30 Aug 2021 03:57:20 +0000

Author: Seoyeon Cho
February 9, 2021

1. Introduction

Since the beginning of 2020, the economies and social activities of global markets ground to a halt due to the rapid spread of the novel contagious disease caused by Coronavirus from Wuhan, China. Scientists named this virus as ”SARS CoV 2” and the disease it causes as ”Coronavirus Disease 2019” (COVID-19). SARS, also known as Severe Acute Respiratory Syndrome, is a disease that can be easily spread through surfaces and saliva. It was first identified in February 2003, and has reemerged since then. The World Health Organization (WHO) first declared the new outbreak as a public health emergency of international concern in January, 2020 but soon declared the crisis a pandemic as the virus spread exponentially in March, 2020. COVID 19 has now spread to more than 200 countries in the world and has caused 1.9 million deaths.

2: R script

This is an R chunk with no plots:

>sarsdataset <- read_csv("sars_2003_complete_dataset_clean.csv", + col_types = cols(Date = col_date(format = "%Y-%m-%d")))

>#View(sarsdataset)

>names(sarsdataset

[1]	"Date"	"Country"
[3]	"Cumulative number of case(s)" "Number of deaths"	
library(readr)
[5] "Number recovered"

names(sarsdataset)[3]<-"CumNumber"

names(sarsdataset)[4]<-"Deaths"

names(sarsdataset)[5]<-"Recovered"

#install.packages("tidyverse")

library(tidyverse)

#install.packages("ggplot2")

library(ggplot2)

str(sarsdataset)
tibble [2,538		5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ Date	:	Date[1:2538], format: "2003-03-17" "2003-03-17" ...
$ Country	:	chr [1:2538] "Germany" "Canada" "Singapore" "Hong Kong SAR, China" ...
$ CumNumber:		num [1:2538] 1 8 20 95 2 1 40 2 8 0 ...
$ Deaths	:	num [1:2538] 0 2 0 1 0 0 1 0 2 0 ...
$ Recovered:		num [1:2538] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "spec")=

.. cols(

..   Date = col_date(format = "%Y-%m-%d"),

..   Country = col_character(),

..Cumulative number of case(s)  = col_double(),

..Number of deaths  = col_double(),

..Number recovered  = col_double()

.. )

>sarsdataset$Country <- as.factor(sarsdataset$Country)

# Explore data for individual country

>

>list_Country<-levels(sarsdataset$Country)

>sarsdataset$Country <- as.factor(sarsdataset$Country)

>rev(sort(table(sarsdataset$Country)))[1:5]
Thailand	Singapore Hong Kong SAR, China	
96	96	96
Germany	China	
96	96	
>Country0<-list_Country[11]
>countryindexes<-c(6,11,27)
>countryindexes<-c(1:length(list_Country))
>list_Country[countryindexes]
[1]	"Australia"	"Belgium"	"Brazil"
[4]	"Bulgaria"	"Canada"	"China"
[7]	"Colombia"	"Finland"	"France"
[10]	"Germany"	"Hong Kong SAR, China" "India"	
[13]	"Indonesia"	"Italy"	"Japan"
[16]	"Kuwait"	"Macao SAR, China"	"Malaysia"
[19]	"Mongolia"	"New Zealand"	"Philippines"
[22]	"Poland"	"Republic of Ireland"	"Republic of Korea"
[25]	"Romania"	"Russian Federation"	"Singapore"
[28]	"Slovenia"	"South Africa"	"Spain"
[31]	"Sweden"	"Switzerland"	"Taiwan, China"
[34]	"Thailand"	"United Kingdom"	"United States"
[37]	"Viet Nam"

With Sweave files, there is only one figure per chunk

countryindexes<- c(6)

for (j in countryindexes){

+   Country0<- list_Country[j]

+ sarsdataset0<- sarsdataset[ sarsdataset$Country== Country0,] + #print(sarsdataset0)

+ #ggplot(data=sarsdataset0, aes(x=Date, y=CumNumber)+ geom_line()

+

+ par(mfcol=c(2,2))

+ plot(sarsdataset0$Date, sarsdataset0$CumNumber, type="l",

main=Country0,

xlab="Date",

ylab="Cumulative Number of Cases")


plot(sarsdataset0$Date, sarsdataset0$Deaths, type="l",

main=Country0,

xlab="Date",

ylab="Deaths")

+

plot(sarsdataset0$Date, sarsdataset0$Recovered, type="l",

main=Country0,

xlab="Date",

ylab="Recovered")

+

plot(sarsdataset0$Date, sarsdataset0$Deaths/sarsdataset0$CumNumber,type="l",

main=Country0,

xlab="Date",
ylab="Death Rate")

}

3. From Google Docs

Background information

The purpose of this research paper is to critically understand the past situation of the SARS outbreak in 2003 with the objective of projecting the near future conditions of the COVID 19 crisis. This effort com-pares and contrasts quantitative information available from the two crises. It now has been almost a year since the outbreak started with significant impacts on human beings all over the world but to date, there have been no significant improvements or comprehensive solutions in countries of the world. The critical, analytical understanding of the SARS outbreak presented in this paper, improves the potential to better predict possible outcomes of the COVID 19.

4. Research Topic

4.1 Issues

Contrasting each country and analyze the SARS impact * examining the progression of disease in each country Interesting points that may arise: * Death rates may be higher in lower income countries * World Health Organization classifies countries by geography and by income level; perhaps there are patterns in expo-sure levels and death rates * When comparing progression of disease by country, accounting for different population sizes of the countries (focus on percentage increases/changes) and possibly residential density of countries. * Degree of government control on quarantines and isolation of active cases.

4.2 Connection between the covid and SARS

Countries that experienced SARS were possibly better prepared to deal with COVID-19 (e.g., testing protocols and other .)

SARS – data, analysis (organized discussion with logical flow) (list the actual number on the data)

5. Plots from GGPlot

6. Analysis

The similarities between the countries that had most death rates.

This graph indicates the death rates of different countries along the time due to the impact of SARS outbreak back in 2003. One can easily notice the highest deaths has been recorded by China, second by Hong Kong, third by Taiwan, and fourth by Canada.

One can also notice that these four countries can be grouped into two parts , where one has recorded relatively significant number of deaths and another with less significant death rate. China and Hong Kong can be grouped as ”most death rate” group whereas Taiwan and Canada can be grouped as ”non-significant death rate” group. This separation could have been done easily because of the distinct difference in the number of deaths.

The very distinct similarities that ”most death rate” group shared was the exponential growth of number of death from April to late May. The green and blue graphs showed almost an identical exponential growth during that time span and slowly stopped increasing and remained almost constant from June to July with constant death rate of 380 and 300 for China and Hong Kong respectively.

Other than the information from given graph, one can also assume that the ”most death rate” group were similar in a sense that they were both sharing / touching the border of China. This may be very obvious and does not seem like an important fact but can’t be ignored as both SARS and COVID-19 were emerged in China and spread to other countries rapidly. (Add another graph with percentage deaths (death rate) as a percent of cases to compare countries)

The similarities between the countries that did not have many cases From the figure 1, one can also notice the similarities between the countries that did not have such high death cases; Taiwan and Canada. It is that compared to ”most death rate” group, the increase of the death case of these two countries was not very exponential but quite steady. Canada showed very linear increase of death cases along time and Taiwan with small exponential growth during May but not as significant as that of China or Hong Kong.

The another distinct similarity that one can know is that these two countries, Canada and Taiwan, are not touching China’s national border directly. There may be a lot of Chinese immigrants in Taiwan and Canada but they both are not directly touching China in terms of country’s borer line.

7. Worst Experienced Country

In order to pin down which country experienced the worst outbreak, one has to define ”worst experience” first. This is important because one could think the country with most death cases had the worst experience whereas another can argue that the worst experience should be based on the percentage of number of death per national population. In this report, the ”worst experience” will be based on the total number of SARS case. This is because the absolute amount of people who suffered from this disease matters more than the numbers of death rate or the number of deaths. According to figure 2, it is clear that China had the highest cumulated number of SARS case among 4 countries with overwhelming 4800 cases, almost the double of Hong Kong’s SARS case

8. Worst Impacted Country Among Top Rated Countries

Just like how ”worst experienced” was defined in the past paragraph, ”impact” will be defined as how much of a shock SARS brought to a country. In other words, ”impact” will be measured based on the percentage of death per the nation’s population.

Taiwan: 24 mil Canada: 38 mil Hong Kong: 7.5 mil China: 1.4 bil

Even without the detail calculation, one can see that the death / nation pop-ulation is the highest for Hong Kong as it has the smallest number of population but almost the same number of death cases according to figure 1. Therefore, one can argue that Hong Kong had the worst impact among the four given countries.

9. Seriousness of SARS Crisis

Unlike other diseases like cancer or heart disease, the problem with SARS was the fact that this disease was so contagious. This is important because person with cancer or heart disease can live their normal life even after recognizing they have those diseases. However, in case of SARS, the other people who do not carry the disease cannot function properly because of the fear that this unseeable virus might be traveling as an aerosol in mid air and infect them. This fear cannot be sorely explained or quantified with graphs and numbers on figures. Which also explains why the numbers and the figures can not properly explain the seriousness of SARS. SARS did not only kill people but killed the economics of not only these four countries but the many other countries who were closely related to China. Obviously, the death of many people is a deep pain that can not be forgotten but the economic damage that SARS caused was very significant.

From figure 3, one can distinguish the differences and similarities between four countries.

First of all, Canada showed a very distinctive characteristic that other 3 countries never shown. During the late February, the deaths/cumnumber of Canada spiked all the way up to 0.3 but it soon died to 0.1 and slowly increased and remained constant at about 0.15. This kind of instant spike, decrease and remaining of the rate has not been shown in any other countries. The other two countries, Hong Kong, and Taiwan, never shown an instant spike like Canada did but rather slower exponential increases. However China has shown a very different rate graph as well. China was the only country that had linear increase among 4 countries in figure 3.

For the death rates why does death rates stop dramatical increases during June?

From the figure 1, one can see that the exponential growth of death rate for China and Hong Kong has been remained until early June but the growth has significantly died from June. A lot of reasons could be the answer to this dra-matic ease of SARS such as the change of weather, countless efforts of scientists, researchers, and doctors to stop the disease. As all four countries’ death rate has decreased dramatically as time went by, one could assume that the obvious change of weather (from winter to summer) along time has helped the human population to stop the contagious disease, SARS. Another reason for a sudden ease of exponential growth of SARS would be the fact that all four countries’ governments were well aware of the situation with the help of WHO and well managed to stop the spread of virus.

There could be multiple reasons for a stop of dramatical increase during June. One is the change of weather from winter to summer and the increased temperature. Another is the cumulated effort of human beings to stop the virus with research for cures and governments’ desperate effort to stop people from going out.

The country that had most severe crisis and what happened to their country Among these four countries, Hong Kong had the most severe crisis with the highest percentage of both death and cumulated case number per nation’s pop-ulation. The death rate of SARS case in Hong Kong were usually around 20% but it went up close to 55% among elders.

Because other normal respiratory diseases’ death rate were about 1% back then, the SARS did not seem to be a significant issue at the beginning of the outbreak. However, the death rate was significantly high specifically in Hong Kong since they were most vulnerable to the disease as they had the highest portion of elders among the four given countries in the data. This was a very significant fact that changed the SARS’ outcome back in 2003. Furthermore, Hong Kong was not very well prepared for these kind of outbreaks as they have one of the highest population density in the world which meant Hong Kong was the best possible option for SARS outbreak to exponentially damage the country and the economics of it.

Due to SARS outbreak in 2003, Hong Kong’s tourism industry had to face the worst period so far. As you could imagine, with this outbreak ongoing in Hong Kong, no one in other countries would have voluntarily wanted to visit and explore Hong Kong because of the SARS. According to Paul Chan Mo Po, finance minister of Hong Kong in 2003, the number of foreign visitors decreased to 40% compared to that of 2002. This kind of economic impact did not only hurt the tourism industry but the local and international economics of Hong Kong as well. This damage was more severe for Hong Kong since they had such a dense country compared to others like Canada or China in terms of both landscape and population density. It completely frozen the economics and the country and took years of effort for Hong Kong and their people to recover from SARS.

10. Prediction

Compare the conclusion of sars and the current cove pandemic SARS could have been completely controlled within only 6 months of time with international effort back in 2003 but the current ongoing COVID 19 pandemic already has surpassed 12 months of time and does not seem to end in the near future. This could be explained with the fact that back in 2003, the world was not as connective as it is now. For instance, traveling was treated as a huge privilege back in 2003 and was not a very common. However, as the world started to globalize exponentially, traveling was not a big privilege for mange people and became common for everyone in 2021. One can believe that the significant reason why SARS could have ended much faster, only in 6 months, compared to COVID was because it was not as globalized back in 2003 as it is now in 2021.

SARS ended only in 6 months and only caused 774 deaths but the current COVID already has killed more than 1.9 million people internationally all over the world. It is very distinct that the SARS was only spread to few countries whereas COVID has been spread to more than 200 countries. This may be the reason why COVID is not concluding like how SARS has concluded, the exponent globalization of the world.

11. Comparison Between SARS and COVID 19

SARS and COVID 19 share the very same virus called coronavirus which is very contagious and causes respiratory diseases. These two also are very similar in a sense that these diseases are much more dangerous for elders or one with existing diseases like obesity, cancer or respiratory disease than others.

It is also similar how the outbreak originated from China and spread to other close countries. As Hong Kong is literally next to China and easy to travel, the virus was able to spread in Hong Kong rapidly just like how COVID19 had spread in Hon Kong as one of the earliest countries in the first several months of the outbreak.

12. Conclusion

In conclusion, the SARS and COVID 19 share the very similar characteristic in a sense that both of them are caused by coronavirus, are very contagious, and are dramatically more dangerous to elders and people with existing diseases. Back in 2003, people were able to successfully stop the SARS as the world was not as globalized as it is now with efforts of scientists, researchers, doctors, and the governments as the virus only spread to few countries. However, the COVID19 now has been spread to more than 200 countries and caused almost 2 million deaths. From the close observation of SARS case, one can conclude that it has surpassed singularity point but the time, effort from researchers and governments will stop this COVID19 in foreseeable future.

13. References

Seladi-Schulman, J. S. S. (2020, April 2). COVID-19 vs. SARS: How Do They Differ? Healthline. https://www.healthline.com/health/coronavirus-vs-sars

Severe Acute Respiratory Syndrome (SARS). (2019, November 1). World Health Organisation. https://www.who.int/health-topics/severe-acute-respiratory-syndrome

E.P.M.K.U.G.D.H.H.N.P.F.C. (2020, July 3). Comparing SARS-CoV-2 with SARS-CoV and influenza pandemics. The Lancet. https://www.thelancet.com/journals/laninf/article/PIIS147 3099(21)00054-2/fulltext

About the author

Seoyeon Cho

The post SARS Data Analysis appeared first on Exploratio Journal.

data analysis Archives - Exploratio Journal

Data Science Analysis of Stroke Prediction￼

1. Introduction

2. Stroke Prediction

3. Process of a Data Science Project

3.1 Data Acquisition

3.2 Data Cleaning

3.3 Data Exploration

3.4 Data Modeling

3.5 Data Interpretation

4. Machine Learning

4.1 Supervised & Unsupervised Problems

4.2 Classification & Regression

4.3 Linear Regression

4.4 Logistic Regression

5. Process of Stroke Prediction Project

5.1 Data Acquisition

5.2 Data Cleaning

5.3 Data Balancing

5.4 Data Modeling

5.5 Data Performance

5.6 Features Selection

6. Conclusion

Works Cited

About the author

Charisse Yeung

United States Crime Data Analysis Using Modern Applied Statistics Methodologies￼

Abstract

I. Introduction

A. Background

B. Problem Statement

II. Methodology

A. Descriptive

B. Exploratory:

C. Confirmatory:

III. Results

A. Descriptive

SPLOM PLOT

B. Exploratory:

C. Confirmatory:

D. Model Validation

IV. Discussion

V. Conclusion

VI. Reference list

About the author

Adarsh Sasikumar

SARS Data Analysis

1. Introduction

2: R script

3. From Google Docs

Background information

4. Research Topic

4.1 Issues

4.2 Connection between the covid and SARS

5. Plots from GGPlot

6. Analysis

7. Worst Experienced Country

8. Worst Impacted Country Among Top Rated Countries

9. Seriousness of SARS Crisis

10. Prediction

11. Comparison Between SARS and COVID 19

12. Conclusion

13. References

About the author

Seoyeon Cho

Data Science Analysis of Stroke Prediction

United States Crime Data Analysis Using Modern Applied Statistics Methodologies