data science Archives - Exploratio Journal

The Theory and Implementation of Common Machine Learning Algorithms

Amanbir Behniwal — Mon, 02 May 2022 14:53:58 +0000

Author: Amanbir Behniwal
Mentor: Dr. Gino Del Ferraro
Vincent Massey Secondary School

1. Introduction

Machine Learning jobs are growing to become one of the most in de- mand jobs in the world. In the 1940’s, the idea of machine learning first started to grow; it was something that would emulate human think- ing and learning. Machine Learning has since grown to become a big part of our daily lives. For example, in speech recognition software, the software will map the different tones and nuances when someone speaks and try to match this to a specific person. Another example is a translator, which tries to understand the accents of people speaking a language and then translates it to another language. Many applications that we use today, such as Alexa, Siri, and Google Translate, use these machine learning algorithms. Furthermore, we are trying to integrate machine learning into our vehicles. Cars like the Tesla use unsupervised learning algorithms to self-drive in traffic and detect any danger. The future holds many possibilities due to machine learning.

In theory, we input great amounts of data into machine-learning programs, which using statistics, will categorize or predict outcomes by finding and applying patterns in the data. We can further categorize the different types of algorithms used in Machine Learning to supervised, unsupervised learning and reinforcement learning. Supervised learning consists of regression and classification while unsupervised learning consists of clustering and association.

In this report, we will first discuss important terminology needed to understand the contents of the report. We will then begin to dis- cuss the theory behind some of the machine learning algorithms. The algorithms implemented in this report are all regression algorithms, however, we will also discuss the theory behind other algorithms. Finally, we will see how to implement the code. There are GitHub links provided with the actual code.

2. Terminology

Before we can get started with all the theory, we must develop an understanding of some key terminology that we will use quite often when working with machine learning programs. These are some basic terms that we should be familiar with:

2.1 Features

When we are trying to extrapolate from data using a linear model such as a line of best fit, we want the line to have an equation that best fits the data. In general a line has an equation of h = θ₀ + θ₁x₁ + θ₂x₂ θ_nx_n. Here we consider x₁, x₂, , x_n₁, x_nthe features. We will go more in depth about this later on in the report.

2.2 Inputs

When we run a python program, we must somehow store the data so that our program knows what we want it to work with. We then take ’input’ of the data in a convenient way for us to work with it. For example, lets say we had a document that contained a few coordinates. We may want our program to take input of this data where the x- coordinates and y-coordinates are stored separately. The program written to complete this process is called ’taking input’. This process is explained in greater deal in the code.

2.3 Outputs

After our code has calculated what we wanted it to, we want to see this information in an organized manner so that we can study it. We then make our program ’output’ this information. Outputs can consist of words, integers, etc.

2.4 Predicted Values

Let us say that we received input of many coordinates and we wanted our program to calculate the line of best fit. When we are testing different equations to see if they best fit the data, we input the same x-coordinates as the ones in our input data. However, our y-coordinates may not always be the exact same as that of the input data. We thus call our y-coordinates predicted values, since they are what our program predicted the coordinate lies at based on the equation that we came up with.

2.5 Expected Values

The values that we get from the inputted data are our expected values since they are the original values that we are comparing the predicted values to.

3. Supervised Learning

Supervised learning is the most commonly used algorithm in Machine Learning and it is also the simplest to implement. When using super- vised learning, we must train the algorithm by pairing labelled inputs with outputs. The program in this stage is trained to look for patterns that correlate the input to the output. When we have provided the algorithm with a good amount of example pairings, the algorithm will be able to apply this to new inputs it receives. We can further split supervised learning into classification and regression.

3.1 Classification

Classification is a type of supervised learning. In classification, our output will always be a category that the algorithm has mapped the input to. An example of this would be our program receiving input of pictures of animals and then outputting what animal they are (their category). We first have to train the program by inputting many pictures of dogs and cats in their respective categories so that the program will be able to establish patterns between the images of the dogs and the images of the cats. After we have inputted a sufficient number of images, the program will get accurate in determining if an animal is a cat or dog when it receives an input that it has not seen before.

3.2 Regression

Regression is another type of supervised learning. In regression, our output is not a category but rather a value such as money or age. We can take for example the price of houses and the total square footage of the house. Using regression, we identify the function that best fits between these values where we have reduced the amount of error as much as we can. We can then use the equation of this line to predict how much a house with a certain square footage will cost.

Figure 1: https://medium.com/machine-learning-in-practice/a-gentle-introduction-to-machine-learning-concepts-cfe710910eb

3.2.1 Linear Regression

When performing linear regression, the program will take input of data and plot it on a graph. It will then find a line of best fit and be able to make predictions based on this line of best fit. For example, we can graph the number of hours a student watches TV rather than studying compared to their test scores.

Figure 2: onlinemath4all.com/scatter-plots-and-trend-lines.html

As we can see, the graph looks fairly linear and it only has one feature; the amount of time spent watching TV rather than studying. This makes it a perfect model for linear regression. We want our program to come up with an approximate equation with which we can estimate a students’ test score based on how long they spent watching TV instead of studying. Really, we are looking for our program to find the line of best fit, since this line would be best for extrapolating the data and providing an as accurate as possible estimate of a test score based on the number of hours that were spent watching TV. Our program would then test many different lines until it reaches one line that fits the data better than any other line.

Figure 3: onlinemath4all.com/scatter-plots-and-trend-lines.html

As we can deduce, when calculating the equation of the line of best fit, our slope and y-intercept variables matter a lot. In fact, we are just making changes to these variables to try to find the line of best fit. Machine learning algorithms rely on these parameters (y-intercept, slope/bias, etc.) to run. When we want to find the best model for our data, we need to keep adjusting these parameters so that the direction of our line better fits the data and our predicted values are closer to the expected values. We must then introduce a function that changes these parameters by determining the amount of error that we are getting with the current parameters. This function is called the cost function.

4. Cost Function

The cost function essentially helps our program minimize the error it produces compared to the actual data set. When we are doing linear regression, it is very rare that we will get a data-set where the data fits precisely on a line. Therefore, when we are computing the line of best fit, we want to find a line such that it has the least possible difference (error) between the actual coordinates and the coordinates our line gives (predicted values). There are multiple ways of defining the cost function, some examples are explained further in the following sections.

4.1 Mean Absolute Error

When we take the mean absolute error, we are taking the absolute value of the difference between the predicted y-value and the expected y- value. The reasoning for this is that, since we are adding up all the error for each data point, we want to keep track of how much error we are accumulating.

Figure 4: https://gist.github.com/FisherKK/86f400f6d88facbf5375286db7029ca2

In this graph, the blue points are the original points of the data set, while the orange points are the ‘predicted’ points that our program is currently testing for the line of best fit. As we can see, each d_irepresents the amount of ‘error’ our model/line produces for each point in the data set.

However, if we add negative numbers (our predicted point is below the original point), our program actually thinks it’s producing less error. To deal with this we take the absolute value, which is always non-negative, so that our program does not add negative error. Then our program can plug this into the formula which is defined as

Where mis the number of training examples, yˆ(i) is the predicted value, y(i) is the expected value and i is the index of the data point since we want to sum the error of all the data points.

4.2 Mean Squared Error

When we take the mean squared error, instead of taking the absolute value of the difference between the predicted and expected value, we take their square. In this way, we still don’t add up negative error since any real number squared is non-negative. The equation is defined as:

When using mean absolute error, we took the absolute value of the distance between the predicted value and the expected value. We are now taking the square of the area of the square whose side length is the distance between the predicted value and the expected value. All these regions are summed and averaged.

Now that we have discussed how our program will calculate the error that our model/line is producing, we must find a way to minimize the value our cost function is returning. The gradient descent algorithm is one of the most effective ways of doing so.

Figure 5: https://gist.github.com/FisherKK/86f400f6d88facbf5375286db7029ca2

For linear regression models, we assume that our data has a linear dependence and therefore can be modelled by using a linear equation as follows;

h_θ(x) = θ^Tx= θ₀ + θ₁x,

where θ₀ is our bias (y-intercept) and θ₁ is our slope. Then, we want to change our parameters θ₀ and θ₁ in such a way that our line better fits the data and the cost function produces less error. In batch gradient descent, we update our theta values continuously with the following equation;

Here, θ_jis the value that we are updating. Again, mis the size of the data (how many points there are). Alpha here represents the learning rate of our algorithm. If alpha is too big, our program may be a lot faster, but it will not be nearly as accurate in determining the equation of a line of best fit as a smaller value of alpha may be. However, when we use too small a value for alpha, our program will be incredibly slow. It is best to find a good median between these two values.

6. Multi-Linear Regression

Now that we have discussed how to optimize our program so that it can calculate the best line of fit with equation h = θ₀ + θ₁ x₁, we think of what we would do when we have multiple features. Currently we have only been working with one feature, which in the example presented, was the number of hours spent watching TV rather than studying. Let’s take another example of the price of a house. When determining the price of a house, we must determine its area, how many rooms it has, how old it is, among other things. In this instance our data when plotted still looks linear however we cannot use the exact same technique as linear regression, since we have more than one feature. We use multi-linear regression in this situation because of its suitability to deal with more than one feature.

Multi-linear regression can be used with as many features as we’d like. Our equation is now

h= θ₀ + θ₁ ·x₁ + θ₂ ·x₂ + ···+ θ_n·x_n,

where all x_irepresent the different features. When we now implement gradient descent, we must use it to update all θ_iso that our line better fits the data. The cost function can be implemented in much the same way.

The interesting thing to note about multi linear regression is that we need an n-D graph to plot all the points, however, if we take a 3-D graph for example, our program is essentially finding the line of best fit in a plane that best suits all the points.

Figure 6: https://aegis4048.github.io/mutiple linear regression and visualization in python

7. Unsupervised Learning

Unlike supervised learning, in unsupervised learning, we do not train the program with inputs and corresponding outputs. Rather, the pro- gram uses its built-in algorithms to try to find patterns in the unlabelled data and produce an output. For example, if we give input of shapes with different sizes, the algorithm can separate these based on how many sides there are in each shape. In general, unsupervised learning requires much less data then supervised learning. We can further split unsupervised learning into clustering and grouping.

7.1 Clustering

As discussed earlier, in unsupervised learning, we input unlabelled data into our program. Graphing our data, it may look like the following:

Figure 7: https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/

Once our program has graphed the data, we want our program to try to find patterns in the data. Specifically, clustering algorithms will try to look for clusters of points that seem to be together. The graph could then be divided into the following clusters:

Figure 8: https://www.analyticsvidhya.com/blog/2021/04/k-means-clustering-simplified-in-python/

Among the many applications of clustering, we can use the example of social networks. We may want to find which people seem to be very close friends on their social networks so our algorithm would make clusters of people that appear to be close friends.

Figure 9: https://www.mghassany.com/MLcourse/introduction.html

A more common example in our daily lives would be our spam filter. Our email uses clustering algorithms to try to group spam emails, update emails, advertisement emails, etc. together.

Furthermore, we can classify clustering as hard clustering and soft clustering. In hard clustering, a data point can either belong in a cluster or not. This type of clustering is useful in binary situations such as whether a movie is good or not. On the contrary, when using soft clustering, a data point can belong to many clusters. This is more useful when we may want to determine which books are similar.

7.2 Association

Association algorithms try to see if two items depend on each other. For example, if we take a customer at a supermarket. If this customer has gone to buy bread, then it is very probable that the customer is also looking to buy butter or milk. In this way, we can associate different items based off of their dependency on each other. Many companies use this technique to place associated items away from each other in a store so that the customer see’s many other items on the way and may consider buying additional things. An example of the different associations in a store are given below:

Figure 10: https://annalyzin.files.wordpress.com/2016/04/association-rules-network-graph2.png

8 Reinforcement Learning

In reinforcement learning, the program learns what to do by trial and error in its current environment. We can think of it as the program receiving a reward if it does something correct and a penalty if it does something incorrect. Take the analogy of a child, when a child is young, they do not know what is good or bad. The only way the child can learn is by trying new things. The child may touch something electric, get a shock, then instinctively not go near the thing again. The child now knows that that object is something that shouldn’t be touched because it will hurt. A reinforcement learning program works in a similar way. The difference here is that the machine can try thousands of operations in one second and even though it may start by making very bad decisions, it will learn over time and will become a lot more sophisticated in its decision. We can simulate giving a program a reward or penalty by giving it a score in which, if it does something incorrect, the score will lower, and conversely, if it does something correct, the score will increase. This type of program is based entirely on trial and error on the programs part, it is also one of the closest things to a machine’s own creativity.

One of the most useful implementations of reinforcement learning are simulations. For example, the program can be used to help create the optimal rocket engine for a rocket launch. If we put our in a rocket launch environment in which the environment responds to the actions of our program, we can ‘reward’ the program if it’s helping the rocket launch with its actions or ‘punish’ the program if it’s not helping the rocket launch.

Figure 11: https://riptutorial.com/machine-learning/example/32668/reinforcement-learning

9. Linear Regression Implementation

For the linear regression code, we took input of the population of a city in 10, 000s and its profit in $10, 000. We then plotted all of the coordinates and got the resulting graph:

As we can see the graph looks fairly linear, thus we can use linear regression on this.

The full code can be found at: https://github.com/ABehniwal/face-recognition/ blob/main/Numpy-Linear-Regression.ipynb

10. Multi-Linear Regression Implementation

For the multi-linear regression code, we took input of the different features of a car (Engine Size, Cylinders, Fuel Consumption (City), Fuel Consumption (Comb)) and the resulting CO2 emission. We then plotted all of these features of the car separately with the CO2 Emissions to get a visual of how the different graphs look. This resulted in the following graphs.

10.1 Engine Size Graph

10.2 Cylinders Graph

10.3 Fuel Consumption (City) Graph

10.4 Fuel Consumption (Comb) Graph

Again, we see that all the graphs look fairly linear, however, since we have multiple different features of the car that we must take into account, we use multi-linear regression. The full code can be found at: https://github.com/ABehniwal/face-recognition/blob/main/Multi-Linear-Regression. ipynb

About the author

Amanbir Behniwal

Amanbir is currently an 11th grader at the Vincent Massey Secondary School in Ontario, Canada. He enjoys challenging myself with difficult math and computer science problems by participating in various contests. Amanbir is an avid fan of Barcelona and has been playing soccer for many years. Amongst other things, he likes to read books, help others with problem-solving, and delve deeper into the field of computer science.

The post The Theory and Implementation of Common Machine Learning Algorithms appeared first on Exploratio Journal.

Data Transmission Via Social Network Sites

Geonwoo Kim — Sun, 24 Apr 2022 14:22:33 +0000

Author: Geonwoo Kim
Mentor: Dr. Charalampos Tsourakakis
Crean Lutheran High School

Introduction

The advancement of the internet gave way to a new form of online interaction supported by social network sites. Over the past years, social network sites have witnessed tremendous growth in numbers. Over one billion users on Facebook and hundreds of millions on Pinterest, Twitter, and Google+. Sharing information on these social network sites has now changed how people communicate. Through the sites, an individual can create a public profile, connect with users with whom one has a connection, and share messages, videos, and images. This has led to unpredictable and emerging sharing, prorogation, interaction, and content creation amongst users. However, one of the most important and active research areas has been understanding the information diffusion on social network sites. Information diffusions refer to how information is spread among interconnected entities or nodes in a network. Studying information diffusions is linked to beneficial outcomes such as determining the various factors that affect the whole process. In addition, studying off data to data transmission across social network sites can help in various sectors such as marketing. The data transmitted in most cases on social network sites involves the source content: the images, video, or texts posted and includes geolocation, posting time, and other meta information.

Background

Densest Subgraphs

According to Dasgupta and Gupta (1), social networks have been created by the billions of users who carry out various activities, including creating their profiles, linking, following, posting content commenting, and other online interactions. In most cases, evolved graphs have been used in modeling social media whereby nodes represent the entities, content, themes, and other meta-data. A typical social media graph will have both end and node properties in most cases. However, Dasgupta and Gupta (2) note that there has been increased research on determining the various subgraphs that allow various analysts to determine the connection between two nodes for years. The sparse nature of social media graphs supports the emergence of various ends between the two nodes (Faloutsos, McCurley, and Tomkins 1). Unlike in other graph analyses, when analyzing social media subgraphs, the representation of a single relationship between two nodes using a single path is limiting. It is thus essential to ensure that the connection between subgraphs is determined in the fastest way possible since it will help in identifying the few most likely transmissions paths of a disease, joke, information leak, or rumor from one user to the other (Faloutsos, McCurley and Tomkins 1). More importantly, this will make it easier to ascertain the unexpected affiliation between individuals or other members. Using graphs will help summarize the connection between two SNS users, thus providing the fastest means of determining how the data is transmitted between the users.

Given the importance of graphs in studying connections between SNS users, it will allow one to map out all the edges and identify the social position of every user. Yazdi et al. (141) note that the best strategy for analyzing data to data transmission across social network sites is using the theory of graphs. The diffusion patterns of information across SNS and its distribution have become a key study area. However, Yazdi et al.(142) argue that one of the most challenging problems has been finding out the best and fastest strategy to help determine and study data to data transmission across SNS and thus predict the diffusions paths based on an actual data that has many applications in critical areas such as gossips news, blog postings, virus resource detection, e-commerce among others. The popularity of given news plays an avital role in determining the nodes influenced in the future, which helps to ensure that nodes that influence the past are used in outlining the transmissions of the news. In this case, the future nodes will be predicted as a function of time (Yazdi et al., 142). The Louvain community detection algorithm was a widely used data to data transmissions research strategy. However, the inability to control the centers of clusters and their numbers made it ineffective. The importance of the centers of clusters is that they help in information propagation.

However, the densest subgraph algorithm can overcome the inefficiencies of the Louvain community detection algorithm since it supports a more efficient center of clusters provisions. Epasto, Lattanzi, and Sozio (1) note that various data analysis tasks such as distance query indexing, event detection, community detection, computational biology, among others, have been improved with the emergence of finding densest subgraphs. The various users across SN have been compared to actual communities, given that most share similar interests or have an affiliation to the same company, university, or other organization. In most cases, the emergence of certain words affiliated to place, cities, company names of even persons on tweets and posts can indicate something affiliated to a given event about to take place. The emergence of the densest subgraph algorithms has been used to study the data-to-data transmission across SNS. This allows analysts to determine the compact representation of node distances in a graph. This, in turn, allows them to compute the distance between two nodes and time and determine the data transmissions rates.

In most cases, new people will always join SNS while others leave, new friendships will be formed, and others will end. Moreover, new tweets and postings on SNS such as Facebook and Twitter will mean the older tweets have become less interesting. The result is that the SNS users’ communities will evolve with time, leading to the emergence of new events that trigger the formation of new densest subgraphs. The node distances will continually change in the long run, thus calling for frequent re-indexing. This can thus significantly hinder research into data-to-data transmission across SNS hence requiring algorithms that can keep up with the ever-evolving users and large and highly dynamic data input streams.

Moreover, graphs have been used to identify various concepts, not only social media but also biological and financial networks. However, given that the common problem is to find the most significant number of connections between nodes, there is a need to determine the best solution. Given that most communities within social media networks are based on the formation of communities, this will lead to a need for a mathematical task that will help detect data-to-data transmissions between various users known as the densest subgraph. In most cases, the number of edges divided by the maximum possible number of edges equals the density of a k-node subgraph. This indicates that by finding the density of graphs, one will determine the data-to-data transmission between various communities and even narrow it down to the respective metadata such as location and time. Tsourakakis (1) argues that various data mining techniques have been employed to determine the data-to-data transmissions across SNS. Most subgraph techniques have tried to ascertain which ones are near-cliques, resulting in the emergence of the NP=hard problem associated with the densest subgraph. However, there have been many types of research aimed at coming up with solutions towards solving the densest NP=hard problem, and it has proven to be solvable hence making the algorithm more effective than previous graph mining applications (Tsourakakis, 1).

Various graph density concepts are used in determining the densest subgraph. One of the concepts is edge density. In these cases, one determines the density measure by dividing the number of edges with the node numbers. Another fundamental concept is the k-core that allows one to ascertain the subgraph with the largest minimum degree instead of its average degree. The K-core concept was introduced in 1970 by Lick and White and was later analyzed in many other papers (Farago and Mojaveri 4). The k-core has been widely used in the densest subgraph since it is easy to find algorithmically. Therefore, mathematically speaking, the densest subgraph would be the best means for analyzing data to data transmissions on social media websites.

On the other hand, transmitting data from one user across social media websites is essential. The development of mobile-based communications has allowed people to access various SNS on their smartphones. Various challenges have marred the traditional mesh network, making it hard for data transmissions from one user to the other ( Yang, Wu, and Luo 1). This has led to the emergence of an opportunistic network that, unlike the traditional mesh network, does not support the advanced setting of the network size and node location. In addition, there is no deed for setting up a complete path between the target node and the source node. The main advantage of the opportunistic network is that it allows nodes to enter the communication range, thus facilitating a much faster exchange of data between users. Yang, Wu, and Luo (2) note that the opportunistic network will thus help eliminate the problems arising from the wireless technology networks, such as network delays network splits, and also be able to ensure that the network communication is much less expensive.

Opportunistic Networks

Opportunistic networks are linked to remote area network transmission, handheld devices networking, in-vehicle networking, and tracking wildlife. However, with the invention of the 5G network, tablet and Bluetooth computers, smartphones and laptops have increased in number and have also been widely distributed across large geographical areas. People can now move from one place to another with the devices, which has led to the formation of a social node. Therefore, unlike the traditional signal transmissions, which affect data transmission across nodes by affecting data acceptances opportunist network improves the broadcast characteristics within the interference range hence eliminating node broadcast delay (Yang, Wu, and Luo, 2). This will lead to low latency and high data transmission across the SNS.

More importantly, opportunist networks across SNS will ensure that it supports the store-carry-forward transmission strategy. In this strategy, the data is sent from the source node to the destination node even when there is no network availability (Xiao and Wu 3). However, to ensure that the opportunistic networks support efficient data transmissions across SNS, it is vital to have an efficient routing algorithm in place. The study on routing algorithms s aim opportunities networks has been widely debated, leading to the proposal of countless routing algorithms. Vahdat Amin and David Becker proposed the epidemic routing algorithm using several meeting nodes to help in data transmissions. The epidemic routing algorithm has been cited as supporting a reduced data transmissions delay improving the average hops and average delay times. Another algorithm referred to as Spray and Wait were proposed by Spyropoulos et al. (253), which sought to overcome the various shortcomings associated with the epidemic algorithm. This algorithm works based on two phases: spray and wait for phase. The L copies of the data are sprayed by the source node towards neighboring nodes, after which the wait phase starts. One must wait for some time before the messages are thus sprayed to the destination node. The core aim of the Spray and Wait algorithm is to support a much faster transfer rate across nodes. It is thus essential to ensure that the best routing algorithm is selected when setting up opportunistic networks to support faster data transmission rates across SNS.

Conclusion

Data to data transmission across social network sites is essential since it supports the ongoing interaction between the users. Across social network sites, information diffusion represents the process via which data and information are transmitted from one user to the other across social network sites. Every social network site must ensure that the information diffusion process is fast to eliminate any disappointment amongst users. Therefore, when studying data to data transmissions across social network sites, one of the fastest ways to support the whole process is using the densest subgraphs. The densest subgraph makes it easier for analysis to map out all the data-to-data transmissions between users, thus making it easier to ascertain the social position of every user. The densest subgraph helps overcome the inefficiencies slinked with the Louvain community detection algorithm. On the other hand, data-to-data transmissions from one user to the other are critical. The traditional wireless technologies have been marred by high latency rates and low data transmission. The ever-increasing smartphone, tablet, and Bluetooth computers have made distributing users across a larger geographical zone easier. However, with the emergence of the 5G network, using an opportunistic network will support a much faster data-to-data transmission. However, it is essential to ensure that the routing algorithm uses supports the fast data-to-data transmission when using an opportunistic network.

Works Cited

Dasgupta, Subhasis, and Amarnath Gupta. “Discovering interesting subgraphs in social media networks.” 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2020.

Epasto, Alessandro, Silvio Lattanzi, and Mauro Sozio. “Efficient densest subgraph computation in evolving graphs.” Proceedings of the 24th international conference on the world wide web. 2015

Faloutsos, Christos, Kevin S. McCurley, and Andrew Tomkins. “Connection subgraphs in social networks.” SIAM International Conference on Data Mining, Workshop on Link Analysis, Counterterrorism and Security. Vol. 2. 2004.

Faragó, András, and Zohre R Mojaveri. “In search of the densest subgraph.” Algorithms 12.8 (2019): 157.

Spyropoulos, T., Psounis, K., & Raghavendra, C. S. (2005, August). Spray and wait: an efficient routing scheme for intermittently connected mobile networks. In Proceedings of the 2005 ACM SIGCOMM workshop on Delay-tolerant networking (pp. 252-259).

Tsourakakis, Charalampos E. “A novel approach to finding near-cliques: The triangle-densest subgraph problem.” arXiv preprint arXiv:1405.1477 (2014).

Vahdat, Amin, and David Becker. “Epidemic routing for partially connected ad hoc networks.” (2000): 2019.

Xiao, Yutong, and Jia Wu. “Data transmission and management based on node communication in opportunistic social networks.” Symmetry 12.8 (2020): 1-13.

Yang, Weiyu, Jia Wu, and Jingwen Luo. “Effective data transmission and control based on social communication in social opportunistic complex networks.” Complexity 2020 (2020).

Yazdi Majbouri, Kasra, Adel Majbouri Yazdi, Saeid Khodayi, Jingyu Hou, Wanlei Zhou, Saeed Saedy, and Mehrdad Rostami. “Prediction optimization of diffusion paths in social networks using integration of ant colony and densest subgraph algorithms.” Journal of High-Speed Networks 26, no. 2 (2020): 141-153.

About the author

Geonwoo Kim

Geonwoo is currently a Junior at the Crean Lutheran High School in Irvine, California

The post Data Transmission Via Social Network Sites appeared first on Exploratio Journal.

Neural Data Analysis Using Spectral Techniques

Gitika Tirumishi Jada — Sun, 10 Oct 2021 13:07:13 +0000

Author: Gitika Tirumishi Jada
CMR Institute Of Technology, Bangalore
September 1, 2021

1. Introduction

Data science is the study of data. It is a concept that unifies statistics, data analysis,
informatics and their related methods to understand the actual phenomena with data. Data
science in an interdisciplinary field focused on extracting large data sets (for example big
data) and applying the knowledge gained from that data to solve problems in a wide
range of application domains.

The methods used in processing the data seen in this paper are similar to that of signal
processing. Digital signal processing is used to process discrete time signals. Some of the
algorithms or techniques used in this are, Discrete time Fourier Transform (DFT), Fast
Fourier Transform (FFT), Finite Impulse Response (FIR), etc,. Along with these, we
make use of spectrograms to study the properties of these signals in different domains.

In this report, we will make use of LFP (local field potential) data, which is a form of
neural data. The data is read from the brain by a certain probe inserted in it. These
micro-needles are inserted in various parts of the brain, thus giving rise to many signals
recorded at different spatial locations. We want to discuss how to perform neural analysis
of these brain signals in both the frequency and time domain, therefore we introduce the
DFT and FFT techniques as well as the short time Fourier Transform STFT and the
spectrogram. Correlations of these LFP signals are introduced towards the end of the
report to investigate the relation among the signals and give an idea on how the brain
functions when subjected to certain tasks and which parts of it are functionally connected.

2. Time series

Time series is simply the collection of data over a period of time or at different points in
time. In most cases, a time series is a sequence taken at successive, equally spaced points.
Therefore it is called a sequence of discrete time data or a regular time series. In other
cases, if the time series is not taken over equally spaced points in time, it is called an
irregular time series. There can also be a change in the number of variables, resulting in a
multivariate time series.

This time series provides a source of additional information that can be analysed and used
in the prediction process. Time series analysis refers to the relationships between
different points in time within a single series.

Often while dealing with time series and data in the time domain, we use sampling as a
method to analyse the signal. In signal processing, when we are comparing and sampling
multiple signals, we come across an effect called aliasing. This is an effect that causes
different signals to be indistinguishable when sampled if they are sampled at different
rates. It can also refer to the distortion or artifact that results when a signal reconstructed
from its samples has not the same sample rate as the original signal.

One such example of a time series is neural data signals. There are many types, the
Electro-encephalogram (EEG), the Local field potential (LPF), etc,. These are the data
taken from the brain signals. They are used in understanding how the brain works,
essentially which part of the brain has more activity when subjected to certain tasks. We
will be discussing more on the LFP in the later sections.

2.1 Time domain and frequency domain

The time domain is where signals are plotted with respect to time. Time domain analysis
is the analysis of this time series with reference to time. In the time domain, the signal’s
value is understood as a real number at various instances. A graph in the time domain
shows how the signals change with respect to time.

Fig 1: A cosine wave represented in the time domain, with a time period of 1000s. This wave was generated by mixing (adding) two different cosine waves with different periods.

The frequency domain is where the signals are plotted with respect to frequency rather
than time. Now we can say, the frequency domain analysis is the analysis of a function or
a series in the frequency domain. A frequency domain displays how much of the signal
exists within a given frequency band concerning a range of frequencies.

Fig 2: This is the same mixture of cosine waves shown in Fig. 1 here displayed in the frequency domain. As we can see, there are two frequency components, one at 10000 Hz with an amplitude of 0.9 and another at 10 Hz with an amplitude of 2.8

2.2 Fourier Transform

A given function or signal can be converted between the time domain and the frequency domain by using certain mathematical operators called transforms. The most commonly used is the Fourier transform. What this does is, it converts the time function into an integral of simple waves like sines and cosines. The spectrum of the frequency components is the frequency domain representation of the signal. The Fourier transform of a signal x(t) can be represented as

The original signal can be reconstructed by applying an inverse Fourier transform. This can be written as-

2.3 Discrete Time Fourier Transform

The Fourier transform deals with infinite number of samples, whereas the discrete time Fourier transform otherwise known as discrete Fourier transform (DFT) is a type of Fourier transform which converts finite number of equally spaced samples of a function in the time domain into a complex valued function of the same length in the frequency domain. Since experimentally we never have infinite time series acquisition, we always have to deal with finite time series and, therefore, we use the Discrete Fourier transform instead. The latter is expressed in formulas as

This creates a spectrum of all the frequency components present in the signal, similarly to the Fourier Transform. One of the many applications of the discrete Fourier transform is spectral analysis. When a sequence is represented as x{t} with samples uniformly spaced, the DFT can tell us about the frequency components of the signal or, in other words, the spectral content of the signal.

2.4 Power Spectral Density

We can also derive a power spectrum from a time series. The power spectrum S_xx(ω) of a time series denoted by x(t) is the absolute value of the Frequency spectrum obtained by taking the DFT of the said time series.

Where Y(ω) is the DFT of the signal x(t).

The spectral density obeys an important theorem, called the Parseval’s theorem which states that the integral of the spectral density equals the squared sum of the absolute value of the time signal, expressed in the following as-

In the above equation, we can notice one of the important features of this theorem is that the integral of the components in the frequency domain is equal to the sum of all components in the time domain.

2.5 Short Time Fourier Transform

Another type of the Fourier transform is the short time Fourier transform (STFT). This transform is used to measure the sinusoidal frequency and phase content of a particular window in the signal. This involves dividing the signals into shorter time segments of equal length and then computing the DFT on each segment separately. This shows us the Fourier of each segment individually. Then we plot this to see the changes in the spectra. Taking an example for a signal x(t). The short time Fourier transform of this signal is essentially the product of this function and a window function which is non zero for a particular period of time.

Fig 3: Several segments of the same signals are taken one after another and the DFT is computed for each of them

3. Spectrogram

The spectrogram is a 2-dimensional representation of the STFT where the time and frequency are expressed in the same plot on each of the two axes respectively. As we can see in Fig. 3, a shorter time segment of the original signal is considered. In the spectrogram, the squared absolute value of the power spectrum (i.e. the spectral energy density) of a segment is represented on the y axis and colored accordingly to its intensity. Consecutive DFTs are represented one after the other on the time axis vs frequency.

We consider the time signal shown in fig.4. There is an active signal between the time period 0 and 3 seconds, after which there is zero frequency for one second length. Then the active signal continues from the 4th second and continues till the end of the signal.

Fig 4: Time signal consisting of various frequencies.

Fig.5 represents the spectrogram of the time signal shown above. As depicted, there are various frequencies from the zeroth second till the third second. There is a gap in the frequencies corresponding to the gap in the signal above. This spectrogram shows which frequency has what value at precisely which instant of time. The four distinct yellow lines represent the highest of frequencies occurring in the time signal. The colour bar helps the reader understand the magnitude of the various frequencies present in it.

Fig 5: Spectrogram of the given time signal shown in Fig. 4.

3.1 Time-Frequency Uncertainty Principle

Coming to look at this spectrogram, we can wonder how one gets a precise value of the time or frequency component. There are several parameters we can adjust to achieve this precision. One of them being the size of the window we consider while taking the Fourier transform. If we have a narrow window, the temporal (time) precision will be high but there will be very few frequencies between 0 and Nyquist. As this window gets longer, there will be more frequencies so the frequency resolution will increase, but at the same time the temporal precision will decrease as the integration occurs over large periods of time. Therefore, there has to be a trade-off between the frequency resolution and the temporal resolution for us to attain a decent spectrogram.

Fig 6: Time-Frequency Trade-off

Depending on the length of the window we consider, we can have two types of spectrograms – Narrowband spectrogram and a Wideband spectrogram.

3.2 Narrowband Spectrum

Narrowband spectrogram is where the window length is long. This means, there will be more points for computation of DFT. Therefore, more frequency resolution. The drawback here is that there is less time resolution as there are many points. As shown in the figure below, the frequency lines are very sharp, indicating exactly where these frequencies lie, but the time scale is not very clear.

Fig 7: Narrowband spectrogram where the frequency resolution is very precise, but the time resolution is less accurate.

3.3 Wideband Spectrum

Wideband spectrum is where window length is short. This means, there are numerous time segments which account for precise location of transitions i.e., high time resolution. However, as the window is short, there are fewer DFT points which results in a poor frequency resolution. In Fig. 7, we can see that the timelines are very accurate but the frequency lines are vague and it is harder to identify the precise frequencies of the signal.

Fig 8: Wideband spectrogram where we can see exactly where the time events happen, but the frequency resolution is less accurate and frequency lines are blurry

3.4 Neural Data Analysis of Brain Signal

From this point forward, all the graphs and pictures have been derived from actual brain data. This data is the LFP signal from the brain of an animal when subjected to certain testing conditions.

3.5 Local Field Potential

The Local Field Potential is the electric potential recorded in the extracellular space around the neurons, typically using microneedles. They differ from electroencephalogram (EEG), which is recorded at the surface of the scalp with macro-electrodes.

When messages are transmitted from one neuron to another, there is a spike in potential known as action potential. This is what the LFP picks up. These are very refined signals as they are taken from such close proximity to the neurons, whereas in the case of the EEG, the signal must propagate through various media like the cranium, the cerebrospinal fluid, dura mater, muscle and skin.

Fig.9: Neural data taken from the same electrode: two different trials of the same experiment

The LFP data shown in Fig.9 is just two trials conducted for a particular experiment, recorded by the same electrode. Each signal corresponds to the data collected by a single electrode. These probes were located at different locations, but the data was taken during the same period of time. The cumulative of all the signals is shown in fig. 10.

Fig.10: LFP data of five trails.

The mean of all these trials is then used for computation of the spectrogram. Fig.11 represents the mean of all the signals in the five trials.

Fig 11: Graph showing the mean LPF of different trials of an experiment

Figure 11 shows the LFP that varies with time. As we can see, the y-axis has both negative and positive voltages. From the zeroth second, the signal shoots up from negative value of 4 to a positive value of 2, and then keeps on varying subsequently.

Fig 12: This figure shows the DFT of the LFP signal shown in Fig. 10

The graph in Fig.12 shows the DFT of the signal depicted in Fig.11. The graph of the positive frequencies looks like a mirror image of the negative frequencies. This is because the DFT has both positive and negative components which are similar in magnitude.

Fig 13: The expanded view of the DFT of Fig. 12 after eliminating the negative portion

The negative frequencies are eliminated and the positive ones are enhanced. As we can see, the graph is more readable now, the frequencies have distinguished values.

Fig.14: Spectrogram of the LFP data

The spectrogram of the given LFP data is depicted in Fig.14. The purple color indicates low frequencies whereas the blue, green and yellow colors indicate high frequencies.

3.6 Correlation Function

A useful tool for comparing two signals which are a function of time is the correlation function. It measures how similar two signals are. It is a function which is dependent on a certain amount of time shift. There are two types – autocorrelation and cross-correlation.

3.7 Cross-Correlation

Cross-correlation is defined as the correlation between a signal and a time shifted version of another signal. This is also known as the sliding dot product or sliding inner product. We can consider two signals x(t) and y(t) which are functions of time. The cross-correlation function can be written as-

Where s is the shift in time.

3.8 Autocorrelation

Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed version of itself. In other words, it is the observations of the time lag in the signal. In signal analysis we can use this to analyze functions or series of values. Considering the same example as above, we can write the autocorrelation functions as-

3.9 Covariance

Covariance is defined as the measure of correlation. In other words, covariance gives an exact number to the similarities between two variables. It is represented by eqn (x).

Where, x_i and y_i are data values of x and y respectively, x and y are mean values and N is the number of data values.

3.10 Pearson Correlation

Another method to calculate correlation is to find the Pearson coefficient. It is defined as the measure of linear correlation between two sets of data. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation.

Eqn.(xi) represents the formula to calculate the Pearson coefficient. From this we can obtain the correlation matrix as follows-

3.11 Correlation Of Neural Data

We consider two signals of the LFP data from two different electrodes, to calculate the correlation. The two signals are represented by two distinct colors. We can see how each of these signals changes with time, and how similar they are to each other.

Fig 15: LFP data taken into consideration for calculation of correlation

We now calculate the correlation among the LFP signals acquired in 5 different electrodes at different brain locations and obtain the correlation matrix represented in Fig. 16. For simplicity, we restricted ourselves to only 5 electrodes for the computation of the correlation matrix.

Fig.16: Correlation matrix for LFP data

Note that the diagonal elements of the matrix are all ‘1’, since they are the correlation of a column with itself. Another point of observation is that this matrix is also symmetric as shown in Pearson correlation.

Fig.17: Correlation plot for LFP data corresponding to 5 data points

Fig.17 shows the corresponding plot for the correlation matrix in Fig.16. The diagonal elements of the plot are shaded white: indicating high correlation, which is true because all the diagonal elements are one. On the contrary, the elements shaded as black have zero correlation. Taking the heat map as reference, we can locate which part of the signal has high correlation and which part does not.

Here below, we plot two LFP signals vs time that have small correlation in Fig. 17, these are the electrodes 1 and 2 which show low correlation in Fig. 17. This way we can conclude to what extent these signals are similar.

Fig.18: LFP data from electrode 1 and electrode 2

We can observe here in fig.18, these signals are not very similar: the correlation indeed is very small as one can observe from the correlation matrix for these electrodes.

Fig.19: LFP data from electrode 1 and 3

Fig.19 represents data which is much more similar than the data from Fig.18, these are electrodes 1 and 3. The peaks and dips of the signal from trail 1 are consistent with that of trial 2. Taking these observations into consideration, we can propose that the parts of the brain from where this data was taken, are connected or work in coordination when subjected to certain tasks. These two electrodes have, indeed, higher correlation as one can observe from the correlation matrix in Fig. 17.

4. Coding

All the graphs and plots in this paper have been coded using python. The codes for these respective figures can be found in the link given below-

https://github.com/giti21/Neural-Data-Analysis

5. References

Mentor: Dr. Gino Del Ferraro, NYU

Time Series – Stoica, P and Moses, R. (2004). Spectral Analysis of signals. Prentice Hall. Wikipedia
Frequency vs. time Domain – Wikipedia
Fourier Transform – Wikipedia STFT
Power Spectral density – Stoica, P and Moses, R. (2004). Spectral Analysis of signals. Prentice Hall.
Spectrogram – Wikipedia Spectrogram code
Correlation – Wikipedia, Tutorial Covariance Pearson Correlation
Local Field Potential – LFP

About the author

Gitika Tirumishi Jada

Gitika is a Senior in college where she studies Electronics. She recently picked up interest in the field of Data science and its applications in the medical field.

The post Neural Data Analysis Using Spectral Techniques appeared first on Exploratio Journal.

Data Quality Analysis Relating to Missing and Corrupted Data

Varshini Siddavatam — Sun, 22 Aug 2021 14:52:10 +0000

Author: Varshini Siddavatam
Sri Chaitanya Junior College
August 1, 2021

Abstract

It is the purpose of this paper to investigate the impact of missing values on commonly encountered data analysis problems. The ability to more effectively identify patterns in socio-demographic longitudinal data is critical in a wide range of social science settings, including academia. Because of the categorical and multidimensional nature of the data, as well as the contamination caused by missing and inconsistent values, it is difficult to perform fundamental analytical operations such as clustering, which groups data based on similarity patterns. Companies can suffer significant financial losses as a result of inaccurate data. Poor-quality data is frequently cited as the root cause of operational snafus, inaccurate analytics, and poorly thought-out business strategies, among other things. Examples of the economic harm that data quality problems can cause include increased costs when products are shipped to the wrong customer addresses, lost sales opportunities as a result of inaccurate or incomplete customer records, and fines for failing to comply with financial or regulatory reporting requirements. Processes such as data cleansing, also known as data scrubbing, are used to correct data errors, as well as work to enhance data sets by including missing values, more up-to-date information, or additional records, among other things. Afterwards, the results are monitored and measured in relation to the performance objectives, and any remaining deficiencies in data quality serve as a starting point for the next round of planned improvements. It is the goal of such a cycle to ensure that efforts to improve overall data quality continue after individual projects are finished.

I. Introduction

A. Background Information

Data quality is a process of measuring the context of data depending on several factors such as consistency, accuracy, reliability, completeness and whether it is contemporary. The professionals have to deal with several missing and corrupted data in their regular work. In order to make data more concrete and flexible, it is highly significant to identify the data quality and data errors. Missing data is similar to the missing values of any important document or information of a whole unit. In case of missing informative data, no information will be provided to the required criteria.

Especially in this recent decade, within constant increasing online data storage the issue regarding corrupt data is rapidly growing. People nowadays provide their maximum personal information in social networking sites or online sites and the majority of the working procedures are happening depending on the online networks. Based on the daily information of the missing data, the reported rate is 15% to 20% (nih.gov, 2021). Accompanied with this approach it is highly significant to maintain the quality of data.

B. Thesis Statement

In this study researchers have focused on the importance of analyzing quality of data in relation to missing and corrupted data. The thesis statement of the research is that missing and corrupted data can be maintained through effective solutions that can improve the quality of overall data. Along with this, improving storage capacity of the data collection process can protect all the valuable data from being corrupted or missing.

Accompanied with better knowledge and skills the operating process of data protection can be utilized in a far better way to secure all the important documents that are uploaded in the various online sites. Maintaining good quality data that will not be easy to imitate or steal also will be identified as a preventer of corrupting data. It also can be stated after analysing the study regarding missing data that overall the world currently the cases of missing data has increased a lot. If the prevention process gets proper governmental support in this criterion, this process will be better understood by everyone.

II. Body

A. Support Paragraph 1

Due to being unable to handle missing and corrupted data can have a negative effect over an individual work process.

In order to handle missing and corrupted data the operators can calculate the cluster value in the column and put the obtained number to the empty spot. As opined by Hao et al. (2018), following the sudden outage of power can save the data from being corrupted. Several times system crashes are considered as another issue of inability to protect data. As stated by Gudivada et al. (2017), in case a PC hard disk gets filled with junk files, the data corruption process gets enhanced. Restoring previous versions in the main storage can help in saving data corruption. In addition, updating the computer process system on a daily basis can help operators to handle their important data. As observed by Azeroual & Schöpfel (2019), the DISM tool is an effective strategy to modify and repair system images by administrators and developers under the category of computer science. Due to recovering corrupted files, the hard disk command is recognized as another key factor that is able to repair missing data.

According to the reports, the frauds based on internet stock have earned millions of amounts per year. Among the total amount of missing data, the maximum quantity is not able to be repaired. As stated by Owusu et al. (2019), the factor of missing data is concerning for the aged people who have a very tiny knowledge regarding the technologies and online procedures. Since nowadays the maximum work process is done through online networking sites, it is really a risk factor to secure the valuable data from the eyes of hackers. The aged people become easily manipulated by the cyber frauds phishing calls and share their personal details. Accompanied with advanced and modern technology several hackers continue to hack others important data easily. If any valuable data is hacked or missed or corrupted, it can be utilized to lead any kind of criminal activities.

In order to secure various types of activities there required a proper approach to protect data properly. Missing or corrupting data not only affect the work procedure in an individual organization but also harm any individual by personal information. As proposed by Morganstein & Ursano (2020), due to working while staying far from the sectors it creates difficulties for the employees under the data security provider system. It is also identified as a major issue regarding corrupted data. In many cases it also can be found that not having proper knowledge and skills, employees remain not capable to protecting data from bemg missing. Data always remains important and significant to prove anything at an initial stage. The principle of “missing data methods” does not place a missing value slightly as they merge available information from the monitoring data with idiomatic supposition.

In case of missing any vital data or information also affects the research process and creates obstacles for the researchers. Especially in the corporate or private working sectors the entire work procedures are happening through internet based networking sites, the majority of data missing cases are found here. As per the view of Pan & Chen (2018), operating online sites are delivering new advantages for the cyber frauds along with hackers to implement offense. All the staff in a corporate sector is not capable of handling data secure processes, so in case of missing data they face a lot of issues in their work system. This affects negatively to lead the work process smoothly and perfectly and consequently it can increase the trust issue.

Adopting several strategic plans the administrators and developers under the category of computer science can recover missing and corrupted data. Apart from this, adopting proper knowledge and skill regarding data protection activity also can help to reduce the effect of missing data. Corrupted data not only affect the work process in the corporate world but also harm the customer trust factors. As nowadays the maximum work process is done through online networking sites, it is really a risk factor to secure the valuable data from the eyes of hackers. In this recent era, not having proper knowledge regarding data security there leads to a serious issue especially in the working system.

B. Support Paragraph 2

Being able to manage data quality analysis can recover missing and corrupted data that have a positive effect over an individual work process.

As poor-quality data often make limitations in the work process, it is important to adopt data quality analysis to have the ability to save work performance. As stated by Wahyudi et al. (2018), to make a more active operating system the quality of data can be maintained by the developers. As per the view of Uthayakumar et al. (2018), top quality databases can bring migration consideration for an individual work process. In this segment, estimating and implementing a data recovery warehouse is able to meet the need of work culture. As opined by Cappiello et al. (2018), within awareness regarding quality management helps in making an effective work process. Maintaining the use of good quality data helps to improve the decision-making process to make the work more authentic.

As in any work procedure data collection method and collected data both are equally important and have a vital role to precede the entire procedure. In order to protect the data there required a proper skill regarding handling the information and making them placed in a secure storage. As opined by Triguero et al. (2019), adoption of adequate data policy also can help in protecting valuable data for a long-term issue. While transforming big sized data, the majority areas cause corruption and missing data. Since big data is heavy to load and transfer with a minimum time, it requires a proper framework that can be helpful to support this approach. Along with this, focusing on the making process of data storage is also capable of securing informative data more protectively.

This approach especially helps the employees who are working in any corporate organization. Holding data properly is a significant requirement in a workplace, as it is related to the success procedure of the organization. According to Benzeval et al. (2020), based on the data the work process has to be done in any organization and it is able to predict whether the profit can be possible or not. Missing data and corruption of data is a random process that happens when the system is filled and overloaded. In this scenario, having computer knowledge can prevent large size loss and make it a little easier to handle. Utilizing good quality databases has the capability to retain important data for a long time to be used. Therefore, constant experiments regarding data quality analysis can assist the entire process to be more active to protect data from being corrupted.

The value of data can be held by adopting effective technologies that are capable of delivering extra security systems that could not be lost. Though, the factor of data analysis needs to be more efficient so that any kind of error can be noticed to prevent the risk issues. In the words of Broeders et al. (2017), focusing on the data quality has the ability to secure the information and reduce the risk factors. Accompanied with the recent pace, it is highly crucial to invent new strategies and technologies in the workplace to bring innovation while maintaining data quality. Understanding the requirement of data analysis also can help in managing a proper strategy to manage the corrupted data.

A top-quality data can mitigate the lack of trust and provide reliable resources for finishing any work segment. Based on the data analysis the process of any individual work has to be done in any organization and it is able to predict whether the profit can be possible or not. While transforming big sized data, the majority areas cause corruption and missing data. Adopting advanced and modern security systems can handle the big size data and secure them from being corrupted. Due to fulfilling all the criteria discussed in the above section, it is highly required to follow a proper data analysis method so that the potential risk factors can be highlighted or marked to be fixed again.

III. Conclusion

Identifying the data quality can ensure whether the work process will be beneficial or not. The entire structure of the data analysis method needs to be more active to recover missing and corrupted data. Maintaining proper rules and regulations also can help to control top data collection methods to avoid data errors. Accompanied with advanced and modern technology several hackers continue to hack others important data easily. Preventing them from all types of offences the organizations need to adopt a more effective and active data security system to retain for a longterm issue. In many cases, it can be seen that not having proper knowledge regarding data security there leads to a serious issue especially in the working system. Maintaining good quality data that will not be easy to imitate or steal also will be identified as a preventer of corrupting data. In addition, having computer knowledge can prevent large size loss and make it a little easier to handle.

Depending on the entire study it can be concluded that monitoring improvement results is capable of managing data quality. High quality data is always considered as helpful in order to meet inaccurate data needs to work with good and valid information. Accompanied with better knowledge and skills the operating process of data protection can be utilized in a far better way to secure all the important documents that are uploaded in the various online sites. Due to leading the work process in any organization there is highly required a proper framework to analyze the collected data in order to identify the potential risks or errors to prevent it from very early stage.

Reference List

Azeroual, O., & Schöpfel, J. (2019). Quality issues of CRIS data: An exploratory investigation with universities from twelve countries. Publications, 7(1), 14. Retrieved From: https://www.mdpi.com/416282

Benzeval, M., Bollinger, C., Burton, J., Couper, M. P., Crossley, T. F., & Jäckle, A. (2020). Integrated data: research potential and data quality. Understanding Society Working Paper Series, (2020-02). Retrieved From: https://www.understandingsociety.ac.uk/sites/default/files/downloads/working-papers/2020-02.pdf

Broeders, D., Schrijvers, E., van der Sloot, B., van Brakel, R., de Hoog, J., & Ballin, E. H. (2017). Big Data and security policies: Towards a framework for regulating the phases of analytics and use of Big Data. Computer Law & Security Review, 33(3), 309-323. Retrieved From: https://www.sciencedirect.com/science/article/pii/S0267364917300675

Cappiello, C., Samá, W., & Vitali, M. (2018, June). Quality awareness for a successful big data exploitation. In Proceedings of the 22nd International Database Engineering & Applications Symposium (pp. 37-44). Retrieved From: https://dl.acm.org/doi/abs/10.1145/3216122.3216124

Gudivada, V., Apon, A., & Ding, J. (2017). Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software, 10(1), 1-20. Retrieved From: https://www.researchgate.net/profile/Junhua-Ding/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning _Going_Beyond_Data_Cleaning_and_Transformations/links/59ded28b0f7e9bcfab244bdf/Data-Quality-Considerations-for-Big-Data-and-Machine-Learning-Going-Beyond-Data-Cleaning-and-Transformations.pdf

Hao, Y., Wang, M., Chow, J. H., Farantatos, E., & Patel, M. (2018). Modelless data quality improvement of streaming synchrophasor measurements by exploiting the low-rank Hankel structure. IEEE Transactions on Power Systems, 33(6), 6966-6977. Retrieved From: https://ieeexplore.ieee.org/abstract/document/8395403/

Morganstein, J. C., & Ursano, R. J. (2020). Ecological disasters and mental health: causes, consequences, and interventions. Frontiers in psychiatry, 11, 1. Retrieved From: https://www.frontiersin.org/articles/10.3389/fpsyt.2020.00001/full

nih.gov, 2021. The prevention and handling of the missing data [Online]. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/ [Accessed on 27 July, 2021]

Owusu, E. K., Chan, A. P., & Shan, M. (2019). Causal factors of corruption in construction project management: An overview. Science and engineering ethics, 25(1), 1-31. Retrieved From: https://link.springer.com/content/pdf/10.1007/s11948-017-0002-4.pdf

Pan, J., & Chen, K. (2018). Concealing corruption: How Chinese officials distort upward reporting of online grievances. American Political Science Review, 112(3), 602-620. Retrieved From: https://www.cambridge.org/core/journals/american-political-science-review/article/concealing-corruption-how-chinese-officials-distort-upward-reporting-of-online-grievances/43D20A0E5F63498BB730537B7012E47B

Triguero, I., García‐Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289. Retrieved From: https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/widm.1289

Uthayakumar, J., Vengattaraman, T., & Dhavachelvan, P. (2018). A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications. Journal of King Saud University-Computer and Information Sciences. Retrieved From: https://www.sciencedirect.com/science/article/pii/S1319157818301101

Wahyudi, A., Kuk, G., & Janssen, M. (2018). A process pattern model for tackling and improving big data quality. Information Systems Frontiers, 20(3), 457-469. Retrieved From: https://link.springer.com/article/10.1007/s10796-017-9822-7

About the author

Varshini Siddavatam

Varshini is a senior at the Sri Chaitanya Junior College. Always interested in coding and data, she hopes to pursue computer science for her undergraduate major. Apart from academics, she is also interested in basketball, painting, dancing, and writing.

The post Data Quality Analysis Relating to Missing and Corrupted Data appeared first on Exploratio Journal.