computer science Archives - Exploratio Journal

A Review of the Use of Computer Vision Techniques in Live-cell Imaging and Image Segmentation

Vedanth Ramji — Mon, 15 Aug 2022 05:51:06 +0000

Author: Vedanth Ramji
Mentor: Dr. Vincent Boudreau
APL Global School

Abstract

Advances in deep learning and computer vision have given rise to algorithms and techniques that can help us understand the content of images with more accuracy, efficiency and precision than previously attainable. Innovation in microscopy has also introduced expansive microscope image data sets that require analysis, and here, computer vision techniques have played a key role in revolutionising the field of microscopy and live-cell imaging.

Live-cell imaging and image segmentation have played a crucial role in understanding biological problems such as transcription regulation in bacterial and eukaryotic cells (Van Valen et al., 2016). Image segmentation is an important component in most live-cell imaging experiments, as it allows us to identify unique parts of an image which, for instance, can be useful to analyse the behaviour of individual cells or to identify and differentiate between cells in close proximity. Computer vision techniques, such as deep convolutional neural networks (a supervised machine learning model), enhance image segmentation by reducing the manual curation of images, making it easier for labs to share solutions and significantly increasing segmentation accuracy.

In this paper, we will discuss how computer vision techniques are used to tackle problems in different aspects of live-cell imaging. We will also highlight how deep convolutional neural networks are used in image segmentation.

Introduction

Artificial intelligence has made processing data more precise, fast and efficient. It has also proven to be a beneficial tool in many different fields. This is especially true in microscopy, particularly live-cell imaging. Live-cell imaging is a process by which living cells are imaged over time using light microscopy to understand biological systems in action. Computer vision, a subset of artificial intelligence that enables computers to procure meaningful information from images and videos, can be applied to microscope images or movies from live-cell imaging experiments to derive information about living systems.

A critical part of live-cell imaging data analysis is image segmentation. Image segmentation refers to par- titioning an image into distinct and meaningful segments. Thresholding, Voronoi algorithms and watershed transform are commonly used techniques and tools for image segmentation (Moen et al., 2019). However, they all have three principal challenges: curation time, segmentation accuracy and solution sharing. As dif- ferent research groups use highly specific combinations of the above-mentioned image segmentation methods for unique segmentation problems, it is difficult for findings and solutions to be shared across labs. Many of these methods also require considerable amounts of manual curation and even then the segmentation might be inaccurate. Deep convolutional neural networks address these three key issues.

Two other important areas of application of live-cell imaging are image classification and object tracking. Image classification is the task of giving a label to an image that is meaningful. An example of image classification is to identify whether a protein is being expressed in the cytosol or the nucleus using fluorescence. Many image classifier architectures share many similarities with commercial image classifiers as well, hence, the challenge in image classification usually lies in procuring well annotated biological datasets (Moen et al., 2019).

Object tracking is the task of following objects through a time-lapse movie. An example of object tracking is the tracking of single cells in a live-cell imaging movie as they exhibit a phototropic response. As object tracking analysis requires cells to be present, identified and differentiated in every frame of a time-lapse movie and phototoxicity and photobleaching limit the frame rate and clarity of the movie, object tracking can usually be quite challenging. However, successful solutions and methods have made object tracking useful in understanding bacterial cell growth and cell motility (Moen et al., 2019 and Kimmel, Chang, Brack & Marshall, 2018)

Deep convolutional neural networks or conv-nets are supervised machine learning models that can be used to solve large-scale image segmentation problems. Supervised machine learning models are trained using well-defined, labelled data sets that can then be applied to new data (Supervised Learning, 2020). Conv-nets have shown to be incredibly useful for image segmentation and can be applied to many different problems from cell type prediction to quantifying localization-based kinase reporters in mammalian cells (Van Valen et al,. 2016).

Discussion

Augmented Microscopy

Another major application of live-cell imaging is augmented microscopy. Augmented microscopy refers to extracting information from biological images that is usually latent. An example of augmented microscopy is identifying the locations of cell nuclei and other large structures from bright-field microscope images. However, augmented microscopy is not limited to light microscopy images with fluorescent traces. It can be used to improve image resolution and minimise phototoxicity in real-time (Moen et al., 2019; see also Sullivan et al,. 2018).

Mathematical Construct of Conv-nets

Conv-nets are composed of two main components: dimensionality reduction and classification (Van Valen et al., 2016). Dimensionality reduction refers to constructing a representation of an input image with lower dimensions using three key operations. We may denote the input image as I, then convolve the input image with a set of filters, which can be denoted as {f1, …, fn}, to get the filtered images {I ∗f1, …, I ∗fn}. After the image is filtered, a transfer function t(x) to produce a set of feature maps {t(I ∗f1), …, t(I ∗fn)}. For example, the rectified linear unit, relu, can be used as a transfer function. It is defined as relu(x) = max(0, x). The third operation is optional, and it scales down the previously created feature maps to a smaller spatial scale, by replacing a section of the feature maps with the largest pixel value in that particular section (Convolutional Implementation of the Sliding Window Algorithm, n.d.). This is known as max pooling. The final output of dimensionality reduction (assuming max pooling was performed) is a set of reduced feature maps.

The second component of a conv-net – classification – uses a neural network that takes the reduced feature maps and assigns them class labels, by the use of matrix multiplication and a transfer function to construct a nonlinear mapping.

Realizing Image Segmentation as Image Classification

For our purpose of live-cell imaging, we may manually annotate each pixel of a microscope image as cell interior, cell boundary or non-cell. Then, by sampling a small region around each pixel and marking each region as any one of the previously mentioned three classes, a training data set can be created. Now the task of image segmentation has been reduced to obtaining a classifier that can distinguish between the three different types of pixels – converting our task of image segmentation to image classification. Then, the classifier can be applied to other data sets.

Work from Van Valen and colleagues (2016) shows that conv-nets can precisely segment mammalian and bacterial cells when properly trained. They showed that conv-nets can reliably segment the following cell types/lines: E. coli (bacterial cells), MCF10A (human breast cells), HeLa-S3, RAW 264.7 (monocytes) and BMDMs (bone marrow-derived macrophages). Conv-nets, therefore, can be used on a wide variety of cells which makes it a convenient solution for segmenting microscope images.

Using Conv-nets to Analyse Bacterial Growth

Single-cell bacterial growth curves have played a crucial role in our understanding of concepts such as metabolic co-dependence in biofilms. The data for single-cell bacterial growth curves have been traditionally created by imaging microcolonies growing in a medium such as agar or using bulk populations of cells in a Coulter counter. It is challenging to separate and identify individual cells in a medium due to the proximity of the growing cells and more importantly, the limit in the resolution of traditional methods. As we collect more single-cell information, the complexity of the data sets grows. Image segmentation becomes essential to follow large numbers of single cells over several frames of a time-lapse movie.

Conv-nets can effectively segment these images. This allows us to analyse time-lapse movies of growing cells and track the growth of each cell by determining the increase in the area it takes up in the image. These growth curves can then be used to create instantaneous growth rates for single cells. Researchers have also shown that these can be used to create a spatial heat map that is colored according to the growth rates (Van Valen et al., 2016). Therefore, conv-nets allow the quantitative measurement of single-cell dynamics at a high resolution. Spatial heat maps can also provide deeper insights into how individual cells grow such as the relationship between the position of the cell and its growth rate.

Conv-nets and Predicting Cell Types

Semantic segmentation is the process by which each pixel in an image is labelled with a corresponding class (Jordan, 2018). This method allows the prediction of the contents of an image while segmenting the image itself. Researchers have shown that if conv-nets are modified to recognize differences between the intracellular structures of two different types of cells, conv-nets can perform semantic segmentation.

For example, images of NIH-3T3 and MCF10A cells show different morphologies under phase contrast. Work from Van Valen and colleagues (2016) demonstrated that modifying the conv-net architecture by increasing the number of classes it detects from three to four, enables the conv-nets to be able to recognize the differences in the interior of these two cell types. Then, a training data set was created from different images of NIH-3T3 and MCF10A cells with each image having the nuclear marker Hoechst 33342 (used to stain the nuclei of cells). A segmentation mask can be created to differentiate between these two cell lines and the conv-net can be used on other datasets. Previous experiments using the same method have shown that conv-nets have an accuracy of 95% when differentiating between NIH-313 and MCF10A cells.

This can have a wide range of uses such as helping diagnoses and furthering our understanding of cellular dynamics in tissues.

Conclusion

Computer vision has proven to be a crucial tool for analysing microscope images and it will certainly be a very important tool for further studies. More efforts to create larger curated datasets of biological images would greatly improve existing computer vision and deep learning tools for live-cell imaging.

Conv-nets are already able to segment a wide variety of cells and with time they will be trained to segment more. This offers researchers a convenient solution to quickly segment different types of cells. Conv-nets have also been shown to be precise and it allows labs to share their work faster with one another as conv- nets generalise the segmentation method which previously was very specific for different cases. Conv-nets and other computer vision tools will also become more accessible to people due to the availability of deep learning and computer vision APIs such as Keras and TensorFlow, therefore its application will become easier as time goes on. However, as the complexity of microscopy datasets grows, conv-nets may become less effective. Techniques to optimise conv-nets and other methods such as data-frugal deep learning could be a solution to this (Landram, 2022).

Computer vision will certainly be one of the main driving forces of new microscopy discoveries and will have a transformative impact on the field.

References

Van Valen DA, Kudo T, Lane KM, Macklin DN, Quach NT, DeFelice MM, et al. (2016) Deep Learning Automates the Quantitative Analysis of Individual Cells in Live-Cell Imaging Experiments. PLoS Comput Biol 12(11): e1005177. https://doi.org/10.1371/journal.pcbi.1005177
Moen, E., Bannon, D., Kudo, T. et al. Deep learning for cellular image analysis. Nat Methods 16, 1233–1246 (2019). https://doi.org/10.1038/s41592-019-0403-1
Supervised Learning. (2020, August 19). IBM Cloud Learn Hub. Retrieved August 7, 2022, from https://www.ibm.com/cloud/learn/supervised-learning
Convolutional implementation of the sliding window algorithm. (n.d.). Medium.Com. Retrieved August 7, 2022, from https://medium.com/ai-quest/convolutional-implementation-of-the-sliding- window-algorithm-db93a49f99a0
Landram, K. (2022, January 6). Data-frugal Deep Learning Optimizes Microstructure Imaging. Carnegie Mellon University. Retrieved August 7, 2022, from http://www.cmu.edu/news/stories/ archives/2022/january/deep-learning.html
Jordan, J. (2018, May 21). An overview of semantic image segmentation. Jeremy Jordan. Retrieved August 7, 2022, from https://www.jeremyjordan.me/semantic-segmentation/
Sullivan, D. P., & Lundberg, E. (2018). Seeing More: A Future of Augmented Microscopy. Cell, 173(3), 546–548. https://doi.org/10.1016/j.cell.2018.04.003
Kimmel, J., Chang, A., Brack, A., & Marshall, W. (2018). Inferring cell state by quantitative motility analysis reveals a dynamic state system and broken detailed balance. PLOS Computational Biology, 14(1), e1005927. doi: 10.1371/journal.pcbi.1005927

About the author

Vedanth Ramji

Vedanth is currently a sophomore at APL Global School, Thoraipakkam, Tamil Nadu, India. He is passionate about science and technology and wants to work at the intersection of natural sciences, math and computer science to explore possible solutions to current day challenges. Programming is of great interest to Vedanth and he enjoys creating coding projects that are interdisciplinary in nature.

The post A Review of the Use of Computer Vision Techniques in Live-cell Imaging and Image Segmentation appeared first on Exploratio Journal.

How do lightning rods work?

Winnie Shi — Wed, 20 Oct 2021 15:02:48 +0000

Author: Winnie Shi
Mentor: Dr. De La Torre, Caltech
Shanghai Starriver Bilingual School
October 1, 2021

1.Introduction

Lightning rods, mostly made of copper, is a structure that protects buildings from being damaged by attracting flashes through electric-magnetic force and guide the current to the ground. After learning a bit about electricity and experiencing a night of thunder and lightning, I intend to explore how lightning rods work. Therefore, in this presentation, I will first introduce the historical research on lightning rods, and then explain how lightning rods work in general using electrostatic principles and some easy-to-understand analogies. I will then write a program to calculate the effective range of the lightning rod based on the Monte Carlo technique and finally propose a lightning protection solution in conjunction with the 3D street view.

2. The History of Lightning Rod

2.1 Franklin Kite Experiment

In 1746, Franklin turned his home into an electrical laboratory after occasionally discovering the electrical experiments of other scientists, and in a letter he described receiving an electric shock as “a numbing sensation from the beginning to the end”.

In 1747, thanks to Franklin’s discoveries, people stopped using glassy and resinous to describe electricity. They began to use positive and negative electricity.

In 1749, Franklin began to make analogies between lightning and batteries, and from then on lightning became palpable. He explained by analogy the bifurcation in lightning, the color of the lightning, and the deafening sound, and was determined to prove that lightning and electricity were directly related. in 1750, he began to focus his research on the protective devices for lightning. This was man’s first step toward the lightning rod

Fifteen years later, Franklin’s close friend recorded in his diary Franklin’s famous kite experiment. He took the risk of using a kite to try to get up close and personal with lightning. He even tied a key to the kite in order to attract an electrical charge. Even though the string of the kite was already made of insulating silk, this was still a very risky act considering the strength of the lightning bolt could even make the insulator relatively conductive. The conclusion of this experiment was that the key was seen to receive the electric charge brought by the lightning, and Franklin thus proved that lightning is electricity.

2.2 Tip Lightning Rod or Round-end Ones?

Almost simultaneously with the kite experiment, Franklin realized the fact that iron needles can conduct electricity, and tried to integrate this into the “lightning rod” invention. In his diary, he envisioned, “Could there be a way to protect people from sudden lightning strikes by inserting thin needles directly into clouds and pulling the electricity out of them before the lightning strikes the ground?”

Franklin focused on elevating the tip of the lightning rod, while Benjamin Wilson, a member of the Royal Court circle of George III, believed that the pointed lightning rod would attract lightning (and this property remained unchanged and became the main principle of the modern lightning rod) and was not as safe as the round-headed lightning rod. Most scholars at the time also supported Benjamin Wilson’s view, so much so that this eventually turned into a political showdown, with proponents of Franklin’s lightning rod being falsely accused of “trying to establish their own political group in England. The war between science and politics officially ended when an East India Company was struck by lightning, and Franklin’s spiked design is still used today.

2.3 Three primary Modern lightning rods

2.3.1 Early Streamline Emmision (ESE)

ESE systems are more similar to conventional lightning rods. They are designed to trigger early initiation of upward flow, which increases the effective protection range. This discharge trigger increases the probability of triggering a “streamline” discharge at or near the tip of the rod as the ionized “leader” approaches.

Fig 1 Early Streamline Emmision(plotted by AXIS house)

2.3.2 Charge Transfer System (CTS)

The CTS is characterized by its designated protection zone. It is the only system that deters lightning strikes, rather than encouraging them. CTS technology is based on existing physical and mathematical principles. The CTS collects the induced charge from the thunderstorm clouds in the area and transfers it to the surrounding air via an ionizer, thereby reducing the electric field strength in the protected area. The resulting reduction in the potential difference between the site and the clouds inhibits the formation of upward currents and thus reduces electric shocks.

Fig 2 Charge Transfer System

2.3.3 Dissipation Array System (DAS)

DAS is a special type of CTS. Based on the “protected area” of CTS, DAS can completely isolate a facility from direct lightning strikes during a thunderstorm by releasing the induced charge within the protected area to 55% of its pre-installation level in relation to its surroundings. When the electric field is reduced, the upward current does not get enough energy, and it is the connection of the upward and downward currents that is required for lightning to occur. Without energy, the connection cannot be made, so lightning cannot be generated.

Fig 3 Dissipation Array System (DAS) (Plotted by India Mart)

3. The working principle of the lightning rod

3.1 Electron distribution

3.1.1 Electrons in the earth’s crust

Before asking how lightning rods come into use, let’s first examine the function of electrons that makes lightning occurs. Before we go into how lightning rods work, let’s take a look at how electrons work to cause lightning. To begin with, the ground’s surface is made up of positive charges because the dipole cloud produces an electric field that forces electrons to flow to the earth’s core. The earth’s crust is plainly devoid of negatively charged electrons, resulting in a positively charged ground. Colors have been employed to depict the phase cancellation process, with yellow indicating negative charges and blue representing positive charges. The green hue created by combining blue and yellow is neutral, but the absence of either color gives it a bluish/yellowish appearance.

The positive charge upon the ground produces an electric field between the earth and the clouds, resulting in a negative charge covering the bottom of the clouds. And this electric field can reach 400,000 volts, creating a powerful electric field that lingers in the atmosphere. The procedure of positive and negative charge exchange in the clouds is essentially like an ion engine that repels the negative charge of the entire planet to the opposite side, according to the principle that different charges attract and the same charges repel.

We all know that when the electric field’s dipole reaches a particular level, clouds unleash lightning, which neutralizes the charge at the cloud’s bottom compared to the ground, and then the clouds repeat the process to rebuild the potential difference in the form of an exponential equation. Here’s a visual representation of this.

Fig 4. Resetting time. At t=5 the cloud releases the lightning

The time it takes to recharge is known as the resetting time, and we use it to determine the power of the ion pump described before, for which we have data of around 5 seconds. I=Q/T. When the experiment is replaced, the result is a cloud with a charge of -20C and a resetting time of 5 seconds.

Despite the fact that 4 amps may appear to be a little quantity, comparable to double the current of a mobile phone charger (2A) or the current used in street lights (4A), it will inflict a great deal of damage due to the fact that it is released from a very small hole.

3.1.2 Electrons in clouds

Clouds that carries lightning consists soft hail particles and ice particles. Soft hails has more weight than ice particles, therefore they fall to the bottom during a thunderstorm while the small crystals were uplifted to the top. This falling process allowed negatively charged hails stay at the bottom(6-8km) and positively charged ice floats to the upper part of the cloud to 10km.

3.1.3 Generation of lightning

There are three major hypothesis about how lightning comes into place.

The electric field inside a stormy cloud is far higher that what has been calibrated.

Lightning is created via hydrometeors, which means water particles in the cloud
Energetic runaway electrons initiate the lightning.

Because the overactive electrons (hypothesis 3) in the above figure are hydrometeors (hypothesis 2), ice crystals and water droplets traveling through the cloud, and the situation presented by hypothesis 2 usually boosts the electric field strength by a large margin, due to the equation of Coulomb’s law, these three points are actually interconnected.

It can be found that the electric field strength is inversely proportional to the square of the distance. Thus overactive electrons are in between many electrons of different charges, causing a huge electric field and thus the birth of lightning.

Most lightning is intra-cloud lightning, while lightning that occurs outside of clouds is divided into four main types, two from the ground to the thunderclouds, which are beyond the scope of this report, and two from the thunderclouds to the ground. One of them is downward lightning negatively-charged leader caused by a negative charge at the bottom of the cloud and a positive charge for activation, and the other is downward lightning positively-charged leader caused by a negative charge leading from the positive charge at the top of the cloud connected to the ground charged leader.

Fig 6 cloud-to-ground lightning flashes

3.2 Lightning propagation methods

The negative step leader, as its name suggests, will extend the length of the leader channel by step propagation. In the study of lightning pathways, early studies based on photography were skewed because some of the steps were too tiny to be seen with the human eye. The multiple-station dE/dt technique was utilized by J. Howard, M.A. Uman, C. Biagi, D. Hill, V.A. Rakov, and D.M. Jordan in 2011 to localize each step. When Step brings the lightning to the ground, the length charge is around 10^-3C/m, and the earth sends a return stroke to contact with it, resulting in lightning. TThis generally happens in the conductor nearest to the elevation, since lightning, like an item in an automated pathfinding system, will want to walk on the side with the least “resistance,” that is, the side with the quickest potential reduction movement. Because the charges at the ground and at the bottom of the cloud are generally different, the ground attracts the leader in this scenario.

Fig 7 This graph shows four main lightning strokes that have been witnessed in Florida, which records the waveform of the electric field they generate

Fig 9. The figure shows the path of lightning in 10 video frames plotted by Biagi et al. The lightning originates from a 150 m high cloud layer and the return stroke is shown in frame 10.[Source: Adapted from Biagi et al.]

Fig 10 The figure shows a zoomed-in schematic of the first 9 frames of the leader, and it is easy to see the trend with increased contrast and brightness.[Source: Adapted from Biagi et al.]

3.3 How do conductors work?

A conductor is a substance that electrons are relatively free to move compared to the insulators, which is the property needed to create the lightning rod–any net charge resides on the surface because ρ=0 inside a conductor according to Gauss’s law. So that negative charges are attaching on the surface of the lightning rod, making it easy to be strike. The reason is, usually a ground is conductive and there are negative charges throughout the ground. When you put a conductor such as a metal on the ground, the electrons of the ground moved to the metal, and the protons in the metal moved to the ground until the metal and the ground are equipotential and the metal and the ground can be regard as a system because of the formula E = -⊽ V , Where E represents the electric field and it equals the negative product of divergence of electric potential. When the system is a closed loop the divergence is 0 so that there is no electric field and electrons are static again. The metal can then be seen as a whole with the ground. This reasoning is also valid for conductive buildings and lightning rods, which become more vulnerable to lightning strikes as if they were a mountain range raised on the natural landscape.

What will the lightning rods do to the lightning that it had intercepted? When lightning occurs, the lightning rod can attract the discharge channel of lightning, so that the lightning current flows from the lightning rod into the earth’s land, avoiding huge currents to cause damage to buildings, equipment, trees or injury to people or animals that happen to walk above the ground.

4. The effective area of lightning rods.

4.1 Monte Carlo Technique

Abhay Srivastava calculated the protection of the lightning rods by applying a mathematic model conducting rod using Monte Carlo technique. It is a computer simulated model that randomized the distribution of lightning strokes. It assumes a concave lateral surface of the cone, using the concept of striking distance in Golde’s formula, where d_s means the striking distance, A=10 and σ = .65 are constants.

Fig 11 The coordinate system of the model

As the graph suggests, it initialized the sky with height k, the starting point of the lightning as h0, which will randomly stepping down towards the ground until it sees a postive charged object in its detecting sphere, which is assigned H_n. H_n-1 and H_n-2 are the last two steps from the striking point, and each point in this graph is given a three dimension coordinate.

The author assumes that the cube is 100*100*1000, and that a Cumulonimbus Cloud capable of storing lightning has a height of 500-16,000 meters.

The lightning will begin at a random location at the maximum height,1000m.

int[][] origin=new int[Math.random()*101][Math.random()*101];

In 5% to 80% range of the striking distance it generate some random variables to select the step length of leader.。Then it designate two angels of the spherical polar coordinates: the inclination angle α lying between π/2 and 3π/2 and the azimuthal angle β lying between 0 and 2π.

The mathematic formula is as follow. Iteration in Java should be used to infer where the lightning will strike. The loop terminates when the program determines that the lightning leader has reached the monitoring range, which is simulated as a sphere of radius 20m.

/*

*Precondition: it checks every step of the lightening

*Postcondition: true means that the lightning has successfully been
*intercepted, and therefore will not be accounted as lightenings that have

*caused damage.

*/

Public boolean inRegion(Object leader,int r){

if(Math.sqrt(Math.pow(leader.getx()-rod.getx(),2)+Math.pow(leader.gety()-rod.gety(),2)+Math.pow(leader.getz()-rod.getz(),2)<=r)}

//Euclidean distance

return true;

return false;

}

Then comes the main program for generating the path, written according to the following mathematical equation.

public int[][] stepProcess(int[][] before,Object leader,int r){

int[][] after=int[][] before;

I=leader.getx();

j=leader.getz();

K=leader.gety();

while(k!=0&&!inRegion(leader,r)){

i=i+r*Math.sin( )*Math.cos(β);

j=j+r*Math.sin( )*Math.sin(β);

k=k+r*Math.sin( );

}

if(inRegion(leader,r)){

return before;

}

else{

after[i][j]=after[i][j]+1;

return after;

}

//the higher the number in the array is, the more dangerous is the area

}

After running the program, I tested 1000 times, assuming 10 strikes per year in this area, which would be all the lightning this virtual area has suffered in 100 years, and drew a graph of the conclusions drawn, where the yellow area represents relative safety(have been struck once), green represents absolute safety, and red represents danger(more than once).

Fig 12/13 The output of the program

4.2 Real Life Application

In the 3D street view of Gaudet Map, I intercepted a dense map of high-rise buildings of about 1000*1000 and used the model for the simulation of lightning rod placement. It is assumed that all the buildings need protection, but we can ignore the open space. Here are the before-after graph of the map. When lightning rods were applied in that area, it meant to make sure every building to stay in the green or yellow circle of fig 12/13, which take the height of the lightning rods, r, as a variable and execute the program.

‘
Fig 14 An actual overview of Lujiazui

Fig 15 Protection after applying the program.

5. Reference List

1.“Modern Lightning Protection Lightning Rods with Lightning Eliminators.” Edited by LEC By admin, LEC, 19 Sept. 2018, www.lightningprotection.com/lightning-rods-are-old-new-lightning-protection-part-3/.

2. J. Howard, M.A. Uman, C. Biagi, D. Hill, V.A. Rakov, D.M. Jordan, Measured close lightning leader-step electric-field-derivative waveforms, J. Geophys.

Res. 116 (2011) http://dx.doi.org/10.1029//2010JD015249.

3. E.P. Krider, C.D. Weidman, R.C. Noggle, The electric field produced by lightning stepped leaders, J. Geophys. Res. 82 (1977) 951–960.

4. Srivastava, Abhay, and Mrinal Mishra. “Lightning Modeling And Protection Zone Of Conducting Rod Using Monte Carlo Technique – ScienceDirect.” Lightning Modeling And Protection Zone Of Conducting Rod Using Monte Carlo Technique – ScienceDirect, Www.sciencedirect.com, 13 June. 2013, https://www.sciencedirect.com/science/article/pii/S0307904X13003478?via%3Dihub.

“Franklin’s Lightning Rod | The Franklin Institute.” The Franklin Institute, Www.fi.edu, 8 March. 2014, https://www.fi.edu/history-resources/franklins-lightning-rod.

5. Godwin, Ian . “Franklin Letter To King Fans Flames Of Lightning Debate › News In Science (ABC Science).” Franklin Letter To King Fans Flames Of Lightning Debate › News In Science (ABC Science), Www.abc.net.au, 26 March. 2003, https://www.abc.net.au/science/articles/2003/03/26/816484.htm.

6. M. Vargas, H. Torres. On the development of a lightning leader model for tortuous or branched channels – Part II: model description

7. J. Electrostat., 66 (2008), pp. 489-495

8. M.A. Uman, The Lightning Discharge, Academic Press, London, 1987, 376 pages, revised paperback edition, Dover, New York, 2001.

9. K. Berger, Blitzstrom-Parameter von Aufwärtsblitzen, Bull. Schweiz. Elektrotech. Ver. 69 (1978) 353–360.

About the author

Douyun (Winnie) Shi

Winnie is a Physics learner at the Starriver Bilingual School in Shanghai, China.

The post How do lightning rods work? appeared first on Exploratio Journal.

Data Quality Analysis Relating to Missing and Corrupted Data

Varshini Siddavatam — Sun, 22 Aug 2021 14:52:10 +0000

Author: Varshini Siddavatam
Sri Chaitanya Junior College
August 1, 2021

Abstract

It is the purpose of this paper to investigate the impact of missing values on commonly encountered data analysis problems. The ability to more effectively identify patterns in socio-demographic longitudinal data is critical in a wide range of social science settings, including academia. Because of the categorical and multidimensional nature of the data, as well as the contamination caused by missing and inconsistent values, it is difficult to perform fundamental analytical operations such as clustering, which groups data based on similarity patterns. Companies can suffer significant financial losses as a result of inaccurate data. Poor-quality data is frequently cited as the root cause of operational snafus, inaccurate analytics, and poorly thought-out business strategies, among other things. Examples of the economic harm that data quality problems can cause include increased costs when products are shipped to the wrong customer addresses, lost sales opportunities as a result of inaccurate or incomplete customer records, and fines for failing to comply with financial or regulatory reporting requirements. Processes such as data cleansing, also known as data scrubbing, are used to correct data errors, as well as work to enhance data sets by including missing values, more up-to-date information, or additional records, among other things. Afterwards, the results are monitored and measured in relation to the performance objectives, and any remaining deficiencies in data quality serve as a starting point for the next round of planned improvements. It is the goal of such a cycle to ensure that efforts to improve overall data quality continue after individual projects are finished.

I. Introduction

A. Background Information

Data quality is a process of measuring the context of data depending on several factors such as consistency, accuracy, reliability, completeness and whether it is contemporary. The professionals have to deal with several missing and corrupted data in their regular work. In order to make data more concrete and flexible, it is highly significant to identify the data quality and data errors. Missing data is similar to the missing values of any important document or information of a whole unit. In case of missing informative data, no information will be provided to the required criteria.

Especially in this recent decade, within constant increasing online data storage the issue regarding corrupt data is rapidly growing. People nowadays provide their maximum personal information in social networking sites or online sites and the majority of the working procedures are happening depending on the online networks. Based on the daily information of the missing data, the reported rate is 15% to 20% (nih.gov, 2021). Accompanied with this approach it is highly significant to maintain the quality of data.

B. Thesis Statement

In this study researchers have focused on the importance of analyzing quality of data in relation to missing and corrupted data. The thesis statement of the research is that missing and corrupted data can be maintained through effective solutions that can improve the quality of overall data. Along with this, improving storage capacity of the data collection process can protect all the valuable data from being corrupted or missing.

Accompanied with better knowledge and skills the operating process of data protection can be utilized in a far better way to secure all the important documents that are uploaded in the various online sites. Maintaining good quality data that will not be easy to imitate or steal also will be identified as a preventer of corrupting data. It also can be stated after analysing the study regarding missing data that overall the world currently the cases of missing data has increased a lot. If the prevention process gets proper governmental support in this criterion, this process will be better understood by everyone.

II. Body

A. Support Paragraph 1

Due to being unable to handle missing and corrupted data can have a negative effect over an individual work process.

In order to handle missing and corrupted data the operators can calculate the cluster value in the column and put the obtained number to the empty spot. As opined by Hao et al. (2018), following the sudden outage of power can save the data from being corrupted. Several times system crashes are considered as another issue of inability to protect data. As stated by Gudivada et al. (2017), in case a PC hard disk gets filled with junk files, the data corruption process gets enhanced. Restoring previous versions in the main storage can help in saving data corruption. In addition, updating the computer process system on a daily basis can help operators to handle their important data. As observed by Azeroual & Schöpfel (2019), the DISM tool is an effective strategy to modify and repair system images by administrators and developers under the category of computer science. Due to recovering corrupted files, the hard disk command is recognized as another key factor that is able to repair missing data.

According to the reports, the frauds based on internet stock have earned millions of amounts per year. Among the total amount of missing data, the maximum quantity is not able to be repaired. As stated by Owusu et al. (2019), the factor of missing data is concerning for the aged people who have a very tiny knowledge regarding the technologies and online procedures. Since nowadays the maximum work process is done through online networking sites, it is really a risk factor to secure the valuable data from the eyes of hackers. The aged people become easily manipulated by the cyber frauds phishing calls and share their personal details. Accompanied with advanced and modern technology several hackers continue to hack others important data easily. If any valuable data is hacked or missed or corrupted, it can be utilized to lead any kind of criminal activities.

In order to secure various types of activities there required a proper approach to protect data properly. Missing or corrupting data not only affect the work procedure in an individual organization but also harm any individual by personal information. As proposed by Morganstein & Ursano (2020), due to working while staying far from the sectors it creates difficulties for the employees under the data security provider system. It is also identified as a major issue regarding corrupted data. In many cases it also can be found that not having proper knowledge and skills, employees remain not capable to protecting data from bemg missing. Data always remains important and significant to prove anything at an initial stage. The principle of “missing data methods” does not place a missing value slightly as they merge available information from the monitoring data with idiomatic supposition.

In case of missing any vital data or information also affects the research process and creates obstacles for the researchers. Especially in the corporate or private working sectors the entire work procedures are happening through internet based networking sites, the majority of data missing cases are found here. As per the view of Pan & Chen (2018), operating online sites are delivering new advantages for the cyber frauds along with hackers to implement offense. All the staff in a corporate sector is not capable of handling data secure processes, so in case of missing data they face a lot of issues in their work system. This affects negatively to lead the work process smoothly and perfectly and consequently it can increase the trust issue.

Adopting several strategic plans the administrators and developers under the category of computer science can recover missing and corrupted data. Apart from this, adopting proper knowledge and skill regarding data protection activity also can help to reduce the effect of missing data. Corrupted data not only affect the work process in the corporate world but also harm the customer trust factors. As nowadays the maximum work process is done through online networking sites, it is really a risk factor to secure the valuable data from the eyes of hackers. In this recent era, not having proper knowledge regarding data security there leads to a serious issue especially in the working system.

B. Support Paragraph 2

Being able to manage data quality analysis can recover missing and corrupted data that have a positive effect over an individual work process.

As poor-quality data often make limitations in the work process, it is important to adopt data quality analysis to have the ability to save work performance. As stated by Wahyudi et al. (2018), to make a more active operating system the quality of data can be maintained by the developers. As per the view of Uthayakumar et al. (2018), top quality databases can bring migration consideration for an individual work process. In this segment, estimating and implementing a data recovery warehouse is able to meet the need of work culture. As opined by Cappiello et al. (2018), within awareness regarding quality management helps in making an effective work process. Maintaining the use of good quality data helps to improve the decision-making process to make the work more authentic.

As in any work procedure data collection method and collected data both are equally important and have a vital role to precede the entire procedure. In order to protect the data there required a proper skill regarding handling the information and making them placed in a secure storage. As opined by Triguero et al. (2019), adoption of adequate data policy also can help in protecting valuable data for a long-term issue. While transforming big sized data, the majority areas cause corruption and missing data. Since big data is heavy to load and transfer with a minimum time, it requires a proper framework that can be helpful to support this approach. Along with this, focusing on the making process of data storage is also capable of securing informative data more protectively.

This approach especially helps the employees who are working in any corporate organization. Holding data properly is a significant requirement in a workplace, as it is related to the success procedure of the organization. According to Benzeval et al. (2020), based on the data the work process has to be done in any organization and it is able to predict whether the profit can be possible or not. Missing data and corruption of data is a random process that happens when the system is filled and overloaded. In this scenario, having computer knowledge can prevent large size loss and make it a little easier to handle. Utilizing good quality databases has the capability to retain important data for a long time to be used. Therefore, constant experiments regarding data quality analysis can assist the entire process to be more active to protect data from being corrupted.

The value of data can be held by adopting effective technologies that are capable of delivering extra security systems that could not be lost. Though, the factor of data analysis needs to be more efficient so that any kind of error can be noticed to prevent the risk issues. In the words of Broeders et al. (2017), focusing on the data quality has the ability to secure the information and reduce the risk factors. Accompanied with the recent pace, it is highly crucial to invent new strategies and technologies in the workplace to bring innovation while maintaining data quality. Understanding the requirement of data analysis also can help in managing a proper strategy to manage the corrupted data.

A top-quality data can mitigate the lack of trust and provide reliable resources for finishing any work segment. Based on the data analysis the process of any individual work has to be done in any organization and it is able to predict whether the profit can be possible or not. While transforming big sized data, the majority areas cause corruption and missing data. Adopting advanced and modern security systems can handle the big size data and secure them from being corrupted. Due to fulfilling all the criteria discussed in the above section, it is highly required to follow a proper data analysis method so that the potential risk factors can be highlighted or marked to be fixed again.

III. Conclusion

Identifying the data quality can ensure whether the work process will be beneficial or not. The entire structure of the data analysis method needs to be more active to recover missing and corrupted data. Maintaining proper rules and regulations also can help to control top data collection methods to avoid data errors. Accompanied with advanced and modern technology several hackers continue to hack others important data easily. Preventing them from all types of offences the organizations need to adopt a more effective and active data security system to retain for a longterm issue. In many cases, it can be seen that not having proper knowledge regarding data security there leads to a serious issue especially in the working system. Maintaining good quality data that will not be easy to imitate or steal also will be identified as a preventer of corrupting data. In addition, having computer knowledge can prevent large size loss and make it a little easier to handle.

Depending on the entire study it can be concluded that monitoring improvement results is capable of managing data quality. High quality data is always considered as helpful in order to meet inaccurate data needs to work with good and valid information. Accompanied with better knowledge and skills the operating process of data protection can be utilized in a far better way to secure all the important documents that are uploaded in the various online sites. Due to leading the work process in any organization there is highly required a proper framework to analyze the collected data in order to identify the potential risks or errors to prevent it from very early stage.

Reference List

Azeroual, O., & Schöpfel, J. (2019). Quality issues of CRIS data: An exploratory investigation with universities from twelve countries. Publications, 7(1), 14. Retrieved From: https://www.mdpi.com/416282

Benzeval, M., Bollinger, C., Burton, J., Couper, M. P., Crossley, T. F., & Jäckle, A. (2020). Integrated data: research potential and data quality. Understanding Society Working Paper Series, (2020-02). Retrieved From: https://www.understandingsociety.ac.uk/sites/default/files/downloads/working-papers/2020-02.pdf

Broeders, D., Schrijvers, E., van der Sloot, B., van Brakel, R., de Hoog, J., & Ballin, E. H. (2017). Big Data and security policies: Towards a framework for regulating the phases of analytics and use of Big Data. Computer Law & Security Review, 33(3), 309-323. Retrieved From: https://www.sciencedirect.com/science/article/pii/S0267364917300675

Cappiello, C., Samá, W., & Vitali, M. (2018, June). Quality awareness for a successful big data exploitation. In Proceedings of the 22nd International Database Engineering & Applications Symposium (pp. 37-44). Retrieved From: https://dl.acm.org/doi/abs/10.1145/3216122.3216124

Gudivada, V., Apon, A., & Ding, J. (2017). Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software, 10(1), 1-20. Retrieved From: https://www.researchgate.net/profile/Junhua-Ding/publication/318432363_Data_Quality_Considerations_for_Big_Data_and_Machine_Learning _Going_Beyond_Data_Cleaning_and_Transformations/links/59ded28b0f7e9bcfab244bdf/Data-Quality-Considerations-for-Big-Data-and-Machine-Learning-Going-Beyond-Data-Cleaning-and-Transformations.pdf

Hao, Y., Wang, M., Chow, J. H., Farantatos, E., & Patel, M. (2018). Modelless data quality improvement of streaming synchrophasor measurements by exploiting the low-rank Hankel structure. IEEE Transactions on Power Systems, 33(6), 6966-6977. Retrieved From: https://ieeexplore.ieee.org/abstract/document/8395403/

Morganstein, J. C., & Ursano, R. J. (2020). Ecological disasters and mental health: causes, consequences, and interventions. Frontiers in psychiatry, 11, 1. Retrieved From: https://www.frontiersin.org/articles/10.3389/fpsyt.2020.00001/full

nih.gov, 2021. The prevention and handling of the missing data [Online]. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3668100/ [Accessed on 27 July, 2021]

Owusu, E. K., Chan, A. P., & Shan, M. (2019). Causal factors of corruption in construction project management: An overview. Science and engineering ethics, 25(1), 1-31. Retrieved From: https://link.springer.com/content/pdf/10.1007/s11948-017-0002-4.pdf

Pan, J., & Chen, K. (2018). Concealing corruption: How Chinese officials distort upward reporting of online grievances. American Political Science Review, 112(3), 602-620. Retrieved From: https://www.cambridge.org/core/journals/american-political-science-review/article/concealing-corruption-how-chinese-officials-distort-upward-reporting-of-online-grievances/43D20A0E5F63498BB730537B7012E47B

Triguero, I., García‐Gil, D., Maillo, J., Luengo, J., García, S., & Herrera, F. (2019). Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(2), e1289. Retrieved From: https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/widm.1289

Uthayakumar, J., Vengattaraman, T., & Dhavachelvan, P. (2018). A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications. Journal of King Saud University-Computer and Information Sciences. Retrieved From: https://www.sciencedirect.com/science/article/pii/S1319157818301101

Wahyudi, A., Kuk, G., & Janssen, M. (2018). A process pattern model for tackling and improving big data quality. Information Systems Frontiers, 20(3), 457-469. Retrieved From: https://link.springer.com/article/10.1007/s10796-017-9822-7

About the author

Varshini Siddavatam

Varshini is a senior at the Sri Chaitanya Junior College. Always interested in coding and data, she hopes to pursue computer science for her undergraduate major. Apart from academics, she is also interested in basketball, painting, dancing, and writing.

The post Data Quality Analysis Relating to Missing and Corrupted Data appeared first on Exploratio Journal.

Machine Learning: Theory of learning models and practice in python

Mahesh V N — Wed, 17 Mar 2021 17:03:59 +0000

Author: Mahesh V N
Shanghai American School
November, 2020

What is Machine Learning

Machine Learning is a subset of Artificial Intelligence (AI) which provides machines the ability to learn automatically and improve from experience without being explicitly programmed. Machine Learning is used anywhere from automating mundane tasks to offering intelligent insights. With growing statistics, machine learning gained popularity and the intersection of computer science and statistics gave birth to probabilistic approach in AI. Having large-scale data available, scientists started to build intelligent systems that were able to analyze and learn from large amounts of data. Machine Learning is a type of AI that mimics learning and becomes more accurate over time.

Figure (1.0)
(left) Segmented approach to AI, over the decades (right) Thought process of ML functioning pics. Credit https://towardsdatascience.com/ and www.edureka.com

Applications and Prospects.

Artificial Intelligence (AI) is all around us and we are using it in one way or the other. One of the popular applications of AI is Machine Learning (ML), in which computers, software, and devices learn through data how to make predictions. Companies are using ML to improve business decisions, forecast weather and much more. Machine learning is about training algorithms on a given set of data and make predictions on another set of unseen data. Some of the most common machine learning applications are:

Learning to predict whether and email is spam or not.
Clustering Wikipedia entries into different categories.
Social Media Analysis (Recognise words and understand context behind them) – LionBridge project is a sentiment analysis tool provides users with insights based on social media posts.
Smart Assistance – analyse voice requests or automate daily tasks as well as adapt to changing user needs – Alexa by Amazon uses all collected data to improve its pattern recognition skills and be able to address user needs.
News Classification – As the amount of content produced exponentially, business and individuals need tools that classify and sort out the information. With algorithms able to run through millions of articles in many languages and select the ones relevant to user interests and habits.
Image Recognition
Video Surveillance – With complex algorithms developed using machine learning for video recognition, at first using human supervision the system will learn to spot human figures, unknown cars and other suspicious objects, soon it will be possible to imagine a video surveillance system that functions without human intervention
Optimisation of Search engine results – Algorithms can learn from search statistics, they would not be relying on meta tags and keywords, but instead analyse the contents of the page. Suitably this is how – Google Rank Brain works.

Types of Machine Learning approaches

Machine learning is a unique way of programming computers. The underlying algorithm is selected or designed by a human. However, the algorithms learn from data, rather than direct human intervention. The parameters of a mathematical model are learnt by the algorithms in order to making predictions. Humans don’t know or set those parameters — the machine does.

To explain in simple terms Machine Learning is using a data set to train a mathematical model which is fed with enough sample to have a predictable analysis and output a sensible result. Machine Learning can be divided into a series of subclasses: Supervised Learning, Unsupervised Learning, and Reinforcement learning. The supervised learning category is further divided into Regression and Classification for more streamlined approach to a task, which results in faster and accurate results.

Nomenclature of Machine Learning terms: A feature is an individual measurableproperty of the phenomenon being observed, e.g. square footage of a house to predict the house’s cost, 2D image input to perform object recognition, etc… It is a characteristic of the data that is used as input of the ML algorithm and for this reason it is often used as a synonymous of the ML input and it is commonly denoted as x.

The output of the algorithm is also called “prediction” or “outcome” and it is denoted with “y hat”. The label of the machine learning algorithm is usually denoted with y. So, for instance, the purpose of a supervised algorithm is to use input x to generate the output “y hat” which is as close as possible to the real outcome y.

Figure (2.0) Hierarchy of ML Classification. Credit: https://towardsdatascience.com/

Supervised Learning

Supervised learning is an approach to creating artificial intelligence (AI), where the program is given labeled input data and the expected output results. The AI system is specifically told what to look for, thus the model is trained until it can detect the underlying patterns and relationships, enabling it to yield good results when presented with never-before-seen data. Supervised learning problems can be of two types: classification and regression problems. Examples are: determining what category a news article belongs to or predicting the volume of sales for a given future date.

How does supervised learning work?

Like all machine learning algorithms, supervised learning is based on training. The system is fed with massive amounts of data during its training phase, which instruct the system what output should be obtained from each specific input value.

The trained model is then presented with test data to verify the result of the training and measure the accuracy.

Types of Supervised learning

Supervised learning can be further classified into two different machine learning categories:

Classification
Regression

Figure 3.0 – division of problem set in supervised learning. Credit: www.edureka.com

Figure (3.1) (left – a) labelled output feedback (center – b) learning approach. (right – c) Aim of supervised learning. Credit: www.edureka.com

In case of supervised learning shown in Fig (3.1 – a) we train the algorithm to take up data and come up with an output which is known as this model is fed with input which is labelled. In the (fig – 3.1(b)) The sample Maps the labelled input into a known output. Few of the application of the supervised learning is risk evaluation and forecast of sales in a given business.

Classification in Machine Learning.

Classification is a supervised Machine learning approach used to categorise a set of data into classes. Typically our outcomes are CLASSES or categories. For example, this is a case where a user is trying to predict what is in a given image. Wherein a clear demarcation is provided.

Classification to start with has two sub categories basically named as Lazy Learners and Eager Learners. In case of the former it just stores the training data and wait until a testing data is presented, although they have more time predicting data. Whereas in the later case, they construct a classification model based on the given training data before performing the task of predictions, they consume a lot of time in training and less time in prediction as they commit to a single hypothesis that shall work for the entire space.

Fig 4.0 Pictorial representation of Classification in Machine learning. Credit: www.edureka.com

In the Figure – 4.0 (above) we have a classifier, which in most cases is an algorithm used to map an input data to a specific category. Classification model is trained to predict the class or category of the data.

Classification problems can be divided into two different types: binary classification and multi-class classification.

Binary classification has two outcomes based on the categories specified for therequired output (Example: the output is to be noted only as a dog or a cat.) .

In case of a multi-class classification the outputs or predictions are more than two (in general a group of M possible predictions). The goal of the algorithm is to predict the specific output among the possible M.

Note: Classification is going to be discussed further in more detail in further sections.

Credit: www.edureka.com

Regression Machine Learning

Regression is a predictive statistical process where the model attempts to find the important relationship between dependent and independent variables. The goal of a regression algorithm is to predict a continuous number such as sales, income, test scores and so on. Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables

Types of Regression: Simple (Univariate)

Conditions: One variable is considered

Figure 5.1 pectoral representation of types of Regression Machine Learning. Credit: https://www.codeproject.com

A Mathematical Approach into Linear Regression

Y = ß0 + ß1 X, is theform of equation of linear regression. Where X isthe independent variable, and Y is the dependent variable. ß0 is the slope of the line and beta1 is the intercept ( the value of y when x = 0 ). Fitting this equation into linear regression model of the data provides successful results in prediction of the unknown data. The fitting procedure consists in finding the parameter ß0 and ß1 that best fit the data. In machine learning terms, this means to “learn” the parameters a and b that produce the least error in fitting the data.

Least-Squares Regression

This is the most common method for fitting a regression line into the Regression model. It is the process of calculating the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each point to the line (vertical deviation being 0 for the point fitted line). The deviations are squared and summed in order to prevent the cancellation between negative and positive values. Least-Square Regression uses the “Ordinary Least Square” or “OLS” method to determine the best fitting line. This is the method where a line is chosen that “minimizes the distance between every point and the regression line”. In figure 6 we give a visual description of this process. The distance is also known as the error in the system (see Figure(6.2)). In Figure 6.1, as an example, we consider the price of the commodity and the year of the production.

Figure 6.1 shows multiple possible best fit candidates lines that could fit the data. Figure 6.2 shows a visual representation of the error for one of these lines.

Finally, figure 6.3 shows the best fitting line, i.e. the line which produces the least squared error. Credit: https://towardsdatascience.com/

Once the best fitting line has been found, we can use it to predict the price of the commodity in the year 2020 based on all the data from the previous years.

What is Multiple Linear Regression?

Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical.

A Mathematical approach

Multiple linear regression attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data.

Model for Multiple Linear Regression, given n observations,

y = ß0 + ß1X1 + ….. + Xnßn + e

Where the values depict,

y – predicted value of the dependent variables.

ß0 – the y-intercept calculated when the other values are set to 0.

ß1X1 – regression coefficient (ß1) of the first independent variable (X1), which depicts the change in the predicted y – value with change in the independent variable.

ßnXn – regression coefficient of the nth independent variable (feature).

e – variation in the prediction of y value, also known as the model error.

(X1, X2, X3, …. Xn ) are the dependent variables factoring into the y value.

Unsupervised Learning

Unsupervised Learning is often used in the more advanced applications of artificial intelligence. It involves giving unlabeled training data to an algorithm and asking it to pick up whatever associations it can on its own. Unsupervised learning is popular in applications of clustering (the act of uncovering groups within data) and association (predicting rules that describe data).

Figure 4.0 – division of problem set in Unsupervised learning

(a) (b) (c) Figure (4.1) (a) Machine is unaware of output (b) learning approach. (c) Applications. Credit: www.edureka.com

In the Figure – 4.0 (above) we have a classifier, which in most cases is an Algorithm which is used to map an input data to a specific category. Classification model predicts/draw conclusion of the class or category the data needs to be segregated. A feature is an individual measurable property of the phenomenon being observed. Binary classification has two outcomes based on the categories specified for the required output (Example: the output is to be noted only as a dog or a cat.) . Whereas in multi-class classification in each sample is assigned to a set of labels or targets (more than two).

Cost Function and Gradient Descent

A measure of how wrong the model is in terms of its ability to estimate the relationship between x and y is a cost function of the Machine Learning model (in the examples above OLS is a cost function for linear regression). Which is typically expressed as a difference or distance between the predicted value and the actual value. The Objective of a Machine Learning model is find parameters, structure etc that minimizes the cost function.

The best-fit line is found by minimizing the difference between the actual value and the predicted values. Linear Regression does not apply brute force to achieve this, instead it applies an elegant measure known as Gradient Descent to minimize the cost function and identifies the best-fit line.

Gradient Descent is an iterative optimisation algorithm for finding the local minimum of a function. The process of determining the local function is by taking steps proportional to the negative of the gradient of the function at the current point. Gradient descent is defined by the algorithm defined as

The goal of the gradient descent algorithm is to minimize the given cost function by performing two steps iteratively.

Compute the gradient (slope), the first order derivative of the function at that point.

Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point.

Alpha is the learning rate – a tuning parameter in the optimization process, which determines the length of the steps to be taken.

Credit: Coursera

Make a step (move) in the direction opposite to the gradient, opposite direction of slope increase from the current point by alpha times the gradient at that point.

We are going to examine different methods of determining the steps with Alpha – the learning rate
a) Learning rate is optimal, model converges to the minimum.
b) Learning rate is too small, it takes more time but converges to the minimum.
c) Learning rate is higher than the optimal value, it overshoots but converges ( 1/C < η <2/C).
d) Learning rate is very large, it overshoots and diverges, moves away from the minima, performance decreases on learning.

Note: As the gradient decreases while moving towards the local minima, the size of the step decreases. So, the learning rate (alpha) can be constant over the optimization and need not be varied iteratively.

Classification: Logistic Regression

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable when the target has only two possible outcomes, for instance either 1 (stands for success/yes) or 0 (stands for failure/no). Logistic Regression is further classified into to categories namely binomial and multinomial.

Linear Regression vs Logistic Regression

Figure – 7.1 (a) Linear Regression graph. (b) Logistic Regression graph. Credit: www.edureka.com

In the above shown figure 7.1 (a), the value of Y or the dependent variable lies within a range that can take values beyond 0 and 1. Whereas in the case Logistic Regression (figure 7.1(b)) the outcome/predictions will be between 0 and 1 (see figure 7.1(b) and figure 7.2). To have then a binary output, we threshold every value below 0.5 to 0 and every value above 0.5 to 1 (see figure 7.2).

Figure 7.2 sigmoid function curve graphical representation. Credit: www.edureka.com

Sigmoid Function Curve

The sigmoid function curve also known as the S-curve is used in conversion of any values between negative infinity to positive infinity to a discrete values or in our into binary format of 0 or 1 (refer figure 7.2). The S-curve has values of x-axis between -4 to +4 (in our example) is considered as the transition values. Considering the data point value as 0.8 which falls neither in the category of a discrete 0 or a discrete 1. The concept of “Threshold Value” 0.5 is applied to get a discrete value of 0 or 1. The process is of threshold value application is dependent on the value of given datapoint is greater or lesser than the threshold value. If the datapoint is greater than threshold the discrete predicted value is 1 or the discrete predicted datapoint is 0 otherwise.

Figure 7.3 comparison between Linear and Logistic Regression. Credit: www.edureka.com

Practical Use – Linear vs Logistic Regression.

Let us consider Weather forecast of a particular given calendar day. Logistic regression is used in predicting the outcome of the weather for a possibility of a rain, snow, tide or a sunny weather. The above predictions of a possibility fall in to the category of Yes or No which are discrete outcomes. If we consider further, Linear Regression can also be applied to weather forecast such as the temperature of a given day (same calendar day, considered earlier) . Since the temperature of day is a continuous value, linear regression can be applied to the same set of samples to predict the outcomes.

Linear Regression – coding exercise

To develop practical coding skills I have implemented Linear Regression in python from scratch. The problem that I consider was to predict the price of a house from its square footage, using existing real data. The code notebook can be found here: https://github.com/gdelfe/Machine-Learning-basics-course/blob/master/Exercise1/exercise1_MLbasics.ipynb

About the author

Mahesh V N

Mahesh is a student at the R. N. Shetty Institute of Technology in Bangalore.

The post Machine Learning: Theory of learning models and practice in python appeared first on Exploratio Journal.