# DS-ML: Module 1 – Introduction to Data Science and Machine Learning

Data science and machine learning are some of the hottest topics in the field of technology nowadays and data is considered the costliest resource nowadays. Now, to begin with, Data science and machine learning, we need to have a rough idea of what the terms actually mean.

## Data Science

- Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data.
- Data science is basically the use of data to do statistical modeling of various phenomena in order to do automated predictions.
- Data science is related to data mining, machine learning, and big data.
- These methods are
**completely probabilistic in nature**and are used to predict the possibilities using statistical inferences.

## Machine Learning

Machine learning is a type of curve fitting without actually knowing the curve, i.e. with given input and output data we train the machine to determine the function which is being incorporated by the process.

The factor of surprise is what we call a form of AI. **There are two forms of AI**:

**Weak AI**: Maybe some kind of machine learning. We are already using weak AI.**Strong AI**: Presence of self-awareness like we human beings. We are on our way to building some kind of strong Ai in the future.

Here is a video explaining data science and machine learning by Evolutionary Intelligence.

### Need of Machine Learning

Anything that happens or anything we do has a concrete reason or motive behind that. We also built machine learning algorithms to solve some kind of problem which posed problems for humans. Let us understand the need with a simple example of a coin toss.

Why do we consider the probability of getting head? Though we know that there are only two outcomes of a coin toss still we cannot correctly predict the outcome every time. It is because:

- The equation to calculate the correct outcome is too complex, and
- It incorporates too many factors like force used, height attained by the coin, wind speed, number of revolutions, and many more which take different values every time we toss a coin.

So to eliminate the complexity of so many factors involved, we prefer the probabilistic approach which states that there are 50% chances of getting heads and tails in a toss of an unbiased coin.

Now, things are quite simple when we are not considering some particular factor in a coin toss. Let us take into consideration the wind speed.

**Suppose the probabilities of getting heads in regions of high wind speed and low wind speed are provided as:**

High Speed | Low Speed | No. of Tosses |
---|---|---|

A | B | N |

Here comes the notion of **conditional probability**. If we simply consider the probability of getting heads then we see-

**P(heads) = (A+B) / N**

But when we take some conditions (here wind speed) we see that-

**P(heads | High wind speed) = A / N**

**P(heads | Low wind speed) = B / N**

Now for specific wind speed mentioned like- **What is the probability of getting heads when the wind speed is given as 114 Km/hr?** It is quite challenging now to derive results at such specific conditions, so we need to predict the results as accurately as possible using the **probabilistic approach**.

To do so, we take data for getting heads at different conditions and wind speeds and prepare a graph for the obtained data. Now for some accurate prediction, we will try to fit a curve for the given points.

There could be many possible curves for the given points, but we will determine the best curve using an extra point to see which curve is closer to the extra points to determine the best curve. For a good curve fitting, one needs to take care of the balance between the bias and variance such that it forms a stable and ideal machine learning algorithm.

## Paradigms of Machine Learning

There are three paradigms of machine learning:

- Supervised Learning
- Unsupervised learning
- Reinforcement learning

Currently we will only deal with the definition of supervised learning and we will dive deep into other methods later on.

### Supervised Learning

- Supervised learning tries to predict the function which connects the set of inputs to a set of outputs.
- The set of inputs and outputs may be more than one.
- The objective is to predict the function as perfectly as possible because perfect prediction is nearly impossible in real-life problems.

### Unsupervised Learning

**Unsupervised learning**(**UL**) is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, the machine is forced to build a compact internal representation of its world.- In contrast to supervised learning (SL) where data is tagged by a human, e.g. as “car” or “fish” etc, UL exhibits self-organization that captures patterns as neuronal predilections or probability densities.
- Let us understand it better with an example: Suppose we have collected Amazon’s sales data and we need to segregate customers based on their pattern of purchases.
- Here we will not be identifying each person with the help of his/her purchase patterns, rather we will divide various types of customers into several categories.
**In unsupervised learning, we do not have specific tags with training data.**

### Reinforcement Learning

- According to Wikipedia,
**Reinforcement learning**(**RL**) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. - In very simple words, Reinforcement learning is making a machine learn something or learn to perform a task.
- For example, the Alpha-Go algorithm, which defeated the world champion and many more.

## Generalization

- It is the work that is accomplished using a machine learning algorithm.
- For example, If we train an algorithm with thousands of images of dogs and cats, then the algorithm can identify the presence of dogs and cats in any image thrown to the algorithm.
- The accuracy may not be 100% in real-life problems like above.

### Curve Fitting : Overfit and Underfit

- Here the curve A is an
**underfit**. It failed miserably on the training dataset, it has a**high bias**. - Curve B is
**overfit**. It failed miserably on the test dataset, it has**high variance**.

We need to have a balance between the bias and the variance to have an ideal machine learning algorithm.

## Embeddings

- If an algorithm to identify dogs and cats are to be trained to identify lions as well, then
**with machine learning algorithms built using embedding technique, we only need to modify or add certain new computational elements**. - Whereas in algorithms built without embedding, we will require complete modification.
- The idea of embeddings is:

In the next module, we will cover the basics of statistics like random variables, probability distributions, moment generating functions, and the Central limit theorem. You can also take a look at our ML Coding category for better coding insights in Python.

Hope the notes help. Feel free to put your suggestions, queries, and comments below. You can also explore our study resources on YouTube. Subscribe if you like! Thank You!