The very core of a lot of data science algorithms includes the concept of probability. In reality, nearly all the solutions to data science problems are probabilistic in nature.

The most used probabilistic algorithm in data science is Naive Baye’s theorem.

Bayes’ Theorem also known as Bayesian Statistics or Bayesian Theorem, was created by Thomas Bayes, a monk who lived during the eighteenth century. Bayes’ Theorem enables us to work on complex data science problems and can be used in several machine learning algorithms involving results to be a probabilistic value.

__Prerequisites for Bayes’ Theorem –__

__Prerequisites for Bayes’ Theorem –__

Before proceeding towards Bayes theorem we first need to understand a few concepts. These concepts are essentially the prerequisites for understanding Bayes’ Theorem.

**1. Experiment:**

An experiment is a planned operation carried out under controlled conditions.

Tossing a coin, rolling a die, and drawing a card out of a well-shuffled pack of cards are all examples of experiments.

**2. Sample Space:**

The result of an experiment is called an outcome. The set of all possible outcomes of an event is called the sample space.

For example, if our experiment is throwing dice and recording its outcome, the sample space will be:

S1 = {1, 2, 3, 4, 5, 6}

**3. Event:**

An event is a set of outcomes (i.e. a subset of the sample space) of an experiment.

Let’s get back to the experiment of rolling a dice and define events A and B as:

A = An odd number is obtained = {1, 3, 5}

B = A number greater than 4 is obtained = {5, 6}

The probability of these events:

P (A) = Number of favorable outcomes / Total number of possible outcomes

= 3 / 6 = 0.5

Similarly,

P (B) = 2 / 6 = 0.667

The basic operations in set theory, union and intersection of events, are possible because an event is a set.

Then, E∪F = {2, 4, 5, 6} and E∩F = {4, 6}

Now consider an event G = An odd number is obtained:

Then E ∩ G = empty set = Φ

Such events are called disjoint events. These are also called mutually exclusive events because only one out of the two events can occur at a time.

**4. Random Variable:**

A Random Variable is exactly what it sounds like – a variable taking on random values with each value having some probability (which can be zero). It is a real-valued function defined on the sample space of an experiment.

**5. Exhaustive Events:**

A set of events is said to be exhaustive if at least one of the events must occur at any time. Thus, two events A and B are said to be exhaustive if A ∪ B = S, the sample space.

For example, let’s say that A is the event that a card drawn out of a pack is black and B is the event that the card drawn is red. Here, A and B are exhaustive because the sample space S = {red, black}.

**6. Independent Events:**

If the occurrence of one event does not have any effect on the occurrence of another, then the two events are said to be independent. Mathematically, two events A and B are said to be independent if:

P (A ∩ B) = P (AB) = P (A)*P (B)

**7. Conditional Probability:**

Consider that we’re drawing a card from a given deck.

What is the probability that it is a **red** card? That’s easy – 1/2, right? However, what if we know it was a red card – then what would be the probability that it was a king?

The approach to this question is not as simple. This is where the concept of conditional probability comes into play.

Conditional probability is the probability of an event A, given that another event B has already occurred.

This is represented by P (A|B).

P (A|B) = P (A ∩ B) / P (B)

Let event A represent picking a king, and event B, picking a red card. Then, P (A|B) is:

P (A ∩ B) = P (Obtaining a red card which is a King) = 2/52

P (B) = P (Picking a red card) = 1/2

Thus, P (A|B) = 4/52.

__What is Bayes Theorem?__

__What is Bayes Theorem?__

Consider that A and B are any two events from a sample space S where P (B) ≠ 0.

By conditional probability, we have:

P (A|B) = P (A ∩ B) / P (B)

P (B|A) = P (A ∩ B) / P(A)

It follows that P (A ∩ B) = P (A|B) * P (B) = P (B|A) * P (A)

Thus, **P (A|B) = P (B|A)*P (A) / P (B)**

This is the Bayes’ theorem formula.

Here, P (A) and P (B) are probabilities of observing A and B independently of each other.

So, we can say that they are **marginal probabilities**.

P (B|A) and P (A|B) are **conditional probabilities**.

P (A) is called **Prior probability** and P (B) is called **Evidence**.

**Bayes Theorem explained –**

Let’s solve these Bayes theorem examples –

There are 3 boxes labeled A, B, and C:

- Box A contains 4 red and 6 black balls
- Box B contains 6 red and 2 black ball
- And box C contains 2 red ball and 8 black balls

The three boxes are identical and have an equal probability of getting picked. Consider that a red ball is chosen. Then what is the probability that this red ball was picked out of box A?

Let E denote the event that a red ball is chosen and A, B, and C denote that the respective box is picked.

We are required to calculate the conditional probability** P (A|E)**

We have prior probabilities P (A) = P (B) = P (C) = 1 / 3, since all boxes have equal probability of getting picked.

P (E|A) = Number of red balls in box A / Total number of balls in box A = 4 / 6

=2 / 5

P (E|B) = 6 / 8 = 3 / 4

P (E|C) = 2 / 10 = 1 / 5

Then,

P (E) = P (E|A)*P (A) + P (E|B)*P (B) + P (E|C)*P (C)

= (2/5) * (1/3) + (3/4) * (1/3) + (1/5) * (1/3) = 0.45

Therefore, P (A|E) = P (E|A) * P (A) / P (E) = (2/5) * (1/3) / 0.45 = **0.296**

__Baye’s theorem applications in Data Science__

__Baye’s theorem applications in Data Science__

There are plenty of applications of the Bayes’ theorem in the real world.

**Bayesian Decision Theory** is a statistical approach to the problem of pattern classification. Under this theory, it is assumed that the underlying probability distribution for the categories is known.

Thus, we obtain an ideal Bayes Classifier against which all other classifiers are judged for performance.

Here are the celebrate applications of Bayes Theorem –

- Naive Bayes’ Classifiers
- Discriminant Functions and Decision Surfaces
- Bayesian Parameter Estimation