Tuesday, February 27, 2018

Distributions and how to plot them with SciPy



Intro

To make the plotting is simple (not that painful), in that post I will make few plots with different distributions.

First importing the libraries. I will use the matplotlib for plotting the data. SciPy one of the core libraries for data science calculations:

import numpy as np # we will use it for generation the arrays
from scipy.stats import bernoulli, binom, poisson, norm, uniform, beta
import matplotlib.pyplot as plt

To make printing comfortable a made the printing function for Mean, Variance, Skew, Kurt:
def print_mvsk(*args):
    t = args[0]
    mean, var, skew, kurt, = float(t[0]),float(t[1]),float(t[2]),float(t[3])
    sd = np.sqrt(var)
    print(f'mean:{mean:.4f}\tvar:{var:.4f}\tskew:{skew:.4f}\nsd:{sd:.4f}\tkurt:{kurt:.4f}')



Bernoulli Distribution

Bernoulli is long representation of regular Probabilities as we all know it. for example - toss of the coin, or rolling the dice. etc.

Example: Toss a fair coin once.
What is the distribution of the number of heads?
A single trial
The trail can result mutually exclusive results.

P(Success)=p
P(Failure)=1-p
Let X = 1 if the success occurs, and X=0 if a failure occurs.
Then X has a Bernoulli distribution:
$P(X=x)=p^x{(1-p)}^{1-x}$ Probability mass function of the Bernoulli distribution
Mean - $\mu=p$ mean
Variance $\sigma^2=p(1-p)$

This is how we code the Bernoulli distributions with using the scipy.stats.bernoulli

fig, ax = plt.subplots(1, 1) #creating the 1 plot

p = 1/6 # our probability of tossing the dice
x = [0,1] # the x axes
# print the distribution's properties
print_mvsk(bernoulli.stats(p, moments='mvsk'))
# generating the data for value 0, 1
data = bernoulli.pmf(x, p)

# vertical line
ax.vlines(x, 0, data, colors='y', lw=20)
plt.ylabel('Probability of winning in dice tos')
plt.xlabel('0 mean prob to lose \n 1 - chances to win')
plt.title('Bernulli Probability Distribution')
plt.grid(True)
plt.show()

# probability of tossing the coin
p = 1/2 #
x = [0,1]
fig, ax = plt.subplots(1, 1)<

data = bernoulli.pmf(x, p)

ax.vlines(x, 0, data, colors='y', lw=20)
plt.ylabel('Probability of winning in coin tos')
plt.xlabel('0 mean prob to lose \n 1 - chances to win')
plt.title('Bernulli Probability Distribution')
plt.grid(False)
plt.show()

Poisson

Suppose we are counting the number of occurrences of an event in a given unit of time, distance, area or volume.

For example:
  • The number of car accidents in a day.
  • The number of dandelions in a square meter plot pf land. 
Suppose:
  • Events are occurring independently
  • The probability that an event occurs in a given length of time does not change through time.
  • Events are occurring randomly and independently.
Then X, the number of events in a fixed unit of time, has a Poisson Distribution.

Probability - $P\left(X=x\right)=\frac{\lambda^xe^{-\lambda}\ }{x!}$
where e=2.71828, x=0,1,2…
Mean - $\mu=\lambda$
Variance - $\sigma^2=\lambda$


Example:
Plutonium-239 is an isotope of plutonium that is used in nuclear weapons and reactors.
One nanogram of Plutonium-239 will have an average of 2.3 radioactive decays per second, and the number of decays will follow a Poisson distribution.

Q:What is the probability that in a 2 second period there are exactly 3 radioactive decay?

Let X represent the number of decays in a 2 second period. Since the mean is lambda and we have avg 2.3 per second, and we are looking for 2 seconds period
$\lambda=2.3\ast2=4.6$
$P\left(X=3\right)=\frac{{4.6}^3e^{-4.6}\ }{3!}=0.163$

Q:What is probability of no more then 3 decays?
$P\left(X\le3\right)=P\left(X=0\right)+P\left(X=1\right)+P\left(X=2\right)+P\left(X=3\right)=$
$\frac{{4.6}^0e^{-4.6}\ }{0!}+\frac{{4.6}^1e^{-4.6}\ }{1!}+\frac{{4.6}^2e^{-4.6}\ }{2!}+\frac{{4.6}^3e^{-4.6}\ }{3!}=$
$0.010+0.046+0.106+0.163=0.326$

We will use the scipy.stats.poisson functions:
fig, ax = plt.subplots(1, 1)
mu = 4.6 # our lambda
print_mvsk(poisson.stats(mu, moments='mvsk'))
poisson.ppf(0.01, mu)
x = np.arange(poisson.ppf(0.00001, mu),
              poisson.ppf(0.99999, mu))

data = poisson.pmf(x, mu)

# creating new array
data2 = [0]*len(data)

# to show the answer with different color
data2[3]= poisson.pmf(3, mu)

ax.vlines(x, 0, data , colors='r', lw=18, alpha=1)
ax.vlines(x, 0, data2, colors='b', lw=18, alpha=1)
ax.vlines(x, 0, data2, colors='b', lw=18, alpha=1)

plt.ylabel('Probability')
plt.xlabel('Number of Decays')
plt.title('Plutonium-239 prob of having 3 decays ')
plt.show()



Normal / Gaussian Distribution

The random independent variables has Normal distribution. If you have allot random data it will tend to have Normal Distribution.
Normal Distribution is Continues distribution (not discrete)

$f(x)=\frac{1}{\sqrt{2\pi\sigma}}e^{-\ \frac{1}{2\sigma^2}{(x-\mu)}^2}$
for
$-\infty<x<\infty$
$-\infty<\mu<\infty$
$\sigma>0$
$\sigma^2>0$
If X is a random variable that has a normal distribution with mean $\mu$ and variance $\sigma^2$, we write this as
$X~N(\mu,\sigma^2)$
The standard normal distribution is a normal distribution with mean 0 and variance 1.
$X~N(0,1)$>

fig, ax = plt.subplots(1, 1)
print_mvsk(norm.stats(moments='mvsk'))

# using the linspace function to create 
# continues X axis, instead using the 
# np.arrange with step attribute
x = np.linspace(norm.ppf(0.00001),
              norm.ppf(0.99999), 1000)

data = norm.pdf(x)
# 'b-' means:
# Blue Color
# - means a solid line
ax.plot(x, data, 'b-', ms=1)

# cover all back ground with red
ax.vlines(x, 0, data, colors='r', lw=1, alpha=1)

# Cover 95 % (2Sigma) of Normal Dist with blue color
x_sigma2 = np.linspace(norm.ppf(0.025),
              norm.ppf(0.975), 1000)
sigma2 = norm.pdf(x_sigma2)
ax.vlines(x_sigma2,0, sigma2, color='b', lw=1, alpha=.5, label='asd')

# Cover 1 sigma with green
p_sigma1 = norm.pdf(1)
x_sigma1 = np.linspace(norm.ppf(p_sigma1),
              norm.ppf(1-p_sigma1), 1000)
sigma1 = norm.pdf(x_sigma1)
ax.vlines(x_sigma1,0, sigma1, color='g', lw=1, alpha=.5)

plt.ylabel('Prob')
plt.xlabel('Red 100%\nBlue 95%\nGreen=68.2')
plt.show()
Used Materials:

No comments:

Post a Comment