Alexei Matusevski personal Blog: Sampling Distribution

Sampling distribution

Imagine you have some data and you and you would like to know it's properties, like - mean, or average or something else - a Point Estimate. But you don't have a way to work with full data set (it's to large, or it's not fully available to you). But you have access to some portion of it.

To have an good aproximate value of that data you can make the Sampling Distribution.

The sampling distribution represents the distribution of the point estimates based on samples of a fixed size from a certain population.

It is useful to think of a particular point estimate as being drawn from such a distribution.

To make that point more visible I made the set of tests to show that it really works - the sample size is our variable that experiment.

import mysql.connector
import pandas as pd
import numpy as np
from scipy.stats import bernoulli, binom, poisson, norm, uniform, beta
import matplotlib.pyplot as plt

This is the function to get the random values from table:

# defined a function to query random data
def get_random(count):
    query = ("SELECT totals FROM orders ORDER BY RAND() LIMIT %s")
    cursor.execute(query, (count,))
    ret = []
    for (fp) in cursor:
        ret.append(fp[0])
    return ret


cnx = mysql.connector.connect(user='user', password='pass',
                              host='127.0.0.1',
                              database='somedb')

# our means
means = []
cursor = cnx.cursor()
n = 0
# during this experiment the sample size changed manually:
# 5500 - 10% of population
# 10,000
# 1,000
# 100
# 10
SAMPLE_SIZE = 10000

for i in range(1001):
    t = get_random()
    if (i % 2 == 0 and i > 0):
        fig, ax = plt.subplots(1, 1)
        ax.hist(means)
        # mean
        m = np.mean(means)
        ax.vlines(m, 0, 230, color='r')
        ax.vlines(132, 1, 1, color='w')
        ax.vlines(147, 1, 1, color='w')
        label = 'mean = {0:.6f}\ncount = {1}\nsample size={2}\niterations ={3}'.format(m, len(means), SAMPLE_SIZE, i)
        ax.annotate(label, xy=(142, 170))
        plt.xlabel(f'{i}')
        # the path to file is changed manually
        # images are saved so we can concat them into video.
        plt.savefig(f'src\scale2\pic{n:04}.png')
        n+=1
        #plt.show()
    means.append(np.mean(t))
cursor.close()
cnx.close()

First set 5500 samples in 1 iteration, it' like to notice how fast the mean becomes 138 value

The 10,000 samples size "behave" much better, and stays within the range to 130-150. and looks almost normal.

1000 sample size - the range of the means increases

100 sample size - behave much interesting and to get the 137 value - we need much more iterations.

10 samples size - made for fun.

Alexei Matusevski personal Blog

Saturday, March 03, 2018

Sampling Distribution

Sampling distribution

No comments:

Post a Comment

About Me