Statistics for Data Science with Python
This course is one of courses in the Data Science Fundamentals with Python and SQL Specialization, which is useful for data analysis but not included in this Data Science Professional Certificate.
- Introduction and Descriptive Statistics
- Data Visualization
- Introduction to Probability Distribution
- Regression Analysis
- Cheat Sheet for Statistical Analysis in Python
Introduction and Descriptive Statistics
Types of Data
Measure of Central Tendency
# get information about each variable
df.info()
df.describe()
Measure of Dispersion
Dispersion, which is also called variability, scatter or spread, is the extent to which the data distribution is stretched or squeezed. The common measures of dispersion are standard deviation and variance.
Reliability
- Average paints a partial picture
- Average statistics are incomplete without standard deviation/variance
- Risk metrics are all about variance
Jupyter Notebook: Descriptive Statistics
Data Visualization
The Extreme Presentation Method:
- The step-by-step approach for designing presentations of complex or controversial information in ways that drive people to action.
seaborn
matplotlib
Jupyter Notebook: Visualizing Data
Introduction to Probability Distribution
Hypothesis Test
To use both the p-value and significance level together, you have to decide on a value for alpha after you state your hypothesis. Suppose that is alpha = 0.10 (or 10%). You then collect the data and calculate the p-value.
- If the p-value is greater than alpha, you assume that the null hypothesis is true and you fail to reject.
- If the p-value is less than alpha, you assume that the null hypothesis is false and you reject it.
- In cases when the p-value and the significance levels are approximately equal e.g. a p-value of 0.11, it is your call to decide to reject or fail to reject or you could decide to resample and collect more data.
Normal Distribution:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
x_axis = np.arange(-4, 4, 0.1)
plt.plot(x_axis, norm.pdf(x_axis,0,1))
plt.show()
Jupyter Notebook: T Test
Z test or T test
- If the population’s standard deviation is known, use z test
- Otherwise, use T-test
Comparing means - 4 cases:
- Use Z test: Comparing sample mean to a population mean when the population standard deviation is known
- Use T test: Comparing sample mean to a population mean when the population standard deviation is not known
- Always use T test: Comparing the means of two independent samples with unequal variances
- Always use T test: Comparing the means of two independent samples with equal variances
Type of Test | z or t Statistics* |
Expected p-value | Decision |
---|---|---|---|
Two-tailed test | The absolute value of the calculated z or t statistics is greater than 1.96 |
Less than 0.05 | Reject the null hypothesis |
One-tailed test | The absolute value of the calculated z or t statistics is greater than 1.64 |
Less than 0.05 | Reject the null hypothesis |
* in large samples this rule of thumb holds true for the t-test
because in large sample sizes, the t-distribution is approximate to a normal distribution
Levene’s Test
Levene’s test is used to check that variances are equal for all samples when your data comes from a non normal distribution. You can use Levene’s test to check the assumption of equal variances before running a test like One-Way ANOVA.
ANOVA
ANOVA - Comparing means of more than two groups
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyze the differences among means.
Correlation Test
Correlation test is used to evaluate the association between two or more variables. For instance, if we are interested to know whether there is a relationship between the heights of fathers and sons, a correlation coefficient can be calculated to answer this question.
See also:
Jupyter Notebook: Hypothesis Testing
Regression Analysis
Linear regression is a linear relationship between the response variable and predictor variables. It can be used to predict the value of a continuous variable, based on the value of another continuous variable. The t-test statistic helps to determine the correlation between the response and the predictor variables. A one-sample t-test will be used in linear regression to test the null hypothesis that the slope or the coefficient is equal to zero. In the case of the multiple regression model, the null hypothesis is that the coefficient of each of the predictor variables is equal to zero.
Regression in place of t-test
Regression in place of ANOVA
Regression in place of Correlation
Jupyter Notebook: Regression Analysis
Cheat Sheet for Statistical Analysis in Python
Descriptive Statistics
Here is a quick review of some popular functions:
- To find the average of the data, we use the
mean()
function - To find the median of the data, we use the
median()
function - To find the mode of the data, we use the
mode()
function - To find the variance of the data, we use the
variance()
function - To find the standard deviation of the data, we use the
stdev()
function - To get the unique values in a dataset, we use the
unique()
.unique()
prints out the values andnunique()
prints out the number of unique values.
Data Visualization
One of the most popular visualization tools is the seaborn
library. It is a Python Data visualization library that is based on matplotlib
. You can learn more here. To get access to functions in the seaborn library or any library, you must first import the library. To import the seaborn library: import seaborn.
Here is a quick summary for creating graphs and plots:
-
Barplots: A barplot shows the relationship between a numeric and a categorical variable by plotting the categorical variables as bars in correspondence to the numerical variable. In the seaborn library, barplots are created by using the
barplot()
function. The following codeax = seaborn.barplot(x="division", y="eval", data=division_eval)
will return a barplot that shows the average evaluation scores for the lower-division and upper-division. -
Scatterplots: This is a two-dimensional plot that displays the relationship between two continuous data. Scatter plots are created by using the
scatterplot()
function in the seaborn library. The following code:ax = seaborn.scatterplot(x='age', y='eval', hue='gender', data=ratings_df)
will return a plot that shows the relationship between age and evaluation scores: -
Boxplots: A boxplot is a way of displaying the distribution of the data. It returns the minimum, first quartile, median, third quartile, and maximum values of the data. We use the
boxplot()
function in the seaborn library. This codeax = seaborn.boxplot(y='beauty', data=ratings_df)
will return a boxplot with the data distribution for beauty. We can make the boxplots horizontal by specifyingx='beauty'
in the argument. -
Other useful functions include
catplot()
to represent the relationship between a numerical value and one or more categorical variables,distplot()
, andhistplot()
for plotting histograms.
Hypothesis Testing
- Use the
norm.cdf()
function in thescipy.stats
library to find the standardized (z-score) value. In cases where we are looking for the area to the right of the curve, we will remove the results above from 1. Remember toimport scipy.stats
- Levene’s test for equal variance: Levene’s test is a test used to check for equality of variance among groups. We use the
scipy.stats.levene()
from thescipy.stats
library. - T-test for two independent samples: This test compares the means of two independent groups to determine whether there is a significant difference in means for both groups. We use the
scipy.stats.ttest_ind()
from thescipy.stats
library. - One-way ANOVA: It compares the mean between two or more independent groups to determine whether there is a statistical significance between them. We use the
scipy.stats.f_oneway()
from thescipy.stats
library or you can use theanova_lm()
from thestatsmodels
library. - Chi-square (𝜒2) test for association: Chi-square test for association tests the association between two categorical variables. To do this we must first create a crosstab of the counts in each group. We do this by using the
crosstab()
function in thepandas
library. Then thescipy.stats.chi2_contingency()
on the contingency table - it returns the 𝜒2 value, p-value, degree of freedom and expected values. - Pearson Correlation: Tests the correlation between two continuous variables. we use the
scipy.stats.pearsonr()
to get the correlation coefficient - To run the tests using Regression analysis, you will need the
OLS()
from thestatsmodels
library. When running these tests using regression analysis, you havefit()
the model, make predictions usingpredict()
and print out the model summary usingmodel.summary()
.