Skip to main content

Hypothesis Testing

What is Hypothesis Testing?

Statistics is all about data. That huge amount of data will only be useful if we are going to analyze it or take out conclusions from it. To find out such important interpretations or conclusions we use hypothesis testing.

Statistics

Statistical Hypothesis testing is to test the assumption (hypothesis) made and draw a conclusion about the population. This is done by testing the sample representing the whole population and based on the results obtained; the hypothesis is either rejected or accepted.

Pre-requisites: DSL | E1 Statistics

Steps in Hypothesis Testing

The three major steps are:

  1. Making an initial assumption.
    • We will take the initial assumption as the null hypothesis, H0.
    • Example:
      • We want to know whether the defendant is guilty or innocent.
      • Thus, we take H0 as Defendant not guilty.
  2. Collecting the data.
    • This data will not only be the data but the shreds of evidence as well.
    • Example: Fingerprints, DNA, etc.
  3. Gathering evidence to reject/accept the hypothesis.
    • Example:
      • If H0 is true -> Null Hypothesis
      • If H0 false -> Alternate Hypothesis (H1)

Dividing the process broadly, it consists of seven steps as below:

Steps in Hypothesis Testing

Null & Alternate Hypothesis

The null and alternative hypothesis is represented by H0 and H1 respectively.

Hypothesis 0 (H0): It is an assumption made about the population which needs to be tested and is considered to be true until evidence is found against it

Hypothesis 1 (H1): It is the opposite of the assumption made and is accepted when the former is rejected.

Confusion Matrix

 

H0

H1

Accept

OK

Type II Error

Reject

Type I Error

OK

Scenario-1: Suppose due to some reasons, we know that H0 is true but we do not have enough evidence. In this case, H1 will fall true even if H0 is actually true. Hence, this is called Type 1 Error.

Scenario-2: Let's say, H0 = Market will crash. H1 = Market will not crash. However, we do not have enough evidence to state that H1 is true. In this case, H0 will fall true. Hence, this is called Type 2 Error.

Thus, both types of error play a major role in Hypothesis Testing. Also, it depends on which error has more significance in different Machine Learning algorithms.

Significance level, P-value & Confidence Level

The significance level is represented by the Greek letter alpha (α).

The common values used for alpha is 0.1%, 1%, 5%and 10%. A smaller alpha value suggests a more robust or strong interpretation of the null hypothesis, such as 1% or 0.1%.

The hypothesis test returns a probability value known as a p-value. Using this value we can either reject the null hypothesis and accept the alternate hypothesis or accept the null hypothesis.

  • p-value = Probability (Data | Null Hypothesis)
    • p-value <= α: Reject the null and accept the alternate hypothesis
    • p-value > α: Failed to reject the null hypothesis

Let us experiment on the above hypothesis by flipping a single coin five times.

Experiment Performed:

  • After flipping the coin five times, we got five heads in a row (X) = 5
  • Considering alpha = 0.05
  • p-value :probability(X=5 | Ho)

Result:

  • No. of events in possible outcomes with all five heads = 1
  • So, P(X=5 | Ho) = 1/32 = 0.03
  • 0.03 signifies that there is only a 3% chance of getting all five heads in a row which is less than alpha.
  • P(X= 5 | Ho) = 0.03 < alpha (0.05)
  • As the ground truth observed cannot be rejected, hence the null hypothesis(Ho) is rejected, and the alternate hypothesis is accepted.
The confidence level can be obtained by subtracting the significance level of the hypothesis from 1 of the given observed sample data.
  • Confidence level = 1 — significance level (α)
Gaussian Distribution

For more information on p-value refer to:

https://www.analyticsvidhya.com/blog/2019/09/everything-know-about-p-value-from-scratch-data-science/

Types of tests and when to use which?

We will be focusing on concepts for now about t-test, chi-square, and ANOVA. 

Gender

Age Group

Weight (kgs)

Height (m)

M

Elder

70

1.4

F

Adult

65

1.2

M

Adult

65

1.4

M

Child

20

1

F

Adult

75

1.3

M

Elder

80

1.4

Scenario-1:

Considering the first column of the table i.e. Gender and it is a categorical value. Our question would be: Is there any difference between the proportion of Male & females?

Note: H0 is always true initially.

H1 = There is a difference.

Now, when we create a bar-plot for M & F, we can see if there is any difference or not. However, it is ONLY for sample data and NOT for the whole population data. And hence, we must consider H0.

H0 = No difference.

Now, to apply the test, we say that we have H0 which is true, so what is the 'likelihood' that H1 will be true?

As we saw the Gaussian Distribution above, we randomly assume for now that our p-value <= 0.05, and as it is one categorical feature, we will apply 'One sample proportion test'. p-value needs to be determined before we select which test to perform. We want our test to fall in the Confidence Internal. If it is in the Rejection Region (here, p <= 0.05) then we reject our Null hypothesis H0 and accept H1.

Conclusion: Difference Exists.

Scenario-2:

We take two categorical features: Gender and Age group. And the question is: Is there any difference between the proportion of Male & females based on Age group?

H0 = No difference.

H1 = Difference exists.

Test = Chi-square test.

If p <= 0.05, H1 is accepted.

Scenario-3:

We take one numeric continuous variable: Height. And the question is: Based on the previous sample, is there a difference w.r.t. the mean height?

H0 = No difference.

H1 = Difference exists.

Test = t-test.

If p <= 0.05, H1 is accepted.

Scenario-4:

We take two numeric continuous variables: Height and Weight. And the question is: Based on the previous sample, is there a difference w.r.t. the mean height based on mean weight?

H0 = No difference.

H1 = Difference exists.

Test = Co-relation.

If p <= 0.05, H1 is accepted.

Scenario-5:

Consider any one numeric feature and two/more categorical features. And in individual categorical features if there exist more than two sub-categories, then we will use the ANOVA test.

Source: Medium, YouTube

Reach me on LinkedIn

Comments

Popular posts from this blog

Types of Machine Learning problems

In the previous blog, we had discussed brief about What is Machine Learning? In this blog, we are going to learn about the types of ML.  ML is broadly classified into four types: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning 1. Supervised Learning Supervised learning is where there are input variables, say X and there are corresponding output variables, say Y. We use a particular algorithm to map a function from input(X) to output(Y). Mathematically, Y=f(X). Majority of the ML models use this type of learning to feed itself and learn. The goal of supervised learning is to approximate the said function so well that whenever we enter any new input, it's output is accurately predicted. Here, we can say that there is a teacher who guides the model if it generates incorrect results and hence, the machine will keep on learning until it performs to desired results. Supervised Learning can be further classified into: Classification : Here, the ou

Statistics in Data Science

Introduction Statistics is one of the popularly regarded disciplines this is particularly centered on records collection, records organization, records analysis, records interpretation and records visualization. Earlier, facts become practiced through statisticians, economists, enterprise proprietors to calculate and constitute applicable records of their field. Nowadays, facts have taken a pivotal position in diverse fields like records technology, system learning, records analyst position, enterprise intelligence analyst position, pc technology position, and plenty more. Statistics is a type of mathematical analysis that uses quantified models and representations to analyze a set of experimental data or real-world research. The fundamental benefit of statistics is that information is provided in an easy-to-understand style. Statistical & Non-Statistical Analysis Statistical analysis is used to better understand a wider population by analyzing data from a sample. Statistical analy

COVID19 Analysis using Power BI Desktop

Analysis of the data before running into predictions is very important. Understand a few rows and a few columns is very nominal task and we can easily examine the data. However, with a little larger data, suppose 10,000 rows with 50 columns, we really need to do analysis of the data so that we can come to know which factors are going to affect our prediction. Data Analysis with Python is a bit tedious task as we have to prepare the data i.e. cleaning, pre-processing and normalization. We use Seaborn and Matplotlib for our data visualization. But before plotting the graphs, we need to know which columns are inter-related. For that, we need a co-relation matrix which we can create using Python. However, PPS matrix is more better than co-relation matrix. Fig1: Co-relation Matrix of Covid19 dataset Fig2: PPS Matrix of Covid19 dataset It is always a tedious task when we code for Data Analysis. So, we have certain tools available in the market for it like Power BI, Tableau, etc. I have done