Skip to main content

Hypothesis Testing

What is Hypothesis Testing?

Statistics is all about data. That huge amount of data will only be useful if we are going to analyze it or take out conclusions from it. To find out such important interpretations or conclusions we use hypothesis testing.

Statistics

Statistical Hypothesis testing is to test the assumption (hypothesis) made and draw a conclusion about the population. This is done by testing the sample representing the whole population and based on the results obtained; the hypothesis is either rejected or accepted.

Pre-requisites: DSL | E1 Statistics

Steps in Hypothesis Testing

The three major steps are:

  1. Making an initial assumption.
    • We will take the initial assumption as the null hypothesis, H0.
    • Example:
      • We want to know whether the defendant is guilty or innocent.
      • Thus, we take H0 as Defendant not guilty.
  2. Collecting the data.
    • This data will not only be the data but the shreds of evidence as well.
    • Example: Fingerprints, DNA, etc.
  3. Gathering evidence to reject/accept the hypothesis.
    • Example:
      • If H0 is true -> Null Hypothesis
      • If H0 false -> Alternate Hypothesis (H1)

Dividing the process broadly, it consists of seven steps as below:

Steps in Hypothesis Testing

Null & Alternate Hypothesis

The null and alternative hypothesis is represented by H0 and H1 respectively.

Hypothesis 0 (H0): It is an assumption made about the population which needs to be tested and is considered to be true until evidence is found against it

Hypothesis 1 (H1): It is the opposite of the assumption made and is accepted when the former is rejected.

Confusion Matrix

 

H0

H1

Accept

OK

Type II Error

Reject

Type I Error

OK

Scenario-1: Suppose due to some reasons, we know that H0 is true but we do not have enough evidence. In this case, H1 will fall true even if H0 is actually true. Hence, this is called Type 1 Error.

Scenario-2: Let's say, H0 = Market will crash. H1 = Market will not crash. However, we do not have enough evidence to state that H1 is true. In this case, H0 will fall true. Hence, this is called Type 2 Error.

Thus, both types of error play a major role in Hypothesis Testing. Also, it depends on which error has more significance in different Machine Learning algorithms.

Significance level, P-value & Confidence Level

The significance level is represented by the Greek letter alpha (α).

The common values used for alpha is 0.1%, 1%, 5%and 10%. A smaller alpha value suggests a more robust or strong interpretation of the null hypothesis, such as 1% or 0.1%.

The hypothesis test returns a probability value known as a p-value. Using this value we can either reject the null hypothesis and accept the alternate hypothesis or accept the null hypothesis.

  • p-value = Probability (Data | Null Hypothesis)
    • p-value <= α: Reject the null and accept the alternate hypothesis
    • p-value > α: Failed to reject the null hypothesis

Let us experiment on the above hypothesis by flipping a single coin five times.

Experiment Performed:

  • After flipping the coin five times, we got five heads in a row (X) = 5
  • Considering alpha = 0.05
  • p-value :probability(X=5 | Ho)

Result:

  • No. of events in possible outcomes with all five heads = 1
  • So, P(X=5 | Ho) = 1/32 = 0.03
  • 0.03 signifies that there is only a 3% chance of getting all five heads in a row which is less than alpha.
  • P(X= 5 | Ho) = 0.03 < alpha (0.05)
  • As the ground truth observed cannot be rejected, hence the null hypothesis(Ho) is rejected, and the alternate hypothesis is accepted.
The confidence level can be obtained by subtracting the significance level of the hypothesis from 1 of the given observed sample data.
  • Confidence level = 1 — significance level (α)
Gaussian Distribution

For more information on p-value refer to:

https://www.analyticsvidhya.com/blog/2019/09/everything-know-about-p-value-from-scratch-data-science/

Types of tests and when to use which?

We will be focusing on concepts for now about t-test, chi-square, and ANOVA. 

Gender

Age Group

Weight (kgs)

Height (m)

M

Elder

70

1.4

F

Adult

65

1.2

M

Adult

65

1.4

M

Child

20

1

F

Adult

75

1.3

M

Elder

80

1.4

Scenario-1:

Considering the first column of the table i.e. Gender and it is a categorical value. Our question would be: Is there any difference between the proportion of Male & females?

Note: H0 is always true initially.

H1 = There is a difference.

Now, when we create a bar-plot for M & F, we can see if there is any difference or not. However, it is ONLY for sample data and NOT for the whole population data. And hence, we must consider H0.

H0 = No difference.

Now, to apply the test, we say that we have H0 which is true, so what is the 'likelihood' that H1 will be true?

As we saw the Gaussian Distribution above, we randomly assume for now that our p-value <= 0.05, and as it is one categorical feature, we will apply 'One sample proportion test'. p-value needs to be determined before we select which test to perform. We want our test to fall in the Confidence Internal. If it is in the Rejection Region (here, p <= 0.05) then we reject our Null hypothesis H0 and accept H1.

Conclusion: Difference Exists.

Scenario-2:

We take two categorical features: Gender and Age group. And the question is: Is there any difference between the proportion of Male & females based on Age group?

H0 = No difference.

H1 = Difference exists.

Test = Chi-square test.

If p <= 0.05, H1 is accepted.

Scenario-3:

We take one numeric continuous variable: Height. And the question is: Based on the previous sample, is there a difference w.r.t. the mean height?

H0 = No difference.

H1 = Difference exists.

Test = t-test.

If p <= 0.05, H1 is accepted.

Scenario-4:

We take two numeric continuous variables: Height and Weight. And the question is: Based on the previous sample, is there a difference w.r.t. the mean height based on mean weight?

H0 = No difference.

H1 = Difference exists.

Test = Co-relation.

If p <= 0.05, H1 is accepted.

Scenario-5:

Consider any one numeric feature and two/more categorical features. And in individual categorical features if there exist more than two sub-categories, then we will use the ANOVA test.

Source: Medium, YouTube

Reach me on LinkedIn

Comments

Popular posts from this blog

Types of Machine Learning problems

In the previous blog, we had discussed brief about What is Machine Learning? In this blog, we are going to learn about the types of ML.  ML is broadly classified into four types: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning 1. Supervised Learning Supervised learning is where there are input variables, say X and there are corresponding output variables, say Y. We use a particular algorithm to map a function from input(X) to output(Y). Mathematically, Y=f(X). Majority of the ML models use this type of learning to feed itself and learn. The goal of supervised learning is to approximate the said function so well that whenever we enter any new input, it's output is accurately predicted. Here, we can say that there is a teacher who guides the model if it generates incorrect results and hence, the machine will keep on learning until it performs to desired results. Supervised Learning can be further classified into: Classification : He...

Statistics in Data Science

Introduction Statistics is one of the popularly regarded disciplines this is particularly centered on records collection, records organization, records analysis, records interpretation and records visualization. Earlier, facts become practiced through statisticians, economists, enterprise proprietors to calculate and constitute applicable records of their field. Nowadays, facts have taken a pivotal position in diverse fields like records technology, system learning, records analyst position, enterprise intelligence analyst position, pc technology position, and plenty more. Statistics is a type of mathematical analysis that uses quantified models and representations to analyze a set of experimental data or real-world research. The fundamental benefit of statistics is that information is provided in an easy-to-understand style. Statistical & Non-Statistical Analysis Statistical analysis is used to better understand a wider population by analyzing data from a sample. Statistical analy...

What is Machine Learning?

Arthur Samuel, firstly coined the term "Machine Learning". He defined the term as, "Field of study that gives computers the capability to learn without being explicitly programmed." Explaining in layman terms, Machine learning means improving the process of learning for computers which is based on it's experiences to do a certain task without further guidance through programs. In other words, we can say that machine learns through initial program and feeds itself the data which obtained from the experiences while executing a particular task. Let's take an example to understand this. A father and a baby went to a park to make the baby learn how to walk. Initially, the father hold the hands of his baby so that the baby can walk without tripping. As the baby can now stand on it's own legs, the father did not hold hands of the baby, thus the baby kept going on and  tripped as stone hit the toes. The baby stood up and learned not to walk over stones. The next...