Skip to main content

Statistics in Data Science

Introduction

Statistics is one of the popularly regarded disciplines this is particularly centered on records collection, records organization, records analysis, records interpretation and records visualization. Earlier, facts become practiced through statisticians, economists, enterprise proprietors to calculate and constitute applicable records of their field. Nowadays, facts have taken a pivotal position in diverse fields like records technology, system learning, records analyst position, enterprise intelligence analyst position, pc technology position, and plenty more.



Statistics is a type of mathematical analysis that uses quantified models and representations to analyze a set of experimental data or real-world research. The fundamental benefit of statistics is that information is provided in an easy-to-understand style.

Statistical & Non-Statistical Analysis

Statistical analysis is used to better understand a wider population by analyzing data from a sample. Statistical analysis enables inferences about target markets, consumer cohorts, and the general population to be established by suitably expanding data to forecast the behavior and attributes of the many based on the few. Data is employed in statistical analysis because it may be combined from multiple sources to aid in the statistical analysis process.

Non-statistical sampling refers to the selection of a test group based on the examiner's judgment rather than a formal statistical procedure. An examiner, for example, could use his or her own discretion to determine one or more of the following: 

  • The number of samples
  • Items are chosen for the test group
  • How are the outcomes assessed?
Statistical Analysis is the science of collecting, exploring, and presenting large amounts of data to identify patterns and trends.

Non-Statistical Analysis provides information and includes text, sound, still images, and moving images.


Statistical Analysis is also called Quantitative Analysis while Non-Statistical Analysis is termed Qualitative Analysis.

Categories of Statistics

Descriptive statistics is crucial before the process of inferential.

There are two types of statistics:

  • Descriptive statistics:
    • Present, organize, summarize, and describe the collected data using the measures discussed throughout measures of center, measures of spread, the shape of our distribution, and outliers.
    • We can also use plots of our data to gain a better understanding.
    • Helps us organize data and focus on the main characteristics of the data.
    • Provides a summary of data numerically or graphically.
  • Inferential statistics:
    • This is where you run different tests and draw conclusions about your sample that we can impute to a larger population.
    • Performing inferential statistics well requires that we take a sample that accurately represents our population of interest.
    • It generalizes the larger dataset and applied probability theory to draw conclusions.
    • It allows us to infer population parameters based on sample statistics and to model relationships within the data.
    • Note: Modelling allows us to develop mathematical equations which describe the inter-relationships among two or more variables.

Statistical Terminologies

Statistics have been used in many sectors:

  • Insurance
  • Stock Market
  • Genetics
  • Medical Studies
  • Shopping
  • Weather Forecasting

There are various statistical terms that one should be aware of while dealing with statistics. Some of them are:

  1. Population: A group from which data is to be collected.
  2. Sample: A sample is a subset of the population.
  3. Variable: A variable is a feature characteristic of any member of a population differing in quantity/quality from another member.
  4. Quantitative Variable: A variable differing in quantity. Example: weight of a person, number of people in a car, etc.
  5. Qualitative Variable: A variable differing in quality. Example: the color of the car, degree of damage to a car in an accident, etc
  6. Discrete Variable: A discrete variable is the one in which no value can be assumed between two given variables. Example: Number of children in a family
  7. Continuous Variable: A continuous variable is the one in which any value can be assumed between two given variables. Example: The time taken for a 100m race.

Types of Statistical Measure

There are four types of statistical measures used to describe data:

  1. Measure of Frequency:
    • The frequency of the data indicates the number of occurrences of any particular data value in the given dataset.
    • The measure of frequency is number and percentage.
  2. Measure of Central Tendency:
    • Central tendency indicates whether the data values accumulate in the middle of distribution or towards the end.
    • The measures of central tendency are Mean, Median, Mode.
  3. Measure of Spread:
    • Spread describes how similar or varied the set of observed values are for a particular variable.
    • The measures of spread are Standard Deviation, Variance, and Quartiles.
    • The measure of spread is also called the measure of dispersion.
  4. Measure of Position:
    • The position identifies the exact location of a particular data value in the given dataset.
    • The measure of position are Percentiles, Quartiles, and Standard Scores.

Hypothesis Testing

Hypothesis Testing is an inferential statistical technique to determine whether there is enough evidence in the data sample to infer that a certain condition holds true for the entire population. 

Steps:

  1. Take a random sample.
  2. Analyze the properties of the sample.
  3. Test whether or not the identified conclusions correctly represent the population.
  4. A hypothesis is generated about a population parameter.

Two types of hypothesis:

  1. Null Hypothesis (H0):
    • The null hypothesis is assumed to be true unless there is strong evidence to the contrary.
    • No variation exists between the variables.
    • Example:
      • A pharmaceutical company has introduced a medicine in the market for a particular disease and people have been using it for a considerable period of time, and it is generally considered safe.
      • If the medicine is proved to be safe then it is referred to as a null hypothesis.
  2. Alternative Hypothesis (H1):
    • The alternative hypothesis is assumed to be true when the null hypothesis is proved false.
    • Example:
      • In the above example, we should prove that the medicine is unsafe to reject the null hypothesis. If the null hypothesis is rejected, then an alternative hypothesis is used.

Variable Types

Based on the nature of variables, variables are classified into four types:

  1. Nominal variables:
    • This has two or more categories and it is important to order the values.
    • Example: Gender and Blood Group
  2. Ordinal variables:
    • This has values in a logical order. However, the relative distance between the two data values is not clear.
    • Example: Size of a coffee cup (S/M/L), ratings of the product (Good/Avg/Bad)
  3. Interval variables:
    • With an interval scale, equal differences between scale values do not have equal quantitative meaning. 
    • An interval scale provides more quantitative information than an ordinal scale. And interval scale does not have a true zero point.
    • Example: The Fahrenheit degree scale used to measure temperature, the distance between two compartments in a train.
  4. Ratio variables:
    • Ratio scales are similar to interval scales in that equal differences between scale values have equal quantitative meaning.
    • It has a true zero point.
    • Example: The system of inches is used with a common ruler.

Hypothesis Testing Procedure

There are two Hypothesis Testing Procedures:

  1. Parametric Tests:
    • Traditional tests such as t-test or ANOVA are called parametric tests. They depend on the specification of a probability distribution except for a set of free parameters.
    • If the population information is known completely by its parameter, then it is a parametric test.
  2. Non-parametric Tests:
    • If the population or parameter information is not known and still required to test the hypothesis of the population, then it is a non-parametric test.
    • They do not require any strict distributional assumptions.
Other Parametric tests:
  • t-test
  • ANOVA
  • Chi-square
  • Linear Regression
Source: https://www.simplilearn.com/
Reach me on LinkedIn

Comments

Post a Comment

Popular posts from this blog

Types of Machine Learning problems

In the previous blog, we had discussed brief about What is Machine Learning? In this blog, we are going to learn about the types of ML.  ML is broadly classified into four types: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning 1. Supervised Learning Supervised learning is where there are input variables, say X and there are corresponding output variables, say Y. We use a particular algorithm to map a function from input(X) to output(Y). Mathematically, Y=f(X). Majority of the ML models use this type of learning to feed itself and learn. The goal of supervised learning is to approximate the said function so well that whenever we enter any new input, it's output is accurately predicted. Here, we can say that there is a teacher who guides the model if it generates incorrect results and hence, the machine will keep on learning until it performs to desired results. Supervised Learning can be further classified into: Classification : Here, the ou

COVID19 Analysis using Power BI Desktop

Analysis of the data before running into predictions is very important. Understand a few rows and a few columns is very nominal task and we can easily examine the data. However, with a little larger data, suppose 10,000 rows with 50 columns, we really need to do analysis of the data so that we can come to know which factors are going to affect our prediction. Data Analysis with Python is a bit tedious task as we have to prepare the data i.e. cleaning, pre-processing and normalization. We use Seaborn and Matplotlib for our data visualization. But before plotting the graphs, we need to know which columns are inter-related. For that, we need a co-relation matrix which we can create using Python. However, PPS matrix is more better than co-relation matrix. Fig1: Co-relation Matrix of Covid19 dataset Fig2: PPS Matrix of Covid19 dataset It is always a tedious task when we code for Data Analysis. So, we have certain tools available in the market for it like Power BI, Tableau, etc. I have done