Skip to main content

Descriptive Statistics

Introduction

Descriptive statistics is one of the ways to describe the data. In data science, descriptive statistics are used to provide an overview of a huge data collection. It's used to see if the data is normally distributed or not. It is presented in the form of a chart, graphs, table, frequency distribution, and so on. It provides information on summary statistics that includes Mean, Standard Error, Median, Mode, Standard Deviation, Variance, Kurtosis, Skewness, Range, Minimum, Maximum, Sum, and Count.

Two major information is provided by Descriptive Statistics regarding the data:

  • The measure of Central Tendency
  • The measure of Dispersion

Descriptive Statistics

Descriptive statistics answer the following questions:

  • What is the value that best describes the data set?
  • How much does a data set speads from its average value?
  • What is the smallest and largest number in a data set?
  • What are the outliers and how it affects the dataset?

Measure of Central Tendency

It describes a whole data column with a single numerical value which represents the center of the distribution. There are three main measures of central tendency: the mode, the median, and the mean.


Mean, Median, Mode, and Range

Limitation of mean:

  • It is affected by extreme values.
  • Very large or very small numbers can distort the answer.

Advantage of median:

  • It is NOT affected by extreme values.
  • Very large or very small numbers do not affect it.

Advantage of mode:

  • It can be used when the data is non-numerical.

Limitation of mode:

  • There may be no mode at all if none of the data is the same.
  • There may be more than one mode.

By now, everyone should know how to calculate mean, median, and mode. They each give us a measure of Central Tendency (i.e. where the center of our data falls) but often give different answers. So how do we know when to use each? Here are some general rules:

  • Mean is the preferred measure of central tendency when:
    • It is the most frequently used measure of central tendency and is generally considered the best measure of it. However, there are some situations where either median or mode are preferred.
    • When your data is not skewed i.e normally distributed. In other words, there are no extreme values present in the data set (Outliers)
  • Median is the preferred measure of central tendency when:
    • There are a few extreme scores in the distribution of the data. (NOTE: Remember that a single outlier can have a great effect on the mean).
    • There are some missing or undetermined values in your data.
    • There is an open-ended distribution (For example, if you have a data field that measures several children and your options are '0', '1', '2', '3', '4', '5' or '6 or more'.  Here, the '6 or more' field is open-ended and makes calculating the mean impossible since we do not know exact values for this field).
    • You have data measured on an ordinal scale (ordered categories). Example: Likert scale -- 1. Strongly dislike 2. Dislike 3. Neutral 4. Like 5. Strongly like)
  • Mode is the preferred measure of central tendency when:
    • Data are measured on a nominal (unordered categories) and even sometimes ordinal (ordered categories) scale.

Measure of Dispersion

It refers to the spread or dispersion of scores. There are four main measures of variability: Range, Interquartile range, Standard deviation, and Variance.

Measure of Dispersion

As two features may have the same mean, median, and mode, a measure of central tendency is insufficient to accurately characterize the data. As a result, knowing the dispersion Measure is essential. It explains how the data changes.

The various measure of dispersion:

  • Range
  • Mean absolute deviation (MAD)
  • Variance
  • Standard deviation
  • Coefficient of variance
  • Coefficient of skewness

Range:

The range gives information about how much spread the given data has. The calculation of range is very simple, use the following formula:

  • Maximum value  -  Minimum value

Advantages:

  • Easy to calculate.

Limitations: 

  • It is very sensitive to outliers and does not use all the observations in a data set.

Mean absolute deviation (MAD):

Mean Absolute Deviation is the average distance between each data value and mean. To calculate MAD use the following formula:

where, x = observations, x bar = mean of observations, n is total number of observations.

Variance:

It gives us an understanding of how far the Data is from the mean.

  • High variance means data points are very spread out from mean, and from one another.
  • Low variance means data points are close to each other and to the mean.

To calculate mean use the following formula:


Standard deviation:

Standard deviation is nothing but a square root of variance. It is used to quantify the amount of dispersion of a set of data values from the average(mean).

  • Low Standard Deviation means most of the Data values are close to average.
  • High Standard Deviation means data values are far away from the mean.

To calculate Standard Deviation, first, calculate the variance and then find the square root of it.

Advantages:

  • It gives a better picture of your data than just the mean alone.

Limitations:

  • It doesn't give a clear picture of the whole range of the data.
  • It can give a skewed picture if data contain outliers.

Sources: ListenData, Medium, LumenLearning

Reach me on LinkedIn

Comments

Popular posts from this blog

Types of Machine Learning problems

In the previous blog, we had discussed brief about What is Machine Learning? In this blog, we are going to learn about the types of ML.  ML is broadly classified into four types: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning 1. Supervised Learning Supervised learning is where there are input variables, say X and there are corresponding output variables, say Y. We use a particular algorithm to map a function from input(X) to output(Y). Mathematically, Y=f(X). Majority of the ML models use this type of learning to feed itself and learn. The goal of supervised learning is to approximate the said function so well that whenever we enter any new input, it's output is accurately predicted. Here, we can say that there is a teacher who guides the model if it generates incorrect results and hence, the machine will keep on learning until it performs to desired results. Supervised Learning can be further classified into: Classification : Here, the ou

Statistics in Data Science

Introduction Statistics is one of the popularly regarded disciplines this is particularly centered on records collection, records organization, records analysis, records interpretation and records visualization. Earlier, facts become practiced through statisticians, economists, enterprise proprietors to calculate and constitute applicable records of their field. Nowadays, facts have taken a pivotal position in diverse fields like records technology, system learning, records analyst position, enterprise intelligence analyst position, pc technology position, and plenty more. Statistics is a type of mathematical analysis that uses quantified models and representations to analyze a set of experimental data or real-world research. The fundamental benefit of statistics is that information is provided in an easy-to-understand style. Statistical & Non-Statistical Analysis Statistical analysis is used to better understand a wider population by analyzing data from a sample. Statistical analy

COVID19 Analysis using Power BI Desktop

Analysis of the data before running into predictions is very important. Understand a few rows and a few columns is very nominal task and we can easily examine the data. However, with a little larger data, suppose 10,000 rows with 50 columns, we really need to do analysis of the data so that we can come to know which factors are going to affect our prediction. Data Analysis with Python is a bit tedious task as we have to prepare the data i.e. cleaning, pre-processing and normalization. We use Seaborn and Matplotlib for our data visualization. But before plotting the graphs, we need to know which columns are inter-related. For that, we need a co-relation matrix which we can create using Python. However, PPS matrix is more better than co-relation matrix. Fig1: Co-relation Matrix of Covid19 dataset Fig2: PPS Matrix of Covid19 dataset It is always a tedious task when we code for Data Analysis. So, we have certain tools available in the market for it like Power BI, Tableau, etc. I have done