Skip to main content

Descriptive Statistics

Introduction

Descriptive statistics is one of the ways to describe the data. In data science, descriptive statistics are used to provide an overview of a huge data collection. It's used to see if the data is normally distributed or not. It is presented in the form of a chart, graphs, table, frequency distribution, and so on. It provides information on summary statistics that includes Mean, Standard Error, Median, Mode, Standard Deviation, Variance, Kurtosis, Skewness, Range, Minimum, Maximum, Sum, and Count.

Two major information is provided by Descriptive Statistics regarding the data:

  • The measure of Central Tendency
  • The measure of Dispersion

Descriptive Statistics

Descriptive statistics answer the following questions:

  • What is the value that best describes the data set?
  • How much does a data set speads from its average value?
  • What is the smallest and largest number in a data set?
  • What are the outliers and how it affects the dataset?

Measure of Central Tendency

It describes a whole data column with a single numerical value which represents the center of the distribution. There are three main measures of central tendency: the mode, the median, and the mean.


Mean, Median, Mode, and Range

Limitation of mean:

  • It is affected by extreme values.
  • Very large or very small numbers can distort the answer.

Advantage of median:

  • It is NOT affected by extreme values.
  • Very large or very small numbers do not affect it.

Advantage of mode:

  • It can be used when the data is non-numerical.

Limitation of mode:

  • There may be no mode at all if none of the data is the same.
  • There may be more than one mode.

By now, everyone should know how to calculate mean, median, and mode. They each give us a measure of Central Tendency (i.e. where the center of our data falls) but often give different answers. So how do we know when to use each? Here are some general rules:

  • Mean is the preferred measure of central tendency when:
    • It is the most frequently used measure of central tendency and is generally considered the best measure of it. However, there are some situations where either median or mode are preferred.
    • When your data is not skewed i.e normally distributed. In other words, there are no extreme values present in the data set (Outliers)
  • Median is the preferred measure of central tendency when:
    • There are a few extreme scores in the distribution of the data. (NOTE: Remember that a single outlier can have a great effect on the mean).
    • There are some missing or undetermined values in your data.
    • There is an open-ended distribution (For example, if you have a data field that measures several children and your options are '0', '1', '2', '3', '4', '5' or '6 or more'.  Here, the '6 or more' field is open-ended and makes calculating the mean impossible since we do not know exact values for this field).
    • You have data measured on an ordinal scale (ordered categories). Example: Likert scale -- 1. Strongly dislike 2. Dislike 3. Neutral 4. Like 5. Strongly like)
  • Mode is the preferred measure of central tendency when:
    • Data are measured on a nominal (unordered categories) and even sometimes ordinal (ordered categories) scale.

Measure of Dispersion

It refers to the spread or dispersion of scores. There are four main measures of variability: Range, Interquartile range, Standard deviation, and Variance.

Measure of Dispersion

As two features may have the same mean, median, and mode, a measure of central tendency is insufficient to accurately characterize the data. As a result, knowing the dispersion Measure is essential. It explains how the data changes.

The various measure of dispersion:

  • Range
  • Mean absolute deviation (MAD)
  • Variance
  • Standard deviation
  • Coefficient of variance
  • Coefficient of skewness

Range:

The range gives information about how much spread the given data has. The calculation of range is very simple, use the following formula:

  • Maximum value  -  Minimum value

Advantages:

  • Easy to calculate.

Limitations: 

  • It is very sensitive to outliers and does not use all the observations in a data set.

Mean absolute deviation (MAD):

Mean Absolute Deviation is the average distance between each data value and mean. To calculate MAD use the following formula:

where, x = observations, x bar = mean of observations, n is total number of observations.

Variance:

It gives us an understanding of how far the Data is from the mean.

  • High variance means data points are very spread out from mean, and from one another.
  • Low variance means data points are close to each other and to the mean.

To calculate mean use the following formula:


Standard deviation:

Standard deviation is nothing but a square root of variance. It is used to quantify the amount of dispersion of a set of data values from the average(mean).

  • Low Standard Deviation means most of the Data values are close to average.
  • High Standard Deviation means data values are far away from the mean.

To calculate Standard Deviation, first, calculate the variance and then find the square root of it.

Advantages:

  • It gives a better picture of your data than just the mean alone.

Limitations:

  • It doesn't give a clear picture of the whole range of the data.
  • It can give a skewed picture if data contain outliers.

Sources: ListenData, Medium, LumenLearning

Reach me on LinkedIn

Comments

Popular posts from this blog

Types of Machine Learning problems

In the previous blog, we had discussed brief about What is Machine Learning? In this blog, we are going to learn about the types of ML.  ML is broadly classified into four types: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning 1. Supervised Learning Supervised learning is where there are input variables, say X and there are corresponding output variables, say Y. We use a particular algorithm to map a function from input(X) to output(Y). Mathematically, Y=f(X). Majority of the ML models use this type of learning to feed itself and learn. The goal of supervised learning is to approximate the said function so well that whenever we enter any new input, it's output is accurately predicted. Here, we can say that there is a teacher who guides the model if it generates incorrect results and hence, the machine will keep on learning until it performs to desired results. Supervised Learning can be further classified into: Classification : He...

Statistics in Data Science

Introduction Statistics is one of the popularly regarded disciplines this is particularly centered on records collection, records organization, records analysis, records interpretation and records visualization. Earlier, facts become practiced through statisticians, economists, enterprise proprietors to calculate and constitute applicable records of their field. Nowadays, facts have taken a pivotal position in diverse fields like records technology, system learning, records analyst position, enterprise intelligence analyst position, pc technology position, and plenty more. Statistics is a type of mathematical analysis that uses quantified models and representations to analyze a set of experimental data or real-world research. The fundamental benefit of statistics is that information is provided in an easy-to-understand style. Statistical & Non-Statistical Analysis Statistical analysis is used to better understand a wider population by analyzing data from a sample. Statistical analy...

What is Machine Learning?

Arthur Samuel, firstly coined the term "Machine Learning". He defined the term as, "Field of study that gives computers the capability to learn without being explicitly programmed." Explaining in layman terms, Machine learning means improving the process of learning for computers which is based on it's experiences to do a certain task without further guidance through programs. In other words, we can say that machine learns through initial program and feeds itself the data which obtained from the experiences while executing a particular task. Let's take an example to understand this. A father and a baby went to a park to make the baby learn how to walk. Initially, the father hold the hands of his baby so that the baby can walk without tripping. As the baby can now stand on it's own legs, the father did not hold hands of the baby, thus the baby kept going on and  tripped as stone hit the toes. The baby stood up and learned not to walk over stones. The next...