Skip to main content

Exploratory Data Analysis with Python

How many times have you got baffled with the beginner part in Data Analysis? I am sure many times as I faced the same. So in this post, I will guide you as how to do Exploratory Data Analysis (EDA) with Python.


EDA is one of the basic steps involved in Data Analysis. Before you begin with the analysis part, understanding data is very important. You should be familier with certain terminologies of the type of data on which you are working. For instance, you are working with any financial data let’s say stocks and thus, you should be familier with terminologies like open, close, high, low, volume and so on. Alright, we now know understanding data is important so let’s move ahead with the analysis part!

Why Data Analysis?

To get inferences from the data and to know future possible outcomes based on historical values of any data. Continuing the above example, with analysis of certain stock, let’s say I analyzed performance of a particular stock from year 2000 to 2020. I will find certain trends in the graph of it’s everyday value. If it shows an upward trend (for major time) then I will like to invest in it, otherwise I will analyze other stocks. An important thing to remember here is:

Analysis is never done on basis of one or two parameters. There are lots of other parameters, might be 5, 10, 20 or any number.

To begin with analysis part, EDA is first step. EDA helps to:

  1. Get inferences from the data.
  2. To understand how parameters of data are inter-co-related.
  3. To extract important parameters and neglecting ones which are not.
  4. Find relationships between those parameters.
  5. Try to bring a conclusion

Let’s get our hands dirty with data

I will be using data provided by World Bank Data (Science and Technology Indicators). This dataset deals with every country in the world. So, I will be narrowing the data to India so as to get easy insights. You can download it from here: Dataset (India)

1. Importing packages
  • We will work with pandas, numpy, seaborn and matplotlib here for data analysis and visualizations.
  • We wrote %matplotlib inline to plot our graphs within the notebook.
#import packages
import pandas as pd
import numpy as np
import seaborn as sns

#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline

2. Loading the dataset and have overview of the data
  • Include file path in read_csv and df.head() will print first 5 rows of dataset.
  • I am using Google Collaboratory, hence we will write ‘/content/India.csv’ as file path.
df = pd.read_csv('/content/India.csv')
df.head()


  • To get shape of the date i.e. number of rows and columns we will use df.shape
df.shape


  • To obtain the names of the columns we will write df.columns.values
df.columns.values 


  • To find what all columns it contains, of what types and if they contain any value in it or not, with the help of info() function.
df.info()


3. Statistical values
  • describe() method will give us statistical values of all the columns individually.
  • We can get to know mean, standard deviation, quartiles, min, max, etc.
df.describe()

  • From above data, we can conclude that the mean value of first five columns is greater or equal to the median value (50%).
  • There is a relatively big difference between the 75% and max values of the columns.
  • Above two observations, gives an indication that there are deviations in our data set.

4. Pair Plot graph
  • sns.pairplot is used to view all the graphs with all the columns. So that we can get rough idea about how one column is related with the other.
  • EDA in one line of code: sns.pairplot(df)
sns.set_style('whitegrid')
g = sns.pairplot(df)
g.fig.set_size_inches(20,20)


5. Co-relation Matrix
  • Co-relation matrix is generally used for Linear Regression because it gives symmetric matrix.
  • We have used annot=True, for getting the values.
  • Values near to 1 shows positive co-relation, values close to -1 shows negative co-relation and 0 shows neutral relation.
  • Absolute 1 indicates fully positive relation while -1 shows no relationship exist.
  • Drawback of Co-relation matrix: It will not show us any columns with non-linear relationship.
dataset = df.corr()
dataset = pd.DataFrame(dataset)
dataset

plt.figure(figsize=(10,5))
sns.set_style("white")
sns.heatmap(dataset, robust=True, annot=True)


  • From above we can see, there is a strong positive correlation of Inflation, average consumer prices and Inflation, end of period consumer prices.
  • However, a strong negative correlation of General Government net lending/borrowing and Year.
6. PPS Matrix
  • PPS stands for Predictive Power Score.
  • To overcome the drawback of co-relation matrix, we will use PPS matrix.
  • This matrix will also let us know about the non-linear relationships among variables. Hence, it will give us asymmetric matrix.
  • The values we get will be interpreted same as Co-relation Matrix.
pip install ppscore
import seaborn as sns
import matplotlib.pyplot as plt
import ppscore as pps
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
matrix_df = pps.matrix(df).pivot(columns='x', index='y',  values='ppscore')
sns.heatmap(matrix_df, annot=True, cmap='Blues')


  • For getting PPS value for a particular feature , we will write below line of code:
pps.predictors(df, "Researchers in R&D (per million people)")
# x is feature, y is target



The analyzing part gets completed here. I would like to end the post with a quote:
Data is what you need to do analytics, information is what you need to do business.
— John Owen

Comments

Popular posts from this blog

Types of Machine Learning problems

In the previous blog, we had discussed brief about What is Machine Learning? In this blog, we are going to learn about the types of ML.  ML is broadly classified into four types: Supervised Learning Unsupervised Learning Semi-supervised Learning Reinforcement Learning 1. Supervised Learning Supervised learning is where there are input variables, say X and there are corresponding output variables, say Y. We use a particular algorithm to map a function from input(X) to output(Y). Mathematically, Y=f(X). Majority of the ML models use this type of learning to feed itself and learn. The goal of supervised learning is to approximate the said function so well that whenever we enter any new input, it's output is accurately predicted. Here, we can say that there is a teacher who guides the model if it generates incorrect results and hence, the machine will keep on learning until it performs to desired results. Supervised Learning can be further classified into: Classification : Here, the ou

Statistics in Data Science

Introduction Statistics is one of the popularly regarded disciplines this is particularly centered on records collection, records organization, records analysis, records interpretation and records visualization. Earlier, facts become practiced through statisticians, economists, enterprise proprietors to calculate and constitute applicable records of their field. Nowadays, facts have taken a pivotal position in diverse fields like records technology, system learning, records analyst position, enterprise intelligence analyst position, pc technology position, and plenty more. Statistics is a type of mathematical analysis that uses quantified models and representations to analyze a set of experimental data or real-world research. The fundamental benefit of statistics is that information is provided in an easy-to-understand style. Statistical & Non-Statistical Analysis Statistical analysis is used to better understand a wider population by analyzing data from a sample. Statistical analy

COVID19 Analysis using Power BI Desktop

Analysis of the data before running into predictions is very important. Understand a few rows and a few columns is very nominal task and we can easily examine the data. However, with a little larger data, suppose 10,000 rows with 50 columns, we really need to do analysis of the data so that we can come to know which factors are going to affect our prediction. Data Analysis with Python is a bit tedious task as we have to prepare the data i.e. cleaning, pre-processing and normalization. We use Seaborn and Matplotlib for our data visualization. But before plotting the graphs, we need to know which columns are inter-related. For that, we need a co-relation matrix which we can create using Python. However, PPS matrix is more better than co-relation matrix. Fig1: Co-relation Matrix of Covid19 dataset Fig2: PPS Matrix of Covid19 dataset It is always a tedious task when we code for Data Analysis. So, we have certain tools available in the market for it like Power BI, Tableau, etc. I have done