Exploratory Data Analysis with Python

How many times have you got baffled with the beginner part in Data Analysis? I am sure many times as I faced the same. So in this post, I will guide you as how to do Exploratory Data Analysis (EDA) with Python.

EDA is one of the basic steps involved in Data Analysis. Before you begin with the analysis part, understanding data is very important. You should be familier with certain terminologies of the type of data on which you are working. For instance, you are working with any financial data let’s say stocks and thus, you should be familier with terminologies like open, close, high, low, volume and so on. Alright, we now know understanding data is important so let’s move ahead with the analysis part!

Why Data Analysis?

To get inferences from the data and to know future possible outcomes based on historical values of any data. Continuing the above example, with analysis of certain stock, let’s say I analyzed performance of a particular stock from year 2000 to 2020. I will find certain trends in the graph of it’s everyday value. If it shows an upward trend (for major time) then I will like to invest in it, otherwise I will analyze other stocks. An important thing to remember here is:

Analysis is never done on basis of one or two parameters. There are lots of other parameters, might be 5, 10, 20 or any number.

To begin with analysis part, EDA is first step. EDA helps to:

Get inferences from the data.
To understand how parameters of data are inter-co-related.
To extract important parameters and neglecting ones which are not.
Find relationships between those parameters.
Try to bring a conclusion

Let’s get our hands dirty with data

I will be using data provided by World Bank Data (Science and Technology Indicators). This dataset deals with every country in the world. So, I will be narrowing the data to India so as to get easy insights. You can download it from here: Dataset (India)

1. Importing packages

We will work with pandas, numpy, seaborn and matplotlib here for data analysis and visualizations.
We wrote %matplotlib inline to plot our graphs within the notebook.

#import packages
import pandas as pd
import numpy as np
import seaborn as sns

#to plot within notebook
import matplotlib.pyplot as plt
%matplotlib inline

2. Loading the dataset and have overview of the data

Include file path in read_csv and df.head() will print first 5 rows of dataset.
I am using Google Collaboratory, hence we will write ‘/content/India.csv’ as file path.

df = pd.read_csv('/content/India.csv')
df.head()

To get shape of the date i.e. number of rows and columns we will use df.shape

df.shape

To obtain the names of the columns we will write df.columns.values

df.columns.values 

To find what all columns it contains, of what types and if they contain any value in it or not, with the help of info() function.

df.info()

3. Statistical values

describe() method will give us statistical values of all the columns individually.
We can get to know mean, standard deviation, quartiles, min, max, etc.

df.describe()

From above data, we can conclude that the mean value of first five columns is greater or equal to the median value (50%).
There is a relatively big difference between the 75% and max values of the columns.
Above two observations, gives an indication that there are deviations in our data set.

4. Pair Plot graph

sns.pairplot is used to view all the graphs with all the columns. So that we can get rough idea about how one column is related with the other.
EDA in one line of code: sns.pairplot(df)

sns.set_style('whitegrid')
g = sns.pairplot(df)
g.fig.set_size_inches(20,20)

5. Co-relation Matrix

Co-relation matrix is generally used for Linear Regression because it gives symmetric matrix.
We have used annot=True, for getting the values.
Values near to 1 shows positive co-relation, values close to -1 shows negative co-relation and 0 shows neutral relation.
Absolute 1 indicates fully positive relation while -1 shows no relationship exist.
Drawback of Co-relation matrix: It will not show us any columns with non-linear relationship.

dataset = df.corr()
dataset = pd.DataFrame(dataset)
dataset

plt.figure(figsize=(10,5))
sns.set_style("white")
sns.heatmap(dataset, robust=True, annot=True)

From above we can see, there is a strong positive correlation of Inflation, average consumer prices and Inflation, end of period consumer prices.
However, a strong negative correlation of General Government net lending/borrowing and Year.

6. PPS Matrix

PPS stands for Predictive Power Score.
To overcome the drawback of co-relation matrix, we will use PPS matrix.
This matrix will also let us know about the non-linear relationships among variables. Hence, it will give us asymmetric matrix.
The values we get will be interpreted same as Co-relation Matrix.

pip install ppscore

import seaborn as sns
import matplotlib.pyplot as plt
import ppscore as pps
sns.set_style('whitegrid')
plt.figure(figsize=(10,5))
matrix_df = pps.matrix(df).pivot(columns='x', index='y',  values='ppscore')
sns.heatmap(matrix_df, annot=True, cmap='Blues')

For getting PPS value for a particular feature , we will write below line of code:

pps.predictors(df, "Researchers in R&D (per million people)")
# x is feature, y is target

The analyzing part gets completed here. I would like to end the post with a quote:

Data is what you need to do analytics, information is what you need to do business.
— John Owen

The 8 Ops

Search This Blog

Exploratory Data Analysis with Python

Let’s get our hands dirty with data

Labels

Comments

Post a Comment

Popular posts from this blog

Statistics in Data Science

What is Machine Learning?

GPT-3 Explained in Under 3 Minutes