How many times have you got baffled with the beginner part in Data Analysis? I am sure many times as I faced the same. So in this post, I will guide you as how to do Exploratory Data Analysis (EDA) with Python.
EDA is one of the basic steps involved in Data Analysis. Before you begin with the analysis part, understanding data is very important. You should be familier with certain terminologies of the type of data on which you are working. For instance, you are working with any financial data let’s say stocks and thus, you should be familier with terminologies like open, close, high, low, volume and so on. Alright, we now know understanding data is important so let’s move ahead with the analysis part!
Why Data Analysis?
To get inferences from the data and to know future possible outcomes based on historical values of any data. Continuing the above example, with analysis of certain stock, let’s say I analyzed performance of a particular stock from year 2000 to 2020. I will find certain trends in the graph of it’s everyday value. If it shows an upward trend (for major time) then I will like to invest in it, otherwise I will analyze other stocks. An important thing to remember here is:
Analysis is never done on basis of one or two parameters. There are lots of other parameters, might be 5, 10, 20 or any number.
To begin with analysis part, EDA is first step. EDA helps to:
- Get inferences from the data.
- To understand how parameters of data are inter-co-related.
- To extract important parameters and neglecting ones which are not.
- Find relationships between those parameters.
- Try to bring a conclusion
Let’s get our hands dirty with data
- We will work with pandas, numpy, seaborn and matplotlib here for data analysis and visualizations.
- We wrote %matplotlib inline to plot our graphs within the notebook.
- Include file path in read_csv and df.head() will print first 5 rows of dataset.
- I am using Google Collaboratory, hence we will write ‘/content/India.csv’ as file path.
- To find what all columns it contains, of what types and if they contain any value in it or not, with the help of info() function.
- describe() method will give us statistical values of all the columns individually.
- We can get to know mean, standard deviation, quartiles, min, max, etc.
- From above data, we can conclude that the mean value of first five columns is greater or equal to the median value (50%).
- There is a relatively big difference between the 75% and max values of the columns.
- Above two observations, gives an indication that there are deviations in our data set.
- sns.pairplot is used to view all the graphs with all the columns. So that we can get rough idea about how one column is related with the other.
- EDA in one line of code: sns.pairplot(df)
- Co-relation matrix is generally used for Linear Regression because it gives symmetric matrix.
- We have used annot=True, for getting the values.
- Values near to 1 shows positive co-relation, values close to -1 shows negative co-relation and 0 shows neutral relation.
- Absolute 1 indicates fully positive relation while -1 shows no relationship exist.
- Drawback of Co-relation matrix: It will not show us any columns with non-linear relationship.
- From above we can see, there is a strong positive correlation of Inflation, average consumer prices and Inflation, end of period consumer prices.
- However, a strong negative correlation of General Government net lending/borrowing and Year.
- PPS stands for Predictive Power Score.
- To overcome the drawback of co-relation matrix, we will use PPS matrix.
- This matrix will also let us know about the non-linear relationships among variables. Hence, it will give us asymmetric matrix.
- The values we get will be interpreted same as Co-relation Matrix.
- For getting PPS value for a particular feature , we will write below line of code:
Comments
Post a Comment