Have you heard of EDA? Let's, deep dive into Exploratory data analysis (EDA). It is the analysis used by a data scientist to spot the patterns, trends, and hypotheses to manipulate the data for getting your questions answered. It is a vital step to get the most out of any data and also it is critical because it provides a better picture of the relationship between different variables. EDA is mainly practiced to get a better understanding of your data before making any assumptions. It answers some basic questions related to data such as the confidence interval, standard deviation, and relationship that exist between different variables. EDA is a soul for any data analysis. According to John Tukey
“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.”
There are mainly 4 types of EDA-
- Univariate non-graphical- It only deals with a single variable so it is obvious that there will be no relationship that can be identified using it. It is used to identify the patterns in data. Overall there is no graph in such analysis so to get a clear picture you need another analysis.
- Univariate- graphical- It resolves the issue as it provides different graphics such as histograms and boxplots.
- Multivariate non-graphical- Unlike univariate non-graphical, it deals with multiple variables and therefore it helps in explaining the relationship that exists between different variables.
- Multivariate graphical- It takes the help of different graphics to show the relationship between different variables. Graphics such as scatter plots, heat maps, and bubble charts.
Let's focus on the tools which help us to perform EDA. All data visualization tools can perform EDA. It can be performed in Tableau, R, QlikView, Python, and many others. In my opinion, I have performed EDA in Tableau and I have realized it is better to perform it before any machine learning model. The main question is how to do EDA?
To start off we need to import the right dataset, identify the columns in the dataset, check the number of observations, and check whether the dataset contains any null value. After these steps, you need to focus on your categorical variables. Before doing this you need to clean your data in this step all the waste or unwanted data is removed. Predictive analysis can be performed on the clean dataset.
Mostly EDA is a basic step that is overlooked many times but it is necessary to figure out the best-suited model for your analysis. EDA plays a crucial role in determining the research hypothesis. It is an asset to any data scientist as it is the first step before performing any machine learning operations. Results obtained from EDA should be valid and can be applied to the business context.
Comments
Post a Comment