Skip to main content

Road to data cleaning

Let's consider a scenario where you get the unstructured data and you need to derive the insights using it. This sounds so baffling to me and considering it for data-driven decision-making is one of the mistakes. This scenario can be intervened by a basic step i.e. data cleaning. As the name suggests it is the prior step you need to take before deriving insights and creating visualizations out of it.  In other words, it is a prerequisite for data visualization.

In my novice experience, I have handled data that arrived from a variety of sources also entered manually which can escalate the chances of getting duplicate values and wrong values which is needed to be removed before taking it further. All these steps may include eliminating some values or replacing some data so that the data is befitted for further visualization.

In today's blog, I will take into account 6 basic steps which prove out to be useful for me in every data cleaning step. All these steps are quite extensively used and can vary depending on the different datasets and what are you trying to achieve from your datasets. 

  • Removing unwanted spaces from your data can be easily achieved by applying TRIM(column_range). I have created a sample text and you can easily spot the difference before and after applying the formula.

  • Getting rid of all missing values can be done by selecting the data range and in the Home tab, you get a tab Find and Select under that select Go to Special and you will get a toggle where you can select blanks and you can see all the blanks are highlighted. Enter any text in one of the cells and press Ctrl+Enter. 


  • Removing and highlighting the duplicated values can be achieved by selecting the data range and go to the conditional formatting tab under that select the highlighted cell rules and select the duplicate values in it. The duplicated data gets highlighted. To delete the duplicated values you need to go to the data ribbon where you can find remove duplicates and it will ask you to select the columns. All the duplicated values get deleted.






  • Using text to columns to get rid of any delimiters or to get customer's first and last names separately. To do so select the data range and go to the data pane and select the text to columns where you will end up in a different toggle where you can see the overview of how your data is going to look like.


  • In contrary to the Text to the column you got the concatenate. To combine the first and last name in a single column this formula comes hefty. Just apply the formula and select both the columns and you will get desired results.


  • Keeping the fonts of your text in mind you need to make sure that you get the befitted fonts whether upper and lower. To do so just apply the formula by selecting the column range and you will get the desired result.



These are some generic steps that I follow almost every time but nowadays you got different tools that are mainly designed to do similar tasks but I prefer all these steps will help you in getting to know your data in a much better way. One of the common questions which I came across is how data cleaning is different from data transformation? The answer lies in the definition where data cleaning is used to eradicate all the unnecessary data while data transformation is used to convert the data from one form to the other. 

As we know mainly all the business intelligence tools have features of data cleaning whether it is a data interpreter or if I may say Power BI query editor where you can almost perform every single task that you can do in excel. So stay tuned for coming blogs where we will get familiar with the basic steps to be performed in the Power BI query editor.



Thanks for Reading  Let's connect on  LinkedIn.





Comments

Popular posts from this blog

Ultimate Beginners Guide to DAX Studio

There are zillions of external tools available with Power BI but DAX Studio is one of the most commonly used tools to work with DAX queries. It is a perfect tool to optimize the DAX and the data model. In this blog let's shed some light on the basic functionalities that can take your report to the next level. ARE YOU READY?  To start you will need the latest version of the DAX Studio. You can download it from their website . Don't worry you don't have to pay for the license. Fortunately, DAX Studio is a free tool As a BI Developer, I am using DAX Studio regularly. Based on my experience I use it for several purposes but in this blog, I will highlight the most common ones. Extracting a dump of all the measures used in your PBIX. Why do we need to do this? It can be used for documentation purposes also sometimes we try to reuse the DAX and such a dump comes in handy in this scenario. How to achieve it? Open the DAX Studio it is located under the external tools once you open t

Append v/s Merge in Power BI

Let's discuss another problem of the week. As a Power BI user, there are times when you want to combine queries. What are the ways to do so? In most cases, you can attain it by using either append or merge and both serve different purposes. Let's understand what do these terms mean in Power BI and how they are functionally different from each other.  It is quite common to get data from various sources and you need to combine those data depending on a particular column which is common in both tables so that you can add extra information or column to your big table. In such cases, we use merge queries. How to perform merge queries? For instance, I am considering Sample Superstore data and we will merge the returns table to the order table. You will find both merge and append in the home tab in extreme right in the power query editor. ProTip - You will find two options when you click on the drop-down in merge which are merge queries and merge queries as new. When you use merge que

Use Relationship in DAX

Data modeling is an essential part of creating perfect visuals. While creating complex data models there can be a case where you can find an inactive relationship represented by dotted lines and it occurs because you already have an active relationship between the two tables. But as a developer, you need to use both the relationship. How can it be done? You can use "Use Relationship" in such cases. Use relationship can be added to your DAX and act as a modifier or enhancer for calculation. It activates the inactive relation. But make sure you have an inactive relationship in place before using the use relationship function. Let's see how it works on Sample Superstore data. In my fact table I have two dates- Order date and Ship date. I am making the two relations between my date table and fact table. The relation between the sample superstore (date) to date table (date) is active while the relation between the sample superstore (ship date) to date table (date) is inactive