IMPORTANT: You can read the official documentation that Microsoft has on this topic from the following link (url).
Microsoft released these features almost a year ago and in the last couple of months it has received some love with some updates. I’ve noticed that most people don’t really use these features, mainly because they’re disabled by default, so I wanted to create this blog post to cover the why and when to use these features.
You might be wondering; are these features worth it? IMHO, absolutely, but it is subjective to how you use Power Query.
The whole reason why these features exist is to make your life easier working with Power Query.
How to activate the Data Profiling, Quality & Distribution feature in Power BI / Power Query?
Note: These features are currently only available inside of Power BI Desktop, but they’ll eventually become available inside of Power Query for Excel and perhaps other integrations of Power Query.
To enable these features, you need to go to the View tab à Data Preview Group à Check the following:
- Column quality
- Column distribution
- Column profile
What can I do with each feature? What do I gain out of them?
Before we go into detail of what each of these features do, let’s try to figure out what each of them is so we know what we’re referring to.
Note that in order to see the column profile you need to click on a column for it to pop up, so that might be the reason why you don’t see it right away.
What these features try to provide to you, as the end user, is a visual way to explore your data and get a sense of the composition of your dataset.
You get column distributions, check how many errors there are on a given column and how many unique values you have as well as some other stats of your column. It’s all contextual to the data that you’re currently seeing.
By default, Power Query does all of these profiling and checks over the first 1,000 rows of your dataset. Don’t worry, with just a simple click you’re able to change that behavior so you can get the whole picture by performing these checks over the entire dataset:
All you need to do is check the status bar at the bottom far left where it says “Column profiling based on top 1000 rows” and change that to be “Column profiling based on entire data set”.
Column Quality in Power BI / Power Query
The first feature that we’ll be using is the Column Quality. Our dataset comes with one column with the name Code which for some reason has some errors from the source.
Notice how the data quality section is telling me that this column consists of 19% errors. Notice how right underneath the column name there’s a little bar chart that is representative of the %s of the Valid, Error and Empty. Quite a subtle touch.
What do we want to do with those errors? We want to replace errors with the value “Z”.
From a User eXperience standpoint, the first thing that would come into my mind as an end-user is to actually click where it says Error on the Data Quality section and see if it’s interactive.
You don’t have to click – you can just mouseover and it’ll display some suggestions on what actions you can take based on the data quality results that you got. In our case we got this:
By default, it brings us the suggestion to Remove Errors, but if you click on the 3 dots you’ll see more options and the one that we’re after is Replace Errors. In that Replace Errors window all I have to input is Z:
And the result of that will look like this:
The most important aspect or the value added by this feature is that we didn’t have to use the right click or go through all of the options in Power Query to find what we wanted to do. It was an intuitive experience that led to the correct results.
Column Distribution in Power BI / Power Query
I’ve been testing this feature for months and most of the time it only provides a simple distribution of the data inside the column, but not any actionable or suggestion that has been beyond helpful.
Don’t get me wrong – it is absolutely amazing to visually realize that your query is missing some data because of distinct and uniqueness counts, but I just wish that it provided more than just a distribution and some stats.
The data inside those visualizations is sorted in descending order from the value with the higher frequency. You can mouseover those visuals, but you won’t get any tooltips as to what data point is for, so you don’t really know what’s going on just by looking at that simple visualization.
Besides being informational of the overall data inside the column (with distinct count and unique values), you could mouseover any of he visuals and get some suggestions of what you could do, but these are limited to only a few options that are not explicit to the values within that chart:
You only get 1 main suggestion from Power Query on what to do (usually a remove duplicates for column distribution) and the 3 dots will always give you the same options as the ones that we saw before with the Data Quality.
Column Profile in Power BI / Power Query
This one was added a few months ago and it extends or builds on top of what we saw before with the Column Distribution.
One of the main caveats of using this feature is that it takes A LOT of real estate of the screen to the point where in some small laptops and screens you’ll barely see the Data Preview on top of it. Nevertheless, this was a much-needed experience.
Usually I only have this feature and the Data Quality enabled.
Let’s look at the example for this one. When you try doing a filter over a column that only has 1 value, the filter is not correctly defined as it tries to do a “Select All” instead of just selecting one value. This is exactly what happens with the Master Account field.
And here’s where the Column profiling shines. You can select the column and then see ALL of the values inside of that column. You can then right click on a single value and do a specific filter to only get that single value:
You can also look on the left at the statistics. It goes beyond as to what we see with the column distribution and it adds the frequency of errors, empty string, min, max and depending on your column type it might add even more statistics. Sadly, this is only informational data – we can’t click on it to, let’s say, remove empty strings from the Column statistics pane.
Furthermore, the Column distribution to the right is also limited in the sense that we can not select multiple values at the same time to do a Keep or Remove of multiple rows at the same time
A bit of contrast: A competitor who provides a BETTER experience (IMHO)
One tool that I’ve had my eyes on for a number of years now is called Trifacta and they have this tool called Wrangler.
It’s a bit more than just a Data Preparation tool – it’s also a data exploration and they even say that it’s a data mining tool.
I even compared Power Query against Trifacta Wrangler a few years ago. You can check out that comparison here.
Instead of me telling you how cool that tool is, let me give you a screenshot of how their User Interface looks like:
They have a free version of their tool, so I encourage you to check it out.
Now, who had the column distribution, quality and profiling first? Power Query or Wrangle? It was actually Wrangler who had it first. Years before Power Query even first began to create beta version of it.
In comparison to what we have with Power Query, you can see that we don’t have the overall “distinct” and “unique” count, but rather several visualizations that rely on the type of data type that we’re dealing with.
- For a date data type, we see a timeline
- For a numeric field, we see a range of values
- For text strings, we see how many categories
The visuals themselves work almost like you’d expect from the actual Power BI report canvas. There’s cross-filtering, so when you select a value (yes! You can select a value) the UI immediately highlights the rows where that value appears:
You can even hold the ctrl key to select multiple values at the same time (in this next image i’m selecting the codes B & C):
There are some drawbacks in comparison to Power Query. In this case, for example, the Trifacta Wrangler connector detects the errors but instead of showing them as errors, it doesn’t display them at all:
They’re categorized as missing values in the Data Quality of Trifacta Wrangler, while in Power Query you do have that distinction and depending on the case that might be extremely important.
Conclusion: Power Query is evolving into a Data Preparation & Data Exploration tool
It is crystal clear that these features are trying to provide a basic Data Exploration experience. Power Query has been positioning itself as THE data preparation tool, but it has always lack on the presentation layer for its users, so it’s trying to invest more in improving the current User eXperience and we’re on the right track.
There’s still a long way to go as you can see when we compare Power Query against other tools in the market, but it’s nice to know that the Power Query team is investing in improving the User eXperience.
Let me know your comments below on how you’re currently using these features today and what you think of them.