Miguel Escobar Published August 13, 2019

Data Profiling, Quality & Distribution in Power BI / Power Query

Power BIPower Query

IMPORTANT: You can read the official documentation that Microsoft has on this topic from the following link (url).

Microsoft released these features almost a year ago and in the last couple of months it has received some love with some updates. I’ve noticed that most people don’t really use these features, mainly because they’re disabled by default, so I wanted to create this blog post to cover the why and when to use these features.

You might be wondering; are these features worth it? IMHO, absolutely, but it is subjective to how you use Power Query.

The whole reason why these features exist is to make your life easier working with Power Query.

How to activate the Data Profiling, Quality & Distribution feature in Power BI / Power Query?


Note: These features are currently only available inside of Power BI Desktop, but they’ll eventually become available inside of Power Query for Excel and perhaps other integrations of Power Query.

To enable these features, you need to go to the View tab à Data Preview Group à Check the following:

  • Column quality
  • Column distribution
  • Column profile

What can I do with each feature? What do I gain out of them?

Before we go into detail of what each of these features do, let’s try to figure out what each of them is so we know what we’re referring to.

Note that in order to see the column profile you need to click on a column for it to pop up, so that might be the reason why you don’t see it right away.

What these features try to provide to you, as the end user, is a visual way to explore your data and get a sense of the composition of your dataset.

You get column distributions, check how many errors there are on a given column and how many unique values you have as well as some other stats of your column. It’s all contextual to the data that you’re currently seeing.

By default, Power Query does all of these profiling and checks over the first 1,000 rows of your dataset. Don’t worry, with just a simple click you’re able to change that behavior so you can get the whole picture by performing these checks over the entire dataset:

All you need to do is check the status bar at the bottom far left where it says “Column profiling based on top 1000 rows” and change that to be “Column profiling based on entire data set”.

Column Quality in Power BI / Power Query

The first feature that we’ll be using is the Column Quality. Our dataset comes with one column with the name Code which for some reason has some errors from the source.

Notice how the data quality section is telling me that this column consists of 19% errors. Notice how right underneath the column name there’s a little bar chart that is representative of the %s of the Valid, Error and Empty. Quite a subtle touch.

What do we want to do with those errors? We want to replace errors with the value “Z”.

From a User eXperience standpoint, the first thing that would come into my mind as an end-user is to actually click where it says Error on the Data Quality section and see if it’s interactive.

You don’t have to click – you can just mouseover and it’ll display some suggestions on what actions you can take based on the data quality results that you got. In our case we got this:

By default, it brings us the suggestion to Remove Errors, but if you click on the 3 dots you’ll see more options and the one that we’re after is Replace Errors. In that Replace Errors window all I have to input is Z:

And the result of that will look like this:

The most important aspect or the value added by this feature is that we didn’t have to use the right click or go through all of the options in Power Query to find what we wanted to do. It was an intuitive experience that led to the correct results.

Column Distribution in Power BI / Power Query

I’ve been testing this feature for months and most of the time it only provides a simple distribution of the data inside the column, but not any actionable or suggestion that has been beyond helpful.

Don’t get me wrong – it is absolutely amazing to visually realize that your query is missing some data because of distinct and uniqueness counts, but I just wish that it provided more than just a distribution and some stats.

The data inside those visualizations is sorted in descending order from the value with the higher frequency. You can mouseover those visuals, but you won’t get any tooltips as to what data point is for, so you don’t really know what’s going on just by looking at that simple visualization.

Besides being informational of the overall data inside the column (with distinct count and unique values), you could mouseover any of he visuals and get some suggestions of what you could do, but these are limited to only a few options that are not explicit to the values within that chart:

You only get 1 main suggestion from Power Query on what to do (usually a remove duplicates for column distribution) and the 3 dots will always give you the same options as the ones that we saw before with the Data Quality.

Column Profile in Power BI / Power Query

This one was added a few months ago and it extends or builds on top of what we saw before with the Column Distribution.

One of the main caveats of using this feature is that it takes A LOT of real estate of the screen to the point where in some small laptops and screens you’ll barely see the Data Preview on top of it. Nevertheless, this was a much-needed experience.

Usually I only have this feature and the Data Quality enabled.

Let’s look at the example for this one. When you try doing a filter over a column that only has 1 value, the filter is not correctly defined as it tries to do a “Select All” instead of just selecting one value. This is exactly what happens with the Master Account field.

And here’s where the Column profiling shines. You can select the column and then see ALL of the values inside of that column. You can then right click on a single value and do a specific filter to only get that single value:

You can also look on the left at the statistics. It goes beyond as to what we see with the column distribution and it adds the frequency of errors, empty string, min, max and depending on your column type it might add even more statistics. Sadly, this is only informational data – we can’t click on it to, let’s say, remove empty strings from the Column statistics pane.

Furthermore, the Column distribution to the right is also limited in the sense that we can not select multiple values at the same time to do a Keep or Remove of multiple rows at the same time

A bit of contrast: A competitor who provides a BETTER experience (IMHO)

One tool that I’ve had my eyes on for a number of years now is called Trifacta and they have this tool called Wrangler.

It’s a bit more than just a Data Preparation tool – it’s also a data exploration and they even say that it’s a data mining tool.

I even compared Power Query against Trifacta Wrangler a few years ago. You can check out that comparison here.

Instead of me telling you how cool that tool is, let me give you a screenshot of how their User Interface looks like:

They have a free version of their tool, so I encourage you to check it out.

Now, who had the column distribution, quality and profiling first? Power Query or Wrangle? It was actually Wrangler who had it first. Years before Power Query even first began to create beta version of it.

In comparison to what we have with Power Query, you can see that we don’t have the overall “distinct” and “unique” count, but rather several visualizations that rely on the type of data type that we’re dealing with.

  • For a date data type, we see a timeline
  • For a numeric field, we see a range of values
  • For text strings, we see how many categories

The visuals themselves work almost like you’d expect from the actual Power BI report canvas. There’s cross-filtering, so when you select a value (yes! You can select a value) the UI immediately highlights the rows where that value appears:

You can even hold the ctrl key to select multiple values at the same time (in this next image i’m selecting the codes B & C):

There are some drawbacks in comparison to Power Query. In this case, for example, the Trifacta Wrangler connector detects the errors but instead of showing them as errors, it doesn’t display them at all:

They’re categorized as missing values in the Data Quality of Trifacta Wrangler, while in Power Query you do have that distinction and depending on the case that might be extremely important.

Conclusion: Power Query is evolving into a Data Preparation & Data Exploration tool

It is crystal clear that these features are trying to provide a basic Data Exploration experience. Power Query has been positioning itself as THE data preparation tool, but it has always lack on the presentation layer for its users, so it’s trying to invest more in improving the current User eXperience and we’re on the right track.

There’s still a long way to go as you can see when we compare Power Query against other tools in the market, but it’s nice to know that the Power Query team is investing in improving the User eXperience.

Let me know your comments below on how you’re currently using these features today and what you think of them.

Power BIPower Query
Subscribe
Notify of
guest
15 Comentario
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Olivier Travers

Trifacta started as Wrangler at Stanford 8 years ago and clearly showed the way, I agree that PQ still has ways to go from a profiling perspective. Google should have been a strong player in this space since they had both Google Refine and Needlebase, but since Google is terminally incompetent at product portfolio management, they shut all of that down. Interestingly Tableau shipped Prep last year with profiling built in from the start.

Bertrand

Hi Olivier, agreed with you that Google Refine was amazing. I was a big fan!
Since then, Google did a strong move to the Cloud and they now have a comprehensive smart analytics suite. You may have missed it, but Google added a data preparation component to it. This is Cloud Dataprep and the company behind it… is Trifacta 😉
You can try it out here https://clouddataprep.com/home
Best
Bertrand

Olivier Travers

Hi Bertand, I know about Dataprep and am aware it’s Trifacta. My point is that Google dropped the ball on their own ETL acquisitions that were awesome many years ago, to now try to catch up with a partner. This is inane product management.

Anthony Orzechowski

Hi,
Thanks for the article. As always very useful.

I see your comparison to Wrangler. I thought I might add to the discussion … there is a tool that, in my opinion, is years ahead of Power BI and Wrangler for data exploration and cleaning, it is called JMP by SAS.

It has been doing a great job in this area for over the past 20 years and the maturity and features present in this application is what I would hope someday Power BI and Wrangler could achieve. To me, it is the clear gold standard in this area. I love using Power BI, but I very often need to return to JMP to do highly efficient and effective data exploration, transformations and preparation.

Please feel free to contact me if you wish to discuss the differentiation of JMP and the features the clearly differentiate it.

Tony Orzechowski

Bertrand Cariou

Thanks, Miguel, for recognizing the antecedence of Trifacta rendering a header quality bar to inform the users in possible data quality flaws.
Having Microsoft adopting a similar look and feel is a strong acknowledgment we’re doing a good job!
Two additional things I wanted to mention:
– In Trifacta Wrangler, by interacting with the data quality issues, Trifacta suggests ways to solve them (based on the data or patterns) and a preview of what the corrected data would look like. A very cool way to validate the data in real-time.
– The data quality header is just an overview of potential data quality flaws. You can access to profiling details from the header menu for more statistics (outliers, min, max, median, patterns, and many more) and get a better idea of the data issues. I wrote a blog on this
https://medium.com/google-cloud/improving-data-quality-for-machine-learning-and-analytics-with-cloud-dataprep-c55a47580fd8
Cheers
Bertrand

AGNAOU Hamza

Hi Mister Escobar,

First of all, I want to thank you for the interesting informations you shared with us.

I am a starter learner of Power BI.

I encounter a problem in enabling “Enable column profiling” under (Options > preview features). So the data preview group (Power Query Editor > View) doesn’t contain neither the column quality, column profile or column distribution.

I am actually using the latest version of Power BI Desktop.

Thank you in advance for your help.

Karthik Rao

Hi Miguel
I have a huge file and it takes hours to do column profiling for entire dataset. After that is see these 3 measures for each column. I see a column that has errors.
I want to see only the rows with errors. Then I can compare with non error rows. And then decide to correct or delete the error rows.
However, no such options. Only remove replace etc. Howdo I decide without seeing? And then if I choose an option, it takes hours to redo the profiling.
Am I doing something wrong?
Any way to filter and see the errors?

Sara

Hi Miguel,
I took a look at your great article. However, I can’t find these new columns. Can you tell me how to view and enable the columns please? I can’t find them anywhere. Is it in a com add in? The article you wrote that is two years old doesn’t outline how those column options can be enabled. I noticed I am also missing the Tools menu too. Is that in an add-in as well. Your article doesn’t list the version of Excel it applies to. I’m using Excel version 2104 Build 13929.20296. I would love it if you could point me in the right direction because I really want to use those column features and they just aren’t anywhere to be seen for me.
Thank you so much!

Stephanie William

Hi !

Thank you for the great article.

I would like to know if it’s possible to convert profile data into the Power BI dash ?

Thank you in advance for your help.

Stephanie