IMPORTANT: You can read the official documentation that Microsoft has on this topic from the following link (url).
A long awaited post! The Fuzzy matching preview feature was added to Power BI Desktop MONTHS ago and here’s my take on it.
What is Fuzzy Matching? In short, it’s an algorithm for approximate string matching.
Why does it matter? Up until September of last year, Power BI / Power Query only gave us the option (natively) to do Merge / JOIN operations similar to a VLOOKUP (FALSE) where we can only do exact matches. That has changed and we can do now “close” or “approximate” matches thanks to Fuzzy Matching
What can I do with it? Let me give you a practical example of something that I recently had to do!
The Scenario: A Survey
Imagine that we recently submitted a pretty cool survey to our office. Basically a simple survey where we just had 1 main question:
What’s your favorite fruit?
Sadly, we didn’t have any type of validation, so people couldn’t pick from a list nor did we have some sort of autofill in place, so that field was completely free-form.
This meant that we would need to do some Data Cleansing in order to really understand the data.
When we received all the responses for our survey, the export looked like this:
So you can imagine that when we tried creating a chart out of it, it didn’t provide the insights that we were hoping for:
As you can tell, it appears that we have some Data Quality issues, meaning that we need to clean our Data and this is a perfect scenario for a Merge with Fuzzy Matching.
Implementing Fuzzy Matching
Before we move forward, we need to set ourselves a goal. Our goal is to reach the following table:
Where we would create this new column called Fruit so we can use that column as the axis of our chart:
To make that happen, we’ll be using a Fuzzy Matching, but even before we try to do the merge operation, we need to have another table besides our original Survey table.
Valid Responses or Dictionary Table
We need to create or have a table with the values that should be “valid”. In other words, you could say that this is a table with the values that are correctly typed OR this is the Dictionary table and we’re trying to figure out if any values from the Survey results look similar to the ones in this Valid Responses or Dictionary Table.
We’re going to keep our table short for example purposes, but our table will look like this:
with that table in place, we can now move on to the next phase which is doing the Merge operation with the Fuzzy Matching
Merge Operation with Fuzzy Matching
In the event that you’re new to Merge or Joins within Power BI / Power Query, I highly recommend that you check out this page on what they are and what they can do for you.
With both of the previously mentioned tables (our Survey Results and the Dictionary table) inside the Power Query Editor, we can perform a new Merge Operation as a New Query taking the Survey Results as the base:
and in that Merge window we tick the checkbox to “Use fuzzy matching to perform the merge”.
We can simply hit OK now and the result will look like this:
and the next thing that we need to do is simply expand the Table values inside the Column Dictionary by clicking on that icon that has the two arrows going in separate directions. The result of that will look like this:
Which does look good at first glance! but we’re missing the value for “Coco” which should be “Coconut”.
It would be great if we could further know how the Fuzzy Matching algorithm works, but that’s only information that Microsoft knows, but there are some Fuzzy Options that we can play with to see if we can get better results.
Fuzzy Merge Options
Let’s go back to the Source Step of that query and click on the gear icon so we can modify the Merge operation. You probably noticed an Expand icon next to the “Fuzzy Merge Options” in one of the previous screenshots. When you click on it, this is what you get:
Let’s do a roundup of the options that we see and what they do:
- Similarity threshold – this can be a number from 0 to 1, Which translates into how closely similar the text strings should match in order to show in the output table. By default is set up to 0.8, which is simply a way to say 80% similar.
- Ignore case – if checked, then the algorithm will ignore if the letters are in uppercase or lowercase and will not take the casing into consideration.
- Match by combining text parts – You probably saw the example that I had on “Watermelon”, which was typed incorrectly as “water melon”. What this option does is that in the event that there’s a space between text strings, it’ll combine both text strings into one. If this didn’t happen, then the value would’ve been merged into “Melon” or “Water” instead, but since this is enabled it ended up as “Watermelon”.
- Maximum number of matches – in some cases, you might want to limit the number of matches that you get depending on the “similarity threshold” that you’ve defined. This is the option where you can define how many matches you want, which can be a number from 1 to a REALLY high number (2147483647)
After playing with these options, I ended up with this set up:
but I still couldn’t get the result that I was expecting for the “Coco” value. It was still giving me a null.
And this is where the last option, Transformation table, comes in.
imagine a scenario where there’s simply no way to do a “close match” and you need to have some sort of explicitly tell the system that you want “Coco” to be referred as “Coconut”. This is where the Transformation Table comes in.
It’s like a sort of Translation table – or it could also be considered a mapping table.
This table does have some requirements, but they are pretty simple and those are that it needs 1 column by the name of From and another one by the name of To. Doesn’t matter if you’re in the Spanish version of the Power Query Editor, those NEED to be the names of the columns, otherwise this won’t work.
This is how my Transform table looks like:
Essentially, the Fuzzy Matching will look for the values from the “From” column and replace them with the value that we see on the “To” column. This is a explicit match or “Mapping”.
We then supply that table to the Fuzzy Matching options like this:
and this one looks promising as it does show that there are 10 out of 10 matches!
When I go back to the last step, I see this:
Exactly the output that we were going for!
At this point we can simply close & Load and create the chart that we’re looking for.
Advanced Fuzzy Matching Options
While we can get pretty much everything that we need through the UI, there are some hidden gems inside the code that we use.
This new experience uses a new function by the name of: Table.FuzzyNestedJoin
Here’s the documentation found within that function:
We’ve already seen most of them except the 2 that are highlighted.
This is what the rest of the documentation reads:
The one that drives my attention the most is the “ConcurrentRequests”, meaning that we can define how many threads this operation can use in the event that we need more power.
I haven’t had a situation where I’ve needed to define the culture yet, but I can see that this might be beneficial for something like a really long text string for fuzzy matching.
I’d love to know if you’ve ever tried using any of these advanced options.