IMPORTANT: You can read the official documentation that Microsoft has on this topic from the following link (url).
A long awaited post! The Fuzzy matching preview feature was added to Power BI Desktop MONTHS ago and here’s my take on it.
What is Fuzzy Matching? In short, it’s an algorithm for approximate string matching.
Why does it matter? Up until September of last year, Power BI / Power Query only gave us the option (natively) to do Merge / JOIN operations similar to a VLOOKUP (FALSE) where we can only do exact matches. That has changed and we can do now “close” or “approximate” matches thanks to Fuzzy Matching
What can I do with it? Let me give you a practical example of something that I recently had to do!
The Scenario: A Survey
Imagine that we recently submitted a pretty cool survey to our office. Basically a simple survey where we just had 1 main question:
What’s your favorite fruit?
Sadly, we didn’t have any type of validation, so people couldn’t pick from a list nor did we have some sort of autofill in place, so that field was completely free-form.
This meant that we would need to do some Data Cleansing in order to really understand the data.
When we received all the responses for our survey, the export looked like this:
So you can imagine that when we tried creating a chart out of it, it didn’t provide the insights that we were hoping for:
As you can tell, it appears that we have some Data Quality issues, meaning that we need to clean our Data and this is a perfect scenario for a Merge with Fuzzy Matching.
Implementing Fuzzy Matching
Before we move forward, we need to set ourselves a goal. Our goal is to reach the following table:
Where we would create this new column called Fruit so we can use that column as the axis of our chart:
To make that happen, we’ll be using a Fuzzy Matching, but even before we try to do the merge operation, we need to have another table besides our original Survey table.
Valid Responses or Dictionary Table
We need to create or have a table with the values that should be “valid”. In other words, you could say that this is a table with the values that are correctly typed OR this is the Dictionary table and we’re trying to figure out if any values from the Survey results look similar to the ones in this Valid Responses or Dictionary Table.
We’re going to keep our table short for example purposes, but our table will look like this:
with that table in place, we can now move on to the next phase which is doing the Merge operation with the Fuzzy Matching
Merge Operation with Fuzzy Matching
In the event that you’re new to Merge or Joins within Power BI / Power Query, I highly recommend that you check out this page on what they are and what they can do for you.
With both of the previously mentioned tables (our Survey Results and the Dictionary table) inside the Power Query Editor, we can perform a new Merge Operation as a New Query taking the Survey Results as the base:
and in that Merge window we tick the checkbox to “Use fuzzy matching to perform the merge”.
We can simply hit OK now and the result will look like this:
and the next thing that we need to do is simply expand the Table values inside the Column Dictionary by clicking on that icon that has the two arrows going in separate directions. The result of that will look like this:
Which does look good at first glance! but we’re missing the value for “Coco” which should be “Coconut”.
It would be great if we could further know how the Fuzzy Matching algorithm works, but that’s only information that Microsoft knows, but there are some Fuzzy Options that we can play with to see if we can get better results.
Fuzzy Merge Options
Let’s go back to the Source Step of that query and click on the gear icon so we can modify the Merge operation. You probably noticed an Expand icon next to the “Fuzzy Merge Options” in one of the previous screenshots. When you click on it, this is what you get:
Let’s do a roundup of the options that we see and what they do:
- Similarity threshold – this can be a number from 0 to 1, Which translates into how closely similar the text strings should match in order to show in the output table. By default is set up to 0.8, which is simply a way to say 80% similar.
- Ignore case – if checked, then the algorithm will ignore if the letters are in uppercase or lowercase and will not take the casing into consideration.
- Match by combining text parts – You probably saw the example that I had on “Watermelon”, which was typed incorrectly as “water melon”. What this option does is that in the event that there’s a space between text strings, it’ll combine both text strings into one. If this didn’t happen, then the value would’ve been merged into “Melon” or “Water” instead, but since this is enabled it ended up as “Watermelon”.
- Maximum number of matches – in some cases, you might want to limit the number of matches that you get depending on the “similarity threshold” that you’ve defined. This is the option where you can define how many matches you want, which can be a number from 1 to a REALLY high number (2147483647)
After playing with these options, I ended up with this set up:
but I still couldn’t get the result that I was expecting for the “Coco” value. It was still giving me a null.
And this is where the last option, Transformation table, comes in.
Transformation Table
imagine a scenario where there’s simply no way to do a “close match” and you need to have some sort of explicitly tell the system that you want “Coco” to be referred as “Coconut”. This is where the Transformation Table comes in.
It’s like a sort of Translation table – or it could also be considered a mapping table.
This table does have some requirements, but they are pretty simple and those are that it needs 1 column by the name of From and another one by the name of To. Doesn’t matter if you’re in the Spanish version of the Power Query Editor, those NEED to be the names of the columns, otherwise this won’t work.
This is how my Transform table looks like:
Essentially, the Fuzzy Matching will look for the values from the “From” column and replace them with the value that we see on the “To” column. This is a explicit match or “Mapping”.
We then supply that table to the Fuzzy Matching options like this:
and this one looks promising as it does show that there are 10 out of 10 matches!
When I go back to the last step, I see this:
Exactly the output that we were going for!
At this point we can simply close & Load and create the chart that we’re looking for.
Advanced Fuzzy Matching Options
While we can get pretty much everything that we need through the UI, there are some hidden gems inside the code that we use.
This new experience uses a new function by the name of: Table.FuzzyNestedJoin
Here’s the documentation found within that function:
We’ve already seen most of them except the 2 that are highlighted.
This is what the rest of the documentation reads:
The one that drives my attention the most is the “ConcurrentRequests”, meaning that we can define how many threads this operation can use in the event that we need more power.
I haven’t had a situation where I’ve needed to define the culture yet, but I can see that this might be beneficial for something like a really long text string for fuzzy matching.
I’d love to know if you’ve ever tried using any of these advanced options.
Thanks for this Miguel, I have a query, I’ve downloaded the latest version of powerquery (2.59.5135.201) /powerbi desktop (2.67.5404.981) but I cant seem to find the fuzzy match option in left outerjoin. May i know what i have been doing wrong? Thanks
Hey!
I highly recommend that you download the latest version of Power BI Desktop from here:
https://www.microsoft.com/en-us/download/details.aspx?id=45331
According to this, the latest version is:
2.70.5494.761
Thanks for the good information! I have 2 queries each is about 300 row. When I tried to do the fuzzy match the query runs really slow (took up to 1 hour) to match. Any idea why? I’m really new to this.
Hey Joyce!
Would you posting your scenario and the files that you’re using on the official Power Query forum? https://social.technet.microsoft.com/Forums/en-US/home?forum=powerquery
I’d love to replicate that scenario and see why it happens. Also, the people that work on this tool from Microsoft monitor that forum so it’s a great a idea to publish your scenario in there
Hi. I tried to use this fuzzy match option, but after selecting fuzzy match the “ok” button at the bottom disappears. I tried to install power bi desktop again, but i get the same thing, no ok button after selecting fuzzy match option.
Hey!
That seems pretty odd. Maybe it deselected the fields for the join after choosing the fuzzy option?
I’d recommend that you post your full scenario and sample dataset on the official Power Query forum:
https://social.technet.microsoft.com/Forums/en-US/home?forum=powerquery
Hi Miguel,
Thanks for the article. It’s super helpful! Have you used this tool for a big data set? One of the advantages of using Power Query is the ability of merging connections to documents that exceed the 1.2M rows etc so I wonder how good will perform in that scenario. Thanks
I haven’t yet. If you do try it out, please let me know your findings! So far it has been small surveys where there’s really no validation in place – just a free form text field
I’ve tested a fuzzy merge between two “not that big tables” (100k vs 250k rows) using the default threshold and didn’t perform well..eventually after 40-45 min found the matches but when I tried to bring the output in it failed… on top of that it doesn’t show you the score per match right? With the Fuzzy Lookup add-in the output shows you the score for all the matches found over the threshold which can be very convenient in case you want to review a certain set manually. Keep us posted if you test it deeper! Many Thanks
Hey! Be sure to send the frown button so that the MSFT team can improve the experience. I don’t generally see that many fuzzy scenarios, so I wouldn’t be the best person to test things out at a deeper level.
Something that I can assure you is that the team behind fuzzy lookup would LOVE to know about your findings, so contact them and let them know about your situation.
Thank you Miguel. Sure, I will do that. Apart from the frown button do you know any other way I can reach them directly? I guess they’d have tested this on a wider spectrum so would be great to get some insights.
the other way would be to post your scenario and findings on the official Power Query forum:
https://social.technet.microsoft.com/Forums/en-US/home?forum=powerquery
They follow that closely, so you’ll reach them in no time
Hey, I’m having the same issue with Fuzzy matches. One table is only 2600 rows, the other is 450k. I gave up after 30 min of loading.
I was thinking that it could be that the process is so complicated that it runs on the F-Engine which is a single thread processing machine instead of the Storage Engine which is multi-threaded?
No clue. Have you tried creating a distinct list of text strings and doing the operation over that list and then cascade those results using a simple join to the original 450K table ?
I have the same issue regarding slowness. Fuzzy matching matched very well, but was so extremely so that it was not worth it.
Can you provide more details about Similarity threshold. By default it is set to 0.8, which is to say 80% similar. How does it measure “similar”?
Does order matter? At 80%, would ABCD match DCBA. They are the same letters but in different order.
Is it saying 80% of the letters are the same? So ABCDEFGHIJ would match GHZABCZDEFZ because 8 of the 10 letters are the same.
Hey! Unfortunately, as stated in my post, only the Microsoft team really know what the algorithm is so I’d recommend asking them directly through here:
https://social.technet.microsoft.com/Forums/en-US/home?forum=powerquery
Hello, Can someone help me with which algorithm does Power BI to perform fuzzy match? I would like to replicate that algorithm on data base side. Please help me with this.
Thanks
Hi Miguel
Great post, I am facing an urgent requirement wherein I am required to provide the match% as slicer.
This will help the user to navigate results by adjusting the match percentage without going to queries.
Your extremely urgent support is appreciated.
Regards
Ank
Hey!
For business inquiries you can contact us via the contact page or via email info@poweredsolutions.co
Best!
I’ve tried to use Fuzzy Merge today, but it’s not working. My datasets are 92K vs 4.5M rows. It has been running for over 12 hours now without any results yet. In all the videos and examples for the Fuzzy Merge, they test it always with 5-10 lines of data, I know the reason now. In the real life you won’t have tables with 10 rows, but millions of them. So this feature on Power BI is pretty useless right now.
Are both tables using distinct values ? That would be one way to expedite the process.
Run the fuzzy merge over distinct values and then “flow” the results down using an exact merge
Hi Miguel, thanks for this post!
Quick question, is there a way to do Fuzzy Grouping (i.e. doing the string similarity on the same column of strings to consolidate it).
What I am trying to do is basically consolidating just the one table without joining it with another table?
Hey!
These two articles could help you:
https://docs.microsoft.com/en-us/power-query/cluster-values
https://docs.microsoft.com/en-us/power-query/group-by#fuzzy-grouping
thank you! I want to merge 2 tables based on the information provided in 2columns. The first column will be a Fuzzy match while the second column is exact match. Data as example here, Country should be exact match. My Table 1 database is huge with 2 mil item, I tried to use Fuzzy logic for both column, With a small Table 2, I can get the needed results but it is extremely slow. but when Table 2 is large like +50 lines it is not working.
Table 1
Inventory | Country | Price | Qty
Apple | US | $10 | 5000
Apple | MX | $8 | 6000
Apple | UK | $12 | 1000
Table 2
Orders | Country | Request
Aple | MX | 100
Appl | US | 20
Apples | US | 30
Result table
Orders | Country | Request | Inventory | Price | Qty
Aple | MX | 100 | Apple | $8 | 6000
Appl | US | 20 | Apple | $10 | 5000
Apples | US | 30 | Apple | $10 | 5000
hey!
I encourage you to post your full scenario with all the details possible on the official Power Query forum:
https://community.powerbi.com/t5/Power-Query/bd-p/power-bi-services