Miguel Escobar Published March 26, 2019

Fuzzy Matching in Power BI / Power Query

Power BIPower Query

IMPORTANT: You can read the official documentation that Microsoft has on this topic from the following link (url).

A long awaited post! The Fuzzy matching preview feature was added to Power BI Desktop MONTHS ago and here’s my take on it.

What is Fuzzy Matching?  In short, it’s an algorithm for approximate string matching.

Why does it matter? Up until September of last year, Power BI / Power Query only gave us the option (natively) to do Merge / JOIN operations similar to a VLOOKUP (FALSE) where we can only do exact matches. That has changed and we can do now “close” or “approximate” matches thanks to Fuzzy Matching

What can I do with it? Let me give you a practical example of something that I recently had to do!

The Scenario: A Survey

Imagine that we recently submitted a pretty cool survey to our office. Basically a simple survey where we just had 1 main question:

What’s your favorite fruit?

Sadly, we didn’t have any type of validation, so people couldn’t pick from a list nor did we have some sort of autofill in place, so that field was completely free-form.

This meant that we would need to do some Data Cleansing in order to really understand the data.

When we received all the responses for our survey, the export looked like this:

image

So you can imagine that when we tried creating a chart out of it, it didn’t provide the insights that we were hoping for:

SNAGHTML2074e39f

As you can tell, it appears that we have some Data Quality issues, meaning that we need to clean our Data and this is a perfect scenario for a Merge with Fuzzy Matching.

Implementing Fuzzy Matching

Before we move forward, we need to set ourselves a goal. Our goal is to reach the following table:

image

Where we would create this new column called Fruit so we can use that column as the axis of our chart:

image

To make that happen, we’ll be using a Fuzzy Matching, but even before we try to do the merge operation, we need to have another table besides our original Survey table.

Valid Responses or Dictionary Table

We need to create or have a table with the values that should be “valid”. In other words, you could say that this is a table with the values that are correctly typed OR this is the Dictionary table and we’re trying to figure out if any values from the Survey results look similar to the ones in this Valid Responses or Dictionary Table.

We’re going to keep our table short for example purposes, but our table will look like this:

image

with that table in place, we can now move on to the next phase which is doing the Merge operation with the Fuzzy Matching

Merge Operation with Fuzzy Matching

In the event that you’re new to Merge or Joins within Power BI / Power Query, I highly recommend that you check out this page on what they are and what they can do for you.

With both of the previously mentioned tables (our Survey Results and the Dictionary table) inside the Power Query Editor, we can perform a new Merge Operation as a New Query taking the Survey Results as the base:

image

and in that Merge window we tick the checkbox to “Use fuzzy matching to perform the merge”.

We can simply hit OK now and the result will look like this:

image

and the next thing that we need to do is simply expand the Table values inside the Column Dictionary by clicking on that icon that has the two arrows going in separate directions. The result of that will look like this:

image

Which does look good at first glance! but we’re missing the value for “Coco” which should be “Coconut”.

It would be great if we could further know how the Fuzzy Matching algorithm works, but that’s only information that Microsoft knows, but there are some Fuzzy Options that we can play with to see if we can get better results.

Fuzzy Merge Options

Let’s go back to the Source Step of that query and click on the gear icon so we can modify the Merge operation. You probably noticed an Expand icon next to the “Fuzzy Merge Options” in one of the previous screenshots. When you click on it, this is what you get:

image

Let’s do a roundup of the options that we see and what they do:

  • Similarity threshold – this can be a number from 0 to 1, Which translates into how closely similar the text strings should match in order to show in the output table. By default is set up to 0.8, which is simply a way to say 80% similar.
  • Ignore case – if checked, then the algorithm will ignore if the letters are in uppercase or lowercase and will not take the casing into consideration.
  • Match by combining text parts – You probably saw the example that I had on “Watermelon”, which was typed incorrectly as “water melon”. What this option does is that in the event that there’s a space between text strings, it’ll combine both text strings into one. If this didn’t happen, then the value would’ve been merged into “Melon” or “Water” instead, but since this is enabled it ended up as “Watermelon”.
  • Maximum number of matches – in some cases, you might want to limit the number of matches that you get depending on the “similarity threshold” that you’ve defined. This is the option where you can define how many matches you want, which can be a number from 1 to a REALLY high number (2147483647)

After playing with these options, I ended up with this set up:

image

but I still couldn’t get the result that I was expecting for the “Coco” value. It was still giving me a null.

And this is where the last option, Transformation table, comes in.

Transformation Table

imagine a scenario where there’s simply no way to do a “close match” and you need to have some sort of explicitly tell the system that you want “Coco” to be referred as “Coconut”. This is where the Transformation Table comes in.

It’s like a sort of Translation table – or it could also be considered a mapping table.

This table does have some requirements, but they are pretty simple and those are that it needs 1 column by the name of From and another one by the name of To. Doesn’t matter if you’re in the Spanish version of the Power Query Editor, those NEED to be the names of the columns, otherwise this won’t work.

This is how my Transform table looks like:

image

Essentially, the Fuzzy Matching will look for the values from the “From” column and replace them with the value that we see on the “To” column. This is a explicit match or “Mapping”.

We then supply that table to the Fuzzy Matching options like this:

image

and this one looks promising as it does show that there are 10 out of 10 matches!

When I go back to the last step, I see this:

image

Exactly the output that we were going for!

At this point we can simply close & Load and create the chart that we’re looking for.

Advanced Fuzzy Matching Options

While we can get pretty much everything that we need through the UI, there are some hidden gems inside the code that we use.

This new experience uses a new function by the name of: Table.FuzzyNestedJoin

Here’s the documentation found within that function:

image

We’ve already seen most of them except the 2 that are highlighted.

This is what the rest of the documentation reads:

image

The one that drives my attention the most is the “ConcurrentRequests”, meaning that we can define how many threads this operation can use in the event that we need more power.

I haven’t had a situation where I’ve needed to define the culture yet, but I can see that this might be beneficial for something like a really long text string for fuzzy matching.

I’d love to know if you’ve ever tried using any of these advanced options.

Power BIPower Query
Subscribe
Notify of
guest
26 Comentario
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Newa Nac

Thanks for this Miguel, I have a query, I’ve downloaded the latest version of powerquery (2.59.5135.201) /powerbi desktop (2.67.5404.981) but I cant seem to find the fuzzy match option in left outerjoin. May i know what i have been doing wrong? Thanks

Joyce

Thanks for the good information! I have 2 queries each is about 300 row. When I tried to do the fuzzy match the query runs really slow (took up to 1 hour) to match. Any idea why? I’m really new to this.

Swapnil

Hi. I tried to use this fuzzy match option, but after selecting fuzzy match the “ok” button at the bottom disappears. I tried to install power bi desktop again, but i get the same thing, no ok button after selecting fuzzy match option.

Joaquin

Hi Miguel,

Thanks for the article. It’s super helpful! Have you used this tool for a big data set? One of the advantages of using Power Query is the ability of merging connections to documents that exceed the 1.2M rows etc so I wonder how good will perform in that scenario. Thanks

Joaquin

I’ve tested a fuzzy merge between two “not that big tables” (100k vs 250k rows) using the default threshold and didn’t perform well..eventually after 40-45 min found the matches but when I tried to bring the output in it failed… on top of that it doesn’t show you the score per match right? With the Fuzzy Lookup add-in the output shows you the score for all the matches found over the threshold which can be very convenient in case you want to review a certain set manually. Keep us posted if you test it deeper! Many Thanks

Joaquin

Thank you Miguel. Sure, I will do that. Apart from the frown button do you know any other way I can reach them directly? I guess they’d have tested this on a wider spectrum so would be great to get some insights.

Miko

Hey, I’m having the same issue with Fuzzy matches. One table is only 2600 rows, the other is 450k. I gave up after 30 min of loading.

I was thinking that it could be that the process is so complicated that it runs on the F-Engine which is a single thread processing machine instead of the Storage Engine which is multi-threaded?

Lenzy Petty

I have the same issue regarding slowness. Fuzzy matching matched very well, but was so extremely so that it was not worth it.

General Ledger

Can you provide more details about Similarity threshold. By default it is set to 0.8, which is to say 80% similar. How does it measure “similar”?

Does order matter? At 80%, would ABCD match DCBA. They are the same letters but in different order.

Is it saying 80% of the letters are the same? So ABCDEFGHIJ would match GHZABCZDEFZ because 8 of the 10 letters are the same.

Srikanth

Hello, Can someone help me with which algorithm does Power BI to perform fuzzy match? I would like to replicate that algorithm on data base side. Please help me with this.
Thanks

Ank

Hi Miguel

Great post, I am facing an urgent requirement wherein I am required to provide the match% as slicer.

This will help the user to navigate results by adjusting the match percentage without going to queries.

Your extremely urgent support is appreciated.

Regards
Ank

Chris

I’ve tried to use Fuzzy Merge today, but it’s not working. My datasets are 92K vs 4.5M rows. It has been running for over 12 hours now without any results yet. In all the videos and examples for the Fuzzy Merge, they test it always with 5-10 lines of data, I know the reason now. In the real life you won’t have tables with 10 rows, but millions of them. So this feature on Power BI is pretty useless right now.

jomanaes

Hi Miguel, thanks for this post!

Quick question, is there a way to do Fuzzy Grouping (i.e. doing the string similarity on the same column of strings to consolidate it).

What I am trying to do is basically consolidating just the one table without joining it with another table?

Mahmoud

thank you! I want to merge 2 tables based on the information provided in 2columns. The first column will be a Fuzzy match while the second column is exact match. Data as example here, Country should be exact match. My Table 1 database is huge with 2 mil item, I tried to use Fuzzy logic for both column, With a small Table 2, I can get the needed results but it is extremely slow. but when Table 2 is large like +50 lines it is not working.

Table 1
Inventory | Country | Price | Qty
Apple | US | $10 | 5000
Apple | MX | $8 | 6000
Apple | UK | $12 | 1000

Table 2
Orders | Country | Request
Aple | MX | 100
Appl | US | 20
Apples | US | 30

Result table
Orders | Country | Request | Inventory | Price | Qty
Aple | MX | 100 | Apple | $8 | 6000
Appl | US | 20 | Apple | $10 | 5000
Apples | US | 30 | Apple | $10 | 5000