Miguel Escobar Published February 7, 2021

Remove duplicates in a text string with Power Query / Power BI

Power BI

A few days ago I was checking social media to find interesting cases where Power Query could be a good solution. I found this scenario posted by Brian in a Facebook group here (url).

Imagine a scenario where you have a column with text values, but within each text string there are duplicates.

Take the previous image where from Column1 we want to:

  1. Find the substring that gets repeated in the cell
  2. Remove duplicates and keep only one instance of the substring that repeats itself

Effectively, we want what you see in the fxRemoveDuplicateString column.

In more clear terms, if we have a value such as:

this is a really long sentence this is a really long sentence this is a really long sentence this is a really long sentence

Then we want to pass that text string through a function that should output only the substring that repeats itself and starts from position 0:

this is a really long sentence

The function that does the magic

Note: The following section will be displaying a lot of advanced M language content.

First I’d like to share the custom function that I’ve created to solve this scenario and then I’ll go step by step as to what it does and why:

The function does exactly what we need. It tries to find the substring that gets repeated and, once it finds the best candidate, it removes the duplicates and keeps that single instance.

At first I tried to use the Column from example (url) option in Power Query, but this didn’t yield the result that I was expecting, so I had to create my own custom function.

Let’s step into the function to see how it works and why. You can download the sample file from the download button below:

Combinations

= List.Transform( {1.. Text.Length(Sample_Text)}, each Text.Start( Sample_Text,_))

The first step of the function is to get the total characters in the strings. That’s why I use the Text.Length function against the Sample_Text. Once I have that value, I create a list with the same length as the total characters in the text value.

For my example, I’m using the text this is a really long sentence this is a really long sentence this is a really long sentence as my sample text. This one has a lenght of 92 characters, so I create a list of 92 elements with the values from 1 through 92.

Then I go through each element in that list and apply the function Text.Start which brings me just the text string, but cut off for the number of characters that I pass in the second argument of the function which happen to be the element of that list that gets passed as the argument of the function.

Since we know that the start of the string will always start from the first character, the best idea that I came up with was to create this list of possible substrings and then test each one until I find the one that fits our logic. That’s why the first step just creates this list of potential or possible substrings that could become the output of this function.

Iterations and Total_Iterations

This is the step where the magic happens. Conceptually, what this step is doing is a while loop where I go element by element from the Combinations list to test which one would have the highest probability to be the one that’s being duplicated. I’m still trying to optimize this step as I only had around 40 minutes to come up and test this solution, but it’s looking good so far at the conceptual level.

The output of this step is simply a list with N elements which I later count at the Total_Iterations step.

Filtered_Possible_List

= List.Transform( Combinations, each Text.Split(Sample_Text, _){Total_Iterations})

The goal of the previous iteration was to achieve a list that started giving me errors and one element in the list that had a null, blank or empty value as you can see in the row 30 of the image below:

The reason why I’m trying to find the first iteration that starts giving me errors at the list level is because this tells me that those errors are not good candidates as they are not repeating themselves as much as the rest.

Table_From_Possible

I then create a new table from the Filtered_Possible_List and the Combinations lists using the Table.FromColumns function.

= Table.FromColumns({ Filtered_Possible_List, Combinations})

And this is when things start to make more sense visually.

Removed Errors

In this step I simply remove the errors from the field with the name Column1.

Filtered rows

Then I filter the table using the values from the field with the name Column1 to only keep the values that are either null, empty or with a blank space. These rows are effectively the best candidates as they were the ones that had an almost perfect or near perfect match for the substring.

Kept Last Rows and Custom3

In some cases I may end up with multiple candidates, so I added this step to always pick the last row from the table which, in theory, should always be the longest string or string with most characters:

And the last step of the function, called Custom3, simply navigates to the cell that contains the string thatt we need and that step is the output of the function:

Conclussion

This was a pretty interesting scenario and one that I can’t figure out how to solve without using some sort of while loop or some sort of recursive function. There’s a lot of room for improvement with this function and I hope to make those improvements at a later time.

Post your comments in the section below and let me know if you have any other interesting scenarios that I could tackle.

Power BI
Subscribe
Notify of
guest
14 Comentario
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Thu

Hi Miguel, it’s really interesting. Thank you for sharing this.

Ivan

Hello,
I have a question – if the substrings within the text string are separated with comma is it possible to change the custom function in order the commas to remain for the resulting substrings after the duplicates have been removed?
As I am new to Power Query I also would like to ask how I introduce the custom function in Power Query? I tried to add a custom column (i have repeating substrings in column 1 of my file) and put the custom function in the menu however the result is that the new custom column has “Function” written in every raw and I can not see the values.
Thanks in advance.

Kaz

Hello,Miguel
This is very interesting. I am not sure if my custom function works in any set of words.

(str as text) =>
let
  Source=str,
  trim = Text.Trim(Source),
  split = Text.Split(trim," "),
  NumOfWords = List.Count(split),
  ListOfWords = List.Distinct(split),
  WordPosition = List.Transform(split,(x)=>List.PositionOf(ListOfWords,x)),
  PossibleNumOfWords = List.Accumulate({1..NumOfWords},{},(state,current)=> if Number.Mod(NumOfWords,current)=0 then state & {current} else state),
  ReptOfFirstN = List.Transform(PossibleNumOfWords,(x)=> if List.Repeat(List.FirstN(WordPosition,x),NumOfWords/x)=WordPosition then x else null),
  Ans = Text.Combine(List.FirstN(split,List.Min(ReptOfFirstN))," ")
in
  Ans

Thank you

Last edited 2 years ago by Miguel Escobar
Kaz

Hello,Miguel.
Thank you for your reply.
I think fxRemoveDuplicateString(“AA AA AA AA“) should be “AA”,
but fxRemoveDuplicateString(“AA AA AA AA“) returns “AA AA“.
big guns big guns big guns big guns” is another set of words fxRemoveDuplicateString does not work well with.
It seems to me that fxRemoveDuplicateString does not work properly when
the number of repetition is an even one like 4,6,8,10 and so on.

Kaz

Thank you very much for your explanation.
I’m looking forward to getting a paperback “Master Your Data with Excel and Power BI” due in November.

james

Hello ,
How can I create repeated rows in power bi columns a number of times

For example,
A
B
C
D
A
B
C
D
.
.
.
Thank you

Radoslaw Szczerbinski

Hi Miguel,

Your function is working great. Unfortunately, I’m working with a set of data in which some of the rows do not have duplicates. In such a case when applying your function I’m receiving an error for rows with no duplicates. Is there any way to simply return the original value?

Kara Sabayo

Hello, i have a question, i have this dataset:
| 2 , 0 , , , 1.75 , 0.5 , 2 , 2 , 3.25
| 0.25 , , , ,
-47.5 , -25.5
-1.5 , 221.5 , -8.5

What I need to do is to extract the “numbers”, “space comma space”, and “-” only, how can i do this in power query?