A few days ago I was checking social media to find interesting cases where Power Query could be a good solution. I found this scenario posted by Brian in a Facebook group here (url).
Imagine a scenario where you have a column with text values, but within each text string there are duplicates.

Take the previous image where from Column1 we want to:
- Find the substring that gets repeated in the cell
- Remove duplicates and keep only one instance of the substring that repeats itself
Effectively, we want what you see in the fxRemoveDuplicateString column.
In more clear terms, if we have a value such as:
this is a really long sentence this is a really long sentence this is a really long sentence this is a really long sentence
Then we want to pass that text string through a function that should output only the substring that repeats itself and starts from position 0:
this is a really long sentence
The function that does the magic
Note: The following section will be displaying a lot of advanced M language content.
First I’d like to share the custom function that I’ve created to solve this scenario and then I’ll go step by step as to what it does and why:
The function does exactly what we need. It tries to find the substring that gets repeated and, once it finds the best candidate, it removes the duplicates and keeps that single instance.
At first I tried to use the Column from example (url) option in Power Query, but this didn’t yield the result that I was expecting, so I had to create my own custom function.
Let’s step into the function to see how it works and why. You can download the sample file from the download button below:
Combinations
= List.Transform( {1.. Text.Length(Sample_Text)}, each Text.Start( Sample_Text,_))
The first step of the function is to get the total characters in the strings. That’s why I use the Text.Length function against the Sample_Text. Once I have that value, I create a list with the same length as the total characters in the text value.
For my example, I’m using the text this is a really long sentence this is a really long sentence this is a really long sentence as my sample text. This one has a lenght of 92 characters, so I create a list of 92 elements with the values from 1 through 92.
Then I go through each element in that list and apply the function Text.Start which brings me just the text string, but cut off for the number of characters that I pass in the second argument of the function which happen to be the element of that list that gets passed as the argument of the function.
Since we know that the start of the string will always start from the first character, the best idea that I came up with was to create this list of possible substrings and then test each one until I find the one that fits our logic. That’s why the first step just creates this list of potential or possible substrings that could become the output of this function.

Iterations and Total_Iterations
This is the step where the magic happens. Conceptually, what this step is doing is a while loop where I go element by element from the Combinations list to test which one would have the highest probability to be the one that’s being duplicated. I’m still trying to optimize this step as I only had around 40 minutes to come up and test this solution, but it’s looking good so far at the conceptual level.
The output of this step is simply a list with N elements which I later count at the Total_Iterations step.
Filtered_Possible_List
= List.Transform( Combinations, each Text.Split(Sample_Text, _){Total_Iterations})
The goal of the previous iteration was to achieve a list that started giving me errors and one element in the list that had a null, blank or empty value as you can see in the row 30 of the image below:

The reason why I’m trying to find the first iteration that starts giving me errors at the list level is because this tells me that those errors are not good candidates as they are not repeating themselves as much as the rest.
Table_From_Possible
I then create a new table from the Filtered_Possible_List and the Combinations lists using the Table.FromColumns function.
= Table.FromColumns({ Filtered_Possible_List, Combinations})
And this is when things start to make more sense visually.

Removed Errors
In this step I simply remove the errors from the field with the name Column1.

Filtered rows
Then I filter the table using the values from the field with the name Column1 to only keep the values that are either null, empty or with a blank space. These rows are effectively the best candidates as they were the ones that had an almost perfect or near perfect match for the substring.

Kept Last Rows and Custom3
In some cases I may end up with multiple candidates, so I added this step to always pick the last row from the table which, in theory, should always be the longest string or string with most characters:

And the last step of the function, called Custom3, simply navigates to the cell that contains the string thatt we need and that step is the output of the function:

Conclussion
This was a pretty interesting scenario and one that I can’t figure out how to solve without using some sort of while loop or some sort of recursive function. There’s a lot of room for improvement with this function and I hope to make those improvements at a later time.
Post your comments in the section below and let me know if you have any other interesting scenarios that I could tackle.
Hi Miguel, it’s really interesting. Thank you for sharing this.
Hello,
I have a question – if the substrings within the text string are separated with comma is it possible to change the custom function in order the commas to remain for the resulting substrings after the duplicates have been removed?
As I am new to Power Query I also would like to ask how I introduce the custom function in Power Query? I tried to add a custom column (i have repeating substrings in column 1 of my file) and put the custom function in the menu however the result is that the new custom column has “Function” written in every raw and I can not see the values.
Thanks in advance.
hey!
It really depends. You could modify the custom function to accomplish exactly what you need and what sort of logic it should follow.
I’d recommend reading these articles to get more familiar with custom functions and parameters in Power Query:
https://docs.microsoft.com/en-us/power-query/custom-function
https://www.thepoweruser.com/2019/02/12/parameters-and-functions-power-bi-power-query/
https://www.thepoweruser.com/2019/02/19/parameters-and-functions-in-power-bi-power-query-custom-functions/
https://www.thepoweruser.com/2019/02/26/parameters-and-functions-in-power-bi-power-query-combine-files/
https://www.thepoweruser.com/2019/03/05/parameter-tables-in-power-bi-power-query/
In short, using the “invoke custom function” should be the easiest way to invoke the function against a column from your table. You can also download the sample file to look at how I added that in the “Invoked Custom Function” step of the “Sample_Table”.
Hello,Miguel
This is very interesting. I am not sure if my custom function works in any set of words.
Thank you
I think I remember another MVP creating a pattern based on List.Accumulate which ran much faster than my version. That function is really the key to it all.
Hello,Miguel.
Thank you for your reply.
I think fxRemoveDuplicateString(“AA AA AA AA“) should be “AA”,
but fxRemoveDuplicateString(“AA AA AA AA“) returns “AA AA“.
“big guns big guns big guns big guns” is another set of words fxRemoveDuplicateString does not work well with.
It seems to me that fxRemoveDuplicateString does not work properly when
the number of repetition is an even one like 4,6,8,10 and so on.
it really depends on the context and the logic that you’re after.
Imagine a scenario where someone sent you a text that looks like this:
“h iya iya iya iya”
You could argue that the person that sent you that text meant to say “hiya” OR you could also argue that the person was trying to say “hi iya” (because the receiver’s name is Iya and because they have an inside joke of an “Echo” when saying each other’s name) and that person forgot the initial ‘i’ in “hi”.
If you ran that text through the function you’ll immediately find out that he tried to do the echo inside joke with you because the output would be:
“h iya iya”
But if we run it through the one that you suggest, it would be “h iya”, which (if you run it through another series of text interpreter transforms) maybe it would be interpreted as “hiya”.
In that sense you’d need to come up with a logic that satisfies your specific need if you want the output to be “AA” instead of “AA AA”. But again, it all depends on your specific analysis, the context that you’ll be using to analyze your data and what you’re expecting to get and do with the output.
Hope this helps explaining the function and the context in which I wrote the article.
Thank you very much for your explanation.
I’m looking forward to getting a paperback “Master Your Data with Excel and Power BI” due in November.
Hello ,
How can I create repeated rows in power bi columns a number of times
For example,
A
B
C
D
A
B
C
D
.
.
.
Thank you
Hey!
I’m not entirely sure that what you’ve mentioned is related to what’s shown in the article. Could you clarify what specific aspect of the article are you referring?
If this is not related to the topic showcased in the article, I highly suggest that you post your question in a forum such as the official Power BI Community forum so you get the help form the community to achieve your desired solution.
Best!
Hi Miguel,
Your function is working great. Unfortunately, I’m working with a set of data in which some of the rows do not have duplicates. In such a case when applying your function I’m receiving an error for rows with no duplicates. Is there any way to simply return the original value?
Hey!
you could wrap the function a “try” statement and if it gives you an error then you’d just throw the original value.
The article below might help:
https://www.thepoweruser.com/2019/06/18/error-handling-iferror-in-power-bi-power-query/
Best!
Hello, i have a question, i have this dataset:
| 2 , 0 , , , 1.75 , 0.5 , 2 , 2 , 3.25
| 0.25 , , , ,
-47.5 , -25.5
-1.5 , 221.5 , -8.5
What I need to do is to extract the “numbers”, “space comma space”, and “-” only, how can i do this in power query?
Hey! I’m not sure I follow your question.
I’d recommend posting the question on:
Power Query – Microsoft Power BI Community
With as many details as possible.
Best!