Miguel Escobar Published January 19, 2020

Incremental refresh for files in a Folder or SharePoint – Power BI

Power BI

A few years ago I posted my first impressions on Power BI Incremental refresh that was only made available for Power BI Premium licenses.

The whole idea of incremental refresh is superb:

  • You only refresh the data that you need – never again a FULL refresh needs to take place
  • It makes your data source work less – so you won’t be hitting your data source with expensive queries that take a long time to execute and load
  • Highway to insights! – you don’t have to wait those long hours just for your dataset to refresh

For the incremental to work as intended your data source has, to a certain degree, be able to handle query folding and most structured data sources like databases and OData are able to handle this with ease. How about files from a Folder either hosted locally or on SharePoint?

Note: Before you continue reading, I just wanted to say that this is an advanced topic on not only Power BI, but also on Power Query. If you or your company requires help setting this up, you can contact us at info@poweredsolutions.co for a quote on any type of assistance setting you up with the incremental refresh that works best for you.

The Scenario: Files in a SharePoint Folder

Imagine that we have Folder that has subfolders. One subfolder for every year and inside of those yearly folders we have the data for each of those years. It looks like this:

What we want to do is connect to this folder and then set up an incremental refresh logic, so we only refresh the data for the current year or, in other words, only get new data and keep the data from previous years to be intact and not be in the refresh process as it adds way too much processing / refresh time.

That means that while we’re in January of 2020, or any months in the year 2020 for that matter, we’ll only be refreshing the data for 2020. Once we get to the year 2021, we’ll start only refreshing the data for 2021 and the data for the year 2020 will be untouched.

In our case, we have Excel files in those folders, but we have SO many that it takes too much to refresh all of them.

The solution: taking advantage of Query Folding and Lazy Evaluation

Note: I highly recommend that you read these few articles if you intend to implement this on your own so you can understand what I’m doing:

What I need to do is quite straightforward. I need to use a Data Source function as a base and then create my custom function that will have the correct logic to drive the Incremental refresh with the RangeStart and RangeEnd parameters that the Power BI Incremental Refresh process requires.

We have 2 options in terms of Data Source functions. We can either use the “From Folder” (Folder.Files) data source function for local files or we can use the “SharePoint Files” (SharePoint.Files) for files hosted on SharePoint:

For this example, I’ll be using the SharePoint Folder connector / data source as I’m able to track what is going behind the scenes and if the incremental refresh is working correctly or not.

Step 1: Create the Custom Function

(StartDate as datetime, EndDate as datetime) =>
let
start = Number.From (Date.From(StartDate)),
end = Number.From(Date.From(EndDate))-1,
Years = List.Distinct( List.Transform( {start..end}, each Text.From( Date.Year(Date.From(_))))),
Source = SharePoint.Files("https://powerbipanama.sharepoint.com/sites/Services", [ApiVersion = 15]),
#"Filtered Rows" = Table.SelectRows(Source, each List.Contains( List.Transform( Years, (x)=> Text.Contains( [Folder Path],x) ), true)),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Custom", each Excel.Workbook([Content], true)),
#"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"Custom"}),
#"Expanded Custom" = Table.ExpandTableColumn(#"Removed Other Columns", "Custom", {"Name", "Data", "Item", "Kind", "Hidden"}, {"Custom.Name", "Custom.Data", "Custom.Item", "Custom.Kind", "Custom.Hidden"}),
#"Custom Data" = Table.Combine(#"Expanded Custom"[Custom.Data]),
Schema = #table( type table [Date = date, ProductName = text, Territory = text, Sales = number, Code = text, Master Account = text], {})
in
try #"Custom Data" otherwise Schema

This is the custom function that I’ve created, but the part drives the magic is actually the Filtered Rows step:

#”Filtered Rows” = Table.SelectRows(Source, each List.Contains( List.Transform( Years, (x)=> Text.Contains( [Folder Path],x) ), true))

It is also super important that the names of the Folders are correctly entered and that the setup that we have for the naming convention of the folders is taken into consideration.

You could change this to your specific needs, and I’ve played with sub folders for the months and created other scenarios where there is an incremental refresh logic specifically for the months.

Again, you can customize this to your needs and all you need to know is a bit of the M language.

Step 2: Create the Parameters for Incremental Refresh

This part is quite straightforward. You just need to go through the Parameters window to add new parameters as I described in my first impressions article for the incremental refresh on Power BI.

Step 3: Using your new Custom Function

Now you just need to invoke that custom function. In my case my Power Query window looks like this:

For the sake of testing, just to make sure that we’re only getting the data that we want, let’s change the RangeEnd parameter value to be 2/1/2019 (February 1st, 2019). Now go into the Tools Menu (enable it from the preview features if you don’t have it yet) and select the option for Diagnose Step:

Once you click on it, refresh the query which will create a new set of queries that you’ll see in the queries pane on the left:

Go to the one that reads “Detailed” in its name. In there, go to the “Data Source Query” field and you’ll see all the calls that were made to the SharePoint API. In my case, I only see that data for the year 2018 and 2019 was requested from the API. Not the one from the year 2020 which is exactly the intended behavior that I was looking for:

For the sake of cleansing, let’s delete all of these new queries that were created by the diagnostics tool and also change the RangeEnd to be today’s date. In my case, I’m writing this blog post on the 18th of January. To me, it looks like this:

And all you have to do now is just hit “Close and Load” to load your data into your Data Model. This initial refresh will take a while, but we’re just a few steps from having incremental refresh for our files.

Step 4: Set up the Incremental Refresh Policy

Now that our data is in our Data Model, we need to set up the incremental refresh for that specific table / query. Right click on the query and select the incremental refresh option:

Once inside the incremental refresh window, I need to set it up to store only the data for the last 2 years and then only refresh the rows in the last (current) year.

You can absolutely change this and customize it to your needs, but just make sure that you also customize your function and your setup on your folders.

At this point, you’ve finalized everything that has to happen on the Power BI Desktop and you need to now publish your file to the service.

You can follow the steps found on the official Power BI documentation here (url).

Testing our Incremental refresh

Once the file has been published, go ahead and do the initial refresh of the file. I used a pretty small dataset so the refresh took just a few seconds:

The way that I can verify that everything is working as intended is by using the XMLA endpoint for this workspace/dataset and see how the partitions were created.

Checking from SQL Server Management Studio, I can see that all 3 partitions were created:

Don’t be fooled by the number of rows. I’m using the same files in each folder, but the column that we should be looking at is the  “Last Processed”.  That’s how the partitions look after their initial creation.

Now, I waited a few minutes and then refreshsed the dataset a few more times and check how the partitions look now:

IT WORKS!

Check how the values for the partitions on year 2018 and 2019 are exactly the same, but the timestamp for the 2020 partition has changed.

Final thoughts

This pattern can be customized either for a any repository of files such as SharePoint or a local folder.

Power BI
5 1 vote
Article Rating
Subscribe
Notify of
guest
45 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Tomislav Mališ
Tomislav Mališ
1 year ago

Does this work under Pro licence or only under Premium?

Roberto
Roberto
1 year ago

That stuff is just great. Thx.
Do I require a premium capacity to make it work?

Vicencio Merida
Vicencio Merida
1 year ago

Hi Miguel,

Do you have the same example using .csv files?

Matan Baruch
Matan Baruch
1 year ago

Hi Miguel,

Is this method should work now on the Power BI pro with February update?

Andrew McDonald
Andrew McDonald
1 year ago

Hi Miguel,

Query folding is not available for folder data sources (local, SP etc.)
Microsoft warns that incremental refresh without query folding can result in worse performance.
Your example demonstrates how to use incremental refresh against a SP folder and confirms that only the “incremental” data is updated in the model. However without query folding it is still possible (likely ?) that _all_ data is being pulled from the source and then partitioned locally on the server, before updating only the recent partition.
As all data is still being pulled the refresh itself _might_ still run slowly or consume a lot of resource.
Have you actually confirmed that the refresh performance is improved, not just that only the recent data is updated ?

This
https://ideas.powerbi.com/forums/265200-power-bi-ideas/suggestions/15387612-refresh-only-new-data-or-modified-file-in-folder

GMF
GMF
1 year ago

What are the limits for incremental refresh – for example, I assume incremental refresh only works when the data or transformations for each segment do not cross the file boundaries? If we are tracking lease payments and the files have monthly data, we need to know a start and end date for the lease. If we do a Group By or some other transformation, do we still have to do a full data load so that the proper dates are recorded?

Nazim Hussain
Nazim Hussain
1 year ago

Miguel, What abt your forthcoming book. You said earlier that it be out by march 2020. Now march is just round the corner. Is the book finally getting out or still there will be some delay or the idea has been shelved completely. Please do reply.

trackback
1 year ago

[…] a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post […]

Lenzy Petty
Lenzy Petty
1 year ago

Regarding SharePoint “Lists”, Can incremental Refresh be performed on a SharePoint Lists, based on the “Modified Date” of an Item (row) in the list?

Felipe
Felipe
1 year ago

Hi Miguel.
Really well explained articule.
I got a question regarding partitions.
Let´s say my data source have only 3 records, one for 2018, other for 2019 and one more for 2020.
Let´s also say I set incremental refresh policy to store the last 3 years and to refresh only the last year.
Once publish to power bi service, it will create 3 partirions (right?): 2018, 2019 and 2020.

Let´s say the data source change and the 2018 record gets a new time stamp, say, 2019. I run a full refresh in SSMS.
Would happen with the partitions after the full refresh? would it recreate them? (2019 and 2020) or would power bi keep the old ones (2018, 2019 and 2020).

Thanks!

Gustavo
1 year ago

Hey Miguel! Thanks for sharing!
I`ve been testing on my side.
It doesn’t seems works on Power BI Pro, I couldn’t verify any timing difference on Scheduled Refresh. Another behavior that see is schema is working on Dataflows and not working on Power BI Desktop Version:2.79.5768.1082 (20.03) (x64).
Do you plan to make some tests using a Pro Account?

Gustavo
Reply to  Miguel Escobar
1 year ago

Thank you! II’ll keep trying.
Just to clarify some points
I’m using Dataflows just to validate the M query.
The first refresh is 20 min, for all other(8 in total) is about the same average.

Hu Wei
Hu Wei
1 year ago

Thanks a lot for the tips and guide. I manage to setup the M code to query on Month folder which stored on my shared drive and have been using it since March. However I notice that the month change to April, it still pointing to month March folder. To troubleshoot this, I manually set the range start and range end on my pbix files to April and it manage to return all files on April. However, once I publish to PB Service, it no longer works. Is there anything wrong with my setup? Could you please help me?

Aaron F
Aaron F
1 year ago

Hi Miguel,

I took the all PQ Academy courses, I wanted to thank you for that, they are a huge resource. I had a little different scenario I wanted to try that is only refresh from files in a folder that have not already been appended to a data model. In other words I am connecting to a folder as a data source but periodically new files are deposited in that folder so I only want to appended those files that have not been appended before. I figured this maybe possible by running a preliminary query to get all the files in a folder and compare that with a list of unique file names from the data source field in the appended query. The diffculty though is only passing the missing files to the refresh query. Does that seem feasible?

Srikanth Kasamolu
Srikanth Kasamolu
1 year ago

Hi Miguel Escobar,

I implemented the solution you provided and its working great. I’m trying to add FileName as a column, how do I achieve this ?

Srikanth Kasamolu
Srikanth Kasamolu
Reply to  Miguel Escobar
1 year ago

Figured it out. Thanks!!

I’m trying to implement this solution using Dataflow but seems like Dataflow is looking for column to set up Incremental load, any ideas on how to set it up?

Anton
1 year ago

Great content!

I used this to try and implement a slightly different solution however it is not working 🙁

I am calling a web source (API) and need to have data for the last 1 years. Obviously incremental refresh will be a great option so that I don’t have to pull 12 months of data on every refresh.

I created the RangeStart and RangeEnd parameters, included them in the web url from and to criteria. I had to create a function to deal with pagination as well so I loop the url for each page of data. I enabled the incremental refresh to store data for the 12 months and refresh for the last 3 days. All looked well in the Desktop version after saving and applying (latest May 2020 release). I published it to the service but when I hit refresh it failed with missing column errors. I suspected that for older dates some columns might not have data so I opened the original desktop file again to troubleshoot. Before I changed the RangeStart date to a recent date just so that I don’t pull thousand of record to the desktop version but I couldn’t apply changes, I receive an ODBC error. The only way to fix it was to change the RangeStart and RangeEnd parameter type to Date from Date/Time which broke the incremental refresh functionality and automatically disabled it which enabled me to apply the changes again. I am not sure if this is an issue with the latest May release or if there are underlying issues using this option for a web source. So I have two issues, one the incremental refresh implemented with a web source URL is not working in the service and I am unable to make changes tot he desktop file once incremental updates are turned on.

Flavio
Flavio
1 year ago

Miguel

I’m still trying to understand and reproduce your approach. I saw on your function you are using StartDate and EndDate as input variables, shouldn’t those inputs be RangeStart and RangeEnd, which are the Incremental Refresh standard parameters?

I also have been trying to figure out how to apply a similar approach but look for “Date Modified” attribute of the files and apply Incremental Refresh policies on those before applying the Table.AddColumn and Table.ExpandTableColumn. Date Modified or Date Created are valuable attributes that should be used by PowerQuery/Dataflow when running this sort of method.

Liyanis Velazquez
Liyanis Velazquez
1 year ago

Muchas Gracias Miguel por la explicación de este artículo! Hice la implementación pero no me funcionó, en Power BI Desktop todo está OK pero al actualizarlo en la WEB todos los valores de mi reporte estaban alterados, el desarrollo fue con una cuenta Pro, el origen de datos es el sharepoint y son archivos csv. Le agradezco si me da algún consejo de como validar en el Servicio si se crearon bien las particiones y que pudo haber ocurrido.

Liyanis Velazquez
Liyanis Velazquez
Reply to  Miguel Escobar
1 year ago

Muchas Gracias Miguel!!! Lo haré así y le comento si me funciona. He leído todo sus artículos sobre la actualización incremental y no logro identificar porqué en mi reporte cuando actualizo en Power BI Service los datos se corrompen, en mi caso tengo un modelo dimensional y solo le apliqué a las tablas de hecho la solución que publicó a las tablas de dimensiones no, es decir tengo tablas que le definí actualización incremental y a otras no en mi modelo de datos, podría esto ser el motivo de porqué no me funciona bien al actualizarse? De antemano Muchas Gracias por su respuesta

Jonathan Fajardo
Jonathan Fajardo
9 months ago

Hola
Estoy teniendo problemas para actualizar mis datos que vienen de varios archivos de excel que descargo de un sharepoint y se agregan al final en una base combinada.
He puesto las restricciones de los paremtros en tanto las consultas madres como la consulta combinada, he impedido que se cargue al power bi las consultas madres y solo se estaria actualizado la combinada con las condiciones del incremental refresh. En mi desktop se actualiza muy bien pero en el service me trae el error que no encuentra una columna.

Data source error: {“error”:{“code”:”ModelRefresh_ShortMessage_ProcessingError”,”pbi.error”:{“code”:”ModelRefresh_ShortMessage_ProcessingError”,”parameters”:{},”details”:[{“code”:”Message”,”detail”:{“type”:1,”value”:”The column ‘Sales Quantity’ of the table wasn’t found.”}}],”exceptionCulprit”:1}}}

Alguna idea de que estoy haciendo mal?

Chathura W
Chathura W
9 months ago

I am writing this question for looking some help.

Currently I have add FTP folder (or Local folder) (containing multiple excel files) as my power BI data source under the combine and load option. When running it will combine all the excel files and load to Power BI.

But I am looking to set above for incremental data loading. What I need to do is only load the newly add excel files from last data loading. Finally my Power BI application should have (all last loading data + newly added excel data)

When I search on the internet I saw there are many for direct sources. But I couldn’t find a solution for the non-direct sources such as FTP/excel.

Do you have any idea on above? Could you please help?

Michael
Michael
3 months ago

Hi Miguel,
thanks for this great idea. What ist the last part of your function for? Why do you have to try if the table.combine works ? Is this a precaution, or neccesary step?

Sonnad
Sonnad
3 months ago

We are now in the midst of going Live with multiple reports for multiple team and we are kind of thinking what’s the best way to implement the logic of incremental refresh within Dataflows when combining historical data ( via sharepoint file) and daily load data via API calls .

Here are some of the key configurations from our side :
a. We are currently on power bi premium subscription
b. We would need to use dataflows (as all transformation logic is stored in dataflows as well) to setup incremental refresh , as this dataflow would be used by multiple reports within individual workspaces catered to end users
c. About data –
a. Historical data : historical (one off activity) data is delivered by our DWH vendor through sftp which we are storing into SharePoint online and pointing our Power BI to this location
b. Our daily data is to be loaded using API calls from our DWH vendor location
c. Key thing to also note is that Daily data is only available on weekdays only (Mon – Fri) So basically run the refresh on Monday to get last friday’s close data
d. Created a dataflow with two entities ( historical and Daily load)

With all the above information – here are some of the things i have tried the below two options and noticed few issues in these . Reaching out to you to hear any better options to set this up .

OPTION 1 – Set Daily refesh for last 4 or 5 days and append to historical table , so last 5 days data is kept updated incase of any backdataed transactions . Append the daily load to Historical load . 1st run works fine , then from Day 2 load – Day 1 data goes missing and Day 4 gets added .

OPTION 2 – Refresh daily load and append to historical table – This option works well but in this case , i would need to pass a date parameter to pick data from DWH for the last Friday’s close and again , if i need to refresh twice a day (AM/ PM ) then this might not work and cause duplicate data when appending to history table
please note – we are using Anonymous connection to connect to Web API

Any help would be appreciated

LJ
LJ
3 months ago

Got a question on incremental refresh. Currently set it up and published. However the file refreshes every hour based on one drive refresh. One drive refresh was set up before I did the incremental refresh setup on the query. Will the incremental refresh still work even with a one drive refresh?

Andrew McGregor
Andrew McGregor
2 months ago

Am I missing some information here?

The only code I see for step 1 is one line highlighted in yellow and beginning #”Filtered Rows” =

Anup Patil
Anup Patil
2 months ago

Hello, Somehow the custom function is not visible now. It was previously. Can you please help post it so that we can look to carry the steps