Miguel Escobar Published September 4, 2017

Guide and Resources for Web Data Extraction with Power BI & Power Query

Power BIPower Query

With the recent release of the Power Query Extensibility Model or Power BI Custom Connectors, I’ve been thinking about posting an article about the status of Web Scraping with Power BI / Power Query, its limitations and how Custom Connectors are here to provide the functionality that was restricted or limited to us within Power BI / Power Query.

The goal of this blog post is to serve you as a guide on what you can accomplish with the native functionality of the ‘Web Data’ connector inside Power Query and when you should consider creating a Custom Connector.

The Native Web Data Connector

(The Web Connector inside the Get Data dialog of Power BI Desktop)

I’ll start with the native Web Connector that Power Query has inside Excel and the Power BI Desktop. With it, we have 3 main methods in order to get data from the web:

  1. Using the Web.Page functionality
  2. Reading the response differently (overriding import format)
  3. Creating the Web.Contents formula by hand

Let’s go ahead and see each of them and when to use them.

Method One: Using the Web.Page function

image

(Navigator window with the Web View enabled for a better UX)

This method relies on the end-user inputting a web url that gives back a html document. It  then scans through the html document to find table tags and expose them to the end-user. It even provides a simple UI in the preview pane to highlight the tables found and a list of such tables on the left.

Alternatively, If no table tags were found, you can select the ‘Document’ option

image

This is quite daunting and is referred to as digging down the rabbit hole. I do not recommend going this route unless you know a bit about HTML, otherwise it might drive you a bit crazy

Gil Raviv wrote a few articles (here and here) about doing Web Scraping this way, but each webpage is different, so knowledge about HTML is needed and sometimes, even JavaScript and knowledge about other languages is needed. You REALLY need to know what you’re looking for before you dive into Power Query in these type of scenarios.

image (the navigation step will show you how deep you’re into the rabbit hole)

Always read the fine print – reading the ToS of the website that you’re trying to query. before you start extracting the data, is a MUST. Some sites might blacklist your ip for multiple queries at once and some others offer other methods like REST APIs that are dedicated for these scenarios.

You might want to read these articles before you start Web Scraping  Article 1Article 2

One of the caveats of using this Web.Page function is that you’ll need to use the Data Gateway in Personal Mode if you want to refresh your file on the Power BI Service.

Method Two: Override the Import Format

image

(using the gear icon to redefine how Power Query reads the HTML document)

As Ken and I describe in our book, you have the ability to change how Power Query reads certain files.

If you see a gear icon next to a step, it means that you can configure the settings of that step, and in this case we can change this html document to be treated as a Text document instead of a Web Page.

With that in mind, you still need to know a bit about HTML, but not as much as you needed to with the previous method. Note that with the previous method you never got to see all of the contents of the html document at once – it was all through a hierarchy search, but with this approach you can see everything at once and apply filters and splitting columns as needed.

This method is quite useful when your html document uses other tags such as lists (ul /li) as you can do a simple filter and get all of the elements of a list in just 2 steps.

image

(exposing the HTML Document as Text and filtering by the list tags)

 

Method Three: Using the Web.Contents for GET / POST requests

image

(Things that work on your Power BI Desktop might not refresh correctly on the Service)

Yes – there’s disparity between the Desktop App and the Cloud service. This means that some things might work on your desktop app, but not on the Power BI Cloud Service.

Usually when you create a query against a Web Service, the response is a binary. This binary can be a json, xml or perhaps even a plain txt file and most of the time you only get a subset of the data that you want and not all of it. If you want all of it, you’d need to paginate or use some metadata for the request that could be found inside the response Header.

This is where things get a bit complicated as there’s quite a few limitations when trying to create a query / solution that works both on your client and the Power BI Service. Again, you might make it work inside the Power BI Desktop, but that doesn’t mean that it’ll completely work inside the Service.

Here’s a list of a few things that are known to cause issues:

  • Dynamic parameters in the url address
  • Authentication methods
  • Pagination
  • Formula Firewall
  • Request Limit and Throttling
  • Tokens expiration and inaccessible Headers

We can definitely solve a few of the problems listed above, like using the 2nd parameter of the Web.Contents function that provides us with more control to add Headers, RelativePath, Queries and even the Content of a POST request.

Chris Webb has already addressed some of these issues here and here. I even created a thorough video on how to do paginations with Power Query here, but even with those articles and videos at hand + the full documentation of the Web.Contents function here, you might not get some queries to work because of how Power Query’s refresh operation was designed to work on the Power BI Service and how some other features are limited like not being able to see the response header of your Web.Contents request when using the Value.Metadata function.

So, the question is, what happens if we hit one of those issues/limitations ? do we have any other alternative?  the response to that is yes – we do have an alternative and it’s called the Power Query Extensibility Model or simply Power BI Custom Connectors

A new ERA: the Power Query Extensibility Model or Power BI Custom Connectors

image

(Power BI Custom Connectors)

If you’ve hit any of the limitations / issues mentioned before, I have the solution for you: Power BI Custom Connectors

Custom Connectors bring completely new possibilities that were unthinkable with just Power Query / Power BI.

Data Connectors allow you to create new data sources, or customize and extend an existing source. Common use cases include:

  • Creating a business analyst friendly view for a REST API
  • Providing branding for a source that Power Query supports with an existing connector (such as an OData service, or ODBC driver)
  • Implementing an OAuth v2 authentication flow for a SaaS offering
  • Exposing a limited/filtered view over your data source to improve usability
  • Supporting different authentication modes when creating a Power BI Content Pack
  • Enabling Direct Query for a data source via an ODBC driver

You can define the urls however you want – no longer are you bound to create your queries using RelativePath, Header, Query or other fields. You can even use the Uri.Combine function to make things easier

Implement OAuth 2.0, API Key and other methods – you have complete control on how you want to set up your authentication method so your private keys and other sensitive data can be stored securely and not exposed to the end-user in the file

Pagination made easy – you can implement paginations in a really easy way. You can say that you could’ve done it with just Power Query, but the good thing about Custom Connectors is that they things like Privacy levels and credentials from the Data Source definition record (your custom connector) so you no longer have to define the privacy level for each call that you make

Begone Formula Firewall! – since you have control over your Data Source, you can test it to make sure that it won’t interfere with other queries in your report and you won’t get any Formula Firewall or Privacy Level issues

Request Limit, Throttling, Token expiration and response Headers– a big restriction with Power Query is that, by design, you are unable to see the headers of the response of your requests. You only get the body of the response, but the Header contains important information, like the next page url, total elements in the endpoint and more. Now with the Extensibility Model you are able to see exactly what the response Header provides and use the values to paginate or just create a query plan for your connector so you never get throttled.

I encourage you to go ahead and give the Power Query Extensibility Model or Power BI Custom Connectors a try. Click in the next button to check the documentation that Microsoft has available.

Power BIPower Query
Subscribe
Notify of
guest
4 Comentario
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Naveen Ajmera

Great article. Is there any custom connector for SAP Business One?

Miguel Angel Escobar

hey! thanks for the comment.

I haven’t seen anyone creating one yet and I’m not really THAT familiar with the SAP solutions. Currently Power BI has a native connector for SAP BW and SAP HANA, but if you want to create a new custom connector for another SAP source I’d recommend that you find either the documentation for the SAP API or, better yet, find an ODBC driver that you can use in order to connect to your database using Power BI and build on top of it with the Extensibility Model.

Frank Tonsen

My impression is that currently neither M extensions nor custom connectors are supported in the Power BI Service.

Miguel Angel Escobar

It depends. As an SaaS you’d need to create a custom connector and then submit that for:
https://powerbi.microsoft.com/es-es/documentation/powerbi-developer-content-pack-overview/
as an individual or company not affiliated to that SaaS, we’d be able to refresh a custom connector via a Data Gateway (probably in personal mode) by the end of this year and have the integration with the Office Store as well.