Why is it so complex to connect to an API with Power BI or Power Query? The OAuth explained
It’s easier to explain using analogies, so let’s go with that!
Analogy: Imagine that you have a gym subscription and this gym also has a VIP lounge, but to get to the lounge you need to head to the front-desk to get authorization for a limited time.
Using MailChimp as an example:
- The gym is actually MailChimp – they put all the facilities for you
- The lounge section is their API and its where you can “access” your data
- The front-desk authorization is what we know as a Authorization Token, which grants us the chance to “be” in the API
I’m overly simplifying things here, but the main idea is that the “vip lounge” is the place that the “gym” gives you to do what they give you access to do. They (the gym) could give you access to read, write, delete or modify “data”.
Now, here’s the interesting part. Inside the VIP lounge, there are a number of doors that lead to other “places”, like a locker room, massage room, cafeteria and other facilities. These are what we call “endpoints” – places inside the actual “VIP lounge” where you can do specific actions.
Now, in the more technical sense. The “VIP lounge” address can be “https://us10.api.mailchimp.com/3.0”, but the exact address for that massage room inside the cafeteria could look like this “https://us10.api.mailchimp.com/3.0/massage-room” – so if we wanted to get straight into the massage room through Power BI or Power Query, we could create a web query to that specific url and inside the Header we would add the Authorization Token that we got from the front-desk.
You’re probably wondering, how do I get my Authorization Token from the Front Desk? Well, depending on the API service, you can simply use something called an API Key or use OAuth. We’ll go with OAuth in this case and, in order to get the authorization, you need to be authenticated through the Web Form that the gym gives you AND also be able to handle the response that they give you in the format and shape that they give it to you, which is different in every gym. You could do this part outside of Power Query and the Power BI Desktop, in your own browser, and then paste the Authorization Token as a new parameter in Power Query, that would work, but what happens when that Authorization Token expires and you need a new one?
So, how do I make this automatic? How can I get new Authorization Tokens on demand? – well, I forgot to tell you that besides giving you an Authorization Token, the Front-Desk usually gives you something called a “refresh token” so that once you’re inside the VIP lounge you can get a new Authorization Token (and a new Refresh token as well) right within the VIP lounge.
Here’s the full process that you need to go through in a normal API:
- Get to the front-desk and ask for authorization and refresh token through the web-form where you need to enter your credentials
- Head over to the specific endpoint inside the “VIP Lounge”
- Submit a GET or POST request (if you’re at the Cafeteria inside the VIP lounge, basically you’ll be requesting a beverage)
- GET = request data to be sent to you only
- POST = send data to the service and expect a response from the service
What are the limitations of Power BI Desktop & Power Query as of January 2017? the only limitation right now is the web-form in OAuth 2.0 and how we can’t set up that workflow. Now, we as end-users can’t do it, but the Power Query team can do it because they have access to other extensibility tools and that’s how new connectors for MailChimp, Google Analytics and such are being created.
We might be able to have the ability to create our own connectors at some point and I can see that being something amazing for Power Query in SSIS, for example, where you can create your own connector using Power Query and use the result of that query as a data source in SSIS. It’s just a matter of how or if the MSFT team can create a toolkit for developers and advanced Power Query users to create their own connector.
You can read more about the RESTful API here.
Using one of the API Connectors inside the Power BI Desktop
This blog post will cover how you can optimize your queries with the connectors available in the Power BI Desktop, but in part 2 I’ll show you how you can connect directly to the Youtube Analytics for Partners so you can extract your daily ad-revenue, views and more.
I’ll be using the MailChimp connector as an example, so before you begin, and in case you want to follow along, you’ll need the following tools:
Head over to the ‘Get Data’ experience inside Power BI Desktop and select “MailChimp” from the available connectors. You’ll be prompted with a window where you’ll need to enter your credentials. This is the OAuth 2.0 happening right in front of your eyes. Next thing that you’ll see is a window called “Navigator” where you can see all the options that you available from the connector:
You might notice that there appear to be 2 folders and then 2 functions:
- Campaigns Folderthis folder contains pre-defined queries created by the MSFT team that will get you data related to the campaigns. You can take them as ready-to-consume API GET requests
- Lists Folderthis folder contains pre-defined queries created by the MSFT team that will get you data related to the Lists. You can take them as ready-to-consume API GET requests as well
- MailChimp.Collection functionInstead of a pre-defined or pre-packaged query, this is just a simple function that has the OAuth workflow embedded so you can create your own GET, POST or any kind of requests that you need to do. The only input needed is the name of the MailChimp Collection, for example, Lists or Reports
- MailChimp.Instance function similar to the MailChimp.Collection function, but in this one you can point towards an specific element of the API, like the email activity endpoint here
Go ahead and select the Campaign Summary option from the Campaigns Folder:
this query will give you a table with all the campaigns for every lists that you have in your MailChimp Account. I want to focus on just one list as this report will be focused on the English side of my business instead of the Spanish one, so I can filter this table by list ID and that way only get campaigns from that specific list. I could even go further and only select. In the end, I’ll use this table as a Dimension table in my Data Model and I’d also be using this table to make other calls to the MailChimp API Service.
After doing a few clean ups, I finally ended with a simple Campaigns table with only 2 columns. The ID of the campaign and the timestamp when it was first sent:
I’m going to reference that Campaigns query and then try to add a new column using the MailChimp.Instance function to make my own GET request to the email activity endpoint. When I try that, I immediately get an error like this:
(a formula firewall – we can’t combine 2 different data sources this way if their privacy levels are different)
Fortunately, there are 2 ways to fix this:
- Head over to the File Menu > Options and Settings > Options > Privacy tab> “Ignore the Privacy Levels and potentially improve performance”
- Create everything in just 1 query and then simply create a duplicate or a reference and delete the duplicates
Before you make a choice….
Here’s a few things that you need to take in consideration:
The Power BI Desktops comes with “Parallel Loading of Tables” enabled in his options. This is mainly for query folding against relational databases, but in our case this might impact us in a really bad way.
Why would it affect me? well, imagine that you have a query that makes 20 calls to the service. You reference this query and, since this is running in parallel and not in series, you are now making 40 calls to the service. What if in this referenced query you need to also make a # of new requests? then you’d need to multiply the 20 calls of your current query times the new calls that you’ll be making per row. This would be sure to time you out OR, even worst, reach your api calls quota limit in no time before even putting your report to use.
Even when you disable this Parallel Loading options, you sometimes get parallel processing in your queries which might cause you to make undesired calls to the API service and this is why you need to use Fiddler to monitor your API calls and check if your solution is working as intended.
My recommendation would be to leave the Parallel Loading of Tables enabled and simply make all the calls needed in just 1 query so you don’t get hit by the Parallel processing running the same query multiple times
You should really check out this thread from someone that had some issues when dealing with Web API requests and the response from MSFT.
In short, here a list of some recommendations:
- Use Function.InvokeAfter to add a timer to your calls – something among the lines of make a call every 2 seconds so you don’t get timed out by multiple connections at the same time
- Caching and Lazy evaluation on the response from the server – use Binary.Buffer() against the Binaries that you get from the service so we get a local copy of the binary and not repeat the same call over and over again
- Use Fiddler or the service log to see how many calls you’re making – try to see how much time you spend on the handshake vs actually getting the data. Throttling can be a performance issue, too many calls at the same time and you might get timed out as well. This would require you to check your query plan
After spending quite a few weeks with the MailChimp API, I’ve reached the conclusion that the way to go would be to create all your calls within one query and then create new tables in your Data Model using DAX
Click here to check the Part 2 of this blog post on how to connect to the YouTube Analytics API with a Power BI Custom Connector (which uses OAuth 2.0 as an authentication method).
Update May-19-2018: Power BI Custom Connectors are now possible! You can create your own connector and connect to any data source of your choice using authentications like OAuth 2.0 – more information here