The latest version of Power BI Desktop came out with a new Web Connector specifically designed for Web Scraping Scenarios. in this blog post I’ll try to go deep into how this new experience works and how you can take advantage of it.
Before we move forward, you’re gonna need the latest version of Power BI Desktop (May 2018 for me) and also enable the Preview feature in the Options window:
My Scenario: Get Data From Amazon
I want to find out how many books are out there that have anything to do with Power Query. I want the reviews that go with them, the name of the authors, when they were released and, of course, the names of these books.
The best place to find this information is probably Amazon. So I went on Amazon and did a quick search using the keywords “power query”.
I then grabbed the url that it created for me, head over to Power BI Desktop, chose the “Web Connector”:
and went through the usual experience…except for one new button called “Extract Table Using Example”:
Clicking on it should take you to the new connector experience which looks like this:
On the upper side you get a preview of your webpage and underneath you can provide a sample of the data that you want to extract from that specific webpage. This is similar to the “Column From Example” experience that we know, but it works a little bit different under the hood.
I highly recommend that you expand this window to full screen mode so you can take the most advantage out of it.
Again, What I’d like to do is extract the following data from each book:
- Title
- Author(s)
- Total Reviews
- Average Rating
- Release Date
To do that, all I need to do is simply go through the webpage and provide examples for each of those columns:
The values in the darker font are the ones that I entered, while the ones in a lighter font are the ones that Power Query was able to come up with following that specific Pattern.
All we need to do now is hit the OK button and then we’ll be taken to the Query Editor to further play with our data.
Here’s one thing that caught my eye. We have some “sponsored” data which are basically ads. I don’t want any ads in my dataset, so how can I get rid of these ads? Well, you can do a simple “filter” to get rid of it inside of Power Query, or we can go under the hood and optimize this query to our needs
Figuring out what’s going on under hood
This new experience relies on 2 new functions and 1 new window (to input the examples). These 2 new functions are:
- Web.BrowserContents – returns a text snapshot of the html DOM for the specific url. It has an optional parameter that accepts a record with a field “WaitFor” so you can set up a timer or a trigger on when to exactly take the DOM snapshot. (Incredibly useful for pages that heavily rely on javascript)
- Html.Table – returns a table containing the results of running the specific CSS Selectors against the provided html. It has an optional parameter that accepts a record with the field “RowSelector” so you can specify how to find new rows in the html document. (can’t tell you how useful and convenient this is)
Let’s look at the code that Power Query created for me:
let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > .a-size-small”}, {“Average Rating”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > SPAN”}, {“Release Date”, “.a-col-right > .a-spacing-small > :nth-child(1) > .a-size-small”}}, [RowSelector=”[id*=””result_””]”]),
#”Changed Type” = Table.TransformColumnTypes(#”Extracted Table From Html”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}})
in
#”Changed Type”
Note: You can find a list of all the available css selectors here.
Let’s modify our code to support a more complex scenario. Let’s imagine that we need to wait 15 seconds before taking the DOM snapshot (for the Web.BrowserContents function) and we also need to make sure that we’re only extracting data from the book titles that are not sponsored (ads).
Adding a WaitFor optional Parameter
In theory, we should be able to simply change our first step to be like this:
= Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”,[WaitFor=#duration(0,0,0,15)])
and that should make Internet Explorer Chromium (through Power Query) wait 15 seconds before taking the snapshot. Unfortunately, when I tested this I couldn’t make it work (not even with the samples provided in the actual function), so I sent an email about this to the Power BI folks that handle this.
Since this is a preview feature, is understandable that some things might not work, but I wasn’t expecting to have problems with even the samples that they provided.
Either way, if this ends up working some day, this will be a CRUCIAL part of any web scrapping scenario inside of Power Query. You can also define other rules like “Wait for xyz element to load” before taking the snapshot.
UPDATE 14-May-2018 12:34pm : Just received some feedback from one of the PMs at Microsoft. The documentation is incorrect and the correct usage for the WaitFor record should be like the folowing
[WaitFor = [Selector = “div.foo”, Timeout = #duration(0,0,0,10)]]
Modifying the CSS Selectors created by Power Query
OK – so what is CSS and what are CSS Selectors? For the first question, you can find a full article about it here and for the second one, imagine that it’s a syntax so you can target or select a specific element.
Usually web pages follow a set of patterns to style and name things. For example, in a grid we can “color” each line or row using a specific “id” or “class” pattern, thus saving time so we don’t have to set up a specific color scheme to each individual line or row inside of a grid.
In most web browsers you can enable a developer tab or dock (inspect element) so you can see the elements of your html page. In my case, I’m using Chrome and I can go through the code to see what’s going on and I noticed that Amazon uses a list to display their products and within each of those lines there’s a pattern for the id. Notice how each of them start with the string “result_”, but here’s the thing – even the “Sponsored” products have that pattern for their id.
What sets them apart? Well,each line also has a class and they all have almost the same text string except that the sponsored products have “AdHolder” near the end of them:
Power Query uses something called a “Row Selector” which is basically a record or field of a record that tells Power Query that every time that it finds a certain pattern, to put that data inside a new row.
Now let’s go ahead and take a look at the “RowSelector” record that Power Query automatically created for us:
let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > .a-size-small”}, {“Average Rating”, “.a-col-right > :nth-child(2) > .a-span5 > DIV.a-row.a-spacing-mini:nth-child(1):nth-last-child(1) > SPAN”}, {“Release Date”, “.a-col-right > .a-spacing-small > :nth-child(1) > .a-size-small”}}, [RowSelector=“[id*=””result_””]”]),
#”Changed Type” = Table.TransformColumnTypes(#”Extracted Table From Html”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}})
in
#”Changed Type”
The specific CSS Selector that is using is this one and the definition states that:
“The [attribute*=value] selector matches every element whose attribute value containing a specified value.”
What’s happening is that every time that Power Query finds an element that satisfies that selector, Power Query will create a new row with the corresponding data. In simple terms, there should be as many rows as there are ids with that specific pattern of “result_”…BUT! we need to add another logical condition to that. We want those elements that have the “result_” pattern for their id, but we also want to not target or extract data from the classes that contain “AdHolder” and for that we’re going to replace that RowSelector with:
[RowSelector=“:not([class*=””AdHolder””])[id*=””result_””]”]
which explicitly does what we want. The result of that shows me this:
After modifying the CSS Selectors that Power Query initially created for me, I ended up with the following code:
let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “div > div > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span5.a-span-last > div > a”}, {“Average Rating”, “[class*=””a-icon a-icon-star””]”}, {“Release Date”, “span.a-size-small.a-color-secondary”}}, [RowSelector=”:not([class*=””AdHolder””])[id*=””result_””]”]),
#”Added Index” = Table.AddIndexColumn(#”Extracted Table From Html”, “Index”, 1, 1),
#”Changed Type” = Table.TransformColumnTypes(#”Added Index”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}}),
#”Split Column by Delimiter” = Table.SplitColumn(#”Changed Type”, “Average Rating”, Splitter.SplitTextByDelimiter(” “, QuoteStyle.Csv), {“Average Rating”}),
#”Changed Type1″ = Table.TransformColumnTypes(#”Split Column by Delimiter”,{{“Average Rating”, Currency.Type}}),
#”Renamed Columns” = Table.RenameColumns(#”Changed Type1″,{{“Index”, “Position”}})
in
#”Renamed Columns”
Caveats: whenever you want to click the gear icon, be sure to hit the refresh button first or you might be working against cached data. The “window” experience to enter data from example might work the first time, but if your webpage changes then this window might crash. The Html.Table function only exposes span attributes for now – this limitation is by-design, but I can imagine that at one point we could look at specific attributes like “src” and simple “href” so we can get urls from the webpage (for images and links).
Conclusion: Web Scraping is now CRAZY simple with this feature, but this also offers a lot of flexibility for people that want to go hardcore with Web Scraping. There are still some things that you can’t do with Power Query like going through a page that requires a log in or “sign in” page. There’s no automatic way of doing a pagination yet in Power Query, but I believe that they might be working on exposing a native Power Query function that could get the job done – right now you need to create your pagination function as shown here.
Check out the report that I created from this dataset:
Don’t forget to leave your feedback in the comments section below!
UPDATE 24-July-2018: Check the improved Web Scraping experience here.
Amazing! Thanks Miguel.
nice article, so first of all thanks a lot!
I have a question: is it possible to “force” web page to load all images before I receive html code in PowerBI using Web.BrowserContents. The thing is that WaitFor field helps to check whether CSS Selector exists on page. And it does exist. But I need it to exist for all images, and page loads only first 2 of them. In more details I asked my question here:
https://community.powerbi.com/t5/Desktop/Web-BrowserContents-how-to-force-web-page-to-load-all-pictures/m-p/612187
Hey Dima!
I was able to bring this to the attention of the Power Query team and they confirmed my suspicions. There’s a script that lazily loads the images based on the viewport and scroll operations. You only get 2 src because that’s what the viewport (that Chromium uses) sees when using the Web.BrowserContents function. If the viewport was bigger and you could see the WHOLE page, then that would change everything, but we currently can’t define what size the viewport should be.
Thanks Miguel,
Following a hackathon based on Migeuls and Chris’s Crossjoin;
let
Source = Web.BrowserContents(“http://www.pleanala.ie/lists/2019/decided/dec20190510.htm”),
#”Extracted Table From Html” = Html.Table(Source, {{“Column0″,”DIV:nth-last-child(4) > DIV > P:nth-child(2) > A,DIV:nth-last-child(4) > DIV > P:nth-child(2) > H2,DIV:nth-last-child(4) > DIV > DIV:nth-child(2) > A,DIV:nth-last-child(4) > DIV > DIV:nth-child(2) > H2”},{“Column1”, “span:nth-child(1)”},{“Column2”, “span:nth-child(2)”}},[RowSelector=”[style*=””font””]”]),
#”Changed Type” = Table.TransformColumnTypes(#”Extracted Table From Html”,{{“Column1″, type text}}),
#”Filled Down” = Table.FillDown(#”Changed Type”,{“Column0″}),
#”Merged Columns” = Table.CombineColumns(#”Filled Down”,{“Column1”, “Column2”},Combiner.CombineTextByDelimiter(“”, QuoteStyle.None),”Merged”)
in
#”Merged Columns”
This succeeded in getting County name, and list of details for each case number in columns 1 and 2.
I cant scrape the detail for each of the list items from:
Board Reference
:Vacant Sites Levy – Appeal5.9 (
Description
: Entry onto vacant site
Hey! I highly suggest that you post this in the official Power Query forum:
https://social.technet.microsoft.com/Forums/en-US/home?forum=powerquery
This comments section isn’t the best place to post code or provide that much technical info due to length and how it’s being displayed in such a small space
Thanks Miguel, have done just now with your name attached, hope that’s ok
Hey Conor, I’m completely OK with it.
The official forum is monitored by members of the Power Query team and by avid Power Query users worldwide, so you should be getting some replies relatively quickly.
I even saw that you just received a reply from a member already. Unfortunately, my time is very limited at this time, so I’m only focusing in paid consulting / mentoring services and other projects.
If you do require further support, I do offer consulting / coaching services and you can contact us at info@poweredsolutions.co
Hey Miguel,
I was trying to solve this very issue recently, and I came to a different solution. In case you are interested, my technique stays within the “Get Data From Web” connector, and modifies it just a bit.
As you mention, by default the ‘columnNameSelectorPairs’ of the “Html.Table()” function will only pull data from the attribute, and therefore can’t get images that are typically in the attribute. However, the official documentation for the “Html.Table()” function* very briefly mentions another way to use the ‘columnNameSelectorPairs’:
– Use #1 (Default): ‘columnNameSelectorPairs’ uses 2 parameters, {‘New Column Name’, ‘Element Name With the Span Attribute’}
– – Example #1: {“Name”, “.name”}
– Use #2 : ‘columnNameSelectorPairs’ uses 3 parameters, {‘New Column Name’, ‘Element Name With the X Attribute’, ‘X Attribute’}
– – Example #2: {“Link”, “a”, each [Attributes][href]}
Using #2 above, you could add an image by adding the ‘img’ element and ‘src’ attribute:
{“Image”,”img”, each [Attributes][#”src”]}
This goes inside the “Html.Table()” function, with all the other parameters of that function unchanged, and will return a columns of *.jpg URLs.
Kindly,
Justin Hylton
Business Intelligence Manager for Mohawk Home
https://www.linkedin.com/in/justin-hylton/
*https://docs.microsoft.com/en-us/powerquery-m/html-table
Hey Justin!
Thanks for checking out this post. At the bottom of the post you should see a link to the follow up blog post where I covered this approach. This is where I showcased how to use the updated function.
Here’s the link to that one:
https://www.thepoweruser.com/2018/07/24/improved-web-scraping-experience/
Ok, sorry I missed that. I see it now, and you covered it well. Good stuff.
Thank you 🙂
Hey!
Thats amazingly cool!
Can you show me how I should adapt the code in order to get data from each page and not just the first result page?
Thank you!
BR, Nati
Hey!
You can check out this video for a way to accomplish pagination:
https://youtu.be/vhr4w5G8bRA
Greetings,
If I may ask, when a page has a “load more”button (like this one for example – https://www.imdb.com/title/tt8358332/reviews), what can we do to get the data hidden under the “load more”and do not show up in the data table in PowerBI?
thanx in advance for your time
Hey! That’s a tough one. It depends. It’s one of those case by case basis since every single website is different and implements their lazy loading or pagination differently.