Miguel Escobar Published July 24, 2018

Improved Web Scraping Experience in Power BI

Power BIPower Query

A lot of things have happened in the last couple of days. I’m for the first time in my life in Seattle spending some quality time with people that I’ve never met in my life other than through the power of the internet. It’s a nice feeling to finally put a face to a name that you always see on your inbox.

I’ll be speaking later today, but I had to get this post out to the world as soon as possible.

I’ve covered the new Web Scraping experience before in here, and talking with the product team today at the Business Applications Summit they (Ehren specifically) told me that they just updated and improved the experience, so he showed me a demo and I took home my homework so I could implement what he showed me on my own dataset

What’s new

If you want to follow along, just use the following url to get data from the web:

https://www.amazon.com/s?keywords=power+query

Now when you use ‘Add table from examples’ and you input your examples, you get a list of possibilities:

image

which helps you pin down exactly the css selector that you need.

The second point is that once you actually define your CSS selectors and what columns/data you want, you can now click on the OK button and a new “custom table” will appear on the Navigator window which is super convenient.

image

The last and most important new feature is that you can finally access attributes inside of your html. Something that I couldn’t do in my previous post was to grab the url of each specific book, but as of yesterday I can do that inside of Power BI Desktop. Here’s the query that I ended up creating:

let
Source = Web.BrowserContents(“https://www.amazon.com/s?keywords=power+query”),
#”Extracted Table From Html” = Html.Table(Source, {{“Title”, “.s-access-detail-page”}, {“Author(s)”, “.a-col-right > .a-spacing-small > :nth-child(2)”}, {“Total Reviews”, “div > div > div > div.a-fixed-left-grid-col.a-col-right > div:nth-child(2) > div.a-column.a-span5.a-span-last > div > a”}, {“Average Rating”, “[class*=””a-icon a-icon-star””]”}, {“Release Date”, “span.a-size-small.a-color-secondary”},{“url”, “.a-col-right > .a-spacing-small:nth-child(1) > :nth-child(1) > .a-link-normal “, each [Attributes][href]}, {“imgurl”, “img.s-access-image.cfMarker”, each [Attributes][src]}}, [RowSelector=”:not([class*=””AdHolder””])[id*=””result_””]”]),
#”Added Index” = Table.AddIndexColumn(#”Extracted Table From Html”, “Index”, 1, 1),
#”Changed Type” = Table.TransformColumnTypes(#”Added Index”,{{“Title”, type text}, {“Author(s)”, type text}, {“Total Reviews”, Int64.Type}, {“Average Rating”, type text}, {“Release Date”, type date}, {“imgurl”, type text}, {“url”, type text}})
in
#”Changed Type”

Notice that every element from the Html.Table’s first parameter (a list of lists), the most inner lists now accepts 3 arguments where the first one is the name for the new column, the 2nd the actual element that we’ll be grabbing and the last one a function that tells how to extract the attribute from that specific html element. Here’s the current documentation that was added to the latest version of Power BI Desktop specifically for this functionality:

image

Finally, thanks to this new feature I’m able to grab the urls of things like the book images (src) and also any links to the direct page of the book (href) so my final report can have a look more like this:

image

Special Thanks  to Ehren from the Power Query team for showing me this. Without him I would not have noticed that there was an updated experience since it was listed on the release notes of the latest Power BI Desktop.

Power BIPower Query
Subscribe
Notify of
guest
7 Comentario
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
andrew mccoy

HI Miguel,

I’m not following at a fundamental level here. Is this example Power BI Desktop? Which release, perhaps August 2018? Also, I don’t think I’ve ever seen ‘Add table from example’ but maybe that is because I usually use Excel version of Power Query.

thanks

Wedge

This is in PBI Desktop. I don’t think the Excel version of PQ has been updated in almost a year, and PQ features in Power BI were already well ahead of the Excel releases. Excel PQ used to get updates to the add-in module at least twice a year, so it’s possible any future PQ updates to Excel could only come through O365 now. A bit annoying as PQ is insanely powerful in Excel and has replaced 90% of work I used to have to do with VBA (having the .pdf connector for Excel would be great).

Alex Diamond

Hi Miguel, I’ve been trying to pull the list of law bills being discussed by our asamblea.

I have tried “Getting Data” from the following address:

https://asamblea.gob.pa/index.php/seguimiento-legislativo

However after connecting to the site, there are no tables showing. Under Display Options it shows “We didn’t find any tables on this web page”

Am I using the wrong address or does this mean the site is “designed” to avoid scraping?

Alex Diamond

Hi Miguel, yes. It just updated recently. My steps are:

1. Get Data –> Web
2. Choose Basic and paste URL: https://asamblea.gob.pa/seguimiento-legislativo, then hit Ok
3. Then I get the NAVIGATOR pop up, but under Display Options it still shows up blank.

Alex Diamond

Miguel, thanks for your help. I was able to extract a table that was not showing up before after I disabled the preview features for the web interface.

Now I see the columns for the data when it connects. However had one follow up question, how do I pull the data from several pages at the same time? The pull I am getting only has the law bills for page 1.

Alex Diamond

Hi Miguel, another question. Is there a way I can pull the hyperlinks for some of the fields instead of text (Ver Etapa and Documento columns)?