turbin3
BuSo Pro
- Joined
- Oct 9, 2014
- Messages
- 613
- Likes
- 1,285
- Degree
- 3
I apologize in advance if this is far too basic for some of you. The other day I was remembering when I first learned the beauty of XPath and being able to use any number of programs to scrape a specific item or set of items from a specific page. I figured I'd put this up to encourage people, that may be a bit more novice, to expand their horizons.
What is it?
So what is XPath? It is exactly what the thread title says. Think of it as a "pathway" through a file, which leads directly to a specific element within that file. This could be HTML, PHP, XML, or any number of other resource file formats. It is a method of calling out specific elements, attributes, and other items to effectively give a program directions in order to reach the specific piece of data you're interested in. In certain cases, you can jump right to the "X", skipping most of the "pathway". In other cases, for specific uses and specific pages, it may be necessary to create a detailed XPath that leads the bot from start to finish through the document. The more unique and definable something is, the more easily you'll be able to go directly to it as opposed to having to use a complex XPath. It can be used for many different purposes, but I will be speaking about XPath in the specific context of scraping data from a webpage, which is an invaluable tool you will want to learn as an SEO, digital marketer, and web developer.
Why you should care
Inevitably, when it comes to web development, database development, lead generation, competitive intelligence, and general analytics, you will eventually find yourself in need of data to accomplish your goals. In lieu of having that data already, you must acquire it from somewhere. You can pay for it, but that's not always fun and it's not always affordable. Well I have news for you. You are quite literally surrounded by more data than human beings have ever had access to in the entirety of human history....and it is right at your fingertips, FOR FREE, and can be acquired within a matter of seconds or minutes. It is often a matter of simply thinking a bit outside the box, and figuring out a way to creatively acquire it.
Be Prepared
A bit of a caveat, to start off with. Be prepared to FAIL. Be prepared to be IP banned from websites for being an unforgiving scrapist. Be prepared to become a bit frustrated from time to time, trying to figure out the "recipe". XPath is often confusing, and takes a lot of trial and error to utilize successfully. Where you will achieve success is in learning to diagnose those failures, reassessing, self-education, and through persistent trial and error. Just keep tweaking those XPaths until those empty fields come streaming in with data. The first time you create, troubleshoot, and successfully achieve an XPath that nets you data, on your own, a whole new horizon will open up for you. You will begin to realize that nearly anything is within your reach, with enough trial and error. It's a very liberating feeling.
So that's what XPath is, and what it can generally do, but how do we use it? I'm going to focus on a simple method that ANYONE reading this will be able to get up and running within minutes. If I can find the focus, I'll probably also be putting up some tutorials on developing your own scrapers in Python, using XPath, among other methods.
Resources & Examples
W3C has a few good resources to get you started with making sense of XPath and how to create/find them for a given page element:
https://www.w3schools.com/xml/xpath_intro.asp
https://www.w3schools.com/xml/xpath_syntax.asp
https://www.w3schools.com/xml/xpath_examples.asp
Pay particular attention to the "syntax" link. You'll probably need to refer back to that quite a bit. Also, trial, error, assessment, and self-education is paramount. I highly recommend trying as many things as you can, and Googling what you are generally trying to achieve.
You will often come across lots of StackOverflow threads, full of TONS of great info to help you develop the winning XPath combination that will net you that piece of data. For example, "xpath anchor href contains" might be something I would Google, if I was trying to develop a more general XPath that would look at all of the hrefs on a page, and would scrape only the ones that contained certain text or certain attributes.
A further example of this might be, if you wanted to develop a general XPath you could utilize to find any href linking to facebook.com, twitter.com, plus.google.com, youtube.com, etc. With that, you could dump a seed list of domains in your chosen scraping program and have a good degree of success in scraping the social profile links for each domain. Now think of all the paid services out there that offer you that sort of ability, or similar abilities. Are you starting to see where I'm going with this? As you go further down this rabbit hole, you will come to a realization. Where you may have previously paid for certain data collection services (sales leads, for example...) in the past, you can simply spend a few minutes or hours and develop the capabilities to perform many if not most of those services yourself and almost entirely for free.
As you begin to scale, and as you target certain sites, there can be some minor expense involved. You will eventually want to take steps to protect yourself by anonymizing and randomizing your activity so as to begin operating effectively in the "wilderness of mirrors". This will involve proxies, a remote VPS so that your true traffic origin is masked, a VPN to encrypt your connection and obfuscate any sort of trace, etc. I'll probably touch on those things more in another tutorial.
Nuts & Bolts
What can we use to scrape? One of the quickest and easiest ways you can get started is with Screaming Frog SEO Spider. Download it if you don't already have it. They also have some decent pointers on creating XPaths:
http://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#extraction
Where do we start in creating an XPath? With FireFox or Chrome, there are quick ways to get started with the developer tools window opened up. Open the site you want to scrape, then right click on the element you want, in the page, and click "inspect". Now hover over that item in the source code on the inspect window, right click, then copy the XPath. Now open up Screaming Frog, and select Configuration, Custom, and Extraction. Select XPath from the drop down, paste yours in, hit OK, then enter in your URL to scrape and run it. What did you come up with? In this example, we came up with a big, fat, NOTHING. So lets go back to the drawing board.
In this example, I'm trying to get a baller thumbnail image from a YouTube video. What you'll find with Chrome/FF developer tools is, the XPath you're given won't always be the one you need to accomplish what you want. Also, there is the separate issue that Screaming Frog has its own issues with the exact XPath syntax you use. Sometimes what you'll find is, the one you use in Screaming Frog won't work in a custom coded Python bot, and vice versa. Get used to tweaking things. In the case of Screaming Frog, I often find that it doesn't seem to like "double quotes", but instead seems to prefer 'single quotes' when you are specifying a class, ID, or certain other items.
So back to the task at hand. We want that image. This XPath didn't work: /html/head/meta[8] Was it the XPath, or was it maybe just a setting in Screaming Frog? Well we have drop down menus, so lets just try all of those options. First we used "Extract Inner HTML", but that didn't work. The thumbnail is in an href, so lets try "Extract HTML Element" instead. BAM! That worked! Though, we got the full HTML string along with it. If you have a list of videos you want to do this with, that could get really tedious doing a Find/Replace in Excel, Sublime, or whatever and trying to delete the extraneous stuff so you just have the thumbnail links. For a few items, no biggie. For thousands, much bigger deal, especially if that's including lots of other XPaths and data fields.
So lets try to grab JUST the URL and save ourselves a lot of hassle. With XPath, you'll commonly see a couple main types of XPaths. There's what I usually call the "general type", and the other I call the "specific type". With the general type, you might be calling out common page elements that are present on most webpages. In the first example, we started with: /html/head/meta[8] Because almost every webpage is going to start with HTML, should have a <head> section, and probably has some <meta> elements. Do you see a potential issue with this general XPath, though? This one is basically saying take a trip past the <HTML>, keep going past the <head>, and from the 8th <meta> tag I want you to "Extract the HTML Element" (what we selected in the SF dropdown) and show it to me. What if that image isn't consistently the EIGTH meta tag on the page, though? You might get nothing, or you might get some random HTML element. Again, not a biggie with a few pages. Absolute hell when you're scraping thousands, tens, hundreds of thousands or more pages.
So that's the downside of the "general type" of XPath. The plus side is, if you have a simple element that has a simple path, and that element remains consistent across many/all pages, a general XPath is often quick and easy to create. For example, /html/head/title will often get you the page title from most webpages. Simple, right? With SF, thankfully a lot of that is built in, and you can just do a standard crawl without having to create any XPaths.
To protect yourself for the future and ensure the integrity of your scraped data, lets take a look at the more specific XPath type. Here's an example: //meta[starts-with(@property, 'og : image')][1]/@content
This is saying, for the [1st] <meta> tag whose property starts with 'og : image', scrape the content from it. In this case, it even works if you select HTML Element or Text in SF.
What if it isn't always the [1st] <meta> tag with og : image, though? You might not want to constrict yourself like that. In fact, there are at least 1,001 ways to skin this cat. Here's another: //*[contains(@content, 'maxresdefault.jpg')]/@content
This one is saying, for anything (*) that contains a content= attribute with 'maxresdefault.jpg' in the string, scrape that content for me and return it. In that case, even if there are sometimes multiple og : image tags on a page, it will only return the one with that exact string in the filename. It just so happens that, as of right now, YT consistently lists their thumbnails with this filename. That may change in the future, but for the time being, that gives us a consistent footprint to work with, using a specific XPath (specific in terms of EXACTLY the type of element that you want), but that is not constrained by a specific position within the page. This also protects the integrity of the data you're scraping, should pages be a bit inconsistent in structure and those exact items end up being located in slightly different areas of the page.
That's great, but what else can we do with this? Well first off, the thing you'll want to realize is, SF is limited and only allows for up to 10 custom extractions. So if you're wanting to scrape more than 10 specific things from a page or set of pages, you're going to have to move to something more like Python, or other languages/programs that are more flexible. Another thing to be aware of is, there are more ways to perform similar functions than just XPath. For example, there are also CSSPaths. As the name denotes, it uses CSS elements instead. From the above example of creating a path to a YT thumbnail image, here's what that CSSPath might look like: head > meta:nth-child(40) I'm sure you can see some of the potential issues there as well. Same principles, however. You can tweak that "Rubik's cube" many different ways, both specific and general.
Possibilities
To give you some other ideas, earlier this week I used Scrapebox to scrape a few hundred keywords from YT suggest, then ~300K video links from YT related to those keywords. I then spent a few minutes creating the XPaths and a couple CSSPaths to pull this data for each of those videos: Views, Likes, Dislikes, Comments, Channel Name, Channel Link, Subscribers, Vid Description, and I think that was it. Plugged that into SF and let that run for an hour or two. BAM. A "curated" list of 300K (maybe 50-75% relevant) vids, with "metrics" to help prioritize them, to begin filling a bulk content upload for a site.
Another involved scraping millions of businesses from a few directories, both to monetize as sales leads, as well as to utilize for content in multiple ways across multiple sites. That involved Python + Scrapy + the wilderness of mirrors.
Next up, if I can manage to focus a bit in the coming weeks, maybe a crash course on scraping with Python as well as the Scrapy framework.
What is it?
So what is XPath? It is exactly what the thread title says. Think of it as a "pathway" through a file, which leads directly to a specific element within that file. This could be HTML, PHP, XML, or any number of other resource file formats. It is a method of calling out specific elements, attributes, and other items to effectively give a program directions in order to reach the specific piece of data you're interested in. In certain cases, you can jump right to the "X", skipping most of the "pathway". In other cases, for specific uses and specific pages, it may be necessary to create a detailed XPath that leads the bot from start to finish through the document. The more unique and definable something is, the more easily you'll be able to go directly to it as opposed to having to use a complex XPath. It can be used for many different purposes, but I will be speaking about XPath in the specific context of scraping data from a webpage, which is an invaluable tool you will want to learn as an SEO, digital marketer, and web developer.
Why you should care
Inevitably, when it comes to web development, database development, lead generation, competitive intelligence, and general analytics, you will eventually find yourself in need of data to accomplish your goals. In lieu of having that data already, you must acquire it from somewhere. You can pay for it, but that's not always fun and it's not always affordable. Well I have news for you. You are quite literally surrounded by more data than human beings have ever had access to in the entirety of human history....and it is right at your fingertips, FOR FREE, and can be acquired within a matter of seconds or minutes. It is often a matter of simply thinking a bit outside the box, and figuring out a way to creatively acquire it.
Be Prepared
A bit of a caveat, to start off with. Be prepared to FAIL. Be prepared to be IP banned from websites for being an unforgiving scrapist. Be prepared to become a bit frustrated from time to time, trying to figure out the "recipe". XPath is often confusing, and takes a lot of trial and error to utilize successfully. Where you will achieve success is in learning to diagnose those failures, reassessing, self-education, and through persistent trial and error. Just keep tweaking those XPaths until those empty fields come streaming in with data. The first time you create, troubleshoot, and successfully achieve an XPath that nets you data, on your own, a whole new horizon will open up for you. You will begin to realize that nearly anything is within your reach, with enough trial and error. It's a very liberating feeling.
So that's what XPath is, and what it can generally do, but how do we use it? I'm going to focus on a simple method that ANYONE reading this will be able to get up and running within minutes. If I can find the focus, I'll probably also be putting up some tutorials on developing your own scrapers in Python, using XPath, among other methods.
Resources & Examples
W3C has a few good resources to get you started with making sense of XPath and how to create/find them for a given page element:
https://www.w3schools.com/xml/xpath_intro.asp
https://www.w3schools.com/xml/xpath_syntax.asp
https://www.w3schools.com/xml/xpath_examples.asp
Pay particular attention to the "syntax" link. You'll probably need to refer back to that quite a bit. Also, trial, error, assessment, and self-education is paramount. I highly recommend trying as many things as you can, and Googling what you are generally trying to achieve.
You will often come across lots of StackOverflow threads, full of TONS of great info to help you develop the winning XPath combination that will net you that piece of data. For example, "xpath anchor href contains" might be something I would Google, if I was trying to develop a more general XPath that would look at all of the hrefs on a page, and would scrape only the ones that contained certain text or certain attributes.
A further example of this might be, if you wanted to develop a general XPath you could utilize to find any href linking to facebook.com, twitter.com, plus.google.com, youtube.com, etc. With that, you could dump a seed list of domains in your chosen scraping program and have a good degree of success in scraping the social profile links for each domain. Now think of all the paid services out there that offer you that sort of ability, or similar abilities. Are you starting to see where I'm going with this? As you go further down this rabbit hole, you will come to a realization. Where you may have previously paid for certain data collection services (sales leads, for example...) in the past, you can simply spend a few minutes or hours and develop the capabilities to perform many if not most of those services yourself and almost entirely for free.
As you begin to scale, and as you target certain sites, there can be some minor expense involved. You will eventually want to take steps to protect yourself by anonymizing and randomizing your activity so as to begin operating effectively in the "wilderness of mirrors". This will involve proxies, a remote VPS so that your true traffic origin is masked, a VPN to encrypt your connection and obfuscate any sort of trace, etc. I'll probably touch on those things more in another tutorial.
Nuts & Bolts
What can we use to scrape? One of the quickest and easiest ways you can get started is with Screaming Frog SEO Spider. Download it if you don't already have it. They also have some decent pointers on creating XPaths:
http://www.screamingfrog.co.uk/seo-spider/user-guide/configuration/#extraction
Where do we start in creating an XPath? With FireFox or Chrome, there are quick ways to get started with the developer tools window opened up. Open the site you want to scrape, then right click on the element you want, in the page, and click "inspect". Now hover over that item in the source code on the inspect window, right click, then copy the XPath. Now open up Screaming Frog, and select Configuration, Custom, and Extraction. Select XPath from the drop down, paste yours in, hit OK, then enter in your URL to scrape and run it. What did you come up with? In this example, we came up with a big, fat, NOTHING. So lets go back to the drawing board.
In this example, I'm trying to get a baller thumbnail image from a YouTube video. What you'll find with Chrome/FF developer tools is, the XPath you're given won't always be the one you need to accomplish what you want. Also, there is the separate issue that Screaming Frog has its own issues with the exact XPath syntax you use. Sometimes what you'll find is, the one you use in Screaming Frog won't work in a custom coded Python bot, and vice versa. Get used to tweaking things. In the case of Screaming Frog, I often find that it doesn't seem to like "double quotes", but instead seems to prefer 'single quotes' when you are specifying a class, ID, or certain other items.
So back to the task at hand. We want that image. This XPath didn't work: /html/head/meta[8] Was it the XPath, or was it maybe just a setting in Screaming Frog? Well we have drop down menus, so lets just try all of those options. First we used "Extract Inner HTML", but that didn't work. The thumbnail is in an href, so lets try "Extract HTML Element" instead. BAM! That worked! Though, we got the full HTML string along with it. If you have a list of videos you want to do this with, that could get really tedious doing a Find/Replace in Excel, Sublime, or whatever and trying to delete the extraneous stuff so you just have the thumbnail links. For a few items, no biggie. For thousands, much bigger deal, especially if that's including lots of other XPaths and data fields.
So lets try to grab JUST the URL and save ourselves a lot of hassle. With XPath, you'll commonly see a couple main types of XPaths. There's what I usually call the "general type", and the other I call the "specific type". With the general type, you might be calling out common page elements that are present on most webpages. In the first example, we started with: /html/head/meta[8] Because almost every webpage is going to start with HTML, should have a <head> section, and probably has some <meta> elements. Do you see a potential issue with this general XPath, though? This one is basically saying take a trip past the <HTML>, keep going past the <head>, and from the 8th <meta> tag I want you to "Extract the HTML Element" (what we selected in the SF dropdown) and show it to me. What if that image isn't consistently the EIGTH meta tag on the page, though? You might get nothing, or you might get some random HTML element. Again, not a biggie with a few pages. Absolute hell when you're scraping thousands, tens, hundreds of thousands or more pages.
So that's the downside of the "general type" of XPath. The plus side is, if you have a simple element that has a simple path, and that element remains consistent across many/all pages, a general XPath is often quick and easy to create. For example, /html/head/title will often get you the page title from most webpages. Simple, right? With SF, thankfully a lot of that is built in, and you can just do a standard crawl without having to create any XPaths.
To protect yourself for the future and ensure the integrity of your scraped data, lets take a look at the more specific XPath type. Here's an example: //meta[starts-with(@property, 'og : image')][1]/@content
This is saying, for the [1st] <meta> tag whose property starts with 'og : image', scrape the content from it. In this case, it even works if you select HTML Element or Text in SF.
What if it isn't always the [1st] <meta> tag with og : image, though? You might not want to constrict yourself like that. In fact, there are at least 1,001 ways to skin this cat. Here's another: //*[contains(@content, 'maxresdefault.jpg')]/@content
This one is saying, for anything (*) that contains a content= attribute with 'maxresdefault.jpg' in the string, scrape that content for me and return it. In that case, even if there are sometimes multiple og : image tags on a page, it will only return the one with that exact string in the filename. It just so happens that, as of right now, YT consistently lists their thumbnails with this filename. That may change in the future, but for the time being, that gives us a consistent footprint to work with, using a specific XPath (specific in terms of EXACTLY the type of element that you want), but that is not constrained by a specific position within the page. This also protects the integrity of the data you're scraping, should pages be a bit inconsistent in structure and those exact items end up being located in slightly different areas of the page.
That's great, but what else can we do with this? Well first off, the thing you'll want to realize is, SF is limited and only allows for up to 10 custom extractions. So if you're wanting to scrape more than 10 specific things from a page or set of pages, you're going to have to move to something more like Python, or other languages/programs that are more flexible. Another thing to be aware of is, there are more ways to perform similar functions than just XPath. For example, there are also CSSPaths. As the name denotes, it uses CSS elements instead. From the above example of creating a path to a YT thumbnail image, here's what that CSSPath might look like: head > meta:nth-child(40) I'm sure you can see some of the potential issues there as well. Same principles, however. You can tweak that "Rubik's cube" many different ways, both specific and general.
Possibilities
To give you some other ideas, earlier this week I used Scrapebox to scrape a few hundred keywords from YT suggest, then ~300K video links from YT related to those keywords. I then spent a few minutes creating the XPaths and a couple CSSPaths to pull this data for each of those videos: Views, Likes, Dislikes, Comments, Channel Name, Channel Link, Subscribers, Vid Description, and I think that was it. Plugged that into SF and let that run for an hour or two. BAM. A "curated" list of 300K (maybe 50-75% relevant) vids, with "metrics" to help prioritize them, to begin filling a bulk content upload for a site.
Another involved scraping millions of businesses from a few directories, both to monetize as sales leads, as well as to utilize for content in multiple ways across multiple sites. That involved Python + Scrapy + the wilderness of mirrors.
Next up, if I can manage to focus a bit in the coming weeks, maybe a crash course on scraping with Python as well as the Scrapy framework.
Last edited by a moderator: