Python Tutorial: Let’s build a web crawler for sitemaps
A quick, niche tutorial for crawling internal webpages on a website.
Have you ever wanted to crawl a website, but specifically one that only visits each page of that website, and ignores other stuff like external links? Okay, no worries if not, but if so, this article could be helpful.
One easy way to crawl internally is by leveraging sitemaps. Many of today’s websites include a sitemap.xml file. That file, which exist within the website code, improves SEO by showing the relationships between the pages, videos, and other files on a site.
I’m writing this article because I haven’t seen way too many simple examples online for crawling webpages specifically by using sitemaps. So, if you’re interesting in doing that very specific thing, here’s how!
First off, you’ll need Python. If you don’t have it, get it. When you have it, install these two libraries using pip, if you don’t have them installed yet: requests and bs4. Then, create a new python file, perhaps called crawler.py, and add these imports to top of the script:
from bs4 import BeautifulSoup
Now it’s time to set up the foundation for your crawler. This part is pretty straightforward — just write down the URL to the sitemap for your website of interest. Then using your imported libraries, store the requested URL and parse it. The HTML parser is fine, since we’re dealing with XML.
myURL = my-website-goes-here.com/sitemap.xml
page = requests.get(myURL)
soup = BeautifulSoup(page.content, 'html.parser')
This is where things start to get interesting. Now, you need to figure out where it is within that parsed HTML that your crawler will find the internal page links.
Sometimes, it’s all right there in that sitemap, but in other cases, that sitemap is linked to other sitemaps and you have to dig into that (see Google’s sitemap, for instance). Then, you might run into permissions issues and such, which makes sense, because lots of pages are private.
To keep things simple, here’s an example of what this looks like in the simple case where you don’t have to dig deep for those links: