Photo credit to Clément Hélardot — that’s JavaScript in the photo, not Python, but I liked the way this looked.

Python Tutorial: Let’s build a web crawler for sitemaps

A quick, niche tutorial for crawling internal webpages on a website.

Ben Scheer
3 min readOct 12, 2021

--

Have you ever wanted to crawl a website, but specifically one that only visits each page of that website, and ignores other stuff like external links? Okay, no worries if not, but if so, this article could be helpful.

One easy way to crawl internally is by leveraging sitemaps. Many of today’s websites include a sitemap.xml file. That file, which exist within the website code, improves SEO by showing the relationships between the pages, videos, and other files on a site.

I’m writing this article because I haven’t seen way too many simple examples online for crawling webpages specifically by using sitemaps. So, if you’re interesting in doing that very specific thing, here’s how!

First off, you’ll need Python. If you don’t have it, get it. When you have it, install these two libraries using pip, if you don’t have them installed yet: requests and bs4. Then, create a new python file, perhaps called crawler.py, and add these imports to top of the script:

import requests
from bs4 import BeautifulSoup

Now it’s time to set up the foundation for your crawler. This part is pretty straightforward — just write down the URL…

--

--