Figuring out every single page a website has is the starting point for any serious SEO audit, competitor deep-dive, or data scraping project. It's about moving past just clicking around. You need a mix of smart techniques—like parsing sitemaps, crawling every link, and even rendering the site like a browser—to find all pages on a website, especially those buried under layers of JavaScript.
Why Finding Every Page Actually Matters
In a world with over a billion websites, the old ways of manual checks just don't cut it. A lot of modern sites are built on dynamic platforms that create pages on the fly. Basic crawlers will trip over anything rendered with JavaScript, leaving you with huge blind spots in your data.
This guide isn't about the basics. We're diving into the real-world headaches developers hit when trying to map out a site, from dealing with client-side frameworks to getting around smart bot defenses. A complete page inventory is the foundation for any real analysis.
The Problem of Scale and Complexity
Just think about the sheer size of the web for a second. There are over 1.13 billion websites out there, with around 200 million of them actively updated. With about 252,000 new sites popping up every single day, trying to keep track of it all manually is a lost cause. Automation isn't just nice to have; it's essential.
Having a complete list of pages is make-or-break for a few key jobs:
- Comprehensive SEO Audits: You can't fix what you can't find. Unearthing every URL helps you spot thin content, duplicate titles, and broken links that are dragging down your rankings.
- Accurate Competitive Analysis: When you map out a competitor's entire website, you see their full content strategy, which keywords they're chasing, and just how deep their product offerings go.
- Large-Scale Data Projects: Whether you're feeding a machine learning model or building a market intelligence tool, you need the whole dataset. Incomplete data leads to bad results. Plain and simple.
Before jumping into the code-heavy strategies, a good first step is to perform a website audit. It helps set the stage and gives you a baseline to work from.
Page Discovery Methods at a Glance
There's more than one way to find a website's pages, and the best approach often depends on what you're up against. Some methods are quick and easy, while others are built to handle the complexities of modern, JavaScript-heavy sites.
This table gives a quick rundown of the techniques we'll be covering, so you can pick the right tool for the job.
Method | Best For | Complexity | Dynamic Site Effectiveness |
Sitemaps & robots.txt | Quick initial discovery of intended pages. | Low | Low |
Search Engine Operators | Finding what Google has indexed, including subdomains. | Low | Medium |
Standard Link Crawlers | Mapping static HTML sites and following internal links. | Medium | Low |
Headless Browsers | Discovering pages on JavaScript-heavy SPAs and dynamic sites. | High | High |
Custom Scrapy Crawler | Large-scale, customized, and efficient page discovery. | High | High (with middleware) |
Each method has its place. Simple sites might only need a quick sitemap check, but for a sprawling, dynamic web app, you'll likely need to pull out the heavy machinery like a custom crawler or headless browser to get the full picture.
Start with Sitemaps and Robots Txt
Before you even think about unleashing a heavy-duty crawler, the smartest first move is to check for the website's own roadmap. Pretty much any well-built site gives you two key files that act as a guide:
robots.txt and one or more XML sitemaps. It's like asking for directions before you start wandering around a new city.Your first stop should always be the
robots.txt file. It’s a simple text file you can find at the root of a domain (like example.com/robots.txt). While its main job is to give instructions to web crawlers, it often hands you a direct link to the site's sitemap on a silver platter. This makes it the perfect starting point to find all the pages the site owner wants you to see.Locating and Parsing Sitemaps
Once you’ve got the
robots.txt file, you can programmatically scan it for any lines that start with Sitemap:. This directive points you straight to an XML sitemap. For a developer, this is the lowest-hanging fruit imaginable and a massive time-saver compared to firing up a blind crawl from the get-go.You'll run into a few different formats out in the wild:
- Standard XML Sitemap: This is just a single
.xmlfile listing out URLs. It often includes extra metadata, like the last time a page was modified.
- Sitemap Index File: Think of this as a sitemap of sitemaps. It's an XML file that points to a bunch of other sitemap files. You'll see this all the time on larger websites—it helps them keep things organized.
- Plain Text URL Lists: These are less common, but some sites keep it simple with a
.txtfile that has one URL per line.
If you're looking to automate this, a Python script using libraries like
requests to fetch the files and xml.etree.ElementTree to parse them is a solid, effective approach. You can quickly rip through these files and extract every <loc> tag, which is where the URL lives. If you're building a full-on scraping solution, understanding the basics laid out in this practical developer's guide to scraping a website will give you a great foundation.The Inherent Limitations of Sitemaps
Look, sitemaps are a fantastic starting point, but relying on them exclusively is a rookie mistake. They give you a curated list, not a complete one. The information inside can be misleading or just plain incomplete for a few strategic reasons.
Here’s exactly why you can't just stop after parsing a sitemap:
- They are often outdated: Sure, many content management systems generate sitemaps automatically, but I’ve seen plenty of custom-built sites or poorly configured plugins with sitemaps that haven't been touched in months—sometimes years. You'll miss every single piece of new content.
- They intentionally omit pages: Marketers love creating specific landing pages for paid campaigns or flash sales. These pages are live and accessible, but they're purposely left out of the sitemap so they don't get indexed in organic search results.
- Orphaned pages are invisible: If a page isn't linked from any other page on the site and it's not in the sitemap, it's what we call an "orphaned page." A sitemap will never, ever help you find these.
At the end of the day, using sitemaps is a crucial first step for efficient page discovery. It gives you a huge head start by quickly populating your list of URLs to check. But it should always be followed by more active methods, like a comprehensive link-crawling process, to build a truly complete picture of the site's architecture.
Uncover Hidden Pages with Advanced Web Crawling
Sitemaps give you the official tour, but the real discoveries happen when you go off-road. This is where advanced web crawling comes in—actively following every link on a site to map out its true, often messy, structure. It’s the difference between reading the map and actually walking the streets yourself.
The core idea sounds simple enough: start at the homepage, scrape every
<a> tag, toss those URLs into a queue, and visit each one, repeating the process until you run out of new links. But the modern web is full of curveballs that can quickly trip up a basic crawler, turning a straightforward task into a real engineering headache.Navigating the Labyrinth of a Modern Website
Once you unleash your crawler, you’ll immediately hit some practical roadblocks. The first is just figuring out where to go next. A crawler has to be smart enough to convert relative paths like
/about-us into absolute URLs (https://example.com/about-us) before it can actually visit them.You also need to set clear boundaries. Without them, your crawler might follow an external link and start mapping out an entirely different website. The goal is to find all pages on a website, so your logic needs to be strict about staying within the target domain and its subdomains.
Deduplication is another big one. URLs with tracking parameters (
?utm_source=...) or session IDs often point to the exact same content. A naive crawler sees these as unique pages, bloating your results and wasting time. You need smart logic to normalize these URLs and pinpoint the canonical version, ensuring you only process each unique page once.This flowchart lays out the foundational steps for page discovery, from checking the
robots.txt file to parsing sitemaps and building an initial URL list.This process gives you a solid starting point, but it's the active, link-by-link crawl that uncovers all the pages that didn't make it into the official guides.
The JavaScript Challenge and Headless Browsing
The biggest hurdle for any modern crawler is JavaScript. So many sites today are Single Page Applications (SPAs) built with frameworks like React, Vue, or Angular. On these sites, the initial HTML your crawler gets is often just a barebones shell.
The real content—including all the links you need to follow—is rendered on the client-side by running JavaScript in the browser. A simple crawler that just makes an HTTP request and parses the raw HTML will find next to nothing. It’s like trying to read a book by only looking at the cover.
This is where headless browser rendering becomes non-negotiable. A headless browser is a real web browser, just without the graphical user interface. You control it programmatically. It loads a page, executes all the JavaScript, and waits for dynamic content to pop into existence, just like a real user's browser would.
By crawling with a headless browser, you get access to the fully rendered Document Object Model (DOM), allowing you to extract links that were completely invisible to a traditional static crawler. It's the only reliable way to map out a JavaScript-heavy website.
Comparing Crawler Types: Static vs. Dynamic
Not all crawlers are built the same. The choice between a simple static crawler and a more complex dynamic one depends entirely on the kind of websites you're targeting. Here’s a quick breakdown of how they stack up.
Crawler Type Comparison Static vs Dynamic Rendering
Feature | Static Crawler (e.g., Requests + BeautifulSoup) | Dynamic Crawler (e.g., Scrappey, Puppeteer) |
Technology | Makes direct HTTP requests and parses raw HTML. | Runs a full browser instance to render pages. |
Speed | Very fast. Low overhead. | Slower due to browser rendering and resource usage. |
Resource Usage | Lightweight. Low CPU and memory consumption. | Heavy. Requires significant CPU and memory. |
JS Support | None. Cannot see content rendered by JavaScript. | Full. Executes all client-side JavaScript. |
Effectiveness | Great for simple, static HTML sites (think Wikipedia). | Essential for modern SPAs and JS-heavy sites. |
Complexity | Simple to build and maintain. | Complex to manage, especially at scale. |
Ultimately, while static crawlers have their place, the modern web increasingly demands the power of dynamic rendering to get a complete picture.
Why Services Like Scrappey Are a Game-Changer
Trying to manage your own fleet of headless browsers is a massive undertaking. You're suddenly dealing with huge infrastructure costs, complex software management, and the constant headache of browser updates and security patches. For most developers, it’s a huge distraction from the actual goal: finding the pages.
This is where a service like Scrappey offers a massive advantage. Instead of building and maintaining your own rendering farm, you make a simple API call. Scrappey handles the headless browser execution, solves CAPTCHAs, manages proxies, and sends back the fully rendered HTML. This approach lets you uncover pages that basic crawlers would miss entirely, but without all the operational overhead.
Top sites are constantly evolving, with JavaScript rendering dynamic content that static crawlers just can't see. When 61.19% of all site traffic globally is mobile and 68% of online experiences begin with a search engine, you need a tool that can handle the complexity.
This is especially critical for scraping content loaded via asynchronous requests (AJAX/XHR). Think of a product page with a "load more" button. A crawler needs to simulate a click or scroll and then capture the new data loaded from an API. You can dive deeper into this technique in our guide on how to intercept XHR requests. By using a service that handles these interactions for you, you can build a far more complete and accurate map of any website.
Building a Scalable Crawler with the Scrappey API
Alright, let's move from theory to practice. Building a crawler that works on your local machine is one thing; building one that can handle a real-world, complex website is another beast entirely. This is where an API-driven approach really shines, letting you offload the grunt work—like managing browsers and dodging bot detection—so you can focus purely on your page discovery logic.
Imagine you need to map out the entire product catalog of a huge e-commerce competitor. Their site is a JavaScript-heavy labyrinth of categories, subcategories, and paginated lists. Trying to build a robust crawler for this from scratch is a massive engineering headache.
But with a service like the Scrappey API, that giant task shrinks down to a series of API calls. You just send a URL and get back the fully rendered HTML, without ever having to spin up a browser yourself. That abstraction is the secret to building scalable crawlers fast.
Initial API Calls and Concurrency
Getting started is pretty simple. First, you need to hit the homepage to grab your initial batch of links. A basic
create session request in Scrappey can fetch the page content for you, handling all the proxy and browser rendering stuff automatically.Once you get that initial HTML response, you parse out all the internal links, just like any crawler would. But here's the fun part: instead of visiting them one by one, you can use concurrency to blow the doors off your crawl speed. You can fire off multiple API requests at the same time, exploring different branches of the site in parallel.
For example, if the homepage links to ten main product categories, you can immediately send ten concurrent API requests—one for each category. This parallel processing cuts your discovery time down dramatically.
Of course, you can't just send an infinite number of requests. Every server has its limits, and your own plan will have constraints. It's crucial to manage your request rate. Definitely check out Scrappey's guidance on understanding concurrency limits to build a crawler that’s both fast and respectful.
Managing Sessions and State
Lots of modern websites need you to maintain a consistent state to navigate correctly. Think about setting your location to see local products or accepting a cookie banner to unlock the rest of the page. A stateless crawler making isolated requests will get blocked or see incomplete info constantly.
This is where session management becomes critical. When you kick off a crawl with Scrappey, you create a session that persists across multiple requests. This means cookies, local storage, and session data are all maintained, making your crawler act more like a real user.
- Persistent Cookies: Any cookie set on one request gets automatically sent with the next request in the same session.
- Stateful Navigation: This allows you to perform multi-step actions, like logging into an account or clicking through a checkout flow.
- Avoiding Blocks: Consistent session data makes your crawler look less like a bot, reducing the chances of being flagged.
Just use a named session parameter in your API calls, and each step of your crawl will build on the last. This lets you navigate complex user flows that would otherwise be impossible.
Handling Proxies and Avoiding Blocks
If you pound a server with thousands of requests from a single IP address, you're going to get blocked. It's the oldest mistake in the book. Websites are always watching for unusual activity, and a firehose of requests from one source is a giant red flag.
An API service gets around this by automatically rotating your requests through a massive pool of proxies.
- Residential & Datacenter Proxies: Your requests can come from a mix of IP types, mimicking real users from all over the world.
- Automatic Rotation: You don't have to manage a proxy list. The API handles swapping them out if one gets flagged or blocked.
- Geo-Targeting: You can even make requests appear to come from specific countries—a must for sites with location-based content.
This built-in proxy management is a lifesaver. It abstracts away one of the most tedious and failure-prone parts of web scraping, letting you focus on your code instead of infrastructure.
For our data engineers, the goal to 'find all pages on a website' often means tackling CMS behemoths where sitemaps hide thousands of auto-generated URLs. Platforms like WordPress and Shopify power these dynamic sites—with 26.6 million e-commerce sites surging 204% in 2021 alone. Scrappey's rotating proxies and smart queueing bypass rate limits, rendering the full DOM for JavaScript-heavy pages. You can find more on these trends in this statistical breakdown from DiviFlash.
Deduplication and Canonical URLs
As your crawler digs through a site, it’s going to find multiple URLs pointing to the exact same content. This happens all the time on e-commerce sites with filtering and sorting parameters. The same category page could have dozens of URLs depending on how the products are displayed.
To avoid processing the same page over and over, your logic needs to handle deduplication. The best way to do this is by looking for the canonical link element in the page's
<head>.This tag (
<link rel="canonical" href="...">) tells search engines which URL is the "master" version. By extracting and storing only the canonical URL, you keep your final list clean and free of junk. Your crawler's logic should be:- Fetch a page.
- Check for a canonical tag.
- If one exists, use its
hrefas the definitive URL for that content.
- If that canonical URL is already in your discovered list, just toss the current URL and move on.
This simple check keeps your crawler from getting stuck in loops or wasting resources on redundant pages, giving you a much more accurate and efficient map of the site.
Find Orphaned Pages with Search Engines and Archives
Even the most relentless crawler has a blind spot: pages that aren't linked anywhere on the site. These are your "orphaned pages"—forgotten landing pages, old blog posts, or test environments accidentally left live. They’re totally invisible to any tool that works by following
<a> tags.To truly find all pages on a website, you have to think beyond your own crawler. You need to supplement your efforts by looking at what the rest of the world has already seen, indexed, and saved.
This is where search engine indexes and historical archives come in. They give you that critical external perspective, helping you build a page list that’s as complete as it gets.
Using Search Operators for Deeper Discovery
Why do all the crawling yourself when search engines have already done most of the heavy lifting? By using advanced search operators, you can query their massive indexes directly to see every URL they've associated with a domain.
The most powerful operator for this is
site:. A quick search for site:example.com will show you most of the pages Google knows about for that domain. You'll often find subdomains and obscure URLs your crawler missed, especially if they’re linked from other websites but not from within the site itself.You can get even more granular by combining it with other operators:
site:example.com -inurl:www: This is a great trick for finding pages on non-www subdomains that might be flying under the radar.
site:example.com filetype:pdf: Instantly uncover PDF files and other documents that aren't standard HTML pages.
site:example.com "specific keyword": Hunt down pages related to a certain topic that might not be part of the main navigation.
This is a fast and surprisingly effective way to find pages that are indexed but potentially orphaned. It’s the perfect cross-reference to check against your own crawl results.
Exploring Historical Web Archives
What about the pages that no longer have active links and have been de-indexed by Google? These digital ghosts exist in a kind of limbo, and web archives are the best place to go hunting for them.
The Internet Archive's Wayback Machine is the most well-known tool for this. It has been taking snapshots of websites for decades. Just pop in a domain, and you can explore a calendar of saved versions, browsing the site exactly as it existed on a specific date. You'll often uncover old sections, campaign pages, or entire site structures that are no longer live but provide priceless context.
For a deeper dive into finding historical data and discovering pages that are no longer linked, it's worth checking out a guide on the Archive.org Wayback Machine.
These archived versions are goldmines for URLs that have been scrubbed from the current sitemap and internal navigation. By comparing a historical snapshot with your current crawl, you can easily spot pages that have vanished—a critical step for setting up 301 redirects during a site migration or conducting a thorough content audit.
So, your crawler has finally ground to a halt. Don't pop the champagne just yet. What you're left with is a massive, raw list of URLs—a messy jumble of live pages, redirects, and plenty of dead ends. Your job isn't done until you turn this data dump into a clean, actionable map of the website.
This final cleanup is absolutely critical. It's the part where you transform a chaotic list into a reliable site inventory. And the very first thing you need to do is validate every single URL. Your crawler just found links; it didn't check to see if they actually work.
Are These Pages Even Live?
A super-efficient way to validate your list is to send a HEAD request to each URL. Unlike a GET request that downloads the whole page, a HEAD request just asks the server for the headers. It's incredibly fast and light on resources, which is perfect when you're checking thousands upon thousands of URLs.
Your goal here is simple: check the HTTP status code the server sends back.
- 200 OK: Perfect. The page is live and accessible. This one's a keeper.
- 301/302 Redirect: This page has moved. You'll want to follow that redirect to its final destination and update your list with the new, correct URL.
- 404 Not Found: A dead end. The page is broken or gone, so you can safely toss this URL from your final list.
- 5xx Server Error: Something's wrong on their end, not yours. You could try these again later, but for now, they're not reachable.
Once you’ve filtered out the dead links and server errors, your list is much cleaner. Now you can move on to the final touches, like deduplicating any remaining stragglers and normalizing the URL formats. For example, decide if you're keeping trailing slashes or not and make everything consistent. Just remember to keep respecting
robots.txt and applying those rate limits you set up earlier to stay on the right side of ethical scraping.Even with a solid game plan, you're bound to hit a few snags when trying to map out a website. Let's tackle some of the most common questions that pop up during the process.
How Do I Find Pages Not Indexed by Google?
Just because a page isn't on Google doesn't mean it doesn't exist. It might be too new, blocked by a
noindex tag, or simply missed by the search engine's crawler. The only way to be sure you find everything is to crawl the site directly and exhaustively.This is where a headless browser really shines. By starting at the homepage and following every single internal link, a headless solution renders the full DOM. It will catch all those links hidden away in JavaScript that a standard crawler—or even a search engine bot—might overlook. You're mapping the site based on its real structure, not just what it chooses to show Google.
Is It Legal to Crawl All Pages on a Website?
Generally, crawling publicly accessible web pages is perfectly fine, but you have to play by the rules. The golden rule is to always respect the
robots.txt file. This is the site owner's instruction manual for bots, and ignoring it is just bad form.Remember, crawling that disrupts a site's service can definitely cross into a legal gray area, so your top priority should always be to be a good internet citizen.
What Is the Best Way to Handle Infinite Scroll?
Ah, infinite scroll—the classic crawler trap. Because new content is loaded dynamically via JavaScript as you scroll, a simple HTML request will only ever grab the first batch of content. Your crawler will think it's done when it's barely scratched the surface.
The only reliable way around this is to use a tool that can act like a real person. A headless browser can be scripted to:
- Scroll all the way to the bottom of the page.
- Wait for the new content to load.
- Scrape the links that just appeared.
- Keep doing this until no more content loads.
This approach perfectly mimics what a user does, guaranteeing your crawler sees and captures every dynamically loaded link and page on the site.
Ready to build a crawler that can handle any website, no matter how complex? Scrappey provides the headless browser infrastructure and proxy management you need to find every page without the engineering overhead. Get started today at https://scrappey.com.
