How to Scrape Multiple Pages of a Website Using Python?

Web scraping multiple pages allows us to collect large amounts of data spread across paginated web content. In Python, this is done by sending repeated requests, handling page links and extracting required information in a structured way. In this article, we’ll take the GeeksforGeeks website as an example and write a Python script to extract the titles of all articles available on its homepage.

Scraping Multiple Pages of a Website Using Python

When we need to collect data from several pages of the same website or from different URLs, writing separate code for each page can be slow and repetitive To make this process easier, we’ll learn two simple techniques to scrape data from multiple webpages:

From multiple pages of the same website
From different website URLs

Approach:

Import the necessary libraries.
Set up the base URL and connect using the requests library.
Parse the webpage data using BeautifulSoup.
Locate and extract the HTML tags or classes containing the required information.
Test it on one page, then use a loop to scrape multiple pages automatically.

Example 1: Looping through the page numbers

page numbers at the bottom of the GeeksforGeeks website

Most websites organize their content across multiple pages labeled from 1 to N, making it easy to loop through them since their structure usually remains the same.

notice the last section of the URL - page/4/

For example, if the URL of a page ends with something like page/4/, we can change that number dynamically in our code. By using a simple for loop and replacing the page number (i) in the URL, we can automatically visit each page and extract the required data without manually editing the URL each time.

The following example demonstrates how to scrape data from multiple pages using a for loop in Python.

Python

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.geeksforgeeks.org//page/1/'

req = requests.get(URL)
soup = bs(req.text, 'html.parser')

titles = soup.find_all('div',attrs = {'class','head'})

print(titles[4].text)

Output:

Now, using the above code, we can get the titles of all the articles by just sandwiching those lines with a loop.

Python

import requests
from bs4 import BeautifulSoup as bs

URL = 'https://www.geeksforgeeks.org//page/1/'

for page in range(1,10):

    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')

    titles = soup.find_all('div',attrs={'class': 'head'})

    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)

Output

Note: The above code will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that fall under those pages.

Example 2: Looping through a list of different URLs.

The previous method worked well when pages followed a numbered pattern. But sometimes, we may want to scrape data from pages that don’t have page numbers or follow different URL structures.

In such cases, instead of writing separate code for each page, we can simply store all the URLs in a list and loop through them to extract data easily. Here’s an example:

Python

import requests
from bs4 import BeautifulSoup as bs
URL = ['https://www.geeksforgeeks.org/','https://www.geeksforgeeks.org//page/10/']

for url in range(0,2):
    req = requests.get(URL[url])
    soup = bs(req.text, 'html.parser')

    titles = soup.find_all('div',attrs={'class','head'})
    for i in range(4, 19):
        if url+1  > 1:
            print(f"{(i - 3) + url * 15}" + titles[i].text)
        else:
            print(f"{i - 3}" + titles[i].text)

Output

How to avoid getting our IP address banned

When scraping multiple pages, sending too many requests in a short time can overload the website’s server. This might lead to our IP address getting blocked or blacklisted.

To prevent this, it’s important to control our crawl rate, that is how frequently our program sends requests. The best way to do this is to add short, random pauses between requests making our script behave more like a human browsing naturally.

We can achieve this using two Python functions:

randint() from the random module: generates a random number between two limits.
sleep() from the time module: pauses the program for a few seconds.

Example:

Python

from time import *
from random import randint

for i in range(0,3):
  # selects random integer in given range
  x = randint(2,5)
  print(x)
  sleep(x)
  print(f'I waited {x} seconds')

Output

5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 seconds

Now, let’s apply this logic to our web scraping loop:

Python

import requests
from bs4 import BeautifulSoup as bs
from random import randint
from time import sleep

URL = 'https://www.geeksforgeeks.org//page/1/'

for page in range(1,10): 
      # pls note that the total number of pages in the website is more than 5000 so i'm only taking the first 10 as this is just an example

    req = requests.get(URL + str(page) + '/')
    soup = bs(req.text, 'html.parser')

    titles = soup.find_all('div',attrs={'class','head'})

    for i in range(4,19):
        if page>1:
            print(f"{(i-3)+page*15}" + titles[i].text)
        else:
            print(f"{i-3}" + titles[i].text)

    sleep(randint(2,10))

Output

The program has paused its execution and is waiting to resume

How to Scrape Multiple Pages of a Website Using Python?

Scraping Multiple Pages of a Website Using Python

Example 1: Looping through the page numbers

Example 2: Looping through a list of different URLs.

How to avoid getting our IP address banned

Explore