One needs to use proxy pool to scrape some of the sites that implement countermeasures against scraping. It might be because of IP-based rate limiting, geographic restrictions, AS-level traffic filtering or even something more advanced such as TLS fingerprinting. When Scrapy framework is used to implement a scraper, there are some ways to do so:
- Setting
HTTPS_PROXY
environment variable when running a spider will make the traffic go through the proxy. Note that this will also apply to requests being made not only from spider, but from pipelines and other parts of the project, which may not be desirable. - Setting proxy URL at
proxy
key of themeta
dictionary for each request. This allows us to be flexible on which requests are being sent through a proxy, but makes code less clean by moving the proxy integration into spider class.
Both of these options depend on the standard HTTP proxy middleware that is available in a vanilla Scrapy project.
However we may want to have more flexibility and introduce additional logic regarding
how exactly the requests are routed through a proxy pool. This can be done through
implementing an additional downloaded middleware that will augment
HttpProxyMiddleware
by assigning a proxy URL to some or all of the requests. To do so, we edit middlewares.py
file in a Scrapy project and create a downloader middleware class there.
To process each HTTP request (represented by scrapy.Request
object) we must
implement our own process_request()
method. If we need to assign a proxy URL to request, we put it into meta
dictionary
at proxy
key. If not, we leave it as-is. Either way we don’t need to return anything
from this method.
For example, we may want to integrate Bright Data proxy pool in a way that forces exit IP
randomisation. The following code implements a simple downloader middleware that
takes Bright Data zone credentials from settings, appends session ID to the user name,
generates proxy URL and puts it into meta
dictionary:
from museums.settings import (
BRIGHT_DATA_ENABLED,
BRIGHT_DATA_ZONE_USERNAME,
BRIGHT_DATA_ZONE_PASSWORD,
)
from w3lib.http import basic_auth_header
import random
class BrightDataDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
return cls()
def process_request(self, request, spider):
if not BRIGHT_DATA_ENABLED:
return None
request.meta["proxy"] = "http://zproxy.lum-superproxy.io:22225"
username = BRIGHT_DATA_ZONE_USERNAME + "-session-" + str(random.random())
request.headers["Proxy-Authorization"] = basic_auth_header(
username, BRIGHT_DATA_ZONE_PASSWORD
)
It also sets a Proxy-Authorization
in a way that it is used for HTTP Basic Auth.
To use this middleware in the Scrapy project, we must not only set the configuration variables,
but also edit DOWNLOADER_MIDDLEWARES
to activate this middleware with a priority value:
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'museums.middlewares.BrightDataDownloaderMiddleware': 500,
}
If this does not work you may want to check that priority of your custom middleware is higher
(numerically lower) than that of HttpProxyMiddleware
, as we need to set proxy URL and
credentials before it gets to the regular middleware.
For further examples on Scrapy middlewares that integrate with proxy pools you may want to read the code of the following projects:
- scrapy-zyte-smartproxy - official middleware for the Zyte Smart Proxy service. Implements some reliability and error handling features such as exponential backoff and gracefully handling an outage of proxy pool, as well as managing vendor-specific HTTP headers.
- scrapy-rotating-proxies - two middlewares that autorotate proxies for you with some ban detection logic. Lets you customise the ban detection policy.