Use Scrapy shell for interactive experimentation
Running scrapy shell
gives you an interactive environment for experimenting with
the site being scraped. For example, running fetch()
with URL of page
fetches the page and creates a response
variable with scrapy.Response
object for that page.
view(response)
opens the browser to let you see the HTML that the Scrapy spider would fetch.
This bypasses some of the client-side rendering and also lets us detect if the site has some
countermeasures against scraping.
Furthermore, calling css()
or xpath()
methods on response object is a convenient way to
refine your CSS or XPath queries before putting them into actual Python code.
To learn more about Scrapy shell, see: https://docs.scrapy.org/en/latest/topics/shell.html
Use image/file pipeline for file downloading
Sometimes your web scraping project will involve downloading images or other kinds of files. Scrapy provides some official pipelines for this exact task.
To download any kind of files, we can integrate FilesPipeline
by adding
it into ITEM_PIPELINES
map in settings.py file like this:
ITEM_PIPELINES = {'scrapy.pipelines.files.FilesPipeline': 1}
You would also need to set the directory path for files to be downloaded:
FILES_STORE = '/path/to/valid/dir'
Furthermore, your items will need to include file_urls
field with a list of file URLs.
When files pipeline processes the item, it will download each file into directory you
have configured and will set a files
property with list of dictionaries of original URLs and
paths of downloaded files. To convert file URL into file name, SHA1 hash is computed on
URL and prepended to the original file extension.
Integration of images pipeline is rather similar. You would edit settings.py to add
scrapy.pipelines.files.FilesPipeline
into ITEM_PIPELINES
dictionary and set
IMAGES_STORE
to path of directory that will be used for storing files. However,
images pipeline provides some image processing capabilities, thus requiring that
Pillow module is installed. By default,
images pipeline automatically converts all downloaded images to JPEG format.
Images pipeline expects image_urls
item property to be filled with list of image
URLs and will set images
property in a a way equivalent to files
property of
files pipeline.
Both pipelines allow storing downloaded files on stores external to local file system:
- AWS S3 buckets
- Remote FTP servers
- Google Cloud storage buckets
To learn more about this, see: https://docs.scrapy.org/en/latest/topics/media-pipeline.html
Use automatic throttling for scraping rate-limited sites
Some sites implement rate-limiting and will refuse to give you responses with proper pages if you are generating too many requests too quickly from single IP. A simple way to slow down your Scrapy project is to decrease the following values in settings.py:
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
These variables control upper limit of how many concurrent requests spider is allowed to launch per domain/target IP address - only one of them should be set.
A more advanced way to slow down is to use AutoThrottle
extension by uncommenting the following
parts of settings.py file and experimenting with values to reach a point of no requests being dropped:
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
This will enable an adaptive throttling algorithm that will take into account a load of both your Scrapy instance and that of remote server(s) being scraped.
To learn more about automatic throttling, see: https://docs.scrapy.org/en/latest/topics/autothrottle.html
Deploy Scrapy projects to the cloud
Once you have your Scrapy project running properly, you may want to avoid running it on your local machine to perform scraping. There are several ways to get it running in the cloud environment.
The simplest way is to get a cheap VPS (e.g. $5/month Digital Ocean droplet), install Scrapy there, upload your Scrapy project via SFTP and run it in tmux session.
Another way is to self-host a solution like Scrapydweb that will provide you with web interface to upload your Scrapy project and monitor scraping progress on the dashboard.
Yet another way is to sign up for Scrapy Cloud - the official service from creators of Scrapy. This service has its own CLI tool for uploading your Scrapy project and launching it in the cloud environment, which means you can integrate it into your CI/CD pipeline.
Use Item Loader to streamline item creation
Scrapy lets you create Item Loaders to simplify and streamline item creation based on CSS selectors and XPath queries. Furthermore, it allows you to include extra steps such as whitespace stripping and other parsing tasks.
Scrapy documentation provides the following example on how Item Loader could be used in the spider:
from scrapy.loader import ItemLoader
from myproject.items import Product
def parse(self, response):
l = ItemLoader(item=Product(), response=response)
l.add_xpath('name', '//div[@class="product_name"]')
l.add_xpath('name', '//div[@class="product_title"]')
l.add_xpath('price', '//p[@id="price"]')
l.add_css('stock', 'p#stock]')
l.add_value('last_updated', 'today') # you can also use literal values
return l.load_item()
To learn more about Item Loaders, see: https://docs.scrapy.org/en/latest/topics/loaders.html