Scrapy documentation¶
This documentation contains everything you need to know about Scrapy.
Getting help¶
Having trouble? We’d like to help!
- Try the FAQ – it’s got answers to some common questions.
- Looking for specific information? Try the Index or Module Index.
- Ask or search questions in StackOverflow using the scrapy tag.
- Ask or search questions in the Scrapy subreddit.
- Search for questions on the archives of the scrapy-users mailing list.
- Ask a question in the #scrapy IRC channel,
- Report bugs with Scrapy in our issue tracker.
First steps¶
- Scrapy at a glance
- Understand what Scrapy is and how it can help you.
- Installation guide
- Get Scrapy installed on your computer.
- Scrapy Tutorial
- Write your first Scrapy project.
- Examples
- Learn more by playing with a pre-made Scrapy project.
Basic concepts¶
- Command line tool
- Learn about the command-line tool used to manage your Scrapy project.
- Spiders
- Write the rules to crawl your websites.
- Selectors
- Extract the data from web pages using XPath.
- Scrapy shell
- Test your extraction code in an interactive environment.
- Items
- Define the data you want to scrape.
- Item Loaders
- Populate your items with the extracted data.
- Item Pipeline
- Post-process and store your scraped data.
- Feed exports
- Output your scraped data using different formats and storages.
- Requests and Responses
- Understand the classes used to represent HTTP requests and responses.
- Link Extractors
- Convenient classes to extract links to follow from pages.
- Settings
- Learn how to configure Scrapy and see all available settings.
- Exceptions
- See all available exceptions and their meaning.
Built-in services¶
- Logging
- Learn how to use Python’s builtin logging on Scrapy.
- Stats Collection
- Collect statistics about your scraping crawler.
- Sending e-mail
- Send email notifications when certain events occur.
- Telnet Console
- Inspect a running crawler using a built-in Python console.
- Web Service
- Monitor and control a crawler using a web service.
Solving specific problems¶
- Frequently Asked Questions
- Get answers to most frequently asked questions.
- Debugging Spiders
- Learn how to debug common problems of your scrapy spider.
- Spiders Contracts
- Learn how to use contracts for testing your spiders.
- Common Practices
- Get familiar with some Scrapy common practices.
- Broad Crawls
- Tune Scrapy for crawling a lot domains in parallel.
- Using Firefox for scraping
- Learn how to scrape with Firefox and some useful add-ons.
- Using Firebug for scraping
- Learn how to scrape efficiently using Firebug.
- Debugging memory leaks
- Learn how to find and get rid of memory leaks in your crawler.
- Downloading and processing files and images
- Download files and/or images associated with your scraped items.
- Deploying Spiders
- Deploying your Scrapy spiders and run them in a remote server.
- AutoThrottle extension
- Adjust crawl rate dynamically based on load.
- Benchmarking
- Check how Scrapy performs on your hardware.
- Jobs: pausing and resuming crawls
- Learn how to pause and resume crawls for large spiders.
Extending Scrapy¶
- Architecture overview
- Understand the Scrapy architecture.
- Downloader Middleware
- Customize how pages get requested and downloaded.
- Spider Middleware
- Customize the input and output of your spiders.
- Extensions
- Extend Scrapy with your custom functionality
- Core API
- Use it on extensions and middlewares to extend Scrapy functionality
- Signals
- See all available signals and how to work with them.
- Item Exporters
- Quickly export your scraped items to a file (XML, CSV, etc).
All the rest¶
- Release notes
- See what has changed in recent Scrapy versions.
- Contributing to Scrapy
- Learn how to contribute to the Scrapy project.
- Versioning and API Stability
- Understand Scrapy versioning and API stability.