Crawling pages with Scrapy

Scrapy is a powerful library used to build crawlers, scrapers and parsers. It is built on top of Twisted which means that under the hood the networking operations are asynchronous which improves the crawling performance significantly. I’m going to describe the essential Scrapy parts that are usually used when building a crawler.

Request & Response

Those classes are central to the framework operation. User invokes a Request class by passing a url and receives a Response object in a callback as an argument. A callback is a function which will be invoked by the Request object once the request has finished its job. The result of the request’s job is a Response object which is passed to the callback function. A callback is sometimes called a handler. Response is basically a string with the HTML code of the page we’ve requested:

import scrapy

def parse(response):
	# do something with the response
	pass

# making a request with a callback that we've defined above
scrapy.Request('http://example.com', callback=parse)

Selectors

Those are pretty easy to understand. So once the request is finished and the handler is invoked, it has a Response object as its argument. In order to find and extract data from this response we invoke selectors on it by calling the css() method. This method accepts a CSS selector string which tells what HTML element it should pick. If you ever wrote jQuery or CSS code then this concept should be familiar. First, you invoke a selector on a response, passing CSS selector as an argument, then you invoke either extract()(to get all occurrences as a list) or extract_first() (to get the first occurrence) on the resulting selector. Here’s an example, the first block is a simplified response string and the second block is a request handler which has this string as an argument:

<html>
<body>
	<h1 class="title">A website</h1>
	<div class="salutation">Hello world</div>
</body>
</html>
def parse(self, response):
	greetings_text = response.css('div.salutation::text').extract_first()
	print(greetings_text) # this will print "Hello world"

Spider

This is the place to write your logic. This is where you launch your requests and parse responses. The main method in a Spider class is start_requests. This method is the main entry point for a spider — when it is invoked, it launches the start_requestsfunction. This function usually contains your initial Request invocations. Further requests can be made inside the callback functions. So, in the end we have a hierarchy of requests & handlers that looks like a tree with a root at start_requests). If you have a simple parser, then the usual scenario for the parser would be to invoke start_requests, this function will make a bunch of requests which will in turn invoke their handlers upon completion. If the requests are of the same nature (e.g. multiple pages of the search results which share the same structure but differ in data) you will probably want to have the same handler for all requests. If you have a more advanced parser then your handler won’t be the last execution point — you will probably want to invoke more requests based on the data that you’ve just collected on the page from the response. After you’ve extracted the data you might want to pass that data for further processing (e.g. storing it in the database or dumping to a JSON file). You can do that by yielding the object from the handler. Here’s another example:

class ExampleSpider(scrapy.Spider):

	def start_requests(self):
		# invoking initial request
		yield scrapy.Request('http://example.com', self.parse)

	def parse(self, response):
		# parsing response from the initial request
		# collecting links
		links = response.css('a.title::attr(href)').extract()
		for link in links:
			# make a request for each link we've collected
			# the handler is different from the one we had
			# in initial request
			yield scrapy.Request(link, self.parse_page)

	def parse_page(self, response):
		# parsing response from each link
		title = response.css('h1.title::text').extract()
		content = response.css('div.content p').extract()

		# returning structured data for further processing
		yield {'title': title, 'content': content}

This simple parser scenario is so common that there is a shortcut to reduce the boilerplate. Here’s a reduced example:

class SimpleSpider(scrapy.Spider):
	# those are the initial urls that you used to write
	# in a start_requests method. A request will be invoked
	# for each url in this list
	start_urls = ['http://example.com']

	# this method will be called automatically for each
	# finished request from the start_urls.
	def parse(self, response):
		# you can either parse the response and return the data
		# or you can collect further urls and create additional
		# handlers for them, like we did with parse_page previously

Pipeline

This is the final entity that we’re going to cover. It is the place where you post-process your data. Post-processing usually means cleaning the data from garbage, applying transformations and most importantly saving the data to a database or a file. Pipelinesare pretty straightforward. They have a process_item method which is invoked for each piece of data returned from the spider. In this function we usually clean the data or save it to a database. The other methods are: open_spider and close_spider. They are usually used to prepare or discard the resources (like database connection or a file handle) before the beginning of the processing. There could be multiple pipelines which stack upon each other, where each one is processing the same item applying their own processing functions. For example, the first pipeline is removing HTML tags, the second one appends a text, while the third one is saving data to a database. In my example markparser.storage incapsulates SQLAlchemy logic.

from markparser.storage import get_session, Place
class MarkparserPipeline(object):

def open_spider(self, spider):
	# this method is invoked once the spider
	# is initialized. No requests have been
	# made at this point yet
	self.session = get_session()

def close_spider(self, spider):
	# this method is invoked when the spider
	# is about to exit. All requests have been
	# made already.
	self.session.close()

def process_item(self, item, spider):
	# here we place our item processing logic
	# we can either modify our data and pass it on
	# for further processing or we can save this
	# item to a database and finish the execution
	record = Place(**item)
	self.session.add(record)
	self.session.commit()
	return item

OK, that’s it for now. There are a lot of other things like Items or Item Loaders (these help structure the data) which I'll probably cover in the next part of this tutorial.