Using Scrapy from a script or celery task

3 minute read

Scrapy is a web scraping framework for Python. If you followed the tutorial, the steps include creating a project, defining an item, writing a spider, and initiating a crawl from the command line.

This method is fine for a large scraping project, but what if you’d like to scrape some web content from within another application, or spawn a Celery task to do so asynchronously? The documentation describes a method to run scrapy from a script, but I found the need to use CrawlerProcess and Twisted to be more complex than it should be. Similarly, a Google search turns up various methods using the multiprocessing library, but they can be difficult to follow, and often won’t work with Celery because it doesn’t like subprocesses.

To simplify things, I created a package called scrapyscript, and made it available on Pypi. The source is on Github. It works on Python 2.7 and 3.4.

To use it, you first need to define an instance of scrapy.spiders.Spider. For example, this code instructs Scrapy to retrieve www.python.org, parse out the title using the xpath syntax //title/text(), extract() the text itself, and return the result as a Python dictionary. Of course, your spider could be as complex as you like, limited only by the capabilties of Scrapy itself.

This example also makes use of the payload option, which is explained later.

from scrapy.spiders import Spider

class PythonSpider(Spider):
    name = 'myspider'
    start_urls = ['http://www.python.org']

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        if self.payload:
           mantra = self.payload.get('mantra', None)
        else:
            mantra = None
        return {'title': title,
                'mantra': mantra}

spider = PythonSpider()

Optionally, you can create an instance of scrapy.settings.Settings, to further customize how Scrapy will behave. Otherwise, scrapyscript will use Scrapy’s default values.

Let’s define settings to specify a user agent.

from scrapy.settings import Settings
settings = Settings()
settings.set('USER_AGENT',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86 Safari/537.36')

Jobs

Next, we need to define one or more instances of scrapyscript.Job. Jobs contain your spider, and any data (payload) you would like to pass into the running instance.

Basic Job

basicjob = Job(spider)

Passing an object

To make extra data available in the running spider, add it to the job using the payload kwarg.

jobwithdata = Job(spider,
                  payload={'mantra': 'Simple is better than complex.'})  # availabe in spider as self.payload

Processor

We have now defined two jobs - basicjob and jobwithdata. To start Scrapy and run these jobs, we need to instantiate an instance of scrapyscript.Processor, call the run method, and pass them in as a list.

Optionally, you can supply settings. Since we’ve defined them, we’ll pass them in.

Processor(settings=settings).run([basicjob, jobwithdata])
[{'mantra': None, 'title': ['Welcome to Python.org']},
 {'mantra': 'Simple is better than complex.',
  'title': ['Welcome to Python.org']}]

How does it work?

scrapyscript uses a fork of the Multiprocessing library to create a process, launch a twisted reactor, and run the spiders in Scrapy. All Jobs are run in parallel, and the process blocks until all are complete. The results come back as a list, as you can see above.

Using Celery

In part 2, I’ll walk through the process of using scrapyscript in a Celery worker.

Complete Code

For your convenience, here is the full example.

from scrapy.spiders import Spider
from scrapy.settings import Settings

class PythonSpider(Spider):
    name = 'myspider'
    start_urls = ['http://www.python.org']

    def parse(self, response):
        title = response.xpath('//title/text()').extract()
        if self.payload:
           mantra = self.payload.get('mantra', None)
        else:
            mantra = None
        return {'title': title,
                'mantra': mantra}

spider = PythonSpider()

settings = Settings()
settings.set('USER_AGENT',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.86 Safari/537.36')

basicjob = Job(spider)
jobwithdata = Job(spider,
                  payload={'mantra': 'Simple is better than complex.'})  # availabe in spider as self.payload

Processor(settings=settings).run([basicjob, jobwithdata])

Comments