第 22 章 Crawler

第 22 章 Crawler
上一页	部分 IV. Python 数据分析	下一页

22.1. Scrapy - Python web scraping and crawling framework

22.1.1. 安装 scrapy 开发环境
22.1.2. scrapy 命令
22.1.3. Scrapy Shell
22.1.4. 爬虫项目
22.1.5. 下载图片
22.1.6. xpath

22.1. Scrapy - Python web scraping and crawling framework

https://scrapy.org

22.1.1. 安装 scrapy 开发环境

22.1.1.1. Mac

neo@MacBook-Pro ~ % brew install python3
neo@MacBook-Pro ~ % pip3 install scrapy

22.1.1.2. Ubuntu

搜索 scrapy 包，scrapy 支持 Python2.7 和 Python3 我们只需要 python3 版本

neo@netkiller ~ % apt-cache search scrapy | grep python3
python3-scrapy - Python web scraping and crawling framework (Python 3)
python3-scrapy-djangoitem - Scrapy extension to write scraped items using Django models (Python3 version)
python3-w3lib - Collection of web-related functions (Python 3)

Ubuntu 17.04 默认 scrapy 版本为 1.3.0-1 如果需要最新的 1.4.0 请使用 pip 命令安装

neo@netkiller ~ % apt search python3-scrapy
Sorting... Done
Full Text Search... Done
python3-scrapy/zesty,zesty 1.3.0-1~exp2 all
  Python web scraping and crawling framework (Python 3)

python3-scrapy-djangoitem/zesty,zesty 1.1.1-1 all
  Scrapy extension to write scraped items using Django models (Python3 version)

安装 scrapy

neo@netkiller ~ % sudo apt install python3-scrapy
[sudo] password for neo: 
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly
  python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb
  python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch
  python3-pygments python3-queuelib python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib python3-wcwidth
  python3-webencodings python3-zope.interface
Suggested packages:
  python-pexpect-doc python-attr-doc python-cryptography-doc python3-cryptography-vectors python3-genshi python3-lxml-dbg python-lxml-doc default-mysql-server | virtual-mysql-server
  python-egenix-mxdatetime python3-mysqldb-dbg python-openssl-doc python3-openssl-dbg python3-pam-dbg python-pil-doc python3-pil-dbg doc-base python-pydispatch-doc ttf-bitstream-vera python-scrapy-doc
  python3-wxgtk3.0 | python3-wxgtk python-setuptools-doc python3-tk python3-gtk2 python3-glade2 python3-qt4 python3-wxgtk2.8 python3-twisted-bin-dbg
The following NEW packages will be installed:
  ipython3 libmysqlclient20 libwebpmux2 mysql-common python-pexpect python-ptyprocess python3-attr python3-boto python3-bs4 python3-cffi-backend python3-click python3-colorama python3-constantly
  python3-cryptography python3-cssselect python3-decorator python3-html5lib python3-idna python3-incremental python3-ipython python3-ipython-genutils python3-libxml2 python3-lxml python3-mysqldb
  python3-openssl python3-pam python3-parsel python3-pexpect python3-pickleshare python3-pil python3-prompt-toolkit python3-ptyprocess python3-pyasn1 python3-pyasn1-modules python3-pydispatch
  python3-pygments python3-queuelib python3-scrapy python3-serial python3-service-identity python3-setuptools python3-simplegeneric python3-traitlets python3-twisted python3-twisted-bin python3-w3lib
  python3-wcwidth python3-webencodings python3-zope.interface
0 upgraded, 49 newly installed, 0 to remove and 0 not upgraded.
Need to get 7,152 kB of archives.
After this operation, 40.8 MB of additional disk space will be used.
Do you want to continue? [Y/n]

输入大写 “Y” 然后回车

22.1.1.3. 使用 pip 安装 scrapy

neo@netkiller ~ % sudo apt install python3-pip
neo@netkiller ~ % pip3 install scrapy

22.1.1.4. 测试 scrapy

创建测试程序，用于验证 scrapy 安装是否存在问题。

			
$ cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for title in response.css('h2.entry-title'):
            yield {'title': title.css('a ::text').extract_first()}

        for next_page in response.css('div.prev-post > a'):
            yield response.follow(next_page, self.parse)
EOF

运行爬虫

$ scrapy runspider myspider.py

22.1.2. scrapy 命令

		
neo@MacBook-Pro ~/Documents/crawler % scrapy     
Scrapy 1.4.0 - project: crawler

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

22.1.2.1.

			
neo@MacBook-Pro ~/Documents % scrapy startproject crawler 
New Scrapy project 'crawler', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/neo/Documents/crawler

You can start your first spider with:
    cd crawler
    scrapy genspider example example.com

22.1.2.2. 新建 spider

			
neo@MacBook-Pro ~/Documents/crawler % scrapy genspider netkiller netkiller.cn
Created spider 'netkiller' using template 'basic' in module:
  crawler.spiders.netkiller

22.1.2.3. 列出可用的 spiders

			
neo@MacBook-Pro ~/Documents/crawler % scrapy list
bing
book
example
netkiller

22.1.2.4. 运行 spider

			
neo@MacBook-Pro ~/Documents/crawler % scrapy crawl netkiller

运行结果输出到 json 文件中

			
neo@MacBook-Pro ~/Documents/crawler % scrapy crawl netkiller -o output.json

22.1.3. Scrapy Shell

Scrapy Shell 是一个爬虫命令行交互界面调试工具，可以使用它分析被爬的页面

		
neo@MacBook-Pro /tmp % scrapy shell http://www.netkiller.cn
2017-09-01 15:23:05 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-09-01 15:23:05 [scrapy.utils.log] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-01 15:23:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-01 15:23:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-01 15:23:05 [scrapy.core.engine] INFO: Spider opened
2017-09-01 15:23:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkiller.cn> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x103b2afd0>
[s]   item       {}
[s]   request    <GET http://www.netkiller.cn>
[s]   response   <200 http://www.netkiller.cn>
[s]   settings   <scrapy.settings.Settings object at 0x1049019e8>
[s]   spider     <DefaultSpider 'default' at 0x104be2a90>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

22.1.3.1. response

response 是爬虫返回的页面，可以通过 css(), xpath() 等方法取出你需要的内容。

当前URL地址

				
>>> response.url
'https://netkiller.cn/linux/index.html'

status HTTP 状态

				
>>> response.status
200

text 正文

返回 HTML 页面正文

response.text

css

css() 这个方法可以用来选择html和css

				
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Netkiller ebook - Linux ebook</ti'>]

>>> response.css('title').extract()
['<title>Netkiller ebook - Linux ebook</title>']

>>> response.css('title::text').extract()
['Netkiller ebook - Linux ebook']

基于 class 选择

				
>>> response.css('a.ulink')[1].extract()
'<a class="ulink" href="http://netkiller.github.io/" target="_top">http://netkiller.github.io</a>'	

>>> response.css('a.ulink::text')[3].extract()
'http://netkiller.sourceforge.net'

数组的处理

				
>>> response.css('a::text').extract_first()
'简体中文'

>>> response.css('a::text')[1].extract()
'繁体中文'

>>> response.css('div.blockquote')[1].css('a.ulink::text').extract()
['Netkiller Architect 手札', 'Netkiller Developer 手札', 'Netkiller PHP 手札', 'Netkiller Python 手札', 'Netkiller Testing 手札', 'Netkiller Java 手札', 'Netkiller Cryptography 手札', 'Netkiller Linux 手札', 'Netkiller FreeBSD 手札', 'Netkiller Shell 手札', 'Netkiller Security 手札', 'Netkiller Web 手札', 'Netkiller Monitoring 手札', 'Netkiller Storage 手札', 'Netkiller Mail 手札', 'Netkiller Docbook 手札', 'Netkiller Project 手札', 'Netkiller Database 手札', 'Netkiller PostgreSQL 手札', 'Netkiller MySQL 手札', 'Netkiller NoSQL 手札', 'Netkiller LDAP 手札', 'Netkiller Network 手札', 'Netkiller Cisco IOS 手札', 'Netkiller H3C 手札', 'Netkiller Multimedia 手札', 'Netkiller Perl 手札', 'Netkiller Amateur Radio 手札']

正则表达式

				
>>> response.css('title::text').re(r'Netkiller.*')
['Netkiller ebook - Linux ebook']

>>> response.css('title::text').re(r'N\w+')
['Netkiller']

>>> response.css('title::text').re(r'(\w+) (\w+)')
['Netkiller', 'ebook', 'Linux', 'ebook']

获取 html 属性

通过 a::attr() 可以获取 html 标记的属性值

					
>>> response.css('td a::attr(href)').extract_first()
'http://netkiller.github.io/'

xpath

				
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Netkiller ebook - Linux ebook</ti'>]

>>> response.xpath('//title/text()').extract_first()
'Netkiller ebook - Linux ebook'

xpath 也可以使用 re() 方法做正则处理

				
>>> response.xpath('//title/text()').re(r'(\w+)')
['Netkiller', 'ebook', 'Linux', 'ebook']	

>>> response.xpath('//div[@class="time"]/text()').re('[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}')
['2017-09-21 02:01:38']

抽取HTML属性值，如图片URL。

				
>>> response.xpath('//img/@src').extract()
['graphics/spacer.gif', 'graphics/note.gif', 'graphics/by-nc-sa.png', '/images/weixin.jpg', 'images/neo.jpg', '/images/weixin.jpg']

筛选 class

				
>>> response.xpath('//a/@href')[0].extract()
'http://netkiller.github.io/'

>>> response.xpath('//a/text()')[0].extract()
'简体中文'

>>> response.xpath('//div[@class="blockquote"]')[1].css('a.ulink::text').extract()
['Netkiller Architect 手札', 'Netkiller Developer 手札', 'Netkiller PHP 手札', 'Netkiller Python 手札', 'Netkiller Testing 手札', 'Netkiller Java 手札', 'Netkiller Cryptography 手札', 'Netkiller Linux 手札', 'Netkiller FreeBSD 手札', 'Netkiller Shell 手札', 'Netkiller Security 手札', 'Netkiller Web 手札', 'Netkiller Monitoring 手札', 'Netkiller Storage 手札', 'Netkiller Mail 手札', 'Netkiller Docbook 手札', 'Netkiller Project 手札', 'Netkiller Database 手札', 'Netkiller PostgreSQL 手札', 'Netkiller MySQL 手札', 'Netkiller NoSQL 手札', 'Netkiller LDAP 手札', 'Netkiller Network 手札', 'Netkiller Cisco IOS 手札', 'Netkiller H3C 手札', 'Netkiller Multimedia 手札', 'Netkiller Perl 手札', 'Netkiller Amateur Radio 手札']

使用 | 匹配多组规则

				
>>> response.xpath('//ul[@class="topnews_nlist"]/li/h2/a/@href|//ul[@class="topnews_nlist"]/li/a/@href').extract()

headers

response.headers.getlist('Set-Cookie')

22.1.4. 爬虫项目

22.1.4.1. 创建项目

创建爬虫项目

scrapy startproject project

在抓取之前，你需要新建一个Scrapy工程

			
neo@MacBook-Pro ~/Documents % scrapy startproject crawler 
New Scrapy project 'crawler', using template directory '/usr/local/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/neo/Documents/crawler

You can start your first spider with:
    cd crawler
    scrapy genspider example example.com

neo@MacBook-Pro ~/Documents % cd crawler 
neo@MacBook-Pro ~/Documents/crawler % find .
.
./crawler
./crawler/__init__.py
./crawler/__pycache__
./crawler/items.py
./crawler/middlewares.py
./crawler/pipelines.py
./crawler/settings.py
./crawler/spiders
./crawler/spiders/__init__.py
./crawler/spiders/__pycache__
./scrapy.cfg

Scrapy 工程目录主要有以下文件组成：

scrapy.cfg: 项目配置文件
middlewares.py : 项目 middlewares 文件
items.py: 项目items文件
pipelines.py: 项目管道文件
settings.py: 项目配置文件
spiders: 放置spider的目录

22.1.4.2. Spider

创建爬虫，名字是 netkiller, 爬行的地址是 netkiller.cn

			
neo@MacBook-Pro ~/Documents/crawler % scrapy genspider netkiller netkiller.cn
Created spider 'netkiller' using template 'basic' in module:
  crawler.spiders.netkiller
neo@MacBook-Pro ~/Documents/crawler % find .
.
./crawler
./crawler/__init__.py
./crawler/__pycache__
./crawler/__pycache__/__init__.cpython-36.pyc
./crawler/__pycache__/settings.cpython-36.pyc
./crawler/items.py
./crawler/middlewares.py
./crawler/pipelines.py
./crawler/settings.py
./crawler/spiders
./crawler/spiders/__init__.py
./crawler/spiders/__pycache__
./crawler/spiders/__pycache__/__init__.cpython-36.pyc
./crawler/spiders/netkiller.py
./scrapy.cfg

打开 crawler/spiders/netkiller.py 文件，修改内容如下

			
# -*- coding: utf-8 -*-
import scrapy


class NetkillerSpider(scrapy.Spider):
    name = 'netkiller'
    allowed_domains = ['netkiller.cn']
    start_urls = ['http://www.netkiller.cn/']

    def parse(self, response):
        for link in response.xpath('//div[@class="blockquote"]')[1].css('a.ulink'):
            # self.log('This url is %s' % link)
            yield {
                'name': link.css('a::text').extract(),
                'url': link.css('a.ulink::attr(href)').extract()
                }
            
        pass

运行爬虫

			
neo@MacBook-Pro ~/Documents/crawler % scrapy crawl netkiller -o output.json
2017-09-08 11:42:30 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: crawler)
2017-09-08 11:42:30 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'crawler', 'FEED_FORMAT': 'json', 'FEED_URI': 'output.json', 'NEWSPIDER_MODULE': 'crawler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['crawler.spiders']}
2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-08 11:42:30 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-08 11:42:30 [scrapy.core.engine] INFO: Spider opened
2017-09-08 11:42:30 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-08 11:42:30 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-08 11:42:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkiller.cn/robots.txt> (referer: None)
2017-09-08 11:42:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.netkiller.cn/> (referer: None)
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Architect 手札'], 'url': ['../architect/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Developer 手札'], 'url': ['../developer/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller PHP 手札'], 'url': ['../php/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Python 手札'], 'url': ['../python/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Testing 手札'], 'url': ['../testing/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Java 手札'], 'url': ['../java/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Cryptography 手札'], 'url': ['../cryptography/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Linux 手札'], 'url': ['../linux/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller FreeBSD 手札'], 'url': ['../freebsd/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Shell 手札'], 'url': ['../shell/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Security 手札'], 'url': ['../security/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Web 手札'], 'url': ['../www/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Monitoring 手札'], 'url': ['../monitoring/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Storage 手札'], 'url': ['../storage/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Mail 手札'], 'url': ['../mail/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Docbook 手札'], 'url': ['../docbook/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Project 手札'], 'url': ['../project/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Database 手札'], 'url': ['../database/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller PostgreSQL 手札'], 'url': ['../postgresql/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller MySQL 手札'], 'url': ['../mysql/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller NoSQL 手札'], 'url': ['../nosql/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller LDAP 手札'], 'url': ['../ldap/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Network 手札'], 'url': ['../network/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Cisco IOS 手札'], 'url': ['../cisco/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller H3C 手札'], 'url': ['../h3c/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Multimedia 手札'], 'url': ['../multimedia/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Perl 手札'], 'url': ['../perl/index.html']}
2017-09-08 11:42:31 [scrapy.core.scraper] DEBUG: Scraped from <200 http://www.netkiller.cn/>
{'name': ['Netkiller Amateur Radio 手札'], 'url': ['../radio/index.html']}
2017-09-08 11:42:31 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-08 11:42:31 [scrapy.extensions.feedexport] INFO: Stored json feed (28 items) in: output.json
2017-09-08 11:42:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 438,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 6075,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 9, 8, 3, 42, 31, 157395),
 'item_scraped_count': 28,
 'log_count/DEBUG': 31,
 'log_count/INFO': 8,
 'memusage/max': 49434624,
 'memusage/startup': 49434624,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 9, 8, 3, 42, 30, 931267)}
2017-09-08 11:42:31 [scrapy.core.engine] INFO: Spider closed (finished)

你会看到返回结果

			
{'name': ['Netkiller Architect 手札'], 'url': ['../architect/index.html']}

翻页操作

下面我们演示爬虫翻页，例如我们需要遍历这部电子书《Netkiller Linux 手札》 https://netkiller.cn/linux/index.html，首先创建一个爬虫任务

				
neo@MacBook-Pro ~/Documents/crawler % scrapy genspider book netkiller.cn
Created spider 'book' using template 'basic' in module:
  crawler.spiders.book

编辑爬虫任务

				
# -*- coding: utf-8 -*-
import scrapy


class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['netkiller.cn']
    start_urls = ['https://netkiller.cn/linux/index.html']

    def parse(self, response):
        yield {'title': response.css('title::text').extract()}
        # 这里取出下一页连接地址
        next_page = response.xpath('//a[@accesskey="n"]/@href').extract_first() 
        self.log('Next page: %s' % next_page)
        # 如果页面不为空交给 response.follow 来爬取这个页面
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)    

        pass

采集内容保存到文件

下面的例子是将 response.body 返回采集内容保存到文件中

				
# -*- coding: utf-8 -*-
import scrapy


class BookSpider(scrapy.Spider):
    name = 'book'
    allowed_domains = ['netkiller.cn']
    start_urls = ['https://netkiller.cn/linux/index.html']

    def parse(self, response):
        yield {'title': response.css('title::text').extract()}

        filename = '/tmp/%s' % response.url.split("/")[-1]

        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

        next_page = response.xpath('//a[@accesskey="n"]/@href').extract_first()
        self.log('Next page: %s' % next_page)
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)    

        pass

任务运维结束后查看采集出来的文件

				
neo@MacBook-Pro ~/Documents/crawler % ls /tmp/*.html
/tmp/apt-get.html            /tmp/disc.html               /tmp/infomation.html         /tmp/lspci.html              /tmp/svgatextmode.html
/tmp/aptitude.html           /tmp/dmidecode.html          /tmp/install.html            /tmp/lsscsi.html             /tmp/swap.html
/tmp/author.html             /tmp/do-release-upgrade.html /tmp/install.partition.html  /tmp/lsusb.html              /tmp/sys.html
/tmp/avlinux.html            /tmp/dpkg.html               /tmp/introduction.html       /tmp/package.html            /tmp/sysctl.html
/tmp/centos.html             /tmp/du.max-depth.html       /tmp/kernel.html             /tmp/pr01s02.html            /tmp/system.infomation.html
/tmp/cfdisk.html             /tmp/ethtool.html            /tmp/kernel.modules.html     /tmp/pr01s03.html            /tmp/system.profile.html
/tmp/console.html            /tmp/framebuffer.html        /tmp/kudzu.html              /tmp/pr01s05.html            /tmp/system.shutdown.html
/tmp/console.timeout.html    /tmp/gpt.html                /tmp/linux.html              /tmp/preface.html            /tmp/tune2fs.html
/tmp/dd.clone.html           /tmp/hdd.label.html          /tmp/locale.html             /tmp/proc.html               /tmp/udev.html
/tmp/deb.html                /tmp/hdd.partition.html      /tmp/loop.html               /tmp/rpm.html                /tmp/upgrades.html
/tmp/device.cpu.html         /tmp/hwinfo.html             /tmp/lsblk.html              /tmp/rpmbuild.html           /tmp/yum.html
/tmp/device.hba.html         /tmp/index.html              /tmp/lshw.html               /tmp/smartctl.html

这里只是做演示，生产环境请不要在 parse(self, response) 中处理，后面会讲到 Pipeline。

22.1.4.3. settings.py 爬虫配置文件

忽略 robots.txt 规则

				
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

22.1.4.4. Item

Item 在 scrapy 中的类似“实体”或者“POJO”的概念，是一个数据结构类。爬虫通过ItemLoader将数据放到Item中

下面是 items.py 文件

			
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()
    ctime = scrapy.Field()
    
    pass

下面是爬虫文件

			
# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader import ItemLoader
from crawler.items import CrawlerItem 
import time

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['netkiller.cn']
    start_urls = ['https://netkiller.cn/java/index.html']
    def parse(self, response):

        item_selector = response.xpath('//a/@href')
        for url in item_selector.extract():
            if 'html' in url.split('.'):
                url = response.urljoin(url)
                yield response.follow( url, callback=self.parse_item)

        next_page = response.xpath('//a[@accesskey="n"]/@href').extract_first()
        self.log('Next page: %s' % next_page)
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)   
        
    def parse_item(self, response):
        l = ItemLoader(item=CrawlerItem(), response=response)
        l.add_css('title', 'title::text')
        l.add_value('ctime', time.strftime( '%Y-%m-%d %X', time.localtime() ))
        l.add_value('content', response.body)
        return l.load_item()

yield response.follow( url, callback=self.parse_item) 会回调 parse_item(self, response) 将爬到的数据放置到 Item 中

22.1.4.5. Pipeline

Pipeline 管道线，主要的功能是对 Item 的数据处理，例如计算、合并等等。通常我们在这里做数据保存。下面的例子是将爬到的数据保存到 json 文件中。

默认情况 Pipeline 是禁用的，首先我们需要开启 Pipeline 支持，修改 settings.py 文件，找到下面配置项，去掉注释。

			
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'crawler.pipelines.CrawlerPipeline': 300,
}

修改 pipelines.py 文件。

			
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class CrawlerPipeline(object):
    def open_spider(self, spider):
        self.file = open('items.json', 'w')

    def close_spider(self, spider):
        self.file.close()
    def process_item(self, item, spider):
        # self.log("PIPE: %s" % item)
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)   
        return item

下面是 items.py 文件

			
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class CrawlerItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()
    ctime = scrapy.Field()
    
    pass

下面是爬虫文件

			
# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader import ItemLoader
from crawler.items import CrawlerItem 
import time

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['netkiller.cn']
    start_urls = ['https://netkiller.cn/java/index.html']
    def parse(self, response):

        item_selector = response.xpath('//a/@href')
        for url in item_selector.extract():
            if 'html' in url.split('.'):
                url = response.urljoin(url)
                yield response.follow( url, callback=self.parse_item)

        next_page = response.xpath('//a[@accesskey="n"]/@href').extract_first()
        self.log('Next page: %s' % next_page)
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)   
        
    def parse_item(self, response):
        l = ItemLoader(item=CrawlerItem(), response=response)
        l.add_css('title', 'title::text')
        l.add_value('ctime', time.strftime( '%Y-%m-%d %X', time.localtime() ))
        l.add_value('content', response.body)
        return l.load_item()

items.json 文件如下

			
{"title": ["5.31.\u00a0Spring boot with Data restful"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.30.\u00a0Spring boot with Phoenix"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.29.\u00a0Spring boot with Apache Hive"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.28.\u00a0Spring boot with Elasticsearch 5.5.x"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.27.\u00a0Spring boot with Elasticsearch 2.x"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.23.\u00a0Spring boot with Hessian"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.22.\u00a0Spring boot with Cache"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.26.\u00a0Spring boot with HTTPS SSL"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.25.\u00a0Spring boot with Git version"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.24.\u00a0Spring boot with Apache Kafka"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.21.\u00a0Spring boot with Scheduling"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.20.\u00a0Spring boot with Oauth2"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.19.\u00a0Spring boot with Spring security"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.16.\u00a0Spring boot with PostgreSQL"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.18.\u00a0Spring boot with Velocity template"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.13.\u00a0Spring boot with MongoDB"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.11.\u00a0Spring boot with Session share"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.17.\u00a0Spring boot with Email"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.15.\u00a0Spring boot with Oracle"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.14.\u00a0Spring boot with MySQL"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.10.\u00a0Spring boot with Logging"], "ctime": ["2017-09-11 11:57:53"]}
{"title": ["5.9.\u00a0String boot with RestTemplate"], "ctime": ["2017-09-11 11:57:53"]}

22.1.5. 下载图片

创建项目

		
neo@MacBook-Pro ~/Documents % scrapy startproject photo

		
neo@MacBook-Pro ~/Documents % cd photo

安装依赖库

		
neo@MacBook-Pro ~/Documents/photo % pip3 install image

创建爬虫

		
neo@MacBook-Pro ~/Documents/photo % scrapy genspider jiandan jandan.net

22.1.5.1. 配置 settings.py

忽略 robots.txt 规则

			
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

配置图片保存路径与缩图

			
#图片保存路径
IMAGES_STORE='/tmp/photo'
#DOWNLOAD_DELAY = 0.25
#缩略图的尺寸，设置这个值就会产生缩略图
IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (200, 200),
}

22.1.5.2. 修改 pipelines.py 文件

加入 process_item（）与 item_completed（）方法

注意：PhotoPipeline(ImagesPipeline) 需要继承 ImagesPipeline

			
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class PhotoPipeline(ImagesPipeline):
    # def process_item(self, item, spider):
        # return item
    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.http.Request('http:'+image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

22.1.5.3. 编辑 items.py

忽略 robots.txt 规则

			
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class PhotoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #图片的链接
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()
    pass

22.1.5.4. Spider 爬虫文件

			
# -*- coding: utf-8 -*-
import scrapy
from scrapy.loader import ItemLoader
from photo.items import PhotoItem

class JiandanSpider(scrapy.Spider):
    name = 'jiandan'
    # allowed_domains = ['jandan.net']
    allowed_domains = [] 
    start_urls = ['http://jandan.net/ooxx']

    def parse(self, response):
       
        l = ItemLoader(item=PhotoItem(), response=response)
        l.add_xpath('image_urls','//img//@src' )
        yield l.load_item()

        next_page = response.xpath('//a[@class="previous-comment-page"]//@href').extract_first() #翻页
        if next_page:
            yield response.follow(next_page,self.parse)
        pass
    def parse_page(self, response):
        l = ItemLoader(item=PhotoItem(), response=response)
        l.add_xpath('image_urls','//img//@src' )
        return l.load_item()

22.1.6. xpath

22.1.6.1. 逻辑运算符

and

				
>>> response.xpath('//span[@class="time" and @id="news-time"]/text()').extract()
['2017-10-09 09:46']

or

				
//*[@class='foo' or contains(@class,' foo ') or starts-with(@class,'foo ') or substring(@class,string-length(@class)-3)=' foo']

22.1.6.2. function

text()

				
>>> response.xpath('//title/text()').extract_first()
'Netkiller ebook - Linux ebook'

contains()

contains() 匹配含有特定字符串的 class

				
//*[contains(@class,'foo')]

				
>>> response.xpath('//ul[contains(@class, "topnews_nlist")]/li/h2/a/@href|//ul[contains(@class, "topnews_nlist")]/li/a/@href').extract()

内容匹配

				
>>> response.xpath('//div[@id="epContentLeft"]/h1[contains(text(),"10")]/text()').extract()
['美联储10月起启动渐进式缩表 维持基准利率不变']

上一页	上一级	下一页
部分 IV. Python 数据分析	起始页	第 23 章 Pandas - Python Data Analysis Library