精品国产综合区久久久久久,亚洲AV日韩精品久久久久,熟女少妇人妻中文字幕

主頁 > 知識庫 > Python爬蟲之Scrapy環(huán)境搭建案例教程

Python爬蟲之Scrapy環(huán)境搭建案例教程

Python爬蟲之Scrapy環(huán)境搭建

如何搭建Scrapy環(huán)境

首先要安裝Python環(huán)境，Python環(huán)境搭建見：https://blog.csdn.net/alice_tl/article/details/76793590

接下來安裝Scrapy

1、安裝Scrapy，在終端使用pip install Scrapy（注意最好是國外的環(huán)境）

進度提示如下：

alicedeMacBook-Pro:~ alice$ pip install Scrapy
Collecting Scrapy
  Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from Scrapy)
  Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy)
 xxxxxxxxxxxxxxxxxxxxx
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg
        return cmd.easy_install(req)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1')
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

出現(xiàn)缺少Twisted的錯誤提示：

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

2、安裝Twiseted，終端里輸入：sudo pip install twisted==13.1.0

alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
  Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
    100% |████████████████████████████████| 2.7MB 398kB/s 
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
  Running setup.py install for twisted ... error
    Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.13-intel-2.7
    creating build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/_version.py -> build/li

3、再次使用sudo pip install scrapy安裝，發(fā)現(xiàn)仍然出現(xiàn)錯誤提示，這次是沒有安裝lxml的錯誤提示：

Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )

No matching distribution found for lxml (from Scrapy)

alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
  Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
    100% |████████████████████████████████| 256kB 210kB/s 
Collecting w3lib>=1.17.0 (from Scrapy)
  xxxxxxxxxxxx
  Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
    100% |████████████████████████████████| 3.1MB 59kB/s 
Collecting lxml (from Scrapy)
  Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)

4、安裝lxml，使用：sudo pip install lxml

alicedeMacBook-Pro:~ alice$ sudo pip install lxml
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
    100% |████████████████████████████████| 8.7MB 187kB/s 
Installing collected packages: lxml
Successfully installed lxml-4.2.4

5、再次安裝scrapy，使用sudo pip install scrapy，安裝成功

alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
  Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
    100% |████████████████████████████████| 256kB 11.5MB/s 
Collecting w3lib>=1.17.0 (from Scrapy)
  xxxxxxxxx
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4)
Collecting functools32; python_version  "3.0" (from parsel>=1.1->Scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/
  Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
    100% |████████████████████████████████| 61kB 66kB/s 
Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy
  Running setup.py install for functools32 ... done
  Running setup.py install for PyDispatcher ... done
  Found existing installation: zope.interface 4.1.1
    Uninstalling zope.interface-4.1.1:
      Successfully uninstalled zope.interface-4.1.1
  Running setup.py install for zope.interface ... done
  Running setup.py install for Twisted ... done
Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0

6、檢查scrapy是否安裝成功，輸入scrapy --version

出現(xiàn)scrapy的版本信息，比如：Scrapy 1.5.1 - no active project即可。

alicedeMacBook-Pro:~ alice$ scrapy --version
Scrapy 1.5.1 - no active project
 
Usage:
  scrapy command> [options] [args]
 
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
 
  [ more ]      More commands available when run from project directory
 
Use "scrapy command> -h" to see more info about a command

PS：如果中途沒有能夠正常訪問org網(wǎng)和使用sudo管理員權(quán)限安裝，則會出現(xiàn)類似的錯誤提示

Exception:

Traceback (most recent call last):

File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main

status = self.run(options, args)

File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run

resolver.resolve(requirement_set)

Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
    resolver.resolve(requirement_set)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve
    self._resolve_one(requirement_set, req)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for
    self.require_hashes
  File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement
    progress_bar=self.progress_bar
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url
    progress_bar=progress_bar
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url
    progress_bar)
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url
    _download_url(resp, link, content_file, hashes, progress_bar)
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read
    decode_content=False):
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

按照指南上搭建好了Scrapy的環(huán)境。

Scrapy爬蟲運行常見報錯及解決

按照第一個Spider代碼練習(xí)，保存在 tutorial/spiders 目錄下的 dmoz_spider.py 文件中:

import scrapy
 
class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body)

terminal中運行：scrapy crawl dmoz，試圖啟動爬蟲

報錯提示一：

Scrapy 1.6.0 - no active project

Unknown command: crawl

alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz
Scrapy 1.6.0 - no active project
 
Unknown command: crawl
 
Use "scrapy" to see available commands

原因是：在使用命令行startproject的時候，會自動生成scrapy.cfg。而使用命令行cmd啟動爬蟲時，crawl會去搜索cmd當前目錄下的scrapy.cfg文件，官方文檔中也進行了說明。找不到scrapy.cfg文件則認為沒有該project。

解決方案：因此cd進入該dmoz項目的根目錄，即scrapy.cfg文件在的目錄，執(zhí)行命令scrapy crawl dmoz

正常情況下得到的輸出應(yīng)該是：

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)

2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)

然而實際不是

報錯提示二：

File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load

raise KeyError("Spider not found: {}".format(spider_name))

KeyError: 'Spider not found: dmoz'

alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 69, in load
    return self._spiders[spider_name]
KeyError: 'dmoz'
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
    raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'

原因：定位的目錄不正確，要進入到dmoz在的目錄

解決方案：也比較簡單，重新check目錄進去即可

報錯提示三：

File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in module>
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util

alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2018-08-06 22:25:23 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: tutorial)
2018-08-06 22:25:23 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.10 (default, Jul 15 2017, 17:16:57) - [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.31)], pyOpenSSL 0.13.1 (LibreSSL 2.2.7), cryptography unknown, Platform Darwin-17.3.0-x86_64-i386-64bit
2018-08-06 22:25:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'tutorial'}
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in module>
    sys.exit(execute())
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 150, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
    func(*a, **kw)
  File "/Library/Python/2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
  t/ssl.py", line 230, in module>
    from twisted.internet._sslverify import (
  File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in module>
    from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util

網(wǎng)上查了很久的資料，仍然無解。部分博主說是pyOpenSSL或Scrapy的安裝有問題，于是重新裝了pyOpenSSL和Scrapy，但還是報同樣錯誤，實在不知道怎么解決了。

后面重裝了pyOpenSSL和Scrapy，貌似是解決了~

2019-04-19 09:46:37 [scrapy.core.engine] INFO: Spider opened
2019-04-19 09:46:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) GET http://www.dmoz.org/robots.txt> (referer: None)
2019-04-19 09:46:39 [scrapy.core.engine] DEBUG: Crawled (403) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response 403 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>: HTTP status code is not handled or not allowed
2019-04-19 09:46:40 [scrapy.core.engine] DEBUG: Crawled (403) GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2019-04-19 09:46:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response 403 http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/>: HTTP status code is not handled or not allowed
2019-04-19 09:46:40 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-19 09:46:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 737,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 2103,
 'downloader/response_count': 3,
 'downloader/response_status_count/403': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 19, 1, 46, 40, 570939),
 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/403': 2,
 'log_count/DEBUG': 3,
 'log_count/INFO': 9,
 'log_count/WARNING': 1,
 'memusage/max': 65601536,
 'memusage/startup': 65597440,
 'response_received_count': 3,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/403': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2019, 4, 19, 1, 46, 37, 468659)}
2019-04-19 09:46:40 [scrapy.core.engine] INFO: Spider closed (finished)
alicedeMacBook-Pro:tutorial alice$

到此這篇關(guān)于Python爬蟲之Scrapy環(huán)境搭建案例教程的文章就介紹到這了,更多相關(guān)Python爬蟲之Scrapy環(huán)境搭建內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

關(guān)于python爬蟲應(yīng)用urllib庫作用分析
python爬蟲Scrapy框架:媒體管道原理學(xué)習(xí)分析
python爬蟲Mitmproxy安裝使用學(xué)習(xí)筆記
Python爬蟲和反爬技術(shù)過程詳解
python爬蟲之Appium爬取手機App數(shù)據(jù)及模擬用戶手勢
爬蟲Python驗證碼識別入門
Python爬蟲技術(shù)
Python爬蟲爬取商品失敗處理方法
Python獲取江蘇疫情實時數(shù)據(jù)及爬蟲分析
Python爬蟲中urllib3與urllib的區(qū)別是什么
教你如何利用python3爬蟲爬取漫畫島-非人哉漫畫
Python爬蟲分析匯總

標簽：濰坊西安雅安辛集七臺河渭南贛州許昌

巨人網(wǎng)絡(luò)通訊聲明：本文標題《Python爬蟲之Scrapy環(huán)境搭建案例教程》，本文關(guān)鍵詞 Python,爬蟲,之,Scrapy,環(huán)境,；如發(fā)現(xiàn)本文內(nèi)容存在版權(quán)問題，煩請?zhí)峁┫嚓P(guān)信息告之我們，我們將及時溝通與處理。本站內(nèi)容系統(tǒng)采集于網(wǎng)絡(luò)，涉及言論、版權(quán)與本站無關(guān)。