使用scrapy的FormRequest.from_response()轻松模拟用户登录

科技一点鑫得 2024-03-11 07:44:38

scrapy是一个应用非常广泛的爬虫框架,github上有50k以上的star,当今要爬取的站点大多是需要登录的,本文介绍如何使用scrapy框架本身提供的FormRequest.from_response()方法轻松实现模拟github用户登录。本文不对scrapy原理做过多介绍,建议阅读官方文档,下图是scrapy框架的架构图,它包含了丰富有用的信息,建议边实践边对照此图,相信会不断有新的理解和体会。

安装scrapy

推荐python -m venv venv创建虚拟环境,在虚拟环境下进行项目开发,使用虚拟环境的好处是每个项目都可以使用不同版本的第三方库而不会互相影响执行pip install scrapy安装scrapy框架注意:安装成功后windows环境下运行scrapy爬虫可能会遇到cryptography依赖库报错的问题,出现这种情况可能是Scrapy依赖库版本不匹配导致的,以我遇到此问题的Scarpy版本2.11.1为例,安装的依赖库cryptography版本为42.0.5,如果遇到类似下面的错误,可以尝试pip uninstall cryptography卸载该依赖库,然后重新安装其他版本的依赖库pip install cryptography==41.0.7

(venv) PS C:\xinRepo\xinspiders> scrapy crawl debug ... File "c:\xinrepo\xinspiders\venv\lib\site-packages\OpenSSL\SSL.py", line 10, in <module> from OpenSSL._util import ( File "c:\xinrepo\xinspiders\venv\lib\site-packages\OpenSSL\_util.py", line 6, in <module> from cryptography.hazmat.bindings.openssl.binding import Binding File "c:\xinrepo\xinspiders\venv\lib\site-packages\cryptography\hazmat\bindings\openssl\binding.py", line 15, in <module> from cryptography.exceptions import InternalError File "c:\xinrepo\xinspiders\venv\lib\site-packages\cryptography\exceptions.py", line 9, in <module> from cryptography.hazmat.bindings._rust import exceptions as rust_exceptionsImportError: DLL load failed while importing _rust: 找不到指定的程序。

创建爬虫项目

激活虚拟环境并安装了scrapy之后,执行scrapy startproject projectname命令创建项目

(venv) PS C:\xinRepo\demo\spider_demo> scrapy startproject spider_demoNew Scrapy project 'spider_demo', using template directory 'C:\xinRepo\demo\spider_demo\venv\Lib\site-packages\scrapy\templates\project', created in: C:\xinRepo\demo\spider_demo\spider_demoYou can start your first spider with: cd spider_demo scrapy genspider example example.com

根据出现的提示,执行scrapy genspider example example.com创建github爬虫

(venv) PS C:\xinRepo\demo\spider_demo> cd .\spider_demo\(venv) PS C:\xinRepo\demo\spider_demo\spider_demo> scrapy genspider github github.com/loginCreated spider 'github' using template 'basic' in module: spider_demo.spiders.github

创建项目和爬虫成功后,生成的项目目录结构如下,其中github.py是scrapy genspider命令执行后生成的,spiders目录下可以创建多个爬虫,也可以不使用命令直接手动添加爬虫文件。

模拟登录github

这里先直接给出模拟登录github的代码,直观感受下使用scrapy.FormRequest.from_response()方法模拟登录是多么简单,一行代码就搞定了

import scrapyclass GithubSpider(scrapy.Spider): name = "github" allowed_domains = ["github.com"] start_urls = ["https://github.com/login"] def parse(self, response): yield scrapy.FormRequest.from_response(response, formdata={"login": "yourusername", "password": "yourpassword"}, callback=self.after_login) def after_login(self, response): if response.css("div.AppHeader-user"): self.log("模拟登录github成功") else: self.log("模拟登录github失败") scrapy.FormRequest其实就是模拟的表单请求,from_response则表示直接从response中提取表单,手工分析github的登录过程可以发现除了用户名和密码之外,提交登录还需要携带隐藏的authenticity_token的值才能成功,而这一切在访问登录页面的响应中已经设置好了,from_response相当于自动处理了隐藏的表单值,我们只要在formdata中加入登录的用户名和密码就可以了,其中"login"和"password"对应的input元素的name。

self.after_login为登录请求的回调函数,这里只是做了个简单的验证判断登录是否成功,如果登录成功响应中一定包含了头像元素,scrapy的css选择器如果定位到了div.AppHeader-user元素则表示登录成功,否则登录失败。

浏览器登录成功后,会通过设置cookie保持用户登录状态,scrapy框架也同样地会自动管理cookie,也就是说一旦登录成功之后,之后进行所有的爬取请求都会携带cookie,即保持用户登录状态。

代理配置

github在国内访问一般比较慢,可能需要使用代理加速,scrapy配置代理也非常方便,对照scrapy的架构图可以发现步骤4中Engine发送请求到Downloader之前会经过一系列中间件,scrapy创建项目已经生成了中间件模板,在settings.py配置文件找到DOWNLOADER_MIDDLEWARES并取消注释即可启用。

# Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = { "spider_demo.middlewares.SpiderDemoDownloaderMiddleware": 543,}

middlewares.py文件找到对应的中间件,在其中process_request中添加代理配置即可

class SpiderDemoDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # 添加代理配置 request.meta["proxy"] = "http://xxxx:port" # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called return None

运行scrapy crawl github执行爬虫,从打印的日志信息可以看到成功登录的提示

(venv) PS C:\xinRepo\demo\spider_demo\spider_demo> scrapy crawl github2024-03-10 10:24:50 [scrapy.utils.log] INFO: Scrapy 2.11.1 started (bot: spider_demo)2024-03-10 10:24:50 [scrapy.utils.log] INFO: Versions: lxml 5.1.0.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 24.3.0, Python 3.12.2 (tags/v3.12.2:6abddd9, Feb 6 2024, 21:26:36) [MSC v.1937 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.1 30 Jan 2024), cryptography 42.0.5, Platform Windows-11-10.0.22631-SP02024-03-10 10:24:50 [scrapy.addons] INFO: Enabled addons:...2024-03-10 10:24:50 [scrapy.core.engine] INFO: Spider opened2024-03-10 10:24:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2024-03-10 10:24:50 [github] INFO: Spider opened: github2024-03-10 10:24:50 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:60232024-03-10 10:24:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/robots.txt> (referer: None)2024-03-10 10:24:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/login> (referer: None)2024-03-10 10:24:52 [py.warnings] WARNING: C:\xinRepo\demo\spider_demo\venv\Lib\site-packages\scrapy\spidermiddlewares\referer.py:305: RuntimeWarning: Could not load referrer policy 'origin-when-cross-origin, strict-origin-when-cross-origin' warnings.warn(msg, RuntimeWarning)2024-03-10 10:24:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://github.com/> from <POST https://github.com/session>2024-03-10 10:24:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://github.com/> (referer: https://github.com/login)2024-03-10 10:24:53 [github] DEBUG: 模拟登录github成功2024-03-10 10:24:53 [scrapy.core.engine] INFO: Closing spider (finished)2024-03-10 10:24:53 [scrapy.statscollectors] INFO: Dumping Scrapy stats:...2024-03-10 10:24:53 [scrapy.core.engine] INFO: Spider closed (finished)总结

FormRequest.from_response()虽然方便,但是越来越多的web平台登录验证越来越复杂,针对需要输入各种验证码的场景就不太适用了,动态验证码的场景一般需要scrapy和类似selenium这种操纵浏览器的库进行结合来达到模拟登录的目的,以后的文章会陆续介绍。

0 阅读:0

科技一点鑫得

简介:感谢大家的关注