scrapy quotesbot 源码分析

2022-05-12

Word count: 1.2k | Reading time≈ 5 min

github：https://github.com/scrapy/quotesbot

Both spiders extract the same data from the same website, but toscrape-css employs CSS selectors, while toscrape-xpath employs XPath expressions.
根据readme描述，实例中一个为使用css语句写的，另一个为使用xpath语句写的

You can run a spider using the scrapy crawl command, such as:
$ scrapy crawl toscrape-css

If you want to save the scraped data to a file, you can pass the -o option:
$ scrapy crawl toscrape-css -o quotes.json

启动抓取并保存至本地json：scrapy crawl crawlname -o quotes.json

两个方法：
extract():这个方法返回的是一个数组list，，里面包含了多个string，如果只有一个string，则返回[‘ABC’]这样的形式。
extract_first()：这个方法返回的是一个string字符串，是list数组里面的第一个字符串。

声明：一下爬取网站均通过phpstudy放入了本地
官方实例spider源码：

# -*- coding: utf-8 -*-
import scrapy


class ToScrapeSpiderXPath(scrapy.Spider):
    name = 'toscrape-xpath'
    start_urls = [
       # 'http://quotes.toscrape.com/',
        "http://localhost/qts.html"
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="quote"]'): #通过源码分析得到，每一个块中的内容都属于一个quote，所以下面用"//"遍历所有的class为quote的div标签即可
            yield {
                'text': quote.xpath('./span[@class="text"]/text()').extract_first(),
                'author': quote.xpath('.//small[@class="author"]/text()').extract_first(),
                'tags': quote.xpath('.//div[@class="tags"]/a[@class="tag"]/text()').extract()
            }

        next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()  #调用a标签的所有href属性
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))  #拼接主域名和子域名

练习写的某招聘单位职位爬取的spider源码：

import scrapy
import time

class BossSpider(scrapy.Spider):
    name = "witherc"
    #allowed_domains = ["https://www.zhipin.com/"]  #允许爬取的域

    #with open('/Users/w1therc/Desktop/「全国Python招聘」 - BOSS直聘.html','r', encoding='UTF8') as url:
    #    start_urls = url

    start_urls = [
        "http://localhost/boss.html"
        #"/Users/w1therc/Desktop/「全国Python招聘」 - BOSS直聘.html"
    ]

    def parse(self, response):
        for quote in response.xpath('//div[@class="job-primary"]'):
            yield {
                '职位名称': quote.xpath('.//span[@class="job-name"]/a[@target="_blank"]/text()').extract_first(),
                '薪资': quote.xpath('.//span[@class="red"]/text()').extract_first(),
                '工作地点': quote.xpath('.//span[@class="job-area"]/text()').extract()
            }
        '''
        下列三行用于网页跳转到下一页并爬取的代码没有问题，报错原因为：爬取页面保存在本地url为http://localhost/boss.html，但是跳转后的url变回了https://www.zhipin.com/c100010000-p100407/?page=2&ka=page-2
        此问题有待解决，可先实行但页面抓取
        next_page_url = response.xpath('//div[@class="page"]/a[@class="next"]/@href').extract_first()  # 调用a标签的所有href属性
        if next_page_url is not None:
            yield scrapy.Request(response.urljoin(next_page_url))  # 拼接主域名和子域名
        '''

上述二者在解析结构时均使用的xpath语句

1
2
3

next_page_url = response.xpath('//div[@class="page"]/a[@class="next"]/@href').extract_first()  # 调用a标签的所有href属性
if next_page_url is not None:
   yield scrapy.Request(response.urljoin(next_page_url))  # 拼接主域名和子域名

上述三行用于网页跳转到下一页并爬取的代码没有问题，报错原因为：爬取页面保存在本地url为http://localhost/boss.html，但是跳转后的url变回了https://www.zhipin.com/c100010000-p100407/?page=2&ka=page-2
此问题有待解决，可先实行但页面抓取（目前想法：尝试保存多级页面到本地进行数据抓取，答辩时可说明情况，为了测试所以保存到本地，实际上可以通过成熟的代码来控制防止爬虫检测，直接用原url进行爬取并输出）
解决如何把爬取文件输出保存为xls文件：~~~~1、直接输出为csv文件(已解决)：scrapy crawl witherc -o 333.csv ~~
~~解决csv文件打开后乱码问题：在setting.py中添加代码~~FEED_EXPORT_ENCODING = 'gb18030'~~即可
~~或者通过安装openpyxl 并对pipeline.py进行配置~~

~~第三部分设计：对一、二部分设计链接方法即研究使用 pandas库中 pd.read_excel方法对保存为excel文件中对数据进行处理及可视化输出等~~

~~对csv可视化的思想：先把csv文件中的内容读取出来再进行可视化~~

upload successful
报403是因为，通过上述代码获取的url为zhipin.com的原url，而非在本地localhost的url，此前访问时ip已被封，所以报错403

~~转化为json文件，再做可视化，json文件中的数据是以dict存储的因此比较方便易用~~
数据格式方面：pyecharts 本质上在做的事情就是将 Echarts 的配置项由 Python dict 序列化为 JSON 格式，所以 pyecharts 支持什么格式的数据类型取决于 JSON 支持什么数据类型。