使用Exporter导出数据

负责导出数据的组件称为Exporter(导出器)，Scrapy内部导出器为：

{ 
    'json' ： 'scrapy.exporters.JsonItemExporter' ，
    'jsonlines' ： 'scrapy.exporters.JsonLinesItemExporter' ，
    'jl' ： 'scrapy.exporters.JsonLinesItemExporter' ，
    'csv' ： 'scrapy.exporters.CsvItemExporter' ，
    ' xml' ： 'scrapy.exporters.XmlItemExporter' ，
    'marshal' ： 'scrapy.exporters.MarshalItemExporter' ，
    'pickle' ： 'scrapy.exporters.PickleItemExporter' ，
}

想要导出其他格式的需要实现Exporter，例如将数据以Excel格式导出

首先在setting.py同级目录下创建一个python文件
书写代码

from scrapy.exporters import BaseItemExporter
# 使用第三方库xlwt将数据写入Excel
import xlwt


class ExcelItemExporter(BaseItemExporter):
    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        # 初始化workbook对象
        self.wbook = xlwt.Workbook()
        # 初始化worksheet对象
        self.wsheet = self.wbook.add_sheet('scrapy')
        # 初始化用来记录写入行坐标的self.row
        self.row = 0

    def finish_exporting(self):
        self.wbook.save(self.file)
	
    def export_item(self, item):
    	  # 调用基类的_get_serialized_fields方法返回一个item所有字段的迭代器
        fields = self._get_serialized_fields(item)
        for col, v in enumerate(x for _, x in fields):
        	  # 调用self.wsheet.write方法将迭代出的字段写入Excel
            self.wsheet.write(self.row, col, v)
        self.row += 1

添加了新的导出数据格式需要在settings.py中添加

FEED_EXPORTERS = {
	# 新的Exporter
	'excel':'pachong.my_exporters.ExcelItemExporter'
}

导出数据需要提供的信息

导出文件路径
导出数据格式

导出方式

通过命令行
通过配置文件

通过命令行

# -o 指定文件导出路径  -t 指定导出格式
# book.json，scrapy通过后缀名.json可以推断出我们想要导出的是json格式，所以不需要-t定义
# 最完整的导出 scrapy crawl books -t json -o books.data
scrapy crawl books -o books.json

通过配置文件

参数

FEED_URI

导出文件路径
FEED_FORMAT

导出数据格式
FEED_EXPORT_ENCODING

导出文件编码（默认情况json使用数字编码，其他使用utf-8）
FEED_EXPORT_FIELDS

导出数据包含字段（默认全部导出）
FEED_EXPORTERS

用户自定义Exporter字典，添加新的导出格式