pythonscrapy爬虫代码及填坑-创新互联

涉及到详情页爬取

成都创新互联公司-专业网站定制、快速模板网站建设、高性价比新城网站开发、企业建站全套包干低至880元,成熟完善的模板库,直接使用。一站式新城网站制作公司更省心,省钱,快速模板网站建设找我们,业务覆盖新城地区。费用合理售后完善,10多年实体公司更值得信赖。

目录结构:

kaoshi_bqg.py

import scrapy
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from ..items import BookBQGItem
class KaoshiBqgSpider(scrapy.Spider):
 name = 'kaoshi_bqg'
 allowed_domains = ['biquge5200.cc']
 start_urls = ['https://www.biquge5200.cc/xuanhuanxiaoshuo/']
 rules = (
  # 编写匹配文章列表的规则
  Rule(LinkExtractor(allow=r'https://www.biquge5200.cc/xuanhuanxiaoshuo/'), follow=True),
  # 匹配文章详情
  Rule(LinkExtractor(allow=r'.+/[0-9]{1-3}_[0-9]{2-6}/'), callback='parse_item', follow=False),
 )
 # 小书书名
 def parse(self, response):
  a_list = response.xpath('//*[@id="newscontent"]/div[1]/ul//li//span[1]/a')
  for li in a_list:
   name = li.xpath(".//text()").get()
   detail_url = li.xpath(".//@href").get()
   yield scrapy.Request(url=detail_url, callback=self.parse_book, meta={'info': name})
 # 单本书所有的章节名
 def parse_book(self, response):
  name = response.meta.get('info')
  list_a = response.xpath('//*[@id="list"]/dl/dd[position()>20]//a')
  for li in list_a:
   chapter = li.xpath(".//text()").get()
   url = li.xpath(".//@href").get()
   yield scrapy.Request(url=url, callback=self.parse_content, meta={'info': (name, chapter)})
 # 每章节内容
 def parse_content(self, response):
  name, chapter = response.meta.get('info')
  content = response.xpath('//*[@id="content"]//p/text()').getall()
  item = BookBQGItem(name=name, chapter=chapter, content=content)
  yield item

网站栏目:pythonscrapy爬虫代码及填坑-创新互联
文章转载:http://ybzwz.com/article/pcjcj.html