我是tring 刮电子邮件从一个网站。为了做到这一点,我必须在列表中的每一个链接中提取,然后在该提取页面上,我将提取电子邮件,address.the问题是下一页按钮最多50。但是如果我用斜杠和输入51来更改url。它还会有一个新的页面,我希望下一页链接使用for循环。例如,我将使用从1到999的for循环,它将更新下一页url。下面是我的代码,只要next_page按钮可用,它就能正常工作。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class BestMoviesSpider(CrawlSpider):
name = 'best_movies'
allowed_domains = ['dastelefonbuch.de']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'
def start_requests(self):
yield scrapy.Request(url='https://www.dastelefonbuch.de/Suche/Textilien%20Gmbh',
headers={
'User-Agent': self.user_agent
})
rules = (
Rule(LinkExtractor(
restrict_xpaths="//a[@class=' name']"),
callback='parse_item', follow=True,
process_request='set_user_agent'),
Rule(LinkExtractor(
restrict_xpaths="//a[@class='nextLink next'][2]"), follow=True,
process_request='set_user_agent')
)
def set_user_agent(self, request):
request.headers['User-Agent'] = self.user_agent
return request
def parse_item(self, response):
yield {
'email': response.xpath(
"//a[starts-with(@href,'mailto')]/@href").get(),
}发布于 2020-02-17 13:45:56
检查start_requests函数中的url。这是不对的。我想你的意思是:"https://www.dastelefonbuch.de/Suche/Textilien“
https://stackoverflow.com/questions/60117774
复制相似问题