文章/答案/技术大牛

发布

社区首页 >问答首页 >如果链接保持不变，如何在抓取链接的同时到达下一页？

问如果链接保持不变，如何在抓取链接的同时到达下一页？
EN

Stack Overflow用户

提问于 2022-02-17 13:17:51

回答 2查看 131关注 0票数 0

我最近在研究网络抓取，我被塞了下来。我需要删除下一页的数据，但是只有一个可点击的按钮，链接保持不变。因此，我的问题是，如果url保持不变，如何提取到下一页的链接？我刮的网页是http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp

到目前为止我的代码是：

import scrapy
import json

class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp/']

def start_requests(self):
    #sending a post request to the web
    return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                               formdata={'sch_com_nm': '',
                                         'sch_yy': '2021',
                                         'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                         'code': '02/02020000/esg02020000',
                                         'pageFirstCall': 'Y'},
                               callback=self.parse)]

def parse(self, response):
    dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
    for i in dict_data['result']:
        company_name = i['com_abbrv']
        compay_share_id = i['isu_cd']
        print(company_name, compay_share_id)

所以现在我只需要从第一页得到信息。现在我要转到下一页了。有人能解释一下我该怎么做吗？

web-scraping

scrapy

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-02-17 17:02:45

您正在抓取的网站公开了一个API，您可以直接调用该API，而不是使用splash。如果您检查网络选项卡，您将看到POST请求被发送到服务器。

见下面的示例代码。我已经对总页数进行了硬编码，但您可以找到一种自动获取总页数的方法，而不是硬编码该值。

注意response.follow的使用。它会自动处理cookie和其他标头。

import scrapy

class EsgKrx1Spider(scrapy.Spider):
    name = 'esg_krx1'
    allowed_domains = ['esg.krx.co.kr']
    start_urls = ['http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        "USER_AGENT": 'Mozilla/5.0 (X11; Linux x86_64; rv:97.0) Gecko/20100101 Firefox/97.0'
    }

    def parse(self, response):
        #send a post request to the api
        url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
        
        headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        }

        total_pages = 77
        for page in range(total_pages):
            payload = f"sch_com_nm=&sch_yy=2021&pagePath=%2Fcontents%2F02%2F02020000%2FESG02020000.jsp&code=02%2F02020000%2Fesg02020000&curPage={page+1}"
            yield response.follow(url=url, method='POST', callback=self.parse_result, headers=headers, body=payload)

    def parse_result(self, response):

        # #looping in the result and assigning the company name
        for item in response.json().get('result'):
            yield {
                'company_name': item.get('com_abbrv'),
                'compay_share_id': item.get('isu_cd')
            }

票数 2

Stack Overflow用户

发布于 2022-02-17 13:56:17

我发现更容易将scrapy_splash与javascript内容丰富的网站集成，就像您正在使用的网站一样，因为在发送请求时，它们通常需要一段时间才能加载。因此，我创建了一个简单的lua脚本来加载站点，然后解析所需的信息。

您会发现有效负载包含当前页面；通过迭代这个数字到站点上的最后一个页面，您就可以获取下一个页面。

因为像这样的网站会很快阻止你，所以非常重要的是你要添加计时器和下载延迟，这样他们就不能阻止你。

这是一个工作刮刀：

import scrapy
from scrapy_splash import SplashRequest
import json

script = """
function main(splash, args)
  assert(splash:go(args.url))
  assert(splash:wait(7))
  return splash:html()
end
"""
class KorenSiteSpider(scrapy.Spider):
    name = 'k-site'
    start_urls = ['https://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp']
    custom_settings = {
        'USER_AGENT':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
        'DOWNLOAD_DELAY':3
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(
                url = url,
                callback = self.parse, 
                endpoint='execute',
                args = {'lua_source':script}
            )

    def parse(self, response):
        for i in range(1, 78, 1):
            yield scrapy.FormRequest(
                url = 'https://esg.krx.co.kr/contents/99/ESG99000001.jspx',
                method = 'POST',
                formdata = {
                            'sch_com_nm': '',
                            'sch_yy': '2021',
                            'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                            'code': '02/02020000/esg02020000',
                            'curPage': str(i)
                            },
                callback = self.parse_json
            )

    def parse_json(self, response):
        dict_data = json.loads(response.text)

    #looping in the result and assigning the company name
        for i in dict_data['result']:
            company_name = i['com_abbrv']
            company_share_id = i['isu_cd']
            yield {
                'company:name':company_name,
                'company_share_id':company_share_id
            }

产出：

2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '페이퍼코리아', 'company_share_id': '001020'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '평화산업', 'company_share_id': '090080'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '평화홀딩스', 'company_share_id': '010770'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '포스코', 'company_share_id': '005490'}
2022-02-17 13:55:04 [scrapy.core.scraper] DEBUG: Scraped from <200 https://esg.krx.co.kr/contents/99/ESG99000001.jspx>
{'company:name': '포스코강판', 'company_share_id': '058430'}

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71158877

复制

相似问题

问如果链接保持不变，如何在抓取链接的同时到达下一页？
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如果链接保持不变，如何在抓取链接的同时到达下一页？EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如果链接保持不变，如何在抓取链接的同时到达下一页？
EN