文章/答案/技术大牛

发布

社区首页 >问答首页 >抓取/表单请求到下一页，回调不转到下一个函数

问抓取/表单请求到下一页，回调不转到下一个函数
EN

Stack Overflow用户

提问于 2022-02-19 04:29:13

回答 1查看 153关注 0票数 1

最近我开始研究刮伤和网络刮伤。我正在做我的第一个项目，我被塞了下来。如果有人能帮我解决这个问题，我将不胜感激:)

我正在抓取页面http://esg.krx.co.kr/contents/02/02020000/ESG02020000.jsp

到目前为止，我的程序已经抓取了所有的77页页面(我知道它有点硬编码，稍后我会尝试修改它)，并获得company_name和company_share_id。因此，现在我尝试转到company_page_url，并再次发送一个post请求从图表中获取数据(不是每个公司都有这个图表)。然而，它似乎没有调用parse_company_result。

下面我上传我的代码：

import scrapy
import json
from scrapy.http import Request


class EsgKrx1Spider(scrapy.Spider):
name = 'esg_krx1'
allowed_domains = ['esg.krx.co.kr']

def start_requests(self):
    #sending a post request to the web
    return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                               formdata={'sch_com_nm': '',
                                         'sch_yy': '2021',
                                         'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                         'code': '02/02020000/esg02020000',
                                         'pageFirstCall': 'Y'},
                               callback=self.parse)]

def parse(self, response):
    url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"

    total_pages = 77
    for page in range(total_pages):
        payload = {
            'sch_com_nm': '',
            'sch_yy': '2021',
            'pagePath': '/contents/02/02020000/ESG02020000.jsp',
            'code': '02/02020000/esg02020000',
            'curPage': str(page+1)
        }

        yield scrapy.FormRequest(url=url,
                                 method='POST',
                                 formdata=payload,
                                 callback=self.parse_result)

def parse_result(self, response):
    dict_data = json.loads(response.text)

    # looping in the result and assigning the company name
    for i in dict_data['result']:
        company_name = i['com_abbrv']
        compay_share_id = i['isu_cd']
        print(company_name, compay_share_id)

        company_page_url = f"http://esg.krx.co.kr/contents/02/02010000/ESG02010000.jsp?isu_cd={compay_share_id}"
        yield Request(company_page_url)

        data_url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"

        headers = {
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
        }

        # yield response.follow(url=data_url, method='POST', callback=self.parse_company_result, headers=headers)
        yield scrapy.FormRequest(url=data_url,
                                 method='POST',
                                 headers=headers,
                                 callback=self.parse_company_result)


def parse_company_result(self, response):
    graph_data = json.loads(response.text)
    print(graph_data)

当然，所有函数都在类中，只是没有像我预期的那样粘贴代码。

所以我的问题是：

我怎么去公司的网页网址？

也许这个要求是对的，但后来我做错了什么？

也许我没有得到data_url的回应？

我会感谢所有的帮助。

web-scraping

callback

scrapy

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-02-19 21:33:57

我更新了您的脚本，因为有相当多的错误，即：

In parse_result最好创建另一个函数来解析公司urls，而不是在同一个函数中解析它们。
您需要包含有效负载来解析来自Request Url的json，同样最好将它们分割成单独的解析器，这样您就可以看到正在发生的事情和正在发生的事情。

我已经建立了一个刮板，这是一种分层的方式，这样你就可以了解发生了什么自上而下。

补充说明：

cb_kwargs允许您将变量从一个解析器带到另一个解析器。因此，我可以从parse_result获取公司id和名称，并在最后一个解析器中生成该名称。注:公司id对于parse_company中的有效负载非常重要。因此，您应该习惯于学习cb_kwargs如何工作.

import scrapy
import json
from scrapy.http import Request

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:97.0) Gecko/20100101 Firefox/97.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'en-GB,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'X-Requested-With': 'XMLHttpRequest',
    'Origin': 'http://esg.krx.co.kr',
    'Connection': 'keep-alive',
    'Referer': 'http://esg.krx.co.kr/contents/02/02010000/ESG02010000.jsp?isu_cd=004710',
}

class EsgKrx1Spider(scrapy.Spider):
    name = 'esg_krx1'
    allowed_domains = ['esg.krx.co.kr']
    
    def start_requests(self):
        #sending a post request to the web
        return [scrapy.FormRequest("http://esg.krx.co.kr/contents/99/ESG99000001.jspx",
                                formdata={'sch_com_nm': '',
                                            'sch_yy': '2021',
                                            'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                                            'code': '02/02020000/esg02020000',
                                            'pageFirstCall': 'Y'},
                                callback=self.parse)]
    
    def parse(self, response):
        url = "http://esg.krx.co.kr/contents/99/ESG99000001.jspx"
    
        total_pages = 77
        for page in range(total_pages):
            payload = {
                'sch_com_nm': '',
                'sch_yy': '2021',
                'pagePath': '/contents/02/02020000/ESG02020000.jsp',
                'code': '02/02020000/esg02020000',
                'curPage': str(page+1)
            }
    
            yield scrapy.FormRequest(url=url,
                                    method='POST',
                                    formdata=payload,
                                    callback=self.parse_result)
    
    def parse_result(self, response):
        dict_data = json.loads(response.text)
    
        # looping in the result and assigning the company name
        for i in dict_data['result']:
            company_name = i['com_abbrv']
            company_share_id = i['isu_cd']

            company_page_url = f"http://esg.krx.co.kr/contents/02/02010000/ESG02010000.jsp?isu_cd={company_share_id}"
            yield Request(company_page_url,
            #headers=headers, 
            callback = self.parse_company, cb_kwargs = {
                'company_share_id':company_share_id,
                'company_name':company_name
            })

    def parse_company(self, response, company_share_id, company_name):
    """ Grab the chart ID from the webpage and store it as a list"""

        chart_id = response.xpath("(//div[@class='CHART-AREA'])[1]//div//@id").get()
        chart_id = [chart_id.split("chart")[-1]]

""" Notice that the number at the end of code in payload changes for each chart"""    

        for id_of_chart in chart_id:
            for code_no in  range(1, 3):
                yield scrapy.FormRequest(
                    url = 'http://esg.krx.co.kr/contents/99/ESG99000001.jspx',
                    method='POST',
                    # headers=headers,
                    formdata = {
                            'url_isu_cd': str(company_share_id),
                            'isu_cd': '',
                            'sch_com_nm': '',
                            'pagePath': '/contents/02/02010000/ESG02010000.jsp',
                            'code': f'02/02010000/esg02010000_0{code_no}',
                            'chartNo': f'{id_of_chart}'
                                                                    },
                    callback = self.parse_company_result,
                    cb_kwargs = {
                        'company_share_id':company_share_id,
                        'company_name':company_name
                    }
                )
        
    def parse_company_result(self, response, company_share_id, company_name):
        graph_data = json.loads(response.text)
        yield {
            'data':graph_data, 
            'company_name':company_name,
            'company_share_id':company_share_id
        }

输出：

{'data': {'block1': [{'yy': '2019', 'pnt0': '7', 'pnt1': '2', 'pnt2': 'null'}, {'yy': '2020', 'pnt0': '7', 'pnt1': '2', 'pnt2': 'null'}, {'yy': '2021', 'pnt0': '7', 'pnt1': '2', 'pnt2': 'null'}]}, 'company_name': '아남전자', 'company_share_id': '008700'}

...
...

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71182185

复制

相似问题

问抓取/表单请求到下一页，回调不转到下一个函数
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取/表单请求到下一页，回调不转到下一个函数EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问抓取/表单请求到下一页，回调不转到下一个函数
EN