文章/答案/技术大牛

发布

社区首页 >问答首页 >基于css属性从网页中解析html字符串

问基于css属性从网页中解析html字符串
EN

Stack Overflow用户

提问于 2017-05-06 02:15:25

回答 2查看 1.8K关注 0票数 1

我试图根据CSS属性在网页上提取特定URL。我可以拉第一个，但我有困难获得完整的URL添加，或获得超过一个URL。

我尝试过使用连接或解析，并遇到了许多问题。我一直在用joinurl获取全局错误。

有更简单的方法吗？？

我正在使用Centos 6.5 &Python2.7.5

下面的代码将提供第一个URL，但不提供 http://www...inline

import scrapy

class PdgaSpider(scrapy.Spider):
name = "pdgavideos"  # Name of the Spider, required value

start_urls = ["http://www.pdga.com/videos/"]

# Entry point for the spiders
def parse(self, response):
    SET_SELECTOR = 'tbody'
    for brickset in response.css(SET_SELECTOR):

        HTML_SELECTOR = 'td.views-field.views-field-title a ::attr(href)'
        yield {
            'http://www.pdga.com': brickset.css(HTML_SELECTOR).extract()[0]
        }

电流输出

http://www.pdga.com

/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman

预期输出

不间断的网址的完整列表

我没有足够的声誉点来发表几个例子

centos6.5

python

css

scrapy

回答 2

Stack Overflow用户

回答已采纳

发布于 2017-05-06 07:27:00

为了从相对链接中获得绝对urls，可以使用Scrapy urljoin()方法并重写代码如下：

import scrapy

class PdgaSpider(scrapy.Spider):
    name = "pdgavideos"
    start_urls = ["http://www.pdga.com/videos/"]

    def parse(self, response):
        for link in response.xpath('//td[2]/a/@href').extract():
            yield scrapy.Request(response.urljoin(link), callback=self.parse_page)

        # If page contains link to next page extract link and parse
        next_page = response.xpath('//a[contains(., "next")]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

    def parse_page(self, response):
        link = response.xpath('//iframe/@src').extract_first()
        yield{
            'you_tube_link': 'http:' + link.split('?')[0]
        }

# To save links in csv format print in console: scrapy crawl pdgavideos -o links.csv
# http://www.youtube.com/embed/tYBF-BaqVJ8
# http://www.youtube.com/embed/_H0hBBc1Azg
# http://www.youtube.com/embed/HRbKFRCqCos
# http://www.youtube.com/embed/yz3D1sXQkKk
# http://www.youtube.com/embed/W7kuKe2aQ_c

票数 1

Stack Overflow用户

发布于 2017-05-06 03:28:43

您的代码返回一个字典，这就是为什么它是坏的：

{'http://www.pdga.com': u'/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

你能做的就是把这本字典写成这样：

yield {
    'href_link':'http://www.pdga.com'+brickset.css(HTML_SELECTOR).extract()[0]
}

这将给您一个新的数据集，其值为“不中断”href。

{'href_link': u'http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-2-pierce-fajkus-leatherman-c-allen-sexton-leatherman'}

注意:蜘蛛必须返回请求，BaseItem，dict或None，参考解析函数。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/43815997

复制

相似问题

问基于css属性从网页中解析html字符串
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于css属性从网页中解析html字符串EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于css属性从网页中解析html字符串
EN