文章/答案/技术大牛

发布

社区首页 >问答首页 >在Python +中使用Selenium下载PDF +用指定的名称保存每个PDF

问在Python +中使用Selenium下载PDF +用指定的名称保存每个PDF
EN

Stack Overflow用户

提问于 2022-01-11 09:55:08

回答 1查看 1.3K关注 0票数 -2

其目的是为一群公司刮掉一堆PDF，2)用相应的公司名称保存它们，所有这些名称都来自https://www1.hkexnews.hk/app/appyearlyindex.html?lang=en&board=mainBoard。

我的代码用于下载PDF，但是负责自动下载的代码片段很方便：

chrome_options = Options()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": "/Users/XXX/Downloads", #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})

我仍然需要使用每个对应公司的名称保存每个PDF，而不仅仅是一个随机的PDF文件名。公司名称可以使用以下方法进行刮除：

all_names = driver.find_elements_by_xpath("//div[@class='applicant-name']")

但是，如何将下面的完整代码修改为，其中包括一个循环，该循环可以用每个公司名称(而不是随机文件名)保存每个文件。

chrome_options = Options()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": "/Users/XXX/Downloads", #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})

year = str(input("Please enter the year for which you want to download the Application Proofs: "))
link = "https://www1.hkexnews.hk/app/appyearlyindex.html?lang=en&board=mainBoard&year=" + year
print("Now loading: ", link)
print("Found the following companies: ")

driver = webdriver.Chrome('/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/chromedriver',options=chrome_options)
wait = WebDriverWait(driver,10)
driver.get(link)

all_proofs = driver.find_elements_by_xpath("//tr[@class='record-ap-phip']//a[contains(.,'Full Version')]")
all_names = driver.find_elements_by_xpath("//div[@class='applicant-name']")

for i in all_names:
    print('---> ', i.text)

print("\nTotal number of proofs in year ",year,": ",len(all_proofs))
Y = 0
N = 0
for proof in all_proofs:
    try:
        proof.click()
        wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='warning-statement-dialog']//label[@for='warning-statement-accept']"))).click()
        wait.until(EC.presence_of_element_located((By.XPATH, "//*[@id='warning-statement-dialog']//a[contains(@class,'btn-ok')]"))).click()
        Y += 1
    except Exception as exc:
        exception = f'An exception occurred.'
        N += 1

print("Number of application proofs downloaded: ", Y)
print("Number of exceptions: ", N)

selenium

pdf

web-scraping

python

回答 1

Stack Overflow用户

发布于 2022-01-11 11:36:12

正如所指出的，Developer Tools中有一个JSON文件- Network /XHR，它可以很容易地在不使用Selenium的情况下被刮到：

import requests
import re

for year in range(2015,2023):

    data_url = f'https://www1.hkexnews.hk/ncms/json/eds/app_{str(year)}_sehk_e.json?_=1641899494829' #found in the Developer Tools - Network - fetch/XHR
    data = requests.get(data_url).json()

    for company in data['app']:
        filename = re.sub(r'[^\w\-_ ]', '_',company['a'])+'.pdf' #company name remove bad characters for filename
        try:
            pdf_url = 'https://www1.hkexnews.hk/app/'+company['ls'][0]['u1']

        except:
            continue

        pdf_data = requests.get(pdf_url)

        print(f'Saving {filename}')
        with open(filename,'wb') as file:
            file.write(pdf_data.content)

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70664733

复制

相似问题

问在Python +中使用Selenium下载PDF +用指定的名称保存每个PDF
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python +中使用Selenium下载PDF +用指定的名称保存每个PDFEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问在Python +中使用Selenium下载PDF +用指定的名称保存每个PDF
EN