我觉得这应该是简单的,但我一直在努力使它正确。我正在尝试从这个网页中提取员工编号(“23万”):https://fortune.com/company/walmart/
我使用Chrome的扩展SelectorGadget定位号码--“info__row--7f9lE:nth-child(13).info__value-2 AH7”“
图书馆(RSelenium)
图书馆(出租)
图书馆(Netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE,port=free_port())
remDr<-rs_driver_object$client
remDr$导航(‘https://fortune.com/company/walmart/')
雇员<-remDr$findElement(使用= 'xpath','//h3@class="inforow--7f9lE:nth-child(13) .info值--2 AH7“)
员工
An error says
> "Selenium message:no such element: Unable to locate element".
I have also tried:雇员<-remDr$findElement(使用= 'class name',‘info__value--2 AH7’)
But it returns the data not as wanted.
Can someone point out the problem? Really appreciate it! 更新的i修改了下面评论中Frodo建议的代码,将其应用于多个网页,将统计数据保存为数据格式。但我还是遇到了一个错误。
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"
,"https://fortune.com/company/apple/"
,"https://fortune.com/company/cvs-health/"
,"https://fortune.com/company/jpmorgan-chase/"
,"https://fortune.com/company/verizon/"
,"https://fortune.com/company/ford-motor/"
, "https://fortune.com/company/general-motors/"
,"https://fortune.com/company/anthem/"
, "https://fortune.com/company/centene/"
,"https://fortune.com/company/fannie-mae/"
, "https://fortune.com/company/comcast/"
, "https://fortune.com/company/chevron/"
,"https://fortune.com/company/dell-technologies/"
,"https://fortune.com/company/bank-of-america-corp/"
,"https://fortune.com/company/target/") )
Data$numEmp<-"NA"
Data$numEmp <- numeric()
for (i in 1:length(Data$url))
{
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
}
Data$numEmpSelenium消息:未知错误:意外命令响应(会话信息: chrome=103.0.5060.114)构建信息:版本:'4.0.0-alpha-2',修订:‘f 148142cf8’,时间:'2019-07-01T21:30:10‘系统信息:主机:’桌面-VCCIL8P‘,ip:'192.168.1.249',os.name:'Windows 10',os.arch:’amd64 64‘,os.version:'10.0',java.version:'1.8.0_311‘驱动程序信息: driver.version:未知
错误:摘要: UnknownError详细信息:处理命令时发生了一个未知的服务器端错误。类: org.openqa.selenium.WebDriverException详细信息:运行errorDetails方法
谁能再看看吗?
发布于 2022-07-13 08:58:06
使用RSelenium加载网页并获取页面源
remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()使用Rvest读取网页内容
pgCnt <- read_html(pgSrc[[1]])此外,使用rvest::html_nodes和rvest::html_text函数使用相关的xpath选择器提取文本。(这个铬延伸应该会有所帮助)
reqTxt <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)reqTxt输出
> reqTxt
[1] "2,300,000"更新
错误的Selenium message:unknown error: unexpected command response似乎正在发生,特别是103号版本的Chromedriver。更多信息,这里。其中一个答案是在驱动程序导航到URL之前和之后进行5秒的简单等待。我还使用了tryCatch来继续代码,以便在while循环中运行。本质上,代码将一直运行,直到加载页面。这似乎很管用。
# Function to fetch employee count
getEmployees <- function(myURL) {
pagestatus <<- 0
while(pagestatus == 0) {
tryCatch(
expr = remDr$navigate(url = myURL),
pagestatus <<- 1,
error = function(error){
pagestatus <<- 0
}
)
}
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}将此函数实现到所有数据URL。
for(i in 1:nrow(Data)) {
Sys.sleep(5)
Data[i, 2] <- getEmployees(Data[i, 1])
Sys.sleep(5)
}现在当我们看到第二列的输出
> Data[, 2]
[1] "2,300,000" "1,608,000" "154,000" "258,000" "271,025" "118,400"
[7] "183,000" "157,000" "98,200" "72,500" "7,400" "189,000"
[13] "42,595" "133,000" "208,248" "450,000" 发布于 2022-07-12 07:15:50
必须只与RSelenium一起使用吗?在我的经验中,最灵活的方法是使用RSelenium导航到所需的页面( findElement帮助您找到输入文本的框或单击按钮),然后使用rvest从页面中提取所需的内容。
开始于
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])然后,您将如何处理这个问题取决于您希望解决方案的具体程度如何成为wrt这个确切的页面。有一种方法:
rvest::html_elements(pg, "div.info__row--7f9lE") |>
rvest::html_text2()或
rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |>
rvest::html_text2()或
rvest::html_elements(pg, "div.info__row--7f9lE")[11] |>
rvest::html_children()或
rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |>
rvest::html_text2()等等。您在租赁部分所做的事情将取决于您希望选择/提取过程的一般程度。
https://stackoverflow.com/questions/72934437
复制相似问题