文章/答案/技术大牛

发布

社区首页 >问答首页 >网络抓取RSelenium findElement

问网络抓取RSelenium findElement
EN

Stack Overflow用户

提问于 2022-07-11 06:21:23

回答 2查看 112关注 0票数 1

我觉得这应该是简单的，但我一直在努力使它正确。我正在尝试从这个网页中提取员工编号(“23万”)：https://fortune.com/company/walmart/

我使用Chrome的扩展SelectorGadget定位号码--“info__row--7f9lE:nth-child(13).info__value-2 AH7”“

图书馆(RSelenium)

图书馆(出租)

图书馆(Netstat)

rs_driver_object<-rsDriver(browser='chrome'，chromever='103.0.5060.53'，verbose=FALSE，port=free_port())

remDr<-rs_driver_object$client

remDr$导航(‘https://fortune.com/company/walmart/')

雇员<-remDr$findElement(使用= 'xpath'，'//h3@class="inforow--7f9lE:nth-child(13) .info值--2 AH7“)

员工

An error says 

> "Selenium message:no such element: Unable to locate element".

I have also tried:

雇员<-remDr$findElement(使用= 'class name'，‘info__value--2 AH7’)

But it returns the data not as wanted. 


Can someone point out the problem? Really appreciate it!

更新的i修改了下面评论中Frodo建议的代码，将其应用于多个网页，将统计数据保存为数据格式。但我还是遇到了一个错误。

    library(RSelenium)
    library(rvest)
    library(netstat)
    
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client


Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
                           ,"https://fortune.com/company/apple/"                   
                           ,"https://fortune.com/company/cvs-health/" 
                           ,"https://fortune.com/company/jpmorgan-chase/"          
                           ,"https://fortune.com/company/verizon/"                 
                           ,"https://fortune.com/company/ford-motor/"              
                           , "https://fortune.com/company/general-motors/"          
                           ,"https://fortune.com/company/anthem/"                  
                           , "https://fortune.com/company/centene/"                 
                           ,"https://fortune.com/company/fannie-mae/"              
                           , "https://fortune.com/company/comcast/"                 
                           , "https://fortune.com/company/chevron/"                 
                           ,"https://fortune.com/company/dell-technologies/"       
                           ,"https://fortune.com/company/bank-of-america-corp/"    
                           ,"https://fortune.com/company/target/") )

Data$numEmp<-"NA"
Data$numEmp <- numeric()



for (i in 1:length(Data$url))
  {
  
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

}
Data$numEmp

Selenium消息:未知错误:意外命令响应(会话信息: chrome=103.0.5060.114)构建信息:版本：'4.0.0-alpha-2'，修订：‘f 148142cf8’，时间：'2019-07-01T21:30:10‘系统信息:主机：’桌面-VCCIL8P‘，ip：'192.168.1.249'，os.name：'Windows 10'，os.arch：’amd64 64‘，os.version：'10.0'，java.version：'1.8.0_311‘驱动程序信息: driver.version:未知

错误:摘要: UnknownError详细信息:处理命令时发生了一个未知的服务器端错误。类: org.openqa.selenium.WebDriverException详细信息:运行errorDetails方法

谁能再看看吗？

web-scraping

rselenium

findelement

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-07-13 08:58:06

使用RSelenium加载网页并获取页面源

remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()

使用Rvest读取网页内容

pgCnt <- read_html(pgSrc[[1]])

此外，使用rvest::html_nodes和rvest::html_text函数使用相关的xpath选择器提取文本。(这个铬延伸应该会有所帮助)

reqTxt <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

reqTxt输出

> reqTxt
[1] "2,300,000"

更新

错误的Selenium message:unknown error: unexpected command response似乎正在发生，特别是103号版本的Chromedriver。更多信息，这里。其中一个答案是在驱动程序导航到URL之前和之后进行5秒的简单等待。我还使用了tryCatch来继续代码，以便在while循环中运行。本质上，代码将一直运行，直到加载页面。这似乎很管用。

# Function to fetch employee count
getEmployees <- function(myURL) {
  pagestatus <<- 0
  while(pagestatus == 0) {
    tryCatch(
      expr = remDr$navigate(url = myURL),
      pagestatus <<- 1,
      error = function(error){
        pagestatus <<- 0
        
      }  
    )
  }
  pgSrc <- remDr$getPageSource()
  pgCnt <- read_html(pgSrc[[1]])
  return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}

将此函数实现到所有数据URL。

for(i in 1:nrow(Data)) {
  Sys.sleep(5)
  Data[i, 2] <- getEmployees(Data[i, 1])
  Sys.sleep(5)
}

现在当我们看到第二列的输出

> Data[, 2]
 [1] "2,300,000" "1,608,000" "154,000"   "258,000"   "271,025"   "118,400"  
 [7] "183,000"   "157,000"   "98,200"    "72,500"    "7,400"     "189,000"  
[13] "42,595"    "133,000"   "208,248"   "450,000"

票数 3

Stack Overflow用户

发布于 2022-07-12 07:15:50

必须只与RSelenium一起使用吗？在我的经验中，最灵活的方法是使用RSelenium导航到所需的页面( findElement帮助您找到输入文本的框或单击按钮)，然后使用rvest从页面中提取所需的内容。

开始于

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
page_source <- remDr$getPageSource()
pg <- xml2::read_html(page_source[[1]])

然后，您将如何处理这个问题取决于您希望解决方案的具体程度如何成为wrt这个确切的页面。有一种方法：

rvest::html_elements(pg, "div.info__row--7f9lE") |> 
  rvest::html_text2()

或

rvest::html_elements(pg, "div:nth-child(13) > div.info__value--2AHH7") |> 
  rvest::html_text2()

或

rvest::html_elements(pg, "div.info__row--7f9lE")[11] |> 
  rvest::html_children()

或

rvest::html_elements(pg, '.info__row--7f9lE:nth-child(13) .info__value--2AHH7') |> 
  rvest::html_text2()

等等。您在租赁部分所做的事情将取决于您希望选择/提取过程的一般程度。

票数 3

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/72934437

复制

相似问题

问网络抓取RSelenium findElement
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问网络抓取RSelenium findElementEN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问网络抓取RSelenium findElement
EN