首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >在lappy函数中查找不需要的删除原因

在lappy函数中查找不需要的删除原因
EN

Stack Overflow用户
提问于 2019-12-16 11:13:28
回答 1查看 81关注 0票数 0

我将一个.txt文件上传到R中,如下所示:Election_Parties <- readr::read_lines("Election_Parties.txt")文件中有以下文本:巴斯丁链

文本大致如下(请使用实际文件进行解决!):

代码语言:javascript
复制
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento 
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])

COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas 
de Colombia)

我想把关于聚会的所有信息都放在一条线上,不管它有多长。

期望产出:

代码语言:javascript
复制
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario 
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])

COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)

我有一个解决方案,它几乎完全完成了@JBGruber的技巧,可以找到这里

代码语言:javascript
复制
lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))

library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
  entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
  header <- entry[1] # first non empty is the header
  entry <- tail(entry, -1)  # remove header from entry
  desc <- str_extract(entry, "^P\\d+-")  # extract description

  for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
    entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
  }

  entry <- entry[!is.na(desc)]
  desc <- desc[!is.na(desc)]

  # turn into nice format
  df <- tibble::tibble(
    header,
    desc,
    entry
  )
  df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
  return(df)
}) %>% 
  bind_rows() # turn list into one data.frame

但它不知怎么地删除了信息。例如,这一信息:

代码语言:javascript
复制
P1-Movement for a Prosperous Czechoslovakia (Hnutie za prosperujúce Česko + Slovensko
[HZPČS])
P2-Social Democracy (Sociálna demokracia [SD])
P3-Association for Workers in Slovakia (Združenie robotníkov Slovenska [ZRS])

我不太了解代码,看不出删除可能发生在哪里,也不知道如何一步一步地检查它发生在哪里(就像在lapply中发生的一切一样)。有人能帮忙吗?

请注意,使用data.table的解决方案也同样受欢迎。

编辑:

EN

回答 1

Stack Overflow用户

发布于 2019-12-16 12:21:18

@JBGruber回答的一个纯基数R选项:

代码语言:javascript
复制
txt <- readLines("https://pastebin.com/raw/KKu9FmF6")

txtgrps <- split(txt, cumsum(grepl("P00-$", txt)))

l <- lapply(txtgrps, function(grp) {
  grp <- tail(grp, -1)
  country <- gsub("^P\\d+-", "", grp[1])
  grp <- tail(grp, -1)
  grp <- tapply(grp, cumsum(grepl("^P\\d+-", grp)), paste, collapse = " ")
  code <- sub("(P\\d+)-.*", "\\1", grp)
  party <- gsub("^P\\d+-", "", grp)
  df <- data.frame(country, code, party)
  return(df)
})

df <- do.call(rbind, l)

这意味着:

代码语言:javascript
复制
> head(df)
    country code                                                                                                                                                   party
1.1 ALBANIA   P1                                                                                             Democratic Alliance Party (Partia Aleanca Democratike [AD])
1.2 ALBANIA   P2                                                                                                    National Unity Party (Partia Uniteti Kombëtar [PUK])
1.3 ALBANIA   P3                                       Social Spectrum Parties-Party of National Unity (Partitë e Spektrit Social-Partia e Unitetit Kombëtar [PSHS-PUK])
1.4 ALBANIA   P4                                                          Alliance Party for Solidarity and Welfare (Partia Aleanca për Mirëqenie dhe Solidaritet [AMS])
1.5 ALBANIA   P5 Albanian Democratic Union-Alliance for Freedom, Justice and Welfare (Partia Bashkimi Demokrat Shqiptar-Aleanca për Liri, Drejtësi dhe Mirëqenie [BDSH])
1.6 ALBANIA   P6                                                                                         Liberal Democrat Party (Partia Bashkimi Liberal Demokrat [BLD])

对于新输入,可以将解决方案调整为:

代码语言:javascript
复制
txt <- readLines("https://pastebin.com/raw/FTV3Gded")

txtgrps <- split(txt, cumsum(grepl("^$|^ $", txt)))
# based on: https://stackoverflow.com/a/59006739/2204410

l <- lapply(txtgrps, function(grp) {
  grp <- tail(grp, -1)
  country <- grp[1]
  grp <- tail(grp, -1)
  grp <- tapply(grp, cumsum(grepl("^P\\d+", grp)), paste, collapse = " ")
  code <- sub("(P\\d+).*", "\\1", grp)
  party <- substring(sub("^P\\d+", "", grp), 2)
  df <- data.frame(country, code, party)
  return(df)
})

df <- do.call(rbind, l)
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/59355505

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档