我将一个.txt文件上传到R中,如下所示:Election_Parties <- readr::read_lines("Election_Parties.txt")文件中有以下文本:巴斯丁链。
文本大致如下(请使用实际文件进行解决!):
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas
de Colombia)我想把关于聚会的所有信息都放在一条线上,不管它有多长。
期望产出:
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)我有一个解决方案,它几乎完全完成了@JBGruber的技巧,可以找到这里
lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame但它不知怎么地删除了信息。例如,这一信息:
P1-Movement for a Prosperous Czechoslovakia (Hnutie za prosperujúce Česko + Slovensko
[HZPČS])
P2-Social Democracy (Sociálna demokracia [SD])
P3-Association for Workers in Slovakia (Združenie robotníkov Slovenska [ZRS])我不太了解代码,看不出删除可能发生在哪里,也不知道如何一步一步地检查它发生在哪里(就像在lapply中发生的一切一样)。有人能帮忙吗?
请注意,使用data.table的解决方案也同样受欢迎。
编辑:

发布于 2019-12-16 12:21:18
@JBGruber回答的一个纯基数R选项:
txt <- readLines("https://pastebin.com/raw/KKu9FmF6")
txtgrps <- split(txt, cumsum(grepl("P00-$", txt)))
l <- lapply(txtgrps, function(grp) {
grp <- tail(grp, -1)
country <- gsub("^P\\d+-", "", grp[1])
grp <- tail(grp, -1)
grp <- tapply(grp, cumsum(grepl("^P\\d+-", grp)), paste, collapse = " ")
code <- sub("(P\\d+)-.*", "\\1", grp)
party <- gsub("^P\\d+-", "", grp)
df <- data.frame(country, code, party)
return(df)
})
df <- do.call(rbind, l)这意味着:
> head(df)
country code party
1.1 ALBANIA P1 Democratic Alliance Party (Partia Aleanca Democratike [AD])
1.2 ALBANIA P2 National Unity Party (Partia Uniteti Kombëtar [PUK])
1.3 ALBANIA P3 Social Spectrum Parties-Party of National Unity (Partitë e Spektrit Social-Partia e Unitetit Kombëtar [PSHS-PUK])
1.4 ALBANIA P4 Alliance Party for Solidarity and Welfare (Partia Aleanca për Mirëqenie dhe Solidaritet [AMS])
1.5 ALBANIA P5 Albanian Democratic Union-Alliance for Freedom, Justice and Welfare (Partia Bashkimi Demokrat Shqiptar-Aleanca për Liri, Drejtësi dhe Mirëqenie [BDSH])
1.6 ALBANIA P6 Liberal Democrat Party (Partia Bashkimi Liberal Demokrat [BLD])对于新输入,可以将解决方案调整为:
txt <- readLines("https://pastebin.com/raw/FTV3Gded")
txtgrps <- split(txt, cumsum(grepl("^$|^ $", txt)))
# based on: https://stackoverflow.com/a/59006739/2204410
l <- lapply(txtgrps, function(grp) {
grp <- tail(grp, -1)
country <- grp[1]
grp <- tail(grp, -1)
grp <- tapply(grp, cumsum(grepl("^P\\d+", grp)), paste, collapse = " ")
code <- sub("(P\\d+).*", "\\1", grp)
party <- substring(sub("^P\\d+", "", grp), 2)
df <- data.frame(country, code, party)
return(df)
})
df <- do.call(rbind, l)https://stackoverflow.com/questions/59355505
复制相似问题