我将一个.txt文件上传到R中,如下所示:Election_Parties <- readr::read_lines("Election_Parties.txt")文件中包含以下文本:pastebin link。
正文大致如下(请使用实际文件解决!):
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento
Nacionalista Revolucionario [MNR])
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas
de Colombia)我希望在一条线路上有关于一个政党的所有信息,无论它有多长。
所需输出:
BOLIVIA
P1-Nationalist Revolutionary Movement-Free Bolivia Movement (Movimiento Nacionalista Revolucionario
P19-Liberty and Justice (Libertad y Justicia [LJ])
P20-Tupak Katari Revolutionary Movement (Movimiento Revolucionario Tupak Katari [MRTK])
COLOMBIA
P1-Democratic Aliance M-19 (Alianza Democratica M-19 [AD-M19])
P2-National Popular Alliance (Alianza Nacional Popular [ANAPO])
P3-Indigenous Authorities of Colombia (Autoridades Indígenas de Colombia)下面的答案是:这个LINK中的strsplit(paste(Election_Parties, collapse=" "), "\\s+(?=P\\d+-)", perl=TRUE)[[1]]可以纠正字符串,但它不能正确处理标题(玻利维亚、哥伦比亚和空行)。处理这个问题很重要,因为我想在之后应用this解决方案。
虽然我在这个例子上的那篇文章的评论中得到了答案,但它在我的文本文件上不起作用。
我如何调整解决方案来处理(不用管)报头和空行?
发布于 2019-11-23 18:39:47
我把整个东西变成了一种整洁和有用的格式。看一看:
首先,我在文件中读到:
lines <- readr::read_lines("https://pastebin.com/raw/jSrvTa7G")
head(lines)
#> [1] ""
#> [2] "ALBANIA"
#> [3] "P1-Democratic Alliance Party (Partia Aleanca Democratike [AD])"
#> [4] "P2-National Unity Party (Partia Uniteti Kombëtar [PUK])"
#> [5] "P3-Social Spectrum Parties-Party of National Unity (Partitë e Spektrit Social-Partia e Unitetit Kombëtar"
#> [6] "[PSHS-PUK])"我通过查找空行将原始格式拆分成多个条目,空行恰好出现在新条目之前:
entries <- split(lines, cumsum(grepl("^$|^ $", lines)))然后,我循环遍历每个条目并将其转换为tibble
library(stringr)
library(dplyr)
df <- lapply(entries, function(entry) {
entry <- entry[!grepl("^$|^ $", entry)] # remove empty elements
header <- entry[1] # first non empty is the header
entry <- tail(entry, -1) # remove header from entry
desc <- str_extract(entry, "^P\\d+-") # extract description
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}
entry <- entry[!is.na(desc)]
desc <- desc[!is.na(desc)]
# turn into nice format
df <- tibble::tibble(
header,
desc,
entry
)
df$entry <- str_replace_all(df$entry, fixed(df$desc), "") # remove description from entry
return(df)
}) %>%
bind_rows() # turn list into one data.frame现在我们有了一个非常好的data.frame,我们可以很容易地使用它:
df
#> # A tibble: 5,525 x 3
#> header desc entry
#> <chr> <chr> <chr>
#> 1 ALBANIA P1- Democratic Alliance Party (Partia Aleanca Democratike [AD~
#> 2 ALBANIA P2- National Unity Party (Partia Uniteti Kombëtar [PUK])
#> 3 ALBANIA P3- Social Spectrum Parties-Party of National Unity (Partitë ~
#> 4 ALBANIA P4- Alliance Party for Solidarity and Welfare (Partia Aleanca~
#> 5 ALBANIA P5- Albanian Democratic Union-Alliance for Freedom, Justice a~
#> 6 ALBANIA P6- Liberal Democrat Party (Partia Bashkimi Liberal Demokrat ~
#> 7 ALBANIA P7- Linking Blerta Albanian Party (Partia Lidhja e Blertë Shq~
#> 8 ALBANIA P8- Democratic Movement for Integration (Lëvizja Demokratike ~
#> 9 ALBANIA P9- Movement of Human Rights and Freedoms Party (Partia Lëviz~
#> 10 ALBANIA P10- Socialist Party of Albania (Partia Socialiste e Shqipëris~
#> # ... with 5,515 more rows分散在多行上的字符串将在此位中进行校正:
for (l in which(is.na(desc))) { # collapse lines that go over 2 elements
entry[l - 1] <- paste(entry[l - 1], entry[l], sep = " ")
}如果行不以"P1-“开头,则desc将为NA (1可以是任何数字)。如果是这种情况,则使用前一个条目将行折叠。后来,NA被删除,只留下正确的行中的信息。
https://stackoverflow.com/questions/59006401
复制相似问题