我有一些部分不规则形成的采访记录:
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")我需要做的是通过将其关键元素提取到dataframe的列中来构造该数据。有四个这样的关键要素:
Rolein面试:受访者或interviewerUtterance::面试伙伴的speechTimestampindicated,由#到endsGap,用括号中的十进制数表示
问题是,Timestamp和Gap都是不一致的。虽然我可以为Gap设置最后一个捕获组,但是那些既没有Timestamp也没有Gap的字符串没有正确地呈现:
我使用来自tidyr的extract进行提取:
library(tidyr)
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\\w{2}:\\s|\\s+)([\\S\\s]+?)\\s*#([^#]+)?#\\s*(\\([0-9.]+\\))?\\s*")
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 <NA> <NA> <NA> <NA>
8 <NA> <NA> <NA> <NA>如何对regex进行细化,使我得到所需的输出:
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] 00:03:25-5
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war;发布于 2021-11-27 13:41:08
您可以更新您的模式以使用您的4个捕获组,并通过可选地匹配第3组和第4组并断言字符串的末尾,使最后一部分可选:
library(tidyr)
tst <- c("In: ja COOL; #00:04:24-6# ",
" in den vier, FÜNF wochen, #00:04:57-8# ",
"In: jah, #00:02:07-8# ",
"In: [ja; ] #00:03:25-5# [ja; ] #00:03:26-1#",
" also jA:h; #00:03:16-6# (1.1)",
"Bz: [E::hm; ] #00:03:51-4# (3.0) ",
"Bz: [mhmh, ]",
" in den bilLIE da war;")
data.frame(tst) %>%
extract(col = tst,
into = c("Role", "Utterance", "Timestamp", "Gap"),
regex = "^(\\w{2}:\\s|\\s+)([\\s\\S]*?)(?:\\s*#([^#]+)(?:#\\s*(\\([0-9.]+\\))?\\s*)?)?$")输出
Role Utterance Timestamp Gap
1 In: ja COOL; 00:04:24-6
2 in den vier, FÜNF wochen, 00:04:57-8
3 In: jah, 00:02:07-8
4 In: [ja; ] #00:03:25-5# [ja; ] 00:03:26-1
5 also jA:h; 00:03:16-6 (1.1)
6 Bz: [E::hm; ] 00:03:51-4 (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war; 发布于 2021-11-27 15:03:27
复杂正则表达式的一种替代方法是使用多个带有更简单正则表达式的提取体。之后,将任何NA转换为"“,并去掉不需要的空格。
library(dplyr)
library(tidyr)
data.frame(tst) %>%
extract(tst, "Gap", "(\\(.*?\\))", remove = FALSE) %>%
extract(tst, "Timestamp", "(#.*?#)", remove = FALSE) %>%
extract(tst, c("Role", "Utterance"), "^(\\S+:|)([^#]*)") %>%
mutate(across(, coalesce, ""), Utterance = trimws(Utterance))给予:
Role Utterance Timestamp Gap
1 In: ja COOL; #00:04:24-6#
2 in den vier, FÜNF wochen, #00:04:57-8#
3 In: jah, #00:02:07-8#
4 In: [ja; ] #00:03:25-5#
5 also jA:h; #00:03:16-6# (1.1)
6 Bz: [E::hm; ] #00:03:51-4# (3.0)
7 Bz: [mhmh, ]
8 in den bilLIE da war; https://stackoverflow.com/questions/70134684
复制相似问题