所以,我有一个关键字列表小写。比方说
keywords = ['machine learning', 'data science', 'artificial intelligence']和一张小写的文本列表。比方说
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]我需要把这些文本转换成:
[[['the', 'new',
'machine_learning',
'model',
'built',
'by',
'google',
'is',
'revolutionary',
'for',
'the',
'current',
'state',
'of',
'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']],
[['data_science',
'and',
'artificial_intelligence',
'are',
'two',
'different',
'fields',
'although',
'they',
'are',
'interconnected'],
['scientists',
'from',
'harvard',
'are',
'explaining',
'it',
'in',
'a',
'detailed',
'presentation',
'that',
'could',
'be',
'found',
'on',
'our',
'page']]]我现在要做的是检查关键字是否在文本中,并将它们替换为_关键字。但这是复杂的m*n,当你有700长的文本和2M的关键字时,它是非常慢的,就像我的例子。
我试图使用短语,但我无法设法建立一个只用我的关键字。
有人能建议我一种更优化的方法吗?
发布于 2019-11-14 19:27:32
Phrases/Phraser类gensim的设计是为了使用它们内部的、统计数据派生的记录,说明哪些词对应该被提升为短语而不是用户提供的配对。(你或许可以通过综合分数/阈值来刺激和激励Phraser做你想做的事情,但这可能会有些尴尬。)
您可以模仿它们的一般方法:(1)对标记列表而不是原始字符串进行操作;(2)学习并记住应该组合的令牌对;& (3)在一次传递中执行组合。这应该比任何基于对字符串进行重复搜索和替换的方法有效得多--听起来你已经尝试过了,并且发现了它的不足。
例如,让我们首先创建一个字典,其中键是应该组合的单词对的元组,值是元组,其中包含指定的组合标记,第二项只是空元组。(这样做的原因稍后会变得清楚。)
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = [
'the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking',
'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.'
]
combinations_dict = {tuple(kwsplit):('_'.join(kwsplit), ())
for kwsplit in [kwstr.split() for kwstr in keywords]}
combinations_dict在这一步之后,combinations_dict是:
{('machine', 'learning'): ('machine_learning', ()),
('data', 'science'): ('data_science', ()),
('artificial', 'intelligence'): ('artificial_intelligence', ())}现在,我们可以使用Python生成器函数来创建任何其他令牌序列的可迭代转换,该转换采用原始令牌一个接一个,但在发出任何标记之前,将下一步添加到缓冲的候选标记对中。如果该对是应该组合的,则单个组合令牌是yield编辑的--但如果不是,则只发出第一个令牌,留下第二个令牌与新候选对中的下一个令牌组合。
例如:
def combining_generator(tokens, comb_dict):
buff = () # start with empty buffer
for in_tok in tokens:
buff += (in_tok,) # add latest to buffer
if len(buff) < 2: # grow buffer to 2 tokens if possible
continue
# lookup what to do for current pair...
# ...defaulting to emit-[0]-item, keep-[1]-item in new buff
out_tok, buff = comb_dict.get(buff, (buff[0], (buff[1],)))
yield out_tok
if buff:
yield buff[0] # last solo token if any在这里,我们看到了早期()空元组的原因:这是成功替换后的buff的首选状态。通过这种方式驱动结果&下一步状态可以帮助我们使用dict.get(key, default)的形式,它提供了一个特定的值,如果找不到密钥,就可以使用它。
现在可以通过以下方式应用指定的组合:
tokenized_texts = [text.split() for text in texts]
retokenized_texts = [list(combining_generator(tokens, combinations_dict)) for tokens in tokenized_texts]
retokenized_texts...which将tokenized_texts报告为:
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial', 'intelligence.', 'it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking'],
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields,', 'although', 'they', 'are', 'interconnected.', 'scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page.']
]请注意,令牌('artificial', 'intelligence.') 不是在这里合并的,因为所使用的非常简单的.split()标记已经留下了标点符号,从而阻止了与规则完全匹配。
实际项目将需要使用更复杂的标记化,这种标记可以去掉标点符号,或者保留标点符号作为标记,或者进行其他预处理,因此可以在不附加'.'的情况下正确地将'.'作为令牌传递。例如,一个简单的标记化(只保留单词运行字符,放弃标点符号)将是:
import re
tokenized_texts = [re.findall('\w+', text) for text in texts]
tokenized_texts另一个将任何游离的非字/非空格字符(标点符号)保留为独立标记的方法是:
tokenized_texts = [re.findall(r'\w+|(?:[^\w\s])', text) for text in texts]
tokenized_texts任何一个简单.split()的替代方案都将确保您的第一篇文章给出了组合所需的('artificial', 'intelligence')对。
发布于 2019-11-13 14:43:35
这可能不是最好的节奏曲方式,但它有3个步骤。
keywords = ['machine learning', 'data science', 'artificial intelligence']
texts = ['the new machine learning model built by google is revolutionary for the current state of artificial intelligence. it may change the way we are thinking', 'data science and artificial intelligence are two different fields, although they are interconnected. scientists from harvard are explaining it in a detailed presentation that could be found on our page.']
#Add underscore
for idx, text in enumerate(texts):
for keyword in keywords:
reload_text = texts[idx]
if keyword in text:
texts[idx] = reload_text.replace(keyword, keyword.replace(" ", "_"))
#Split text for each "." encountered
for idx, text in enumerate(texts):
texts[idx] = list(filter(None, text.split(".")))
print(texts)
#Split text to get each word
for idx,text in enumerate(texts):
for idx_s,sentence in enumerate(text):
texts[idx][idx_s] = list(map(lambda x: re.sub("[,\.!?]", "", x), sentence.split())) #map to delete every undesired characters
print(texts)输出
[
[
['the', 'new', 'machine_learning', 'model', 'built', 'by', 'google', 'is', 'revolutionary', 'for', 'the', 'current', 'state', 'of', 'artificial_intelligence'],
['it', 'may', 'change', 'the', 'way', 'we', 'are', 'thinking']
],
[
['data_science', 'and', 'artificial_intelligence', 'are', 'two', 'different', 'fields', 'although', 'they', 'are', 'interconnected'],
['scientists', 'from', 'harvard', 'are', 'explaining', 'it', 'in', 'a', 'detailed', 'presentation', 'that', 'could', 'be', 'found', 'on', 'our', 'page']
]
]https://stackoverflow.com/questions/58839049
复制相似问题