我试图在列b中得到子串的计数,该列按任何顺序与列a相匹配。
示例:
[col a] [col b] [frequency]
big red car elon musk drives a big red car 1
elon musk car elon musk drives a big red car 1
red big car elon musk drives a big red car 1最大的匹配量需要固定在1。例如,大型红色汽车只能匹配一次,而不是对每一个组合进行匹配。
如果可能的话,我需要返回与单词完全匹配的信息。汽车不匹配的汽车,,d,等。
我试过的是:
df["frequency"] = df.apply(lambda x: x['col b'].count(x['col a']), axis=1)这只会找到精确的匹配,但我需要它们按任何顺序进行匹配。
任何帮助都很感激。
发布于 2021-07-15 08:47:23
假设您想检查"col“中的所有单词都在"col”中:
def ismatch(s):
A = set(s['[col a]'].split())
B = set(s['[col b]'].split())
return A.intersection(B) == A
df.apply(ismatch, axis=1)投入:
[col a] [col b] [frequency]
0 big red car elon musk drives a big red car 1
1 elon musk car elon musk drives a big red car 1
2 red big car elon musk drives a big red car 1
3 red big card elon musk drives a big red car 1产出:
0 True
1 True
2 True
3 False发布于 2021-07-15 09:57:46
尝试通过str.contains()
words='|'.join(df['[col a]'].unique())
#Finally:
df['[frequency]']=df['[col b]'].str.contains(words).astype(int)
#OR
df['[frequency]']=df['[col b]'].str.contains(words).view('i1')df输出
[col a] [col b] [frequency]
big red car elon musk drives a big red car 1
elon musk car elon musk drives a big red car 1
red big car elon musk drives a big red car 1https://stackoverflow.com/questions/68390475
复制相似问题