我正在尝试搜索在file-1中每个文件包含一行的印地语单词,并在file-2的行中找到它们。我必须打印行号和找到的字数。代码如下:
import codecs
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []
for counter, line in enumerate(hypernyms):
count_arr.append(0)
for word in words:
if line.find(word) >=0:
count_arr[counter] +=1
for iterator, count in enumerate(count_arr):
if count>0:
print iterator, ' ', count这是找到一些单词,但忽略其他一些输入文件是: File-1:
पौधा
वनस्पति文件2:
वनस्पति, पेड़-पौधा
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग
पादप_समूह, पेड़-पौधे, वनस्पति_समूह
पेड़-पौधा这将给出输出:
0 1
3 1显然,它忽略了वनस्पति,只搜索पौधा。我也尝试过其他输入。它只搜索一个单词。你知道怎么纠正这个问题吗?
发布于 2012-04-07 19:20:11
这是因为您没有删除行尾的"\n“字符。所以你不用搜索"some_pattern\n",而不是"some_pattern“。使用strip()函数像这样将它们切掉:
import codecs
words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []
for line in hypernyms:
count_arr.append(0)
for word in words:
count_arr[-1] += (word in line)
for count in enumerate(count_arr):
if count:
print iterator, ' ', count发布于 2012-04-07 18:59:03
我认为问题出在这里:
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines().readlines()将在末尾保留换行符,因此您搜索的不是पौधा,而是पौधा\n,并且只能在行尾匹配。如果我改用.read().split(),我会得到
0 2
2 1
3 1发布于 2012-04-07 19:33:58
放上这段代码,你就会明白为什么会发生这种情况,因为有空格:在文件1中,第一个单词是पौधा空格...
for i in hypernyms:
print "file1",i
for i in words:
print "file2",i在count_arr = []之后和计数器之前,第...行...
https://stackoverflow.com/questions/10053756
复制相似问题