首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >统计语篇中情态动词的总数

统计语篇中情态动词的总数
EN

Stack Overflow用户
提问于 2020-09-29 06:57:49
回答 1查看 357关注 0票数 1

我正在尝试创建一个自定义的单词集合,如下所示:

代码语言:javascript
复制
Modal    Tentative    Certainty    Generalizing
Can      Anyhow       Undoubtedly  Generally
May      anytime      Ofcourse     Overall
Might    anything     Definitely   On the Whole
Must     hazy         No doubt     In general
Shall    hope         Doubtless    All in all
ought to hoped        Never        Basically
will     uncertain    always       Essentially
need     undecidable  absolute     Most
Be to    occasional   assure       Every
Have to  somebody     certain      Some
Would    someone      clear        Often
Should   something    clearly      Rarely
Could    sort         inevitable   None
Used to  sorta        forever      Always

我正在逐行读取CSV文件中的文本:

代码语言:javascript
复制
import nltk
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize

count = defaultdict(int)
header_list = ["modal","Tentative","Certainity","Generalization"]
categorydf = pd.read_csv('Custom-Dictionary1.csv', names=header_list)
def analyze(file):
    df = pd.read_csv(file)
    modals = str(categorydf['modal'])
    tentative = str(categorydf['Tentative'])
    certainity = str(categorydf['Certainity'])
    generalization = str(categorydf['Generalization'])
    for text in df["Text"]:
        tokenize_text = text.split()
        for w in tokenize_text:          
            if w in modals:
                count[w] += 1
                       
analyze("test1.csv")
print(sum(count.values()))
print(count)

我希望在上面的表中以及test1.csv中的每一行中找到一些Modal/Tentative/Certainty动词,但是无法这样做。这是用数字产生单词的频率。

代码语言:javascript
复制
19
defaultdict(<class 'int'>, {'to': 7, 'an': 1, 'will': 2, 'a': 7, 'all': 2})

见'an','a‘不在表中。我想得到模型动词的=1行test.csv文本中的总情态动词

test1.csv

代码语言:javascript
复制
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
"They convey the content of a communication."
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"

我被困住了什么也得不到。我该怎么做?

EN

回答 1

Stack Overflow用户

发布于 2020-09-29 08:52:24

我已经解决了您的初始CSV格式的任务,如果需要的话,可以将它用于XML输入。

我用NumPy做了很棒的解决方案,这就是为什么解决方案可能有点复杂,但是运行速度非常快,适合大数据,甚至Giga。

它使用排序表的词,也排序文字计数和排序-搜索表,因此工作在O(n log n)时间复杂性。

它在第一行上输出原始文本行,然后在查找行中按排序顺序列出每个在表中找到的单词( Count,Modality,(TableRow,TableCol)),然后是非查找行,其中列出非查找的表中单词加计数(这个单词在文本中出现的次数)。

另外,在第一个解决方案之后也会找到一个更简单(但更慢)的解决方案。

在网上试试!

代码语言:javascript
复制
import io, pandas as pd, numpy as np

# Instead of io.StringIO(...) provide filename.
tab = pd.read_csv(io.StringIO("""
Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
tabc = np.array(tab.columns.values.tolist(), dtype = np.str_)
taba = tab.values.astype(np.str_)
tabw = np.char.lower(taba.ravel())
tabi = np.zeros([tabw.size, 2], dtype = np.int64)
tabi[:, 0], tabi[:, 1] = [e.ravel() for e in np.split(np.mgrid[:taba.shape[0], :taba.shape[1]], 2, axis = 0)]
t = np.argsort(tabw)
tabw, tabi = tabw[t], tabi[t, :]

texts = pd.read_csv(io.StringIO("""
Text
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
""")).values[:, 0].astype(np.str_)

for i, (a, text) in enumerate(zip(map(np.array, np.char.split(texts)), texts)):
    vs, cs = np.unique(np.char.lower(a), return_counts = True)
    ps = np.searchsorted(tabw, vs)
    unc = np.zeros_like(a, dtype = np.bool_)
    psm = ps < tabi.shape[0]
    psm[psm] = tabw[ps[psm]] == vs[psm]
    print(
        i, ': Text:', text,
        '\nFound:',
        ', '.join([f'"{vs[i]}": ({cs[i]}, {tabc[tabi[ps[i], 1]]}, ({tabi[ps[i], 0]}, {tabi[ps[i], 1]}))'
            for i in np.flatnonzero(psm).tolist()]),
        '\nNon-Found:',
        ', '.join([f'"{vs[i]}": {cs[i]}'
            for i in np.flatnonzero(~psm).tolist()]),
        '\n',
    )

产出:

代码语言:javascript
复制
0 : Text: When LIWC was first developed, the goal was to devise an efficient will system
Found: "will": (1, Modal, (6, 0))
Non-Found: "an": 1, "developed,": 1, "devise": 1, "efficient": 1, "first": 1, "goal": 1, "liwc": 1, "system": 1, "the": 1, "to": 1, "was": 2, "when":
 1

1 : Text: Within a few years, it became clear that there are two very broad categories of words
Found: "clear": (1, Certainty, (10, 2))
Non-Found: "a": 1, "are": 1, "became": 1, "broad": 1, "categories": 1, "few": 1, "it": 1, "of": 1, "that": 1, "there": 1, "two": 1, "very": 1, "withi
n": 1, "words": 1, "years,": 1

2 : Text: Content words are generally nouns, regular verbs, and many adjectives and adverbs.
Found: "generally": (1, Generalizing, (0, 3))
Non-Found: "adjectives": 1, "adverbs.": 1, "and": 2, "are": 1, "content": 1, "many": 1, "nouns,": 1, "regular": 1, "verbs,": 1, "words": 1

3 : Text: They convey the content of a communication.
Found:
Non-Found: "a": 1, "communication.": 1, "content": 1, "convey": 1, "of": 1, "the": 1, "they": 1

4 : Text: To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”
Found:
Non-Found: "a": 1, "and": 2, "are:": 1, "back": 1, "content": 1, "dark": 1, "go": 1, "night”": 1, "phrase": 1, "stormy": 1, "the": 2, "to": 2, "was":
 1, "words": 1, "“dark,”": 1, "“it": 1, "“night.”": 1, "“stormy,”": 1

第二种解决方案是用纯Python实现的,只是为了简单起见,只使用标准python模块iocsv

在网上试试!

代码语言:javascript
复制
import io, csv

# Instead of io.StringIO(...) just read from filename.
tab = csv.DictReader(io.StringIO("""Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))

texts = csv.DictReader(io.StringIO("""
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
"""), fieldnames = ['Text'])

tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]

for text in texts:
    cnt, mod = {}, {}
    for word in text.lower().split():
        if word in tabi:
            cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
    print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))

它的产出如下:

代码语言:javascript
复制
'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)

我是从CSV的StringIO内容中读取的,这是为了方便,这样代码包含的所有内容都不需要额外的文件,在您的情况下,您肯定需要直接读取文件,因为在下一个代码和下一个链接(名为Try it online!)中,您可能会这样做:

在网上试试!

代码语言:javascript
复制
import io, csv

tab = csv.DictReader(open('table.csv', 'r', encoding = 'utf-8-sig'))
texts = csv.DictReader(open('texts.csv', 'r', encoding = 'utf-8-sig'), fieldnames = ['Text'])

tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]

for text in texts:
    cnt, mod = {}, {}
    for word in text.lower().split():
        if word in tabi:
            cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
    print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64114399

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档