文章/答案/技术大牛

发布

社区首页 >问答首页 >Regex:命名匹配的多个选项

问Regex:命名匹配的多个选项
EN

Stack Overflow用户

提问于 2020-09-30 03:41:40

回答 1查看 256关注 0票数 1

我有一个正则表达式模式，它能部分捕捉到我想要的东西。这个模式可以看起来像任何一个

"caller command"
"caller command specifier"
"caller command 'two-worded specifier'"
"caller 'two-worded command' specifier"
"caller 'two-worded command' 'two-worded specifier'"

我的当前代码正在将它们匹配到命名组中，并使用Python的re库文档中显示的yes/no模式。

messages = ["your.majesty hello", "proclamation honor Dom", "your.majesty query 'Weekly Coding Challenge'", "your.majesty 'build test' submissions", "your.majesty 'build test' 'Weekly Coding Challenge'"]
call = "(?P<call>.*?)"
command = "(?P<command>'(.*?)'|(.*?))"
specifier = "(?P<specifier>'(.*?.)'|(.*?))"
duo = f"{call}\s{command}"
trio = f"({call}\s{command}\s{specifier})"

regex_duo = re.compile(duo, flags=re.DOTALL)
regex_trio = re.compile(trio)

for msg in messages:
    match = regex_trio.match(msg)
    if match is None:
        match = regex_duo.match(msg)
    print(match)

它的输出是

<re.Match object; span=(0, 13), match='your.majesty '>
<re.Match object; span=(0, 19), match='proclamation honor '>
<re.Match object; span=(0, 44), match="your.majesty query 'Weekly Coding Challenge'">
<re.Match object; span=(0, 26), match="your.majesty 'build test' ">
<re.Match object; span=(0, 51), match="your.majesty 'build test' 'Weekly Coding Challeng>

当我想

<re.Match object; span=(0, ...), match='your.majesty hello'>
<re.Match object; span=(0, ...), match='proclamation honor Dom'>
<re.Match object; span=(0, ...), match="your.majesty query 'Weekly Coding Challenge'">
<re.Match object; span=(0, ...), match="your.majesty 'build test' submissions">
<re.Match object; span=(0, ...), match="your.majesty 'build test' 'Weekly Coding Challenge'>

是否有比我现在做的更好的方法来做这件事？
为什么即使我使用贪婪的匹配，它也会被切断那么多？

python

regex

string

string-matching

回答 1

Stack Overflow用户

发布于 2020-10-01 01:40:17

解决方案1: csv.reader (重用车轮)

只需将问题转换为csv.reader使用io.StringIO可读的格式即可。

码

from io import StringIO
import csv

messages = [
    "your.majesty hello",
    "proclamation honor Dom",
    "your.majesty query 'Weekly Coding Challenge'",
    "your.majesty 'build test' submissions",
    "your.majesty 'build test' 'Weekly Coding Challenge'"
]

# Avoid creating StringIO object multiple times
# for s in messages:
#    reader = csv.reader(StringIO(s), delimiter=" ", quotechar="'")

# load at once
ss = "\n".join(messages)
reader = csv.reader(StringIO(ss), delimiter=" ", quotechar="'")    

for row in reader:  # type(row) is a list
    caller = row[0]
    command = row[1]
    specifier = row[2] if len(row) == 3 else ""
    # check
    print(f"caller = {caller}, command = {command}, specifier = {specifier}")
    # do something with the parsed components here

输出

caller = your.majesty, command = hello, specifier = 
caller = proclamation, command = honor, specifier = Dom
caller = your.majesty, command = query, specifier = Weekly Coding Challenge
caller = your.majesty, command = build test, specifier = submissions
caller = your.majesty, command = build test, specifier = Weekly Coding Challenge

此解决方案不生成re.match对象，而是直接解析这三个组件。作为字符串而不是匹配的组，后续操作应该更容易。

优点是:我们知道现有的csv加载器可以正确地处理引号和空格分隔格式，对吗？所以不要重新发明轮子，试着重新利用它。这样，代码也更易于维护。

使用pandas.read_csv

注意:还可以使用pandas.read_csv()直接生成pandas.Dataframe。同样的语法也适用，只是列名必须手动分配。可能缺少的列(最后一列)得到适当处理。

import pandas as pd

pd.read_csv(StringIO(ss), delimiter=" ", quotechar="'", names=["caller", "command", "specifier"])
Out[38]: 
         caller     command                specifier
0  your.majesty       hello                      NaN
1  proclamation       honor                      Dom
2  your.majesty       query  Weekly Coding Challenge
3  your.majesty  build test              submissions
4  your.majesty  build test  Weekly Coding Challenge

解决方案2:改进的Regex (更一般)

对于正则表达式，是的，它也可以改进很多。我个人认为这也是值得阐述的，因为许多解析任务(可能还有大部分)不能由现有的库来解决。

文摘

使用原始-docstring+ re.VERBOSE允许提供详细的文档。(正则表达式在PyCharm中显示得相当舒适。)
对匹配的模式要更加精确。通常，除非匹配的字符是任意的，否则不要使用.*。
使用?量词表示可选的存在。

码

regex_uni = re.compile(r"""
    (?P<call>\S+) 
    \             # a space character
    (?P<command>  # group 2:
        (?:         # 1st option (non-capturing group):
           '          # begins with SQ
           [^']+      # followed by one or more consecutive non-SQ chars
           '          # ends with SQ
        )
        |         # or
        \S+         # 2nd option: consecutive non-space chars (assuming no SQ)
    )        
    \ ?  # optional space character
    (?P<specifier>       # group 3:   
        (?:'[^']+')|\S+    # same as group 2
    )?                   # but the existence is optional
    """, re.VERBOSE
)

for msg in messages:
    match = regex_uni.match(msg)
    if match is not None:
        print(f"* input = {match.group()}")
        print(f"    call = {match.group('call')}")
        print(f"    command = {match.group('command')}")
        print(f"    specifier = {match.group('specifier')}")

输出

* input = your.majesty hello
    call = your.majesty
    command = hello
    specifier = None
* input = proclamation honor Dom
    call = proclamation
    command = honor
    specifier = Dom
* input = your.majesty query 'Weekly Coding Challenge'
    call = your.majesty
    command = query
    specifier = 'Weekly Coding Challenge'
* input = your.majesty 'build test' submissions
    call = your.majesty
    command = 'build test'
    specifier = submissions
* input = your.majesty 'build test' 'Weekly Coding Challenge'
    call = your.majesty
    command = 'build test'
    specifier = 'Weekly Coding Challenge'

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/64130504

复制

相似问题

问Regex:命名匹配的多个选项
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Regex:命名匹配的多个选项EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Regex:命名匹配的多个选项
EN