首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >Regex:命名匹配的多个选项

Regex:命名匹配的多个选项
EN

Stack Overflow用户
提问于 2020-09-30 03:41:40
回答 1查看 256关注 0票数 1

我有一个正则表达式模式,它能部分捕捉到我想要的东西。这个模式可以看起来像任何一个

代码语言:javascript
复制
"caller command"
"caller command specifier"
"caller command 'two-worded specifier'"
"caller 'two-worded command' specifier"
"caller 'two-worded command' 'two-worded specifier'"

我的当前代码正在将它们匹配到命名组中,并使用Python的re库文档中显示的yes/no模式。

代码语言:javascript
复制
messages = ["your.majesty hello", "proclamation honor Dom", "your.majesty query 'Weekly Coding Challenge'", "your.majesty 'build test' submissions", "your.majesty 'build test' 'Weekly Coding Challenge'"]
call = "(?P<call>.*?)"
command = "(?P<command>'(.*?)'|(.*?))"
specifier = "(?P<specifier>'(.*?.)'|(.*?))"
duo = f"{call}\s{command}"
trio = f"({call}\s{command}\s{specifier})"

regex_duo = re.compile(duo, flags=re.DOTALL)
regex_trio = re.compile(trio)

for msg in messages:
    match = regex_trio.match(msg)
    if match is None:
        match = regex_duo.match(msg)
    print(match)

它的输出是

代码语言:javascript
复制
<re.Match object; span=(0, 13), match='your.majesty '>
<re.Match object; span=(0, 19), match='proclamation honor '>
<re.Match object; span=(0, 44), match="your.majesty query 'Weekly Coding Challenge'">
<re.Match object; span=(0, 26), match="your.majesty 'build test' ">
<re.Match object; span=(0, 51), match="your.majesty 'build test' 'Weekly Coding Challeng>

当我想

代码语言:javascript
复制
<re.Match object; span=(0, ...), match='your.majesty hello'>
<re.Match object; span=(0, ...), match='proclamation honor Dom'>
<re.Match object; span=(0, ...), match="your.majesty query 'Weekly Coding Challenge'">
<re.Match object; span=(0, ...), match="your.majesty 'build test' submissions">
<re.Match object; span=(0, ...), match="your.majesty 'build test' 'Weekly Coding Challenge'>
  1. 是否有比我现在做的更好的方法来做这件事?
  2. 为什么即使我使用贪婪的匹配,它也会被切断那么多?
EN

回答 1

Stack Overflow用户

发布于 2020-10-01 01:40:17

解决方案1: csv.reader (重用车轮)

只需将问题转换为csv.reader使用io.StringIO可读的格式即可。

代码语言:javascript
复制
from io import StringIO
import csv

messages = [
    "your.majesty hello",
    "proclamation honor Dom",
    "your.majesty query 'Weekly Coding Challenge'",
    "your.majesty 'build test' submissions",
    "your.majesty 'build test' 'Weekly Coding Challenge'"
]

# Avoid creating StringIO object multiple times
# for s in messages:
#    reader = csv.reader(StringIO(s), delimiter=" ", quotechar="'")

# load at once
ss = "\n".join(messages)
reader = csv.reader(StringIO(ss), delimiter=" ", quotechar="'")    

for row in reader:  # type(row) is a list
    caller = row[0]
    command = row[1]
    specifier = row[2] if len(row) == 3 else ""
    # check
    print(f"caller = {caller}, command = {command}, specifier = {specifier}")
    # do something with the parsed components here

输出

代码语言:javascript
复制
caller = your.majesty, command = hello, specifier = 
caller = proclamation, command = honor, specifier = Dom
caller = your.majesty, command = query, specifier = Weekly Coding Challenge
caller = your.majesty, command = build test, specifier = submissions
caller = your.majesty, command = build test, specifier = Weekly Coding Challenge

此解决方案不生成re.match对象,而是直接解析这三个组件。作为字符串而不是匹配的组,后续操作应该更容易。

优点是:我们知道现有的csv加载器可以正确地处理引号和空格分隔格式,对吗?所以不要重新发明轮子,试着重新利用它。这样,代码也更易于维护。

使用pandas.read_csv

注意:还可以使用pandas.read_csv()直接生成pandas.Dataframe。同样的语法也适用,只是列名必须手动分配。可能缺少的列(最后一列)得到适当处理。

代码语言:javascript
复制
import pandas as pd

pd.read_csv(StringIO(ss), delimiter=" ", quotechar="'", names=["caller", "command", "specifier"])
Out[38]: 
         caller     command                specifier
0  your.majesty       hello                      NaN
1  proclamation       honor                      Dom
2  your.majesty       query  Weekly Coding Challenge
3  your.majesty  build test              submissions
4  your.majesty  build test  Weekly Coding Challenge

解决方案2:改进的Regex (更一般)

对于正则表达式,是的,它也可以改进很多。我个人认为这也是值得阐述的,因为许多解析任务(可能还有大部分)不能由现有的库来解决。

文摘

  • 使用原始-docstring+ re.VERBOSE允许提供详细的文档。(正则表达式在PyCharm中显示得相当舒适。)
  • 对匹配的模式要更加精确。通常,除非匹配的字符是任意的,否则不要使用.*
  • 使用?量词表示可选的存在。

代码语言:javascript
复制
regex_uni = re.compile(r"""
    (?P<call>\S+) 
    \             # a space character
    (?P<command>  # group 2:
        (?:         # 1st option (non-capturing group):
           '          # begins with SQ
           [^']+      # followed by one or more consecutive non-SQ chars
           '          # ends with SQ
        )
        |         # or
        \S+         # 2nd option: consecutive non-space chars (assuming no SQ)
    )        
    \ ?  # optional space character
    (?P<specifier>       # group 3:   
        (?:'[^']+')|\S+    # same as group 2
    )?                   # but the existence is optional
    """, re.VERBOSE
)

for msg in messages:
    match = regex_uni.match(msg)
    if match is not None:
        print(f"* input = {match.group()}")
        print(f"    call = {match.group('call')}")
        print(f"    command = {match.group('command')}")
        print(f"    specifier = {match.group('specifier')}")

输出

代码语言:javascript
复制
* input = your.majesty hello
    call = your.majesty
    command = hello
    specifier = None
* input = proclamation honor Dom
    call = proclamation
    command = honor
    specifier = Dom
* input = your.majesty query 'Weekly Coding Challenge'
    call = your.majesty
    command = query
    specifier = 'Weekly Coding Challenge'
* input = your.majesty 'build test' submissions
    call = your.majesty
    command = 'build test'
    specifier = submissions
* input = your.majesty 'build test' 'Weekly Coding Challenge'
    call = your.majesty
    command = 'build test'
    specifier = 'Weekly Coding Challenge'
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/64130504

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档