我希望使用python正则表达式从保存编号754和1231中删除字符串中的数字,因为它们与税务部分代码754和sec代码1231有关。例如,我有下面的文本数据:
test="""Dividends 9672
Dividends 9680
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment - 2015
M-1 Section 754 Stock Basis Adjustment - 2015
Section 754 Stock Basis Adjustment - 2018
M-1 Section 754 Stock basis adjustment - 2018
"""我希望输出是:
Dividends
Dividends
Interest Income
Ordinary Dividends
Royalties
Capital Gain Distributions
Income from Blackstone
Ordinary Income
Rental Income
Long Term Capital Gain
Short Term Capital Gain
1231 Gain
Section 754 Stock Basis Adjustment
M- Section 754 Stock Basis Adjustment
Section 754 Stock Basis Adjustment
M- Section 754 Stock basis adjustment我的解决办法是:
test=re.sub(r'[^(754)(1231)A-Za-z]','',test)
print(test)但它并不把754或1231看作一个整体,只删除数字6、8、9。
发布于 2022-04-22 20:48:40
您可以使用
re.sub(r'(754|1231)|[^A-Za-z\s]', r'\1', text)见regex演示。
在这里,(754|1231)将一个754或1231数字序列匹配并捕获到第1组,然后|[^A-Za-z\s]匹配除ASCII字母或任何Unicode空格之外的任何字符,并将匹配替换为Group 1值(即捕获的内容保留在字符串中)。
注意事项:如果要将数字匹配为精确数字,则使用数字边界:
re.sub(r'(?<!\d)(754|1231)(?!\d)|[^A-Za-z\s]', r'\1', text)发布于 2022-04-23 02:22:38
你可以写以下内容。
rgx = r' *-? *(?<!\d)(?!(?:754|1231)(?!\d))\d+'re.sub(rgx, '', test)请注意,这将删除所有不需要的空格和连字符以及数字,例如,'7541'被匹配并替换为空字符串。
正则表达式可以细分如下(我已经用一个包含空格的字符类替换了初始空间,以便它是可见的)。
[ ]*-? * # match >= 0 spaces, optionally followed by a hyphen,
# followed by >= 0 spaces
(?<!\d) # negative lookbehind asserts that preceding character is
# not a digit
(?! # begin negative lookahead
(?:754|1231) # match '754' or '1231'
(?!\d) # negative lookahead asserts that next character is
# not a digit
) # end negative lookahead
\d+ # match >= 1 digitshttps://stackoverflow.com/questions/71974243
复制相似问题