我试图为段落中的每一个单词添加一个红宝石标签。html文档看起来像这样
<div class = "bodyTxt">
<p>Lorem ipsum dolor sit amet, no postea maiorum sadipscing quo, ad illum percipitur
inciderint usu. Rebum vidisse apeirian an vel. Vis nostro iudicabit instructior ex, ne eos
facer iudicabit. Nec ludus ridens facete ea, ad vix populo adversarium, te mel meis malis
mundi.
</p>
<p>Putant omittam no qui, ei sed esse saperet. Te alii unum ignota has, vix ei maiestatis
expetendis. Et error iracundia argumentum vim, mel maiestatis delicatissimi ex. Sit altera
vivendo ad, vis dolorem consetetur et, fabulas admodum sadipscing te est. Sit et senserit
consequuntur interpretaris, et sale ornatus consequat has, modus aeque omittantur has te.
</p>
</div>在我希望它看起来像这样之后
<div class = "bodyTxt">
<p><ruby>Lorem</ruby> <ruby>ipsum </ruby><ruby>dolor</ruby> <ruby>sit</ruby>
<ruby>amet</ruby>,<ruby>no</ruby> <ruby>postea</ruby> <ruby>maiorum</ruby>
<ruby>sadipscing/<ruby> <ruby>quo</ruby>,
<ruby>ad</ruby> <ruby>illum</ruby> <ruby>percipitur<ruby>
</p>
<p><ruby>Putant</ruby> <ruby>omittam</ruby> <ruby>no</ruby> <ruby>qui</ruby>, <ruby>ei</ruby>
<ruby>sed</ruby> <ruby>esse</ruby> <ruby>saperet</ruby>
</div>我试图做的是首先从bodyText类获取文本,然后添加标记,但问题是,它只是在每一段的乞讨和结尾添加标记。
for textSection in bodyText.stripped_strings:
RubyTag = soup.new_tag('ruby')
RubyTag.string = textSection
textSection.replace_with(RubyTag)我还试着迭代段落,得到每个单词并添加如下的标记
for textSection in bodyText.stripped_strings:
for word in textSection:
RubyTag = soup.new_tag('ruby')
RubyTag.string = word
textSection.replace_with(RubyTag)但这会引发错误AttributeError:'str‘对象没有属性'replace_with'
发布于 2021-06-12 19:51:36
从文档的角度来看,一种方法可能是利用new_tag()和decompose()。由于您希望在单独的标记中处理标点符号,所以可以使用regex为每个新的ruby标记生成内容。我使用了@user3850 3850的正则表达式。
在循环期间创建一个新的p标记,并追加您的ruby标记,然后您可以对原始的p标记进行decompose()。
from bs4 import BeautifulSoup as bs
import re
html = '''<div class = "bodyTxt">
<p>Lorem ipsum dolor sit amet, no postea maiorum sadipscing quo, ad illum percipitur
inciderint usu. Rebum vidisse apeirian an vel. Vis nostro iudicabit instructior ex, ne eos
facer iudicabit. Nec ludus ridens facete ea, ad vix populo adversarium, te mel meis malis
mundi.
</p>
<p>Putant omittam no qui, ei sed esse saperet. Te alii unum ignota has, vix ei maiestatis
expetendis. Et error iracundia argumentum vim, mel maiestatis delicatissimi ex. Sit altera
vivendo ad, vis dolorem consetetur et, fabulas admodum sadipscing te est. Sit et senserit
consequuntur interpretaris, et sale ornatus consequat has, modus aeque omittantur has te.
</p>
</div>'''
soup = bs(html, 'lxml')
for t in soup.select('.bodyTxt > p'):
parent = soup.new_tag('p')
t.insert_after(parent)
for i in [j for j in re.findall(r"[\w]+|[^\s\w]", t.text)]:
new_tag = soup.new_tag('ruby')
new_tag.string = i
parent.append(new_tag)
t.decompose()
print(soup.prettify())https://stackoverflow.com/questions/67951593
复制相似问题