<xml>
<maintag>
<content> lorem ipsum <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>我经常解析的xml文件可能在内容标签中有标记,如上面所示。
在这里,我如何解析文件:
parser = etree.XMLParser(remove_blank_text=False)
tree = etree.parse(StringIO(xmlFile), parser)
for item in tree.iter('maintag'):
my_content = item.find('content').text
#print my_content
#output: lorem ipsum因此,结果是my_content = 'lorem‘,而不是我想看到的-which -’,等等。‘
我怎样才能把内容读成“lorem ipsum dolor”等等?
注意:内容标签可能有另一个标签,而不是强标记。可能根本就没有它们。
发布于 2013-11-06 13:23:32
属性仅返回第一个子元素之前的文本。
尝试以下几个方面:
>>> from lxml import etree
>>> from StringIO import StringIO
>>> xmlFile = '''
... <xml>
... <maintag>
... <content> lorem ipsum <strong> dolor sit </strong> and so on </content>
... </maintag>
... </xml>
... '''
>>> parser = etree.XMLParser(remove_blank_text=False)
>>> tree = etree.parse(StringIO(xmlFile), parser)
>>> for my_content in tree.xpath('maintag/content//text()'):
... print my_content
...
lorem ipsum
dolor sit
and so on或者:
>>> for my_content in tree.find('maintag/content').itertext():
... print my_content
...
lorem ipsum
dolor sit
and so on
>>> ' '.join(tree.find('maintag/content').itertext())
' lorem ipsum dolor sit and so on '
>>> ' '.join(t.strip() for t in tree.find('maintag/content').itertext())
'lorem ipsum dolor sit and so on'https://stackoverflow.com/questions/19813192
复制相似问题