Python xml模块学习笔记

最新推荐文章于 2024-10-28 12:05:02 发布

原创最新推荐文章于 2024-10-28 12:05:02 发布 · 362 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

Python 专栏收录该内容

4 篇文章

订阅专栏

这篇博客介绍了Python中的XML处理模块，特别是ElementTree API。内容涵盖了XML的生成、解析、查找和修改。ElementTree提供了简单高效的方式处理XML文档，包括生成XML文件，解析XML结构，查找XML元素以及修改XML内容。示例展示了如何使用ElementTree进行XML的各个操作，如添加、删除和修改元素。

XML处理模块

警告 XML 模块对于错误或恶意构造的数据是不安全的。如果你需要解析不受信任或未经身份验证的数据，请参阅 XML 漏洞和 defusedxml 包部分。

xml.dom 和 xml.sax 包的文档是 DOM 和 SAX 接口的 Python 绑定的定义。

XML 处理子模块包括:

xml.etree.ElementTree： ElementTree API，一个简单而轻量级的XML处理器。代码可用性好，速度快，消耗内存少。
xml.dom：DOM API 定义。将 XML 数据在内存中解析成一个树，通过对树的操作来操作XML。
xml.dom.minidom：最小的 DOM 实现
xml.dom.pulldom：支持构建部分 DOM 树
xml.sax：SAX2 基类和便利函数。SAX 用事件驱动模型，通过在解析XML的过程中触发一个个的事件并调用用户定义的回调函数来处理XML文件。
xml.parsers.expat：Expat解析器绑定

1. ElementTree

文档

1.1 XML生成

#from xml.etree import ElementTree as  etree
from xml.etree.ElementTree import Element
from xml.etree.ElementTree import SubElement
from xml.etree.ElementTree import ElementTree

from xml.dom import minidom

# generate root node
root = Element('root')

# generate first child-node head
head = SubElement(root, 'first-sub')
head.set('a1', '123')
# child-node of head node
title = SubElement(head, 'title')
title.text = "This is first subelement!"

# generate second child-node body
body = SubElement(root, 'second-sub')
body.text = "This is second subelement!"

tree = ElementTree(root)

生成的文件结构如下

<root><first-sub><title>This is first subelement!</title></first-sub><second-sub>This is second subelement!</second-sub></root>

XML缩进
每个标签末尾都有一个tail的结束符，通过判断下一个节点是子节点还是并行节点，添加
node.tail='\n\t'或者node.tail=‘\n\t\t’

<root>
	<first-sub a1="123">
		<title>This is first subelement!</title>
	</first-sub>
	<second-sub>This is second subelement!</second-sub>
</root>

1.2 XML解析

import xml.etree.ElementTree as ET
tree = ET.parse('example.xml')
root = tree.getroot()
# 直接从字符串解析
root = ET.fromstring(example_as_string)

解析1.1所生成的XML

>>>root.tag()
>>> root.attrib
{}
>>> for child in root:
...     print(child.tag, child.attrib)
...
first-sub {'a1': 123}
second-sub {}

1.3 XML查找

以下面XML为例：

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

Element.iter()可帮助递归遍历其下的所有子树（包括子级，子级的子级，等等）；

>>> for neighbor in root.iter('title'):
...     print(neighbor.attrib)
...
{'name': 'Austria', 'direction': 'E'}
{'name': 'Switzerland', 'direction': 'W'}
{'name': 'Malaysia', 'direction': 'N'}
{'name': 'Costa Rica', 'direction': 'W'}
{'name': 'Colombia', 'direction': 'E'}

Element.findall()仅查找当前元素的直接子元素中带有指定标签的元素；
Element.find() 找带有特定标签的第一个子级，然后可以用 Element.text 访问元素的文本内容，Element.get 访问元素的属性；

>>> for country in root.findall('country'):
...     rank = country.find('rank').text
...     name = country.get('name')
...     print(name, rank)
...
Liechtenstein 1
Singapore 4
Panama 68

1.4 XML修改

调用 ElementTree.write() 方法就可以实现。
创建后可以直接操作 Element 对象。例如：使用 Element.text 修改文本字段，使用 Element.set() 方法添加和修改属性，以及使用 Element.append() 添加新的子元素；

为每个country更新rank值，并添加一个updated属性：

>>> for rank in root.iter('rank'):
...     new_rank = int(rank.text) + 1
...     rank.text = str(new_rank)
...     rank.set('updated', 'yes')
...
>>> tree.write('output.xml')

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
    <!-- rank值加一，添加`updated`属性 -->
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank updated="yes">69</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

使用 Element.remove() 删除元素；

>>> for country in root.findall('country'):
...     # using root.findall() to avoid removal during traversal
...     rank = int(country.find('rank').text)
···		# delete countries with rank higher than 50
...     if rank > 50:
...         root.remove(country)
...
>>> tree.write('output.xml')

<data>
    <country name="Liechtenstein">
        <rank updated="yes">2</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank updated="yes">5</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
</data>