如何从嵌套的XML创建Pandas DataFrame

  • Post category:Python

从嵌套的XML创建Pandas DataFrame 的过程,可以分为以下几个步骤:

  1. 导入所需的库
import xml.etree.ElementTree as ET
import pandas as pd
  1. 解析XML文件
tree = ET.parse('file.xml')
root = tree.getroot()

此处,我们假设 XML 数据如下:

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
  1. 获取 DataFrame 所需的数据

我们可以使用 iter()findall() 方法来获取所需的数据,具体方法取决于 XML 的结构。如果 XML 中包含多个同名标签,我们可以使用 findall() 方法,例如获取所有国家的数据:

data = []
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    year = country.find('year').text
    gdppc = country.find('gdppc').text
    neighbors = [neighbor.get('name') for neighbor in country.findall('neighbor')]
    data.append((rank, name, year, gdppc, neighbors))

如果每个国家的数据都有独立的标签,我们可以使用 iter() 方法,例如获取第一个国家的数据:

for child in root.iter('country'):
    rank = child.find('rank').text
    name = child.get('name')
    year = child.find('year').text
    gdppc = child.find('gdppc').text
    neighbors = [neighbor.get('name') for neighbor in child.findall('neighbor')]
  1. 将数据转换成 DataFrame 格式
df = pd.DataFrame(data, columns=['Rank', 'Name', 'Year', 'GDP Per Capita', 'Neighbors'])

完整的代码片段如下:

import xml.etree.ElementTree as ET
import pandas as pd

tree = ET.parse('file.xml')
root = tree.getroot()

data = []
for country in root.findall('country'):
    rank = country.find('rank').text
    name = country.get('name')
    year = country.find('year').text
    gdppc = country.find('gdppc').text
    neighbors = [neighbor.get('name') for neighbor in country.findall('neighbor')]
    data.append((rank, name, year, gdppc, neighbors))

df = pd.DataFrame(data, columns=['Rank', 'Name', 'Year', 'GDP Per Capita', 'Neighbors'])