如何使用pandas将DataFrame和Series高效转换为Python列表及常见问题解决方法

威震华夏关云长 · 发表于 2025-9-12 23:10:01

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

您需要登录才可以下载或查看，没有账号？立即注册

x

引言

Pandas是Python中最流行的数据分析库之一，它提供了强大的数据结构和数据分析工具。在数据处理过程中，DataFrame和Series是pandas中最常用的两种数据结构。然而，在某些情况下，我们需要将这些数据结构转换为Python原生的列表(list)类型，以便与其他库集成或进行特定的操作。本文将详细介绍如何高效地将DataFrame和Series转换为Python列表，并解决在此过程中可能遇到的常见问题。

pandas数据结构简介

在深入探讨转换方法之前，让我们简要回顾一下pandas中的两种主要数据结构：

DataFrame

DataFrame是pandas中的二维表格型数据结构，可以看作是由多个Series组成的字典。它具有行索引和列索引，并且可以存储不同类型的数据。

import pandas as pd
# 创建一个简单的DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
})
print(df)

复制代码

Series

Series是pandas中的一维标记数组结构，类似于带有标签的NumPy数组。它由一组数据和与之相关的数据标签（索引）组成。

# 创建一个简单的Series
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print(s)

复制代码

将DataFrame转换为列表

将整个DataFrame转换为列表

有几种方法可以将整个DataFrame转换为列表：

# 将整个DataFrame转换为列表的列表
list_of_lists = df.values.tolist()
print(list_of_lists)
# 输出: [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'], ['David', 40, 'Houston']]

复制代码

# 使用to_numpy()方法转换为NumPy数组，然后转换为列表
list_of_lists = df.to_numpy().tolist()
print(list_of_lists)
# 输出: [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'], ['David', 40, 'Houston']]

复制代码

将DataFrame的列转换为列表

# 将单列转换为列表
name_list = df['Name'].tolist()
print(name_list)
# 输出: ['Alice', 'Bob', 'Charlie', 'David']
# 或者使用点号表示法（仅适用于列名是有效的Python标识符的情况）
age_list = df.Age.tolist()
print(age_list)
# 输出: [25, 30, 35, 40]

复制代码

# 使用列索引将列转换为列表
city_list = df.iloc[:, 2].tolist() # 第3列（索引为2）
print(city_list)
# 输出: ['New York', 'Los Angeles', 'Chicago', 'Houston']

复制代码

将DataFrame的行转换为列表

# 使用iterrows()将每一行转换为列表
row_lists = []
for index, row in df.iterrows():
row_lists.append(row.tolist())
print(row_lists)
# 输出: [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'], ['David', 40, 'Houston']]

复制代码

# 使用itertuples()将每一行转换为命名元组，然后转换为列表
row_lists = [list(row) for row in df.itertuples(index=False)]
print(row_lists)
# 输出: [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'], ['David', 40, 'Houston']]

复制代码

# 使用values和列表推导式将每一行转换为列表
row_lists = [list(row) for row in df.values]
print(row_lists)
# 输出: [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'], ['David', 40, 'Houston']]

复制代码

将Series转换为列表

基本转换方法

# 将Series转换为列表
s_list = s.tolist()
print(s_list)
# 输出: [10, 20, 30, 40]

复制代码

# 使用list()构造函数将Series转换为列表
s_list = list(s)
print(s_list)
# 输出: [10, 20, 30, 40]

复制代码

# 使用values属性将Series转换为列表
s_list = s.values.tolist()
print(s_list)
# 输出: [10, 20, 30, 40]

复制代码

保留索引的转换

如果需要在转换过程中保留索引信息：

# 将Series转换为包含索引和值的元组列表
indexed_list = list(zip(s.index, s.values))
print(indexed_list)
# 输出: [('a', 10), ('b', 20), ('c', 30), ('d', 40)]
# 或者转换为字典列表
dict_list = [{'index': idx, 'value': val} for idx, val in zip(s.index, s.values)]
print(dict_list)
# 输出: [{'index': 'a', 'value': 10}, {'index': 'b', 'value': 20}, {'index': 'c', 'value': 30}, {'index': 'd', 'value': 40}]

复制代码

高效转换技巧

使用to_numpy()方法

对于大型DataFrame，使用to_numpy()方法通常比直接访问values属性更高效：

# 创建一个大型DataFrame
import numpy as np
large_df = pd.DataFrame(np.random.rand(100000, 5), columns=['A', 'B', 'C', 'D', 'E'])
# 使用to_numpy()方法转换为NumPy数组，然后转换为列表
%timeit large_list = large_df.to_numpy().tolist()
# 输出: 平均执行时间（具体时间取决于硬件）
# 使用values属性转换为列表
%timeit large_list = large_df.values.tolist()
# 输出: 平均执行时间（通常比to_numpy()稍慢）

复制代码

使用列表推导式

列表推导式通常比显式循环更高效：

# 使用列表推导式将每一列转换为列表
column_lists = [df[col].tolist() for col in df.columns]
# 或者使用字典推导式
column_dict = {col: df[col].tolist() for col in df.columns}
print(column_dict)
# 输出: {'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']}

复制代码

使用itertuples()进行行迭代

对于行迭代，itertuples()比iterrows()更高效：

# 使用itertuples()进行行迭代
%timeit row_tuples = [tuple(row) for row in df.itertuples(index=False)]
# 输出: 平均执行时间
# 使用iterrows()进行行迭代
%timeit row_tuples = [tuple(row) for index, row in df.iterrows()]
# 输出: 平均执行时间（通常比itertuples()慢）

复制代码

使用apply方法

对于复杂的转换操作，可以使用apply方法：

# 使用apply方法将每一行转换为列表
row_lists = df.apply(lambda row: row.tolist(), axis=1).tolist()
print(row_lists)
# 输出: [['Alice', 25, 'New York'], ['Bob', 30, 'Los Angeles'], ['Charlie', 35, 'Chicago'], ['David', 40, 'Houston']]

复制代码

常见问题及解决方法

问题1：处理缺失值

在转换过程中，DataFrame或Series中的缺失值（NaN）可能会导致问题。

# 创建包含缺失值的DataFrame
df_with_nan = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, np.nan]
})
# 直接转换为列表
list_with_nan = df_with_nan.values.tolist()
print(list_with_nan)
# 输出: [[1.0, 5.0, 9.0], [2.0, nan, 10.0], [nan, 7.0, 11.0], [4.0, 8.0, nan]]

复制代码

解决方法：

# 方法1：使用fillna()填充缺失值
filled_list = df_with_nan.fillna(0).values.tolist()
print(filled_list)
# 输出: [[1.0, 5.0, 9.0], [2.0, 0.0, 10.0], [0.0, 7.0, 11.0], [4.0, 8.0, 0.0]]
# 方法2：使用dropna()删除包含缺失值的行
dropped_list = df_with_nan.dropna().values.tolist()
print(dropped_list)
# 输出: [[1.0, 5.0, 9.0]]
# 方法3：在转换为列表后处理缺失值
import math
processed_list = [[0 if math.isnan(x) else x for x in row] for row in df_with_nan.values.tolist()]
print(processed_list)
# 输出: [[1.0, 5.0, 9.0], [2.0, 0, 10.0], [0, 7.0, 11.0], [4.0, 8.0, 0]]

复制代码

问题2：处理大数据集的性能问题

对于大型DataFrame，转换为列表可能会消耗大量内存和时间。

解决方法：

# 创建一个大型DataFrame
large_df = pd.DataFrame(np.random.rand(1000000, 5), columns=['A', 'B', 'C', 'D', 'E'])
# 方法1：分块处理
chunk_size = 100000
list_chunks = []
for i in range(0, len(large_df), chunk_size):
chunk = large_df.iloc[i:i+chunk_size]
list_chunks.append(chunk.values.tolist())
# 方法2：使用生成器表达式（适用于逐行处理）
row_generator = (list(row) for row in large_df.itertuples(index=False))
# 使用生成器（逐行处理，不一次性加载所有数据到内存）
for i, row in enumerate(row_generator):
if i < 3: # 只打印前3行作为示例
print(row)
else:
break

复制代码

问题3：处理多级索引

对于具有多级索引的DataFrame或Series，转换为列表需要特别注意索引结构的保留。

# 创建具有多级索引的DataFrame
index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['letter', 'number'])
multi_df = pd.DataFrame({'X': [10, 20, 30, 40], 'Y': [50, 60, 70, 80]}, index=index)
print(multi_df)

复制代码

解决方法：

# 方法1：将索引转换为列，然后转换为列表
multi_df_reset = multi_df.reset_index()
list_with_index = multi_df_reset.values.tolist()
print(list_with_index)
# 输出: [['A', 1, 10, 50], ['A', 2, 20, 60], ['B', 1, 30, 70], ['B', 2, 40, 80]]
# 方法2：保留索引结构
indexed_list = []
for idx, row in multi_df.iterrows():
indexed_list.append([idx[0], idx[1]] + row.tolist())
print(indexed_list)
# 输出: [['A', 1, 10, 50], ['A', 2, 20, 60], ['B', 1, 30, 70], ['B', 2, 40, 80]]

复制代码

问题4：处理数据类型转换问题

在转换过程中，可能会遇到数据类型不一致的问题。

# 创建包含不同数据类型的DataFrame
mixed_df = pd.DataFrame({
'int_col': [1, 2, 3, 4],
'float_col': [1.1, 2.2, 3.3, 4.4],
'str_col': ['a', 'b', 'c', 'd'],
'bool_col': [True, False, True, False]
})
# 直接转换为列表
mixed_list = mixed_df.values.tolist()
print(mixed_list)
# 输出: [[1, 1.1, 'a', True], [2, 2.2, 'b', False], [3, 3.3, 'c', True], [4, 4.4, 'd', False]]

复制代码

解决方法：

# 方法1：在转换前指定数据类型
typed_list = mixed_df.astype({
'int_col': int,
'float_col': float,
'str_col': str,
'bool_col': bool
}).values.tolist()
print(typed_list)
# 输出: [[1, 1.1, 'a', True], [2, 2.2, 'b', False], [3, 3.3, 'c', True], [4, 4.4, 'd', False]]
# 方法2：在转换为列表后处理数据类型
processed_list = []
for row in mixed_df.values.tolist():
processed_row = [
int(row[0]),
float(row[1]),
str(row[2]),
bool(row[3])
]
processed_list.append(processed_row)
print(processed_list)
# 输出: [[1, 1.1, 'a', True], [2, 2.2, 'b', False], [3, 3.3, 'c', True], [4, 4.4, 'd', False]]

复制代码

实际应用场景

场景1：数据可视化准备

许多Python可视化库（如Matplotlib、Seaborn）要求数据以列表形式提供。

import matplotlib.pyplot as plt
# 准备数据
x = df['Age'].tolist()
y = df['Name'].tolist()
# 创建条形图
plt.barh(y, x)
plt.xlabel('Age')
plt.ylabel('Name')
plt.title('Age by Name')
plt.show()

复制代码

场景2：与Web API交互

当需要将数据发送到Web API时，通常需要将DataFrame转换为JSON格式，而列表是JSON的基本组成部分。

import json
# 将DataFrame转换为字典列表，然后转换为JSON
json_data = df.to_dict(orient='records')
json_string = json.dumps(json_data)
print(json_string)
# 输出: '[{"Name":"Alice","Age":25,"City":"New York"},{"Name":"Bob","Age":30,"City":"Los Angeles"},{"Name":"Charlie","Age":35,"City":"Chicago"},{"Name":"David","Age":40,"City":"Houston"}]'

复制代码

场景3：机器学习特征准备

在机器学习中，特征通常需要以列表或NumPy数组的形式提供给模型。

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# 准备特征和目标变量
X = df[['Age']].values.tolist() # 特征
y = df['Age'].tolist() # 目标变量
# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 训练模型
model = LinearRegression()
model.fit(X_train, y_train)
# 评估模型
score = model.score(X_test, y_test)
print(f"Model R^2 score: {score}")

复制代码

场景4：数据库操作

在与数据库交互时，通常需要将数据转换为列表形式。

import sqlite3
# 创建内存数据库
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
# 创建表
cursor.execute('''
CREATE TABLE employees (
name TEXT,
age INTEGER,
city TEXT
)
''')
# 将DataFrame转换为列表并插入数据库
employees = df.values.tolist()
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?)', employees)
# 查询数据
cursor.execute('SELECT * FROM employees')
rows = cursor.fetchall()
print(rows)
# 输出: [('Alice', 25, 'New York'), ('Bob', 30, 'Los Angeles'), ('Charlie', 35, 'Chicago'), ('David', 40, 'Houston')]
# 关闭连接
conn.close()

复制代码

性能比较和最佳实践

性能比较

让我们比较不同方法的性能：

import timeit
# 创建测试DataFrame
test_df = pd.DataFrame(np.random.rand(10000, 5), columns=['A', 'B', 'C', 'D', 'E'])
# 测试不同方法的性能
def method1():
return test_df.values.tolist()
def method2():
return test_df.to_numpy().tolist()
def method3():
return [list(row) for row in test_df.itertuples(index=False)]
def method4():
return [row.tolist() for index, row in test_df.iterrows()]
# 测量执行时间
time1 = timeit.timeit(method1, number=100)
time2 = timeit.timeit(method2, number=100)
time3 = timeit.timeit(method3, number=100)
time4 = timeit.timeit(method4, number=100)
print(f"values.tolist(): {time1:.4f} seconds")
print(f"to_numpy().tolist(): {time2:.4f} seconds")
print(f"itertuples(): {time3:.4f} seconds")
print(f"iterrows(): {time4:.4f} seconds")

复制代码

最佳实践

基于性能和可读性的考虑，以下是一些最佳实践：

1. 对于整个DataFrame转换为列表：使用df.values.tolist()或df.to_numpy().tolist()，这两种方法性能最佳。
2. 对于单列转换为列表：直接使用df['column_name'].tolist()，这是最直观和高效的方法。
3. 对于行转换为列表：使用列表推导式结合itertuples()，即[list(row) for row in df.itertuples(index=False)]，这比使用iterrows()更高效。
4. 对于大型DataFrame：考虑分块处理或使用生成器，以避免内存问题。
5. 处理缺失值：在转换前使用fillna()或dropna()处理缺失值，以确保数据的一致性。
6. 保留索引信息：如果需要保留索引信息，考虑使用reset_index()或在转换过程中显式包含索引。

对于整个DataFrame转换为列表：使用df.values.tolist()或df.to_numpy().tolist()，这两种方法性能最佳。

对于单列转换为列表：直接使用df['column_name'].tolist()，这是最直观和高效的方法。

对于行转换为列表：使用列表推导式结合itertuples()，即[list(row) for row in df.itertuples(index=False)]，这比使用iterrows()更高效。

对于大型DataFrame：考虑分块处理或使用生成器，以避免内存问题。

处理缺失值：在转换前使用fillna()或dropna()处理缺失值，以确保数据的一致性。

保留索引信息：如果需要保留索引信息，考虑使用reset_index()或在转换过程中显式包含索引。

总结

本文详细介绍了如何将pandas中的DataFrame和Series高效转换为Python列表，包括基本转换方法、高效转换技巧、常见问题及解决方法，以及实际应用场景。通过掌握这些技巧，你可以更加灵活地处理数据转换任务，提高数据分析的效率和准确性。

在实际应用中，选择合适的转换方法取决于你的具体需求、数据大小和性能要求。希望本文能帮助你更好地理解和使用pandas数据结构转换为Python列表的方法，并在实际工作中取得更好的效果。

	通知：关于部分勋章领取条件及购买价格调整的通知	05-18 21:22
	通知：本站资源由网友上传分享，如有违规等问题请到版务模块进行投诉，资源失效请在帖子内回复要求补档，会尽快处理！	10-23 09:31

活动公告

如何使用pandas将DataFrame和Series高效转换为Python列表及常见问题解决方法

马上注册，结交更多好友，享用更多功能，让你轻松玩转社区。

塔罗

立华奏

站长推荐 /1

友情链接

Tencent QQ