转载自https://blog.csdn.net/lijinlon/article/details/81676859
Python数据分析首先需要进行数据清洗处理,涉及到很多DataFrame和Series相关知识,这里对涉及到的常用方法进行整理,主要设计数据增减、变更索引、数值替换等。其中一些函数的参数并没有介绍齐全,可以通过参考pandas文档或者在编辑器输入方法+?查询(例如df.reindex?),实践是检验知识水平的最好途径。
import pandas as pdimport numpy as npdf = pd.DataFrame({'name':['James','Curry','James','Kobe','Wade'], 'age':[31,30,31,35,38], 'score':[18,25,18,17,15], 'block':[5,2,5,3,2]}, index = ['player1','player2','player3','player4','player5'])print(df)
age block name scoreplayer1 31 5 James 18player2 30 2 Curry 25player3 31 5 James 18player4 35 3 Kobe 17player5 38 2 Wade 15
1 更改索引
创建一个新索引(行列)reindex:重新创建新索引,原有数据会根据新索引进行重排,如果索引值不存在,会引入缺失值,原有索引对应的值不会发生变化
# 设置inplace = True 可以直接在原dDataFrame上修改,否则会复制修改df_reindex = df.reindex(columns = ['name','age','block','score','reb'], index = ['player1','player2','player3','player4','player5','player6'] )print(df_reindex)
name age block score rebplayer1 James 31.0 5.0 18.0 NaNplayer2 Curry 30.0 2.0 25.0 NaNplayer3 James 31.0 5.0 18.0 NaNplayer4 Kobe 35.0 3.0 17.0 NaNplayer5 Wade 38.0 2.0 15.0 NaNplayer6 NaN NaN NaN NaN NaN
重新给索引命名rename,可以结合字典给部分索引重新命名,或者结合相关函数对索引进行整体重新命名
# 用字典修改new_index = {'player1':'PLAYER1'}new_col = {'name':'Name','age':'Age'}df_rename_dict = df.rename(index = new_index, columns = new_col)print(df_rename_dict)
Age block Name scorePLAYER1 31 5 James 18player2 30 2 Curry 25player3 31 5 James 18player4 35 3 Kobe 17player5 38 2 Wade 15
# 用函数修改df_rename_fun = df.rename(columns = new_col)print(df_rename_fun)
Age block Name scoreplayer1 31 5 James 18player2 30 2 Curry 25player3 31 5 James 18player4 35 3 Kobe 17player5 38 2 Wade 15
# 用map函数修改,这种方法是直接在原DataFrame上修改df.columns = df.columns.map(str.title)df.index = df.index.map(str.upper)print(df)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Curry 25PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
将一列或者多列变为行索引 set_index
df_setlindex = df.set_index(['Name'])print(df_setlindex)
Age Block ScoreName James 31 5 18Curry 30 2 25James 31 5 18Kobe 35 3 17Wade 38 2 15
将两列作为索引,默认这些作为索引的列会从DataFrame中删除,设置 drop = False 可以将其保留
df_set2index = df.set_index(['Name', 'Block'], drop = False)print(df_set2index)
Age Block Name ScoreName Block James 5 31 5 James 18Curry 2 30 2 Curry 25James 5 31 5 James 18Kobe 3 35 3 Kobe 17Wade 2 38 2 Wade 15
将行索引变为DataFrame的一列 reset_index
df_rIdx = df.reset_index()print(df_rIdx)
index Age Block Name Score0 PLAYER1 31 5 James 181 PLAYER2 30 2 Curry 252 PLAYER3 31 5 James 183 PLAYER4 35 3 Kobe 174 PLAYER5 38 2 Wade 15
2 数据删除
主要包括多重形式下行列的删除
删除整列
用del删除,在原DataFrame上直接修改删除
用drop方法删除,返回删除后的复制版本,不会修改原DataFramdf2 = df.copy()print(df2)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Curry 25PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
# 用del方法删除del df2['Age']print(df2)
Block Name ScorePLAYER1 5 James 18PLAYER2 2 Curry 25PLAYER3 5 James 18PLAYER4 3 Kobe 17PLAYER5 2 Wade 15
# 用drop方法删除,默认axis = 0,设置axis=1才能删除列df2_drop = df2.drop(['Block', 'Score'], axis = 1)print(df2_drop)
NamePLAYER1 JamesPLAYER2 CurryPLAYER3 JamesPLAYER4 KobePLAYER5 Wade
删除整行
df3 = df.copy()print(df3)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Curry 25PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
# 默认drop参数axis=0,删除行df3_drop = df3.drop(['PLAYER1'])print(df3_drop)
Age Block Name ScorePLAYER2 30 2 Curry 25PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
删除重复行
重复判断 duplicated(),返回一个布尔型的Series,表示各行是否与前面重复,重复则显示True
df.duplicated()
PLAYER1 FalsePLAYER2 FalsePLAYER3 TruePLAYER4 FalsePLAYER5 Falsedtype: bool
该方法可以通过设置subset = [‘列名’]根据一列或多列对重复值进行判断,设置 keep=’last’使重复项最后一项显示False,其余为True
配合sum函数可以迅速判断,该行是否存在重复值,sum返回的数值即为重复行的数目df.duplicated().sum()
1
还可以用Series的is_unique方法对单列是否有重复值进行判断,该方法能判断Series的values是否独立,没有重复则返回True
df['Name'].is_unique
False
df3_drop['Name'].is_unique
True
重复值的删除使用drop_duplicates方法,返回的是删除掉重复行的DataFrame,不会修改原DataFrame
# 依据全部列进行判断df_d = df.drop_duplicates()print(df_d)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Curry 25PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
# 依据设定的一列或多列进行判断,默认会保留第一个出现的值组合,传入keep = 'last'后会保留最后一个,传入inplace = True则会取代原DataFramedf_d2 = df.drop_duplicates(subset = ['Block'], keep = 'last')print(df_d2)
Age Block Name ScorePLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
包含缺失值的行/列删除
滤除缺失数据一般使用dropna,返回删除后的复制版本,不会修改原DataFrame
df_data = df.copy()df_data.iloc[1,2] =np.nandf_data.iloc[2] = np.nanprint(df_data)
Age Block Name ScorePLAYER1 31.0 5.0 James 18.0PLAYER2 30.0 2.0 NaN 25.0PLAYER3 NaN NaN NaN NaNPLAYER4 35.0 3.0 Kobe 17.0PLAYER5 38.0 2.0 Wade 15.0
# 默认只要行内有一个NaN值,该行就会被删除data_drop = df_data.dropna()print(data_drop)
Age Block Name ScorePLAYER1 31.0 5.0 James 18.0PLAYER4 35.0 3.0 Kobe 17.0PLAYER5 38.0 2.0 Wade 15.0
# 如果只想删除全部为NaN的行,可以传入 how = 'all'data_drop2 = df_data.dropna(how = 'all')print(data_drop2)
Age Block Name ScorePLAYER1 31.0 5.0 James 18.0PLAYER2 30.0 2.0 NaN 25.0PLAYER4 35.0 3.0 Kobe 17.0PLAYER5 38.0 2.0 Wade 15.0
# 如果像删除列,可以传入axis=1data_drop3 = data_drop2.dropna(axis = 1)print(data_drop3)
Age Block ScorePLAYER1 31.0 5.0 18.0PLAYER2 30.0 2.0 25.0PLAYER4 35.0 3.0 17.0PLAYER5 38.0 2.0 15.0
此外,dropna还有 (thresh=None, subset=None, inplace=False)三个参数,分别控制缺失值删除数目的阈值,根据subset指定列名的空值删除以及是否取代原DataFrame
3 数据替换
缺失值替换
缺失值替换可以采用fillna
# 直接替换全部为同一个值df_1 = df_data.fillna(0)print(df_1)
Age Block Name ScorePLAYER1 31.0 5.0 James 18.0PLAYER2 30.0 2.0 0 25.0PLAYER3 0.0 0.0 0 0.0PLAYER4 35.0 3.0 Kobe 17.0PLAYER5 38.0 2.0 Wade 15.0
# 也可以传入列名为键的字典为不同列替换为不同值df_dict = df_data.fillna({'Age':30, 'Block':10})print(df_dict)
Age Block Name ScorePLAYER1 31.0 5.0 James 18.0PLAYER2 30.0 2.0 NaN 25.0PLAYER3 30.0 10.0 NaN NaNPLAYER4 35.0 3.0 Kobe 17.0PLAYER5 38.0 2.0 Wade 15.0
# 可以设置method= ‘ffill'或者method='bfill'分别为前后值填充df_m = df_data.fillna(method = 'ffill')print(df_m)
Age Block Name ScorePLAYER1 31.0 5.0 James 18.0PLAYER2 30.0 2.0 James 25.0PLAYER3 30.0 2.0 James 25.0PLAYER4 35.0 3.0 Kobe 17.0PLAYER5 38.0 2.0 Wade 15.0
此外还有axis、limit、inplace参数分别设置轴、前后替换的阈值和是否替代
其他值替换
使用replace替换
# 利用列表实现将不同值替换为同一值 # 将Curry、kobe替换为Stephendf_replace = df.replace(['Curry','Kobe'], 'Stephen')print(df_replace)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Stephen 25PLAYER3 31 5 James 18PLAYER4 35 3 Stephen 17PLAYER5 38 2 Wade 15
# 利用字典实现对不同值得不同替换df_reDict = df.replace({'Curry':'Stephen'})print(df_reDict)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Stephen 25PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
# 利用双列表实现对不同值的不同替换df_reList = df.replace(['Curry','Kobe'],['Stephen','Bryant'])print(df_reList)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Stephen 25PLAYER3 31 5 James 18PLAYER4 35 3 Bryant 17PLAYER5 38 2 Wade 15
4数据索引
标签索引 loc
行索引
print(df)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Curry 25PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15
列索引
# 单列df['Age']
PLAYER1 31PLAYER2 30PLAYER3 31PLAYER4 35PLAYER5 38Name: Age, dtype: int64
# 多列df[['Age', 'Name']]
Age | Name | |
---|---|---|
PLAYER1 | 31 | James |
PLAYER2 | 30 | Curry |
PLAYER3 | 31 | James |
PLAYER4 | 35 | Kobe |
PLAYER5 | 38 | Wade |
# loc选取df.loc[:, 'Name']
PLAYER1 JamesPLAYER2 CurryPLAYER3 JamesPLAYER4 KobePLAYER5 WadeName: Name, dtype: object
行列共同索引
df.loc['PLAYER2',['Age','Name']]
Age 30Name CurryName: PLAYER2, dtype: object
位置索引 iloc
行索引
# 单行df.iloc[1]
Age 30Block 2Name CurryScore 25Name: PLAYER2, dtype: object
# 列索引df.iloc[:,1]
PLAYER1 5PLAYER2 2PLAYER3 5PLAYER4 3PLAYER5 2Name: Block, dtype: int64
# 连续多列df.iloc[:, 1:3]
Block | Name | |
---|---|---|
PLAYER1 | 5 | James |
PLAYER2 | 2 | Curry |
PLAYER3 | 5 | James |
PLAYER4 | 3 | Kobe |
PLAYER5 | 2 | Wade |
# 列行同时索引df.iloc[0:2,2:4]
Name | Score | |
---|---|---|
PLAYER1 | James | 18 |
PLAYER2 | Curry | 25 |
通过逻辑选择
df_logic = df[df['Score']>17]print(df_logic)
Age Block Name ScorePLAYER1 31 5 James 18PLAYER2 30 2 Curry 25PLAYER3 31 5 James 18
5数据排序
按照索引排序 sort_index
# 可以设置axis按照行列进行排序,并可以设置ascending悬着升序降序df_sort = df.sort_index(axis = 0, ascending = False)print(df_sort)
Age Block Name ScorePLAYER5 38 2 Wade 15PLAYER4 35 3 Kobe 17PLAYER3 31 5 James 18PLAYER2 30 2 Curry 25PLAYER1 31 5 James 18
按照值进行排序 sort_values
可以设置ascending选择升序降序
df_sort2 = df.sort_values (by = ['Age', 'Score'])print(df_sort2)
Age Block Name ScorePLAYER2 30 2 Curry 25PLAYER1 31 5 James 18PLAYER3 31 5 James 18PLAYER4 35 3 Kobe 17PLAYER5 38 2 Wade 15