python數據分析(pandas入門)

1、pandas數據結構之DataFrame

DataFrame生成方式:1、從另一個DataFrame創建。2、從具有二維形狀的NumPy數組或數組的複合結構生成。3、使用Series創建。4、從CSV之類文件生成。下面介紹DataFrame的簡單用法:

a):讀取文件

代碼:


from pandas.io.parsers import read_csv
df=read_csv("H:\Python\data\WHO.csv")
print "DataFrame:",df

運行結果(只截取部分):


DataFrame: Country CountryID Continent \
0 Afghanistan 1 1
1 Albania 2 2
2 Algeria 3 3
3 Andorra 4 2
4 Angola 5 3

b):得到形狀數據

代碼:


print "Shape:",df.shape #大小
print "Length:",len(df) #長度

結果:


Shape: (202, 358)
Length: 202

c):得到列標題及類型數據

代碼:


print "Column Headers",df.columns #得到每列的標題
print "Data type",df.dtypes #得到每列數據的類型

結果(截取部分)


Column Headers Index([u'Country', u'CountryID', u'Continent',
u'Adolescent fertility rate (%)', u'Adult literacy rate (%)',
u'Gross national income per capita (PPP international $)',
u'Net primary school enrolment ratio female (%)',
u'Net primary school enrolment ratio male (%)',
u'Population (in thousands) total',
u'Population annual growth rate (%)',
...
u'Total_CO2_emissions', u'Total_income', u'Total_reserves',
u'Trade_balance_goods_and_services', u'Under_five_mortality_from_CME',
u'Under_five_mortality_from_IHME', u'Under_five_mortality_rate',
u'Urban_population', u'Urban_population_growth',
u'Urban_population_pct_of_total'],
dtype='object', length=358)
Data type Country object
CountryID int64
Continent int64
Adolescent fertility rate (%) float64
Adult literacy rate (%) float64
Gross national income per capita (PPP international $) float64
Net primary school enrolment ratio female (%) float64
Net primary school enrolment ratio male (%) float64

d):索引

代碼:

print "Index:",df.index

結果:

Index: RangeIndex(start=0, stop=202, step=1)

e):values,非數值數據標位nan

代碼:

print "Vales:",df.values

結果


Vales: [['Afghanistan' 1L 1L ..., 5740436.0 5.44 22.9]
['Albania' 2L 2L ..., 1431793.9 2.21 45.4]
['Algeria' 3L 3L ..., 20800000.0 2.61 63.3]
...,
['Yemen' 200L 1L ..., 5759120.5 4.37 27.3]
['Zambia' 201L 3L ..., 4017411.0 1.95 35.0]
['Zimbabwe' 202L 3L ..., 4709965.0 1.9 35.9]]


2、pandas數據結構之Series

pandas的Series數據結構是由不同類型的元素組成的一維數組,該數據結構也具有標籤,創建方式有:由Python字典創建;由numpy數組創建;由單個標量值創建。

a):類型。當選中DataFrame的一列時,得到的是一個Series型的數據。

代碼:


country_df=df["Country"]
print "Type df:",type(df)
print "Type country_df:",type(country_df)

結果:


Type df:
Type country_df:

b):屬性。pandas的Series數據結構不僅共享了DataFrame的一些屬性,還提供與名稱相關的一個屬性。

代碼:


print "Series Shape:",country_df.shape #獲取列的形狀
print "Series index:",country_df.index #獲取索引
print "Series values:",country_df.values #獲取該列的所有值
print "Series name:",country_df.name #獲取列名(標題)

結果:


Series Shape: (202L,)
Series index: RangeIndex(start=0, stop=202, step=1)
Series values: ['Afghanistan' 'Albania' 'Algeria' 'Andorra' 'Angola' 'Antigua and Barbuda'
'Argentina' 'Armenia' 'Australia' 'Austria' 'Azerbaijan' 'Bahamas'
'Bahrain' 'Bangladesh' 'Barbados' 'Belarus' 'Belgium' 'Belize' 'Benin'
'Bermuda' 'Bhutan' 'Bolivia' 'Bosnia and Herzegovina' 'Botswana' 'Brazil'
'Brunei Darussalam' 'Bulgaria' 'Burkina Faso' 'Burundi' 'Cambodia'
'Cameroon' 'Canada' 'Cape Verde' 'Central African Republic' 'Chad' 'Chile'
'China' 'Colombia' 'Comoros' 'Congo, Dem. Rep.' 'Congo, Rep.'
'Cook Islands' 'Costa Rica' "Cote d'Ivoire" 'Croatia' 'Cuba' 'Cyprus'
'Czech Republic' 'Denmark' 'Djibouti' 'Dominica' 'Dominican Republic'
'Ecuador' 'Egypt' 'El Salvador' 'Equatorial Guinea' 'Eritrea' 'Estonia'
'Ethiopia' 'Fiji' 'Finland' 'France' 'French Polynesia' 'Gabon' 'Gambia'
'Georgia' 'Germany' 'Ghana' 'Greece' 'Grenada' 'Guatemala' 'Guinea'
'Guinea-Bissau' 'Guyana' 'Haiti' 'Honduras' 'Hong Kong, China' 'Hungary'
'Iceland' 'India' 'Indonesia' 'Iran (Islamic Republic of)' 'Iraq'
'Ireland' 'Israel' 'Italy' 'Jamaica' 'Japan' 'Jordan' 'Kazakhstan' 'Kenya'
'Kiribati' 'Korea, Dem. Rep.' 'Korea, Rep.' 'Kuwait' 'Kyrgyzstan'
"Lao People's Democratic Republic" 'Latvia' 'Lebanon' 'Lesotho' 'Liberia'
'Libyan Arab Jamahiriya' 'Lithuania' 'Luxembourg' 'Macao, China'
'Macedonia' 'Madagascar' 'Malawi' 'Malaysia' 'Maldives' 'Mali' 'Malta'

'Marshall Islands' 'Mauritania' 'Mauritius' 'Mexico'
'Micronesia (Federated States of)' 'Moldova' 'Monaco' 'Mongolia'
'Montenegro' 'Morocco' 'Mozambique' 'Myanmar' 'Namibia' 'Nauru' 'Nepal'
'Netherlands' 'Netherlands Antilles' 'New Caledonia' 'New Zealand'
'Nicaragua' 'Niger' 'Nigeria' 'Niue' 'Norway' 'Oman' 'Pakistan' 'Palau'
'Panama' 'Papua New Guinea' 'Paraguay' 'Peru' 'Philippines' 'Poland'
'Portugal' 'Puerto Rico' 'Qatar' 'Romania' 'Russia' 'Rwanda'
'Saint Kitts and Nevis' 'Saint Lucia' 'Saint Vincent and the Grenadines'
'Samoa' 'San Marino' 'Sao Tome and Principe' 'Saudi Arabia' 'Senegal'
'Serbia' 'Seychelles' 'Sierra Leone' 'Singapore' 'Slovakia' 'Slovenia'
'Solomon Islands' 'Somalia' 'South Africa' 'Spain' 'Sri Lanka' 'Sudan'
'Suriname' 'Swaziland' 'Sweden' 'Switzerland' 'Syria' 'Taiwan'
'Tajikistan' 'Tanzania' 'Thailand' 'Timor-Leste' 'Togo' 'Tonga'
'Trinidad and Tobago' 'Tunisia' 'Turkey' 'Turkmenistan' 'Tuvalu' 'Uganda'
'Ukraine' 'United Arab Emirates' 'United Kingdom'
'United States of America' 'Uruguay' 'Uzbekistan' 'Vanuatu' 'Venezuela'
'Vietnam' 'West Bank and Gaza' 'Yemen' 'Zambia' 'Zimbabwe']
Series name: Country

c):切片。

代碼:


print "Last 2 countries:",country_df[-2:]
print "Last 2 countries type:",type(country_df[-2:])

結果:


Last 2 countries: 200 Zambia
201 Zimbabwe
Name: Country, dtype: object
Last 2 countries type:


3、利用Pandas查詢數據

a):head()和tail()函數:

代碼:


sunspots=read_csv("H:\Python\data\sunspots.csv")
print "Head 2:",sunspots.head(2) #查看前兩行
print "Tail 2:",sunspots.tail(2) #查看後兩行

運行結果:


Head 2: Date Yearly Mean Total Sunspot Number
0 2016/12/31 39.8
1 2015/12/31 69.8
Tail 2: Date Yearly Mean Total Sunspot Number
316 1701-12-31 18.3
317 1700-12-31 8.3

b):loc函數

代碼:


last_date=sunspots.index[-1]
print "Last value:\n",sunspots.loc[last_date]

運行結果:


last_date=sunspots.index[-1]
print "Last value:\n",sunspots.loc[last_date]

4、利用Pandas的DataFrame進行統計計算

pandas的DataFrame數據結構為我們提供了若干統計函數,下面給出部分方法及其簡要說明。

方法說明describe這個方法返回描述性統計信息count返回非NAN數據項的數量mad計算平均絕對偏差,級類似於標準差的一個有力統計工具median返回中位數,等價於第50百分位數的值min返回最小值max返回最大值mode返回眾數(mod),即一組數據中出現次數最多的變量值std返回表示離散度的標準差,即方差的平方根var返回方差skew返回偏差係數(skewness),該係數表示的是數據分佈的對稱程度kurt這個方法將返回峰太係數,反映數據分佈曲線頂端尖峭或扁平程度代碼:

print "Describe:\n",sunspots.describe()
print "Non NaN observations:\n",sunspots.count()
print "MAD:\n",sunspots.mad()
print "Median:\n",sunspots.median()
print "Min:\n",sunspots.min()
print "Max:\n",sunspots.max()
print "Mode:\n",sunspots.mode()
print "Standard Deviation:\n",sunspots.std()
print "Variance:\n",sunspots.var()
print "Skewness:\n",sunspots.skew()
print "Kurtosis:\n",sunspots.kurt()

運行結果:


Describe:
Yearly Mean Total Sunspot Number
count 318.000000
mean 79.193396
std 61.988788
min 0.000000
25% 24.950000
50% 66.250000
75% 116.025000
max 269.300000
Non NaN observations:
Date 318
Yearly Mean Total Sunspot Number 318
dtype: int64
MAD:
Yearly Mean Total Sunspot Number 50.925104
dtype: float64
Median:
Yearly Mean Total Sunspot Number 66.25
dtype: float64
Min:
Date 1700-12-31
Yearly Mean Total Sunspot Number 0
dtype: object
Max:
Date 2016/12/31
Yearly Mean Total Sunspot Number 269.3
dtype: object
Mode:
Date Yearly Mean Total Sunspot Number
0 1985/12/31 18.3
Standard Deviation:
Yearly Mean Total Sunspot Number 61.988788

dtype: float64
Variance:
Yearly Mean Total Sunspot Number 3842.60983
dtype: float64
Skewness:
Yearly Mean Total Sunspot Number 0.808551
dtype: float64
Kurtosis:
Yearly Mean Total Sunspot Number -0.130045
dtype: float64


5、利用pandas的DataFrame實現數據聚合

a):為numpy的隨機數生成器指定種子,以確保重複運行程序時生成的數據不會走樣。該數據有4列:

1、Weather(一個字符串);

2、Food(一個字符串);

3、Price(一個隨機浮點數);

4、Number(1~9之間的一個隨機整數)。

代碼:


import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import randint
import numpy as np
seed(42)
#random.rand(n),生成n個0到1間隨機數
#random.random_integers(low,high=None,size=None) 生成閉區間[low,high]上離散均勻分佈的整數值;若high=None,則取值區間變為[1,low]

df=pd.DataFrame({'Weather':['cold','hot','cold','hot','cold','hot','cold'],'Food':['soup','soup','icecream','chocolate','icecream','icecream','soup'],
'Price':10*rand(7),'Number':randint(1,9,size=(7,))})
print df

運行結果:


Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
3 chocolate 8 5.986585 hot
4 icecream 8 1.560186 cold
5 icecream 3 1.559945 hot
6 soup 6 0.580836 cold

b):通過Weather列為數據分組,然後遍歷各組數據

代碼:


weather_group=df.groupby('Weather') #按天氣分組
i=0
for name,group in weather_group:
i=i+1
print "Group ",i,name
print group

運行結果:


Group 1 cold
Food Number Price Weather
0 soup 8 3.745401 cold
2 icecream 4 7.319939 cold
4 icecream 8 1.560186 cold
6 soup 6 0.580836 cold
Group 2 hot
Food Number Price Weather
1 soup 5 9.507143 hot
3 chocolate 8 5.986585 hot
5 icecream 3 1.559945 hot

c):變量Weather_group是一種特殊的pandas對象,可由groupby()生成。這個對象為我們提供了聚合函數,下面展示它的用法:

代碼:


print "Weather group first:\n",weather_group.first() #展示各組第一行內容
print "Weather group last:\n",weather_group.last() #展示各組最後一行內容
print "Weather group mean:\n",weather_group.mean() #計算各組均值

運行結果:

Weather group first:
Food Number Price
Weather
cold soup 8 3.745401
hot soup 5 9.507143
Weather group last:
Food Number Price
Weather
cold soup 6 0.580836
hot icecream 3 1.559945
Weather group mean:
Number Price
Weather
cold 6.500000 3.301591
hot 5.333333 5.684558

d):恰如利用數據庫的查詢操作那樣,也可以針對多列進行分組。

然後就可以用groups屬性來了解所生成的數據組,以及每一組包含的行數:

代碼:


wf_group=df.groupby(['Weather','Food'])

print "WF Group:\n",wf_group.groups

運行結果:


WF Group:
{('hot', 'chocolate'): Int64Index([3], dtype='int64'), ('cold', 'icecream'): Int64Index([2, 4], dtype='int64'), ('cold', 'soup'): Int64Index([0, 6], dtype='int64'), ('hot', 'soup'): Int64Index([1], dtype='int64'), ('hot', 'icecream'): Int64Index([5], dtype='int64')}

e):通過agg方法,可以對數據組施加一系列的numpy函數:

代碼:

print "WF Aggregated:\n",wf_group.agg([np.mean,np.median])

運行結果:


WF Aggregated:
Number Price
mean median mean median
Weather Food
cold icecream 6 6 4.440063 4.440063
soup 7 7 2.163119 2.163119
hot chocolate 8 8 5.986585 5.986585
icecream 3 3 1.559945 1.559945
soup 5 5 9.507143 9.507143

6、DataFrame的串聯與附加操作

a):數據庫中的數據表有內部連接與外部連接兩種連接類型。pandas的DataFrame也有類似操作,也可以對數據進行串聯和附加。

函數concat()的作用是串聯DataFrame,如可以把一個由3行數據組成的DataFrame與其他行數據行串接,以便重建原DataFrame:

代碼:


print "df:3\n",df[:3]
print "Contact Back together:\n",pd.concat([df[:3],df[:3]])

運行結果:


df:3
Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
Contact Back together:
Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold

b):為了追加數據行,可以使用append函數:

代碼:

print "Appending rows:\n",df[3:].append(df[5:])

運行結果:


Appending rows:
Food Number Price Weather
3 chocolate 8 5.986585 hot
4 icecream 8 1.560186 cold
5 icecream 3 1.559945 hot
6 soup 6 0.580836 cold
5 icecream 3 1.559945 hot
6 soup 6 0.580836 cold

7、連接DataFrames

a)、新建兩個CSV文件:dest.csv和tips.csv

代碼:


dests=pd.read_csv("H:\Python\data\dest.csv")
tips=pd.read_csv("H:\Python\data\\tips.csv")
print "dests:\n",dests
print "tips:\n",tips

運行結果:


dests:
EmpNr Dest
0 5 The Hague
1 3 Amsterdam
2 9 Rotterdam
tips:
EmpNr Amount
0 5 10.0
1 9 5.0
2 7 2.5

b):pandas提供的merge函數或DataFrame的join函數實例方法都能實現類似數據庫的連接操作數功能。

pandas支持所有的這些連接類型,這裡僅介紹內部連接與完全外部連接。

  • 用merge函數按照員工編號進行連接處理,代碼如下:
print "Merge() on key:\n",pd.merge(dests,tips,on='EmpNr')

運行結果:


Merge() on key:
EmpNr Dest Amount
0 5 The Hague 10.0
1 9 Rotterdam 5.0
  • 使用join方法執行連接操作,需要使用後綴來指示左操作對象和右操作對象:
print "Dest join() tips:\n",dests.join(tips,lsuffix='Dest',rsuffix='Tips')

運行結果:


Dest join() tips:
EmpNrDest Dest EmpNrTips Amount
0 5 The Hague 5 10.0
1 3 Amsterdam 9 5.0
2 9 Rotterdam 7 2.5
  • 用merge()執行內部連接和外部連接時,更顯示的方法如下所示:

代碼:


print "Inner join with merge():\n",pd.merge(dests,tips,how='inner') #內連接
print "Outer join with merge():\n",pd.merge(dests,tips,how='outer') #完全外部連接

運行結果:


Inner join with merge():
EmpNr Dest Amount
0 5 The Hague 10.0
1 9 Rotterdam 5.0
Outer join with merge():

EmpNr Dest Amount
0 5 The Hague 10.0
1 3 Amsterdam NaN
2 9 Rotterdam 5.0
3 7 NaN 2.5

8、處理缺失數據

a):讀取數據。

代碼:


df=pd.read_csv("H:\Python\data\WHO.csv")
#print df.head()
df=df[['Country',df.columns[6]]][:2] #將原df的Country列和第6列組成新DataFrame,並取前兩行
print "New df:\n",df

運行結果:


New df:
Country Net primary school enrolment ratio female (%)
0 Afghanistan NaN
1 Albania 93.0

b):pandas會把缺失的數值標記為NaN,表示None。pandas的isnull()函數可以幫我們檢查缺失的數據。

代碼:


print "Null Values:\n",pd.isnull(df) #檢查每行缺失的數
print "Not Null Values:\n",pd.notnull(df) #檢查非缺失的數
print "Last Column Doubled:\n",2*df[df.columns[-1]] #NAN值乘以一個數後還是NAN
print "Last Column plus NaN:\n",df[df.columns[-1]]+np.nan #非NAN值加上NAN後變為了NAN

print "Zero filled:\n",df.fillna(0) #使用0替換NAN

運行結果:


Null Values:
Country Net primary school enrolment ratio female (%)
0 False True
1 False False
Not Null Values:
Country Net primary school enrolment ratio female (%)
0 True False
1 True True
Last Column Doubled:
0 NaN
1 186.0
Name: Net primary school enrolment ratio female (%), dtype: float64
Last Column plus NaN:
0 NaN
1 NaN
Name: Net primary school enrolment ratio female (%), dtype: float64
Zero filled:
Country Net primary school enrolment ratio female (%)
0 Afghanistan 0.0
1 Albania 93.0

9、處理日期數據

a):設定從1900年1月1日開始為期42天的時間範圍。

代碼:

print "Date range:\n",pd.date_range('1/1/1900',periods=42,freq='D') #42表示天數,D表示使用日頻率。如果periods='W',表示42周

運行結果:


Date range:
DatetimeIndex(['1900-01-07', '1900-01-14', '1900-01-21', '1900-01-28',

'1900-02-04', '1900-02-11', '1900-02-18', '1900-02-25',
'1900-03-04', '1900-03-11', '1900-03-18', '1900-03-25',
'1900-04-01', '1900-04-08', '1900-04-15', '1900-04-22',
'1900-04-29', '1900-05-06', '1900-05-13', '1900-05-20',
'1900-05-27', '1900-06-03', '1900-06-10', '1900-06-17',
'1900-06-24', '1900-07-01', '1900-07-08', '1900-07-15',
'1900-07-22', '1900-07-29', '1900-08-05', '1900-08-12',
'1900-08-19', '1900-08-26', '1900-09-02', '1900-09-09',
'1900-09-16', '1900-09-23', '1900-09-30', '1900-10-07',
'1900-10-14', '1900-10-21'],
dtype='datetime64[ns]', freq='W-SUN')

b):在pandas中,日期區間是有限制的。pandas的時間戳基於numpy datetime64類型,以納秒為單位,並且用一個64位整數來表示具體數值。因此,日期有效的時間戳介於1677年至2262年。當然,這些年份也不是所有日期都是有效的。這個時間範圍的精確中點是1970年1月1日。這樣,1677年1月1日就無法用pandas時間戳定義,而1677年9月30日就可以,下面用代碼說明:

代碼:


import pandas as pd
import sys
try:
print "Date range:\n",pd.date_range('1/1/1677',periods=4,frep='D')
except:
etype,value,_=sys.exc_info() #獲得錯誤類型,錯誤值
print "Error encountered:\n",etype,value #打印

運行結果:


Date range:
Error encountered:
Out of bounds nanosecond timestamp: 1677-01-01 00:00:00

b):使用pandas的Dateoffset函數計算允許的日期範圍:

代碼:


offset=pd.DateOffset(seconds=2**63/10**9)
mid=pd.to_datetime('1/1/1970')
print "Start valid range:\n",mid-offset
print "End valid range:\n",mid+offset

運行結果:


Start valid range:
1677-09-21 00:12:44
End valid range:
2262-04-11 23:47:16

c):pandas可以把一串字符串轉化成日期數據:

代碼:

print "With format:\n",pd.to_datetime(['1901113','19031230'],format='%Y%m%d')

運行結果:


With format:
DatetimeIndex(['1901-11-03', '1903-12-30'], dtype='datetime64[ns]', freq=None)

d):如果一個字符串明顯不是日期,無法轉化。可以使用參數coerce設置為True強制轉化:

代碼:


print "Illegal date:\n",pd.to_datetime(['1901-11-13','not a date']) #第二個字符串無法轉換,運行報錯

print "Illegal date:\n",pd.to_datetime(['1901-11-13','not a date'],coerce=True) #強制轉化,得到非時間數NAT

運行結果:


Illegal date:
DatetimeIndex(['1901-11-13', 'NaT'], dtype='datetime64[ns]', freq=None)

10、數據透析表

a):數據透析表可以從一個平面文件中指定的行和列中聚合數據,這種聚合操作可以是求和、求平均值,求標準差等運算。


import pandas as pd
from numpy.random import seed
from numpy.random import rand
from numpy.random import randint
import numpy as np

seed(42)
N=7
df=pd.DataFrame({'Weather':['cold','hot','cold','hot','cold','hot','cold'],'Food':['soup','soup','icecream','chocolate','icecream','icecream','soup'],
'Price':10*rand(7),'Number':randint(1,9,size=(7,))})
print "DataFrame:\n",df
print pd.pivot_table(df,index='Food',aggfunc=np.sum) #計算各類型Food的統計值

運行結果:


DataFrame:
Food Number Price Weather
0 soup 8 3.745401 cold
1 soup 5 9.507143 hot
2 icecream 4 7.319939 cold
3 chocolate 8 5.986585 hot
4 icecream 8 1.560186 cold

5 icecream 3 1.559945 hot
6 soup 6 0.580836 cold
Number Price
Food
chocolate 8 5.986585
icecream 15 10.440071
soup 19 13.833380
python數據分析(pandas入門)


分享到:


相關文章: