学过sql的童鞋都知道在sql的世界里面,查询语句就是一切,不管查询一列,还是多列,还是条件查询,加上排序等等,都是很快很高效,但是有一个缺点不可避免的就是,你需要把数据导入到数据库,然后建立查询语句,比较复杂的查询语句,比较复杂,也是很难编写,如果电脑配置比较低,即便是简单的查询语句,在面临巨大的数据面前,也是会变得很慢。
今天主要是讲2点:
- python下的pandas如何筛选数据
- SQL的查询方法如何在pandas里面实现
- 相比较你感受一下谁的速度跟快更便捷
查询几列数据并只显示前几行
SQL实现方法:
SELECT 'Year', 'Month', 'DayofMonth', 'DayOfWeek' FROM tips LIMIT 5
pandas实现方法:
df[['Year', 'Month', 'DayofMonth', 'DayOfWeek']]
Where :查询指定的值
时间为year=2011的前五行数据:
SQL:
SELECT * FROM tips WHERE tYear = 'Dinner' LIMIT 5;
pandas:
df[df["Year"]==2011].head()
多条件查询 AND:
通过多个条件限制查询的值得范围
SQL:
SELECT * FROM tips WHERE Year = 2011 AND Month > 6;
pandas:
df[(df["Year"]==2011)&(df["Month"]>6)].head()
多条件查询 OR:
多个条件只要满足一个就可以
SQL:
SELECT * FROM tips WHERE Year = 2011 OR Month > 6;
pandas:
df[(df["Year"]==2011)|(df["Month"]>6)].head()
筛选是否含有空值:
筛选某列包含空值的数据:
SQL:
SELECT * FROM frame WHERE Month IS NULL;
pandas:
df[df["Month"].isna()].head
筛选不包含空值的列:
SQL:
SELECT * FROM frame WHERE Month IS NOT NULL;
pandas:
df[df["Month"].notna()].head
分组统计:
SQL:
SELECT Year, count(*) FROM tips GROUP BY Year;
pandas:
df.groupby("Year").size()
分组统计应用不容的统计方式:
SQL:
SELECT Year, AVG(TaxiOut), COUNT(*) FROM tips GROUP BY Year;
pandas:
df.groupby('Year').agg({'TaxiOut': np.mean, 'Year': np.size})
数据连接查询:
INNER JOIN
SQL:
SELECT * FROM df1 INNER JOIN df2 ON df1.key = df2.key;
pandas:
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value': np.random.randn(4)})
df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
'value': np.random.randn(4)})
pd.merge(df1, df2, on='key')
LEFT OUTER JOIN
SQL:
SELECT * FROM df1 LEFT OUTER JOIN df2 ON df1.key = df2.key;
pandas:
pd.merge(df1, df2, on='key', how='left')
RIGHT JOIN
SQL:
SELECT * FROM df1 RIGHT OUTER JOIN df2 ON df1.key = df2.key;
pandas:
pd.merge(df1, df2, on='key', how='right')
FULL JOIN
SQL :
SELECT * FROM df1 FULL OUTER JOIN df2 ON df1.key = df2.key;
pandas:
pd.merge(df1, df2, on='key', how='outer')
合并查询
Union 包括两种情况,删除重复和不删除
不删除:
SQL:
SELECT city, rank FROM df1 UNION ALL SELECT city, rank FROM df2;
pandas:
pd.concat([df1, df2])
删除:
SQL:
SELECT city, rank FROM df1 UNION SELECT city, rank FROM df2;
pandas:
pd.concat([df1, df2]).drop_duplicates()
排序
SQL:
SELECT * FROM tips ORDER BY Month DESC
pandas:
df.sort_values(by="Month",ascending=False)
更新
SQL:
UPDATE tips SET TaxiIn = TaxiIn*2 WHERE Month < 5
pandas:
df.loc[df["Month"]<5,"TaxiIn"]*=2
df[df["Month"]<5]["TaxiIn"]
删除
SQL:
DELETE FROM tips WHERE Month > 9;
pandas:
df.drop(df[df["Month"]>9].index)
閱讀更多 嘩啦圈的夢 的文章