您现在的位置是：首页> 学无止境> Python> Python

Python学习之数据挖掘（二）

张伟江136659942042019-08-18 09:21【Python】2961人已围观

Numpy 高效的运算工具

Numpy的优势

ndarray属性

基本操作

ndarray.方法()

numpy.函数名()

ndarray运算

逻辑运算

统计运算

数组间运算

合并、分割、IO操作、数据处理

3.1 Numpy优势

3.1.1 Numpy介绍 - 数值计算库

num - numerical 数值化的

py - python

ndarray

n - 任意个

d - dimension 维度

array - 数组

3.1.2 ndarray介绍

3.1.3 ndarray与Python原生list运算效率对比

3.1.4 ndarray的优势

1）存储风格

ndarray - 相同类型 - 通用性不强

list - 不同类型 - 通用性很强

2）并行化运算 ndarray支持向量化运算

3）底层语言

C语言，解除了GIL

3.2 认识N维数组-ndarray属性

3.2.1 ndarray的属性

shape

ndim

size

dtype

itemsize

在创建ndarray的时候，如果没有指定类型

默认

整数 int64

浮点数 float64

3.2.2 ndarray的形状 [1, 2, 3, 4]

[[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]

 [[[1, 2, 3, 4],  [1, 2, 3, 4],  [1, 2, 3, 4]],   [[1, 2, 3, 4],  [1, 2, 3, 4],  [1, 2, 3, 4]], [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]]

3.3 基本操作

adarray.方法()

np.函数名()

np.array()

3.3.1 生成数组的方法

1）生成0和1

np.zeros(shape)

np.ones(shape)

2）从现有数组中生成

np.array() np.copy() 深拷贝

np.asarray() 浅拷贝

3）生成固定范围的数组

np.linspace(0, 10, 100)

[0, 10] 等距离

            np.arange(a, b, c)
range(a, b, c)
[a, b) c是步长
        4）生成随机数组
分布状况 - 直方图
1）均匀分布
每组的可能性相等
2）正态分布
σ 幅度、波动程度、集中程度、稳定性、离散程度
    3.3.2 数组的索引、切片
    3.3.3 形状修改
    ndarray.reshape(shape) 返回新的ndarray，原始数据没有改变
    ndarray.resize(shape) 没有返回值，对原始的ndarray进行了修改
      ndarray.T 转置行变成列，列变成行
    3.3.4 类型修改
    ndarray.astype(type)
      ndarray序列化到本地
    ndarray.tostring()

3.3.5 数组的去重
set()
3.4 ndarray运算

逻辑运算

布尔索引

通用判断函数

np.all(布尔值)

只要有一个False就返回False，只有全是True才返回True

np.any() 只要有一个True就返回True，只有全是False才返回False

np.where（三元运算符）

np.where(布尔值, True的位置的值, False的位置的值)

统计运算

统计指标函数

min, max, mean, median, var, std

np.函数名

ndarray.方法名

返回最大值、最小值所在位置

np.argmax(temp, axis=)

np.argmin(temp, axis=)

数组间运算

3.5.1 场景

3.5.2 数组与数的运算

3.5.3 数组与数组的运算

3.5.4 广播机制

3.5.5 矩阵运算

1 什么是矩阵

矩阵matrix 二维数组

矩阵 & 二维数组

两种方法存储矩阵

1）ndarray 二维数组

矩阵乘法：

np.matmul

np.dot

2）matrix数据结构

2 矩阵乘法运算

形状

(m, n) * (n, l) = (m, l)

运算规则

A (2, 3) B(3, 2) A * B = (2, 2)

3.6 合并、分割

3.7 IO操作与数据处理

3.7.1 Numpy读取

3.7.2 如何处理缺失值

两种思路：

直接删除含有缺失值的样本

替换/插补

按列求平均，用平均值进行填补

案例1：原生list与ndarray的效率比较

# ndarray与Python原生list运算效率对比
import numpy as np
import random
import time
python_list = [];
for i in(range(100000000)):
    python_list.append(random.random())
ndarray_list = np.array(python_list)

t1 = time.time()
a = sum(python_list)
t2 = time.time()
d1 = t2-t1

# ndarray求和
t3 = time.time()
b = np.sum(ndarray_list)
t4 = time.time()
d2 = t4-t3

print("list的时间：{}".format(d1),"ndarray的时间：{}".format(d2))

案例2：ndarray的属性

# ndarray的属性
import numpy as np
score = np.array([[80, 89, 86, 67, 79],
[78, 97, 89, 67, 81],
[90, 94, 78, 67, 74],
[91, 91, 90, 67, 69],
[76, 87, 75, 67, 86],
[70, 79, 84, 67, 84],
[94, 92, 93, 67, 64],
[86, 85, 83, 67, 80]])
# 表示每一个维度的信息
print(score.shape)
# 数组的维度大小
print(score.ndim)
# 表示包含的元素总数
print(score.size)
# 维度的类型
print(score.dtype )
# 每个元素所占用的字节数目
print(score.itemsize)
# 不指定类型整数为int64,小为float64

案例3：生成数组的方法

# 生成数组的方法
import numpy as np
# 生成数组的方法
# 1、 生成0和1的数组
print(np.zeros(shape=(3, 4), dtype="float32"))
print(np.ones(shape=[2, 3], dtype=np.int32))
# 2、从现有数组生成
score = np.array([[1,2],[3,4]])
# np.array()---深拷贝
data1 = np.array(score)
# np.asarray()--浅拷贝
data2 = np.asarray(score)
# np.copy()--深拷贝
data3 = np.copy(score)
score[0,0] = 10
print('np.array()的值：',data1)
print('np.asarray()的值：',data2)
print('np.copy()的值：',data3)
# 3、生成固定范围的数组
print(np.linspace(0, 10, 5))
print(np.arange(0, 11, 5))
# 4、生成随机数组
# 均匀分布随机数(每组可能性相等)
import matplotlib.pyplot as pt
uniform = np.random.uniform(low=-1,high=1,size=100000)
pt.figure()
pt.hist(uniform,100)
pt.show()
# 正态分布随机数
normal =np.random.normal(loc=1.75, scale=0.1, size=100000)
# 1、创建画布
pt.figure(figsize=(20, 8), dpi=80)
# 2、绘制直方图
pt.hist(normal, 1000)
# 3、显示图像
pt.show()

案例4：数组的切片索引与形状修改

# 数组的切片
# 需求：随机生成8只股票2周的交易日涨幅数据
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
# 获取第一个股票的前3个交易日的涨跌幅数据
stock_change[0, :3]
# 数组的修改
# 需求：让刚才的股票行、日期列反过来，变成日期行，股票列
print(stock_change.reshape((10, 8))) #返回新的ndarray，原始数据没有改变
print(stock_change.resize((10, 8)))  # 没有返回值，对原始的ndarray进行了修改
print(stock_change)
stock_change.resize((8,10))
print(stock_change.T) #转置，行变列，列变成行
print(stock_change)

案例5：数组的类型修改与去重

# 类型修改与数组去重
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
print(stock_change.astype("int32"))
stock_change.tostring()
temp = np.array([[1, 2, 3, 4],[3, 4, 5, 6]])
print(np.unique(temp))
print(set(temp.flatten()))

案例6：数组的逻辑运算、统计运算、数组与数的运算、数组与数组的运算、矩阵运算

# 逻辑运算
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
# 逻辑判断, 如果涨跌幅大于0.5就标记为True 否则为False
print(stock_change > 0.5)
stock_change[stock_change > 0.5] = 1.1
stock_change
# 判断stock_change[0:2, 0:5]是否全是上涨的
print(stock_change[0:2, 0:5] > 0)
# 只要有一个false就返回false,全是true才返回true
print(np.all(stock_change[0:2, 0:5] > 0))
# 全是false才会返回true,只要有一个true就返回true,
# 判断前5只股票这段期间是否有上涨的
print(np.any(stock_change[:5, :] > 0))
# 三元运算符
# 判断前四个股票前四天的涨跌幅 大于0的置为1，否则为0
temp = stock_change[:4, :4]
print(np.where(temp > 0, 1, 0))
# 判断前四个股票前四天的涨跌幅 大于0.5并且小于1的，换为1，否则为0
print(np.where(np.logical_and(temp > 0.5, temp < 1), 1, 0))
# 判断前四个股票前四天的涨跌幅 大于0.5或者小于-0.5的，换为1，否则为0
print(np.where(np.logical_or(temp > 0.5, temp < -0.5), 11, 3))

# 统计运算
# 前四只股票前四天的最大涨幅
print(temp) # shape: (4, 4) 0  1
print(temp.max(axis=0)) #axis=0表示按列求最大值，axis=1表示按行求最大值
print(np.max(temp, axis=-1))
print(np.argmax(temp, axis=-1))

# 数组与数的运算
arr = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
print(arr/10)

# 数组与数组的运算
# 满足条件(维度相等或其中对应的一个地方为1)
# 例如(8,1)和(8,3)是满足的,(8,1)和(7,3)不满足，(8,2)和(8,3)不满足
arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
# arr1+ arr2
arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1], [3]])
print(arr1+arr2)

# 矩阵运算
# ndarray存储矩阵
data = np.array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])

# ndarray存储矩阵
data_mat = np.array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
print(type(data))
# (8, 2) * (2, 1) = (8, 1)
weights = np.array([[0.3],[0.7]])
print(np.matmul(data,weights))
print(np.dot(data,weights))
weights_mat = np.mat([[0.3], [0.7]])
print(data_mat*weights_mat)

案例7：数组的合并与分割

# 合并
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
a = stock_change[:2,0:4]
b = stock_change[4:6,0:4]
print(np.hstack((a,b)))
print(np.concatenate((a,b),axis=1))
print(np.vstack((a,b)))
print(np.concatenate((a,b),axis=0))
# 分割
x = np.arange(8.0)
print(np.split(x,[3,5,6,10]))

案例8：numpy读取

# numpy读取
import numpy as np
data = np.genfromtxt("test.csv", delimiter=",")
def fill_nan_by_column_mean(t):
    for i in range(t.shape[1]):
        # 计算nan的个数
        nan_num = np.count_nonzero(t[:, i][t[:, i] != t[:, i]])
        if nan_num > 0:
            now_col = t[:, i]
            # 求和
            now_col_not_nan = now_col[np.isnan(now_col) == False].sum()
            # 和/个数
            now_col_mean = now_col_not_nan / (t.shape[0] - nan_num)
            # 赋值给now_col
            now_col[np.isnan(now_col)] = now_col_mean
            # 赋值给t，即更新t的当前列
            t[:, i] = now_col
    return t
print(fill_nan_by_column_mean(data))