Kaggle笔记

Florin Wang

2026-05-01

GAME

Pandas

处理表格很有用的工具.

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

data.describe()    # 返回每一列的基础信息,比如 max min 各种分位数 前缀和 平均数
data['123'].mean() # 指定列的指定数据
data.head()        # 快速查看数据前几行

模型验证

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# 随机把数据集分成两半,默认训练75%测试25%

melbourne_model = DecisionTreeRegressor(random_state = 0)
melbourne_model.fit(train_X, train_y)
# 拟合

val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
# 评估的绝对值差

决策树

y = home_data.SalePrice
# 选择要预测的指标

feature_names = ['LotArea',
'YearBuilt',
'1stFlrSF',
'2ndFlrSF',
'FullBath',
'BedroomAbvGr',
'TotRmsAbvGrd']
# 选择想结合的feature

X = home_data[feature_names]
# 指标X

from sklearn.tree import DecisionTreeRegressor
iowa_model = DecisionTreeRegressor(random_state=1)
# 建立随机初始状态的决策树

iowa_model.fit(X, y)
# 拟合模型

predictions = iowa_model.predict(X)
print(predictions)
# 测试预测

过拟合

可以通过控制叶节点最大数量来防止过拟合.

from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

随机森林

用法和上面的决策树完全一样,只是核心算法变了.

1	RandomForestRegressor(random_state=1)

深度学习

基础的神经元

1 2	input_shape = [8] # 意思是说input只有8个单元

from tensorflow import keras
from tensorflow.keras import layers

# Create a network with 1 linear unit
model = keras.Sequential([
    layers.Dense(units=1, input_shape=[3])
]) # 时序模型

w, b = model.weights # 查看权重

模型

model = keras.Sequential([
    layers.Dense(32, input_shape=[8]),
    layers.Activation('relu'),
    # 两层中间用activation,激活层不是一个独立层
    layers.Dense(32),
    layers.Activation('relu'),
    layers.Dense(1),
])

优化器

model.compile(
    optimizer="adam",
    loss="mae",
)
# 添加优化器和损失函数

from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(128, activation='relu', input_shape=input_shape),
    layers.Dense(128, activation='relu'),    
    layers.Dense(64, activation='relu'),
    layers.Dense(1),
])
model.compile(
    optimizer='adam',
    loss='mae',
)
history = model.fit(
    X, y,
    batch_size=128,
    epochs=200,
)
import pandas as pd

history_df = pd.DataFrame(history.history)
history_df.loc[5:, ['loss']].plot();

from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=20, # how many epochs to wait before stopping
    restore_best_weights=True,
)# 提前停止函数
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=256,
    epochs=500,
    callbacks=[early_stopping], # put your callbacks in a list
    verbose=0,  # 关闭训练日志
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print("Minimum validation loss: {}".format(history_df['val_loss'].min()))

激活函数

relu elu selu swish mish gelu

以及tanh和softmax等.

Dense

全连接层

Dropout

随机选某些节点踢出当前轮的训练,这让神经网络少学习一些局部的特征,注重全局特征,进而避免过拟合.

1	layers.Dropout(rate=0.3), # 随机丢弃30%节点

BatchNormalization

批量归一化

1	layers.BatchNormalization(),

二元分类

构建模型

from tensorflow import keras
from tensorflow.keras import layers

model = model = keras.Sequential([
    layers.BatchNormalization(input_shape=input_shape),
    layers.Dense(256,activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(256,activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.3),
    layers.Dense(1,activation='sigmoid'),
])
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['binary_accuracy'],
)# 添加优化器

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer

hotel = pd.read_csv('../input/dl-course-data/hotel.csv')

X = hotel.copy()
y = X.pop('is_canceled')

X['arrival_date_month'] = \
    X['arrival_date_month'].map(
        {'January':1, 'February': 2, 'March':3,
         'April':4, 'May':5, 'June':6, 'July':7,
         'August':8, 'September':9, 'October':10,
         'November':11, 'December':12}
    )

features_num = [
    "lead_time", "arrival_date_week_number",
    "arrival_date_day_of_month", "stays_in_weekend_nights",
    "stays_in_week_nights", "adults", "children", "babies",
    "is_repeated_guest", "previous_cancellations",
    "previous_bookings_not_canceled", "required_car_parking_spaces",
    "total_of_special_requests", "adr",
]
features_cat = [
    "hotel", "arrival_date_month", "meal",
    "market_segment", "distribution_channel",
    "reserved_room_type", "deposit_type", "customer_type",
]

transformer_num = make_pipeline(
    SimpleImputer(strategy="constant"), # there are a few missing values
    StandardScaler(),
)
transformer_cat = make_pipeline(
    SimpleImputer(strategy="constant", fill_value="NA"),
    OneHotEncoder(handle_unknown='ignore'),
)

preprocessor = make_column_transformer(
    (transformer_num, features_num),
    (transformer_cat, features_cat),
)

# stratify - make sure classes are evenlly represented across splits
X_train, X_valid, y_train, y_valid = \
    train_test_split(X, y, stratify=y, train_size=0.75)

X_train = preprocessor.fit_transform(X_train)
X_valid = preprocessor.transform(X_valid)

input_shape = [X_train.shape[1]]
# 加入数据集的代码

early_stopping = keras.callbacks.EarlyStopping(
    patience=5,
    min_delta=0.001,
    restore_best_weights=True,
)
history = model.fit(
    X_train, y_train,
    validation_data=(X_valid, y_valid),
    batch_size=512,
    epochs=200,
    callbacks=[early_stopping],
)

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot(title="Cross-entropy")
history_df.loc[:, ['binary_accuracy', 'val_binary_accuracy']].plot(title="Accuracy")
# 训练代码

缺失值的处理

直接删掉一整行(有可能丢失重要信息)
归因法(用平均数填补缺失值)
开一个新特征:missing,然后设成true,然后再用平均值填补空白

离散值的处理

直接删掉.
处理为一个递增值,例如”偶尔””经常”.
开桶记录多个值,例如”黄色””绿色””苹果””香蕉”.

Pipeline 管道

很好用的标准化流程.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

交叉验证

每轮用不同的数据进行验证,其他的数据测试,因为验证的数据每轮都不太一样而且前一轮用于训练的数据这一轮可能验证,所以叫交叉验证.

XGBoost

一个仅对树模型有效的算法,据说因为要保证加法性质所以不能用神经网络.

数据泄露

时间上存在某种关联的数据.不能逆转时间(即这个指标出现的时间在你预测之后,但是你训练用了这个指标).
得肺炎的人们一般会吃抗生素.
但是得肺炎才被发现的时候人们没吃抗生素,模型逆转时间是不对的.

还有就是测试数据不小心外泄了,模型拿到测试数据加以训练,分自然很高.

时间序列

根据历史数据预测未来.

时间序列的两个特殊特征是”时间步长特征”和”时间滞后特征”.

时间步长特征与时间滞后特征

以销售额为例.

如果模型发现,周末的销售量似乎比较好,这就是时间步长特征,与时间戳本身有关.
如果模型发现,如果上周三销量比较好,那么这周三的销量也比较好,那么这就是时间滞后特征.

这两个特征的区别在于,时间步长与时间戳本身有关,时间滞后特征与前几天的数据有关(体现为前几天数据的影响”滞留”到了今天).

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.time_series.ex1 import *

# Setup notebook
from pathlib import Path
from learntools.time_series.style import *  # plot style settings

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.linear_model import LinearRegression


data_dir = Path('../input/ts-course-data/')
comp_dir = Path('../input/store-sales-time-series-forecasting')

book_sales = pd.read_csv(
    data_dir / 'book_sales.csv',
    index_col='Date',
    parse_dates=['Date'],
).drop('Paperback', axis=1)
book_sales['Time'] = np.arange(len(book_sales.index))
book_sales['Lag_1'] = book_sales['Hardcover'].shift(1)
book_sales = book_sales.reindex(columns=['Hardcover', 'Time', 'Lag_1'])

ar = pd.read_csv(data_dir / 'ar.csv')

dtype = {
    'store_nbr': 'category',
    'family': 'category',
    'sales': 'float32',
    'onpromotion': 'uint64',
}
store_sales = pd.read_csv(
    comp_dir / 'train.csv',
    dtype=dtype,
    parse_dates=['date'],
    infer_datetime_format=True,
)
store_sales = store_sales.set_index('date').to_period('D')
store_sales = store_sales.set_index(['store_nbr', 'family'], append=True)
average_sales = store_sales.groupby('date').mean()['sales']

趋势

趋势就是趋势…

移动平均线图

就是什么,5日均线,10日均线的图,并不高大上.

趋势建模

坐标系建模.

697

上下两个分别使用下面的建模:

1 2	target = a * time + b target = a * time ** 2 + b * time + c

一般而言不建议使用非常多次的多项式建模,因为模型会非常受项最高的数据影响.

季节性

有两种方式拟合,使用季节特征或傅里叶特征.

季节特征

根据季节图手动选择特征,然后删掉一个(因为这个会被并入bias),例如一周的周期删掉周一(把周一的影响并入bias),适用于数据采集很广的情况.

傅里叶特征

因为傅里叶本质上就是一堆神秘三角函数叠一起,所以适用于数据采集较少的情况.
可以通过周期图判断应该包含多少个傅里叶对.

(例如一年四季,但是周太小了适合与指标一起建模)

序列相关性与时间相关性

其实就是两个特征的翻版,只不过换成相关性了.
那么,序列相关性与历史数据有关,时间相关性与时间有关.

时间序列特征

介绍了一个新图叫做Partial autocorrelation()

根据这个图判断滞后几步带来的影响比较大,滞后几步能够较好的模拟.

还有一个滞后图,用于判断数据的线性相关性(或许可以尝试转线性相关性)

混合模型

特征工程

模型只学他们能学到的东西才能发挥最大性能,所以需要对数据下手.

不对数据做任何处理的叫baseline,可以合成数据.

比如对混凝土下手,加一些比如水比例的数据作为新特征,尽可能构造线性数据让模型识别规律,更好预测.

MI 互信息

知道这个对判断结果影响有多大.

通过分组法实现.例如:顾客对于同一件商品50%买了50%没买.
根据顾客男女分成两组,假设我们发现男的90%没买,但是女的90%买了,此时男女的指标就很有用.

常见的互信息挖掘例子可以在生活中找到.
比如电话的前几位是区号,这个区可能有一些提示作用,所以电话号码建议拆成几个重要的组.
然后组内可以使用聚合,比如这个组一共有多少个人,然后加一起.

K-means

因为是不带标签的,常用于通过聚类发现特征.

PCA 主成分分析

通过对数据条目间建立线性关系来找到核心的成分.

目标编码

把字符串标签映射为数字的方法.
上文中每一个品牌都一个独热编码会造成维度灾难,而直接用数字编码会造成模型误解数字关系,所以使用目标编码解决.

目标编码:用同一品牌的均价代替这个品牌的标签,比如FluuC虚拟币卖了50,100,150三个,就用100的均价代替FluuC这个虚拟币.

但是目标编码在品牌数据量很小的情况下很容易过拟合,比如只有一个某品牌的数据,可能就会严重拉大,而且要防止数据泄露.

平滑处理

为了解决品牌数据过少的问题,可以使用平滑处理,设计公式让该品牌数据过少时品牌的值趋向于整体均值,数据很多时品牌的值趋向于品牌的均值.

缩放和归一化

缩放是指把数据线性映射,归一化则是指把数据变成近似正态分布的部分(通过Box-Cox变换).

CV(计算机视觉)

CV大致过程就是,首先把图像拆成好几块(用特征分类器),然后

CNN 卷积神经网络

有卷积核,对特定的模式比较敏感,所以能够进行图像模式识别.

Convolutional Classifier 卷积分类器

图片->base(卷积基)->head(密集头)->结果

卷积基主要用于拆分图像提取特征,但是拆分本身是不涉及特征识别的,所以卷积基可以复用.
密集头(dense head)主要用于特征识别,所以需要训练.

特征提取

图片—(乘上一个卷积核)->特征—(过一遍relu)->更明显的特征—(过一遍池化)->压缩过的特征.

卷积也称滤波.
因为relu会产生一块很大的0区,这块0区是不重要的,所以需要过一遍池化层,压缩特征.

滑动窗口

取max…滑动起来了,自己设置步长和采样.

自定义卷积神经网络

卷积快: (卷积->ReLU->池化)

数据增强

可以对原始图片做一些反转,或者旋转,因为图片的主体不变的话对图像本身做一些变换是没问题的,从而扩充数据集,让模型更泛化.

杂项

Kaggle的提交是云端保存版本,选择run all然后在这个版本中选择文件提交评测.
row是横行,column是竖列
因为神经网络训练依赖求导,而其他激活函数在数值取极大或极小的时候效果会很不明显,relu把导数设成1,避免了神经网络无法训练的问题.