李宏毅机器学习作业总结

最开始是4 5个月前刚入门深度学习的时候做过一次，但是当时感觉只是囫囵吞枣般的做完了，没有认真的思考过里面的细节。最近在做datawhale夏令营时，被模型创建和训练伤透，深感自己的调参技术相当垃圾，于是重新捡起李老的作业从头认真的做一遍。

Homework1

作业简介

一个回归问题，由若干不同症状的患者，根据他们的症状给出 covid-19 阳性的概率

一些标准:

调参记录

【Simple Baseline】

只需要将原始代码跑一下就好了

【Medium + Strong Baseline】

修改特征选择（选择默认的除前35个之外的，即不选择地区作为学习的特征）：

def select_feat(train_data, valid_data, test_data, select_all=True):
    '''Selects useful features to perform regression'''
    y_train, y_valid = train_data[:,-1], valid_data[:,-1]
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data

    if select_all:
        feat_idx = list(range(raw_x_train.shape[1]))
    else:
        feat_idx = list(range(35, raw_x_train.shape[1])) # TODO: Select suitable feature columns.

    return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid

再将select_all置为False：

device = 'cuda' if torch.cuda.is_available() else 'cpu'
config = {
    'seed': 5201314,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 5000,     # Number of epochs.
    'batch_size': 256,
    'learning_rate': 1e-5,
    'early_stop': 600,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}

运行出来的结果:

奇怪的是，助教给出的hint说通过选择特定的特征可以达到medium baseline，而要达到strong baseline还需要改进模型。但是仅仅通过选择特征就可以通过strong baseline了。看来所谓“数据远远大于模型”不无道理。

【Boss Baseline】

根据上一个baseline的经验，特征选择非常重要，因此我们选择调库来选择最好的k个特征：

from sklearn.feature_selection import SelectKBest, f_regression

def select_feat(train_data, valid_data, test_data, select_all=True):
    '''Selects useful features to perform regression'''
    y_train, y_valid = train_data[:,-1], valid_data[:,-1]
    raw_x_train, raw_x_valid, raw_x_test = train_data[:,:-1], valid_data[:,:-1], test_data

    if select_all:
        feat_idx = list(range(raw_x_train.shape[1]))
    else:
        # TODO: Select suitable feature columns.
        selector = SelectKBest(score_func=f_regression, k=24)
        result = selector.fit(raw_x_train, y_train)
        idx = np.argsort(result.scores_)[::-1]
        feat_idx = list(np.sort(idx[:24]))

    return raw_x_train[:,feat_idx], raw_x_valid[:,feat_idx], raw_x_test[:,feat_idx], y_train, y_valid

关于这段代码的解释（我的理解）： SeclectKBest是scikit中的一个函数，用于选择K个最好的特征，选择的标准则是由函数f_regression给出的，顾名思义，这是一个用于回归任务选择特征的函数。定义好selector之后（selector = SelectKBest(score_func=f_regression, k=24)，调用fit方法可以计算出所有特征的分数（result = selector.fit(raw_x_train, y_train) 紧接着按照从大到小的顺序排列，选出最前面的k个下标组成切片，提取特征

在不断的尝试下，k取20-24的时候效果最佳。

接着是调整模型的大小和参数，加入了LeakyReLU和BatchNorm，以及Dropout：

class My_Model(nn.Module):
    def __init__(self, input_dim):
        super(My_Model, self).__init__()
        # TODO: modify model's structure, be aware of dimensions.
        self.layers = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.LeakyReLU(0.2),
            nn.BatchNorm1d(64),
            nn.Dropout(0.1),
            nn.Linear(64, 16),
            nn.BatchNorm1d(16),
            nn.LeakyReLU(0.2),
            nn.Dropout(0.1),
            nn.Linear(16, 1)
        )

    def forward(self, x):
        x = self.layers(x)
        x = x.squeeze(1) # (B, 1) -> (B)
        return x

在训练方面，新增加学习率调整器，并且改用Adam：

    optimizer = torch.optim.Adam(model.parameters(), lr=config['learning_rate'] * 10, weight_decay=1e-4)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=2,T_mult=2,eta_min=config['learning_rate'])

关于CosineAnnealingWarmRestarts: 意思是可以将学习率从初始值，在2， 4， 8... 个（T_0*（n - 1） * T_mult ）ephoch之间，逐渐下降到eta_min。一直重复这个周期

此外，这里用AdamW的效果不如Adam，这里面可能有些东西没搞懂

最后是一些参数设置：

device = 'cuda' if torch.cuda.is_available() else 'cpu'
config = {
    'seed': 5201314,      # Your seed number, you can pick your lucky number. :)
    'select_all': False,   # Whether to use all features.
    'valid_ratio': 0.2,   # validation_size = train_size * valid_ratio
    'n_epochs': 10000,     # Number of epochs.
    'batch_size': 256,
    'learning_rate': 1e-3,
    'early_stop': 1000,    # If model has not improved for this many consecutive epochs, stop training.
    'save_path': './models/model.ckpt'  # Your model will be saved here.
}

最后的结果：

只差一点就可以到boss baseline了，可能是特征的选择上还是没有做好。不过我也没有继续做了。

Homework2

作业简介

给定若干段录音，将它分解成不同的小段（frames），通过深度学习的方法来确定这一段录音中讲话人说的是哪一个字（音素）。总而言之，这是一个分类问题。

一些标准：

调参记录

【Simple Baseline】

*老样子，还是只需要把助教给的代码跑一遍就好“

【Medium Baseline】

根据提示，达到medium的条件是将合适的多个frames拼接在一起，这样可以最大限度的保留整个音素的信息。此外，还需要在模型中增加更多的层。

首先在block中添加更多的层，并且使用dropout和batchnorm

class BasicBlock(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(BasicBlock, self).__init__()

        self.block = nn.Sequential(
            nn.Linear(input_dim, output_dim),
            nn.BatchNorm1d(output_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(output_dim, output_dim * 2),
            nn.BatchNorm1d(output_dim * 2),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(output_dim * 2, output_dim),
            nn.BatchNorm1d(output_dim),
            nn.ReLU(),
            nn.Dropout(0.2),
        )

    def forward(self, x):
        x = self.block(x)
        return x

接着改变隐藏层的大小，连接更多的frames（这里n取11）：

# data prarameters
concat_nframes = 11              # the number of frames to concat with, n must be odd (total 2k+1 = n frames)
train_ratio = 0.8               # the ratio of data used for training, the rest will be used for validation

# training parameters
seed = 1213                        # random seed
batch_size = 512                # batch size
num_epoch = 10                   # the number of training epoch
learning_rate = 1e-4         # learning rate
model_path = './model.ckpt'     # the path where the checkpoint will be saved

# model parameters
input_dim = 39 * concat_nframes # the input dim of the model, you should not change the value
hidden_layers = 2               # the number of hidden layers
hidden_dim = 512                # the hidden dim

运行结果：

可以看到，效果并不理想。于是转而使用更深的网络，更宽的层（layers=6，dim=1024），并且连接更多的frames（n=17），

结果更好了：

思考题：课件中让我们做一个小实验，即更深，更窄的层好还是更浅，更宽的层好。照着课件上的思路，再根据上面的模型重新跑了一遍，这次layers=2， dim=1750，结果明显好于上面的模型：

【Strong Baseline】

我们根据上面的思路，先进一步加深模型，取layers=12， dim=1024。此外，在模型输出最后一层加上softMax，但是

结果却出奇的差：

于是想到会不会是softMax的问题，于是去掉后重新做实验，

发现效果变好了

，证明确实是softmax的问题：

在查阅资料之后，发现其实crossentry的损失函数是默认加了一层softMax的，所以如果在模型中再加一层的话会导致模型难以收敛。

【Boss Baseline】

助教提示的slides里写道，如果需要过boss baseline的话，需要用到RNN。我这里首先想到的是用LSTM。根据我之前看到的一篇文章并按照这个顺序来从头构建这个模型，正好实践一下。根据第一条建议，我构建出了以下模型：

class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim=41, hidden_layers=1, hidden_dim=256):
        super(Classifier, self).__init__()

        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=hidden_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        x, _ = self.lstm(x)
        x = self.fc(x)
        return x

其中layers=2，dim=1024，并且成功在[10000, 2000]的数据集上过拟合：

这里我还尝试了layers=2，3， dim=512，1024， 2048的所有排列组合，最后发现这有选择的这个组合下train_loss的曲线最像log函数，跟建议所说一致。

再根据第五，六条建议，设定学习率为1e-4，使用Adam和CosineAnnealingLR。再根据第七条建议，使用梯度裁剪...

参考的太多了！自己看博客吧...

跑了一晚上之后，

结果不尽人意：

之后在网上参考了大量的博客和文章，最后把模型继续加深：

import torch.nn as nn
import torch.nn.init as init

class BasicBlock(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(BasicBlock, self).__init__()

        self.block = nn.Sequential(
            nn.Linear(input_dim, output_dim),
            nn.ReLU(),
            nn.BatchNorm1d(output_dim),
            nn.Dropout(0.25),
        )

    def forward(self, x):
        x = self.block(x)
        return x


class Classifier(nn.Module):
    def __init__(self, input_dim, output_dim=41, hidden_layers=1, hidden_dim=256):
        super(Classifier, self).__init__()


        self.hidden_layers = 5
        self.hidden_dim = 512
        self.input_dim = 39

        self.lstm = nn.LSTM(self.input_dim, self.hidden_dim, num_layers=self.hidden_layers,dropout=0.25,batch_first=True,bidirectional=True)
        self.norm = nn.LayerNorm(self.hidden_dim * 2)
        self.relu = nn.ReLU()

        self.fc = nn.Sequential(
            BasicBlock(self.hidden_dim * 2, hidden_dim),
            *[BasicBlock(hidden_dim, hidden_dim) for _ in range(hidden_layers)],
            nn.Linear(hidden_dim, output_dim),
        )

        self.dropout = nn.Dropout(0.25)
        self.init_weights()

    def init_weights(self):
        for name, param in self.lstm.named_parameters():
            if 'weight_ih' in name:  # input to hidden weights
                init.xavier_uniform_(param.data)
            elif 'weight_hh' in name:  # hidden to hidden weights
                init.orthogonal_(param.data)
            elif 'bias' in name:  # biases
                init.zeros_(param.data)
            else:
                init.he_uniform_(param.data)

    def forward(self, x):
        x = x.view(x.shape[0], concat_nframes, 39)
        x, _ = self.lstm(x)
        x = x[:, -1]
        x = self.relu(x)
        x = self.norm(x)
        x = self.dropout(x)
        x = self.fc(x)
        return x

再根据第二条建议，在[60000, 3000] 和 [80000, 4000] 的小数据集上分别调参，最后确定了超参数:

# data prarameters
concat_nframes = 81              # the number of frames to concat with, n must be odd (total 2k+1 = n frames)
train_ratio = 0.95               # the ratio of data used for training, the rest will be used for validation

# training parameters
seed = 1213                        # random seed
batch_size = 256                # batch size
num_epoch = 20                   # the number of training epoch
learning_rate = 2e-4         # learning rate
model_path = './model.ckpt'     # the path where the checkpoint will be saved

# model parameters
input_dim = 39 * concat_nframes # the input dim of the model, you should not change the value
hidden_layers = 4               # the number of hidden layers
hidden_dim = 1024                # the hidden dim

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer,T_0=2,T_mult=2,eta_min=0.1 * learning_rate)

得到的loss曲线如下所示：

可以看到，在这个参数下模型拟合的不错。最终在跑了20ephoch（约用了16个小时），得到了

最后的结果：

可惜的是仍然没有过boss baseline。loss曲线如下：

可以看到loss在最后并没有完全收敛（甚至train loss 还没有超过 val loss），于是决定再多跑几个epoch。在进行多5轮的训练后，发现模型已经收敛了，再次提交效果

并没有得到提升：

感觉很可惜，毕竟只差一点点了。但是从头再训练一次花费的时间太多了，而且对学习没有太大的提升了，于是就先这样了吧！

Homework3

作业简介

利用CNN对食物的图片进行分类，一共有11个不同的类别

一些标准：

调参记录

【Simple baseline】

老样子，跑通示例代码就行：

【Medium baseline】

根据提示，我们需要先做一些图像增广，这里顺便把Report1在这里记录下来：

homework_tfm = transforms.Compose([transforms.RandomGrayscale(),
                transforms.RandomResizedCrop(128,(0.1, 1),(0.5, 2)),
                transforms.RandomHorizontalFlip(),
                transforms.RandomVerticalFlip(),
                transforms.ColorJitter(0.5, 0.5, 0.5, 0.3),
                transforms.GaussianBlur(7)])
init = transforms.Resize((128, 128))
img = Image.open('/kaggle/input/ml2023spring-hw3/train/0_0.jpg')
img = init(img)
display(img)
for _ in range(5):
    display(homework_tfm(img))

效果如下：

说实话，这变换之后我看着都费劲，不知道机器真的能看懂吗。。。

跑了70多个epoch之后，没有过线：

感觉是自己的图像变换有问题，在网上找了一些资料（https://zhuanlan.zhihu.com/p/430563265）后，选择了以下的方案：

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.08, 1.0), ratio=(3. / 4., 4. / 3.)),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    normalize
 ])

test_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize,
 ])

结果有非常显著的提升：

【Strong Baseline】

根据提示，选一个定义好的模型来训练，我这里使用的是ResNet18。效果有，但不多：

然后我继续尝试了ResNet34以及ResNet50，但是也只有微弱的提升:

小插曲：在最初定义模型的时候，由于官方文档没写需要给定num_classes参数，所以我直接忽略掉了这一项，但是没想到num_classes参数的默认值是1000！也就是意味着我上图跑的模型都是以1000类为目标的。在发现这点后，我立马去改了模型的定义加上了参数，重新训练了，结果居然大差不差，但是也是接近Strong baseline了：

【Boss Baseline】

最戏剧性的一幕是，当我想继续在Strong baseline的基础上选择更好的模型时，我选择了efficient net b3，但是结果却直接过了Boss baseline：

于是我翻阅了efficient net的原始论文，使用了更强大的b4模型继续实验：

在private上也获得了提升。

在真正强大的模型面前，所有的cross validation和TTA这些技巧都显得微不足道啊。。

于是就这样稀里糊涂的过了Boss baseline，直接去下一个任务了^^

Homework4

作业简介

进行多类分类 (Multiclass Classification) 的说话人识别 (Speaker Identification)。目标是通过给定的语音数据来预测说话人的身份类别。在该任务中，您需要基于语音信号的特征，建立模型来识别不同说话人的身份。

一些标准：

调参记录

【Simple Baseline】

跑通原始代码即可：

【Medium/Strong Baseline】

根据作业提示，需要调整transformer中的self-attention的层数以及隐藏层的大小，这里我选择了attention is all you need原始论文中的设置，代码如下：

class Classifier(nn.Module):
	def __init__(self, d_model=512, n_spks=600, dropout=0.1):
		super().__init__()
		# Project the dimension of features from that of input into d_model.
		self.prenet = nn.Linear(40, d_model)
		# TODO:
		#   Change Transformer to Conformer.
		#   https://arxiv.org/abs/2005.08100
		self.encoder_layer = nn.TransformerEncoderLayer(
			d_model=d_model, dim_feedforward=2048, nhead=8, batch_first=True
		)
		self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=8)

		# Project the the dimension of features from d_model into speaker nums.
		self.pred_layer = nn.Sequential(
			nn.Linear(d_model, d_model),
			nn.Sigmoid(),
			nn.Linear(d_model, n_spks),
		)

结果是距离medium baseline还有一定的距离：

于是反正是调参，我就去看了一个比较好的自动调参的框架：optuna, 并使用默认的参数搜索方法和中值剪枝进行了优化，选出了一组比较好的超参数：

class Classifier(nn.Module):
	def __init__(self, d_model=1024, n_spks=600, dropout=0.4):
		super().__init__()
		# Project the dimension of features from that of input into d_model.
		self.prenet = nn.Linear(40, d_model)
		# TODO:
		#   Change Transformer to Conformer.
		#   https://arxiv.org/abs/2005.08100
		self.encoder_layer = nn.TransformerEncoderLayer(
			d_model=d_model, dim_feedforward=512, nhead=16, batch_first=True
		)
		self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=8)

		# Project the the dimension of features from d_model into speaker nums.
		self.pred_layer = nn.Sequential(
			nn.Linear(d_model, d_model * 2),
			nn.Sigmoid(),
            nn.Dropout(dropout),
			nn.Linear(d_model * 2, n_spks),
		)

config = {
        "data_dir": "/kaggle/input/ml2023springhw4/Dataset",
        "save_path": "model.ckpt",
        "batch_size": 32,
        "n_workers": 2,
        "valid_steps": 2000,
        "warmup_steps": 2000,
        "save_steps": 10000,
        "total_steps": 100000,
}

criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=4e-5, weight_decay=7e-8)
scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)

结果非常好，居然一下就过了Strong Baseline：

【Boss Baseline】

根据助教的提示，需要将transformer更改成conformer。这里我使用了GitHub上某位大佬用pytorch写的conformer，pip安装后直接import即可。并且同样的使用optuna进行超参数搜索，代码和参数设置如下：

from conformer import Conformer
class Classifier(nn.Module):
    def __init__(self, d_model=512, n_spks=600, dropout=0.3, nhead=16, ff_mult=4, conv_expansion_factor=8, num_layers=4):
        super().__init__()
        # Project the dimension of features from that of input into d_model.
        self.prenet = nn.Linear(40, d_model)
        # TODO:
        #   Change Transformer to Conformer.
        #   https://arxiv.org/abs/2005.08100
        # self.encoder_layer = nn.TransformerEncoderLayer(
        # 	d_model=d_model, dim_feedforward=dim_feedforward, nhead=nhead, batch_first=True
        # )
        # self.encoder = nn.TransformerEncoder(self.encoder_layer, num_layers=num_layers)
        self.conformer_block = Conformer(dim=d_model, depth=num_layers, dim_head=(d_model//nhead), heads=nhead,
                                      ff_mult=ff_mult, conv_expansion_factor=conv_expansion_factor, attn_dropout=dropout,
                                      ff_dropout=dropout, conv_dropout=dropout)

config = {
        "data_dir": "/kaggle/input/ml2023springhw4/Dataset",
        "save_path": "model.ckpt",
        "batch_size": 32,
        "n_workers": 2,
        "valid_steps": 2000,
        "warmup_steps": 2000,
        "save_steps": 10000,
        "total_steps": 100000,
}

criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=4e-5, weight_decay=7e-8)
scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)

最后结果直接到达了Boss Baseline！其他的技巧（如AMSoftmax）也全都不用使用了：

总结

由于hw5及以后的作业kaggle上疑似不让报名了，我也就先做到这里了！