textcnn+emlo

实验3（附加）：词嵌入的原理与预训练词向量使用报告

郑书航

2022113601

实验背景

数据集

本实验使用了经过抽样处理的RT-Polarity数据集，该数据集由正负情感分类的文本组成。其中，正面情感文本和负面情感文本分别保存在两个文件中。在实验中，我们将数据进行了清理、分词，并按照80:20的比例划分为训练集和测试集。

rt-polarity.neg数据集：包含一些对作品的负面评价。一下是取其中4个句子做翻译。

simplistic , silly and tedious .简陋、愚蠢且乏味。
it’s so laddish and juvenile , only teenage boys could possibly find it funny .这东西太粗俗幼稚了，可能只有少年男孩才会觉得好笑。
exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .具有剥削性，而且几乎完全缺乏能够使人忍受如此露骨犯罪描写的深度或精巧。
[garbus] discards the potential for pathological study , exhuming instead , the skewed melodrama of the circumstantial situation .[Garbus] 放弃了对病理学研究的潜力，转而挖掘情境中的扭曲的情节剧。

rt-polarity.pos数据集：包含一些对作品的正面评价。以下是取其中3个句子的翻译。

the rock is destined to be the 21st century’s new “ conan “ and that he’s going to make a splash even greater than arnold schwarzenegger , jean-claude van damme or steven seagal .巨石强森注定会成为21世纪的新“科南”，他的影响力将超过阿诺德·施瓦辛格、尚-克劳德·范·达美或史蒂文·席格。
the gorgeously elaborate continuation of “ the lord of the rings “ trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson’s expanded vision of j . r . r . tolkien’s middle-earth .“指环王”三部曲的华丽延续如此宏大，以至于用几段文字无法充分描述编剧兼导演彼得·杰克逊对托尔金中土世界的宏大愿景。
effective but too-tepid biopic有效但略显平淡的传记片。

实验环境

实验在Python环境下进行，主要依赖以下库：

PyTorch：用于构建神经网络模型和进行训练。
AllenNLP：用于加载和使用ELMo词嵌入模型。
Scikit-learn：用于数据集的划分和模型评估。
Matplotlib：用于绘制训练过程中损失和准确率的变化曲线。

实验在带有CUDA支持的GPU上进行训练，确保了较快的计算速度。

词嵌入与预训练词向量

词嵌入（Word Embedding）是一种将词汇表中的每个单词表示为一个固定长度的向量的技术。这些向量捕捉了单词之间的语义关系，使得相似意义的词在向量空间中距离较近。常见的词嵌入方法包括Word2Vec、GloVe等，它们通过大规模语料库的预训练学习单词的上下文信息，从而获得语义丰富的词向量。

ELMo词向量

ELMo（Embeddings from Language Models）是AllenNLP团队提出的一种基于双向LSTM的深度词嵌入模型。与传统的词嵌入方法不同，ELMo通过结合上下文信息生成动态的词嵌入，即每个单词在不同上下文中会生成不同的向量表示。这种方法大大提升了模型对多义词和复杂句子结构的理解能力。

在本实验中，我们使用预训练好的ELMo模型来生成文本的词向量。这些词向量不仅包含了丰富的上下文语义信息，还能显著提高情感分类任务中的模型表现。ELMo模型通过一个两层的双向LSTM网络构建，每层包含2048个隐层单元，输出的词嵌入维度为512。我们使用了ELMo的预训练模型，并冻结了其参数以避免在训练过程中更新模型权重。

实验方案设计

数据预处理

代码主要用于文本数据的清理、处理和加载，特别是在自然语言处理（NLP）任务中。以下是对每个部分的详细解释：

数据清理函数 `clean_str`

                    def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()
                    lang-python

功能：该函数用于清理输入的字符串，去除不必要的字符，并规范化文本格式。
步骤：
- 使用正则表达式 re.sub 替换掉所有非字母、数字和特定标点符号的字符，替换为一个空格。
- 处理常见的缩写形式（如 \'s, \'ve, n\'t 等），确保它们前后有空格，以便后续处理。
- 对标点符号（如逗号、感叹号、括号和问号）进行处理，确保它们前后有空格。
- 最后，使用 strip() 去除字符串首尾的空格，并将字符串转换为小写。

初始化 ELMo 词向量

                    elmo_options_file = "elmo_2x2048_256_2048cnn_1xhighway_options.json"
elmo_weight_file = "elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5"
elmo_dim = 512
 
def init_elmo():
    elmo = Elmo(elmo_options_file, elmo_weight_file, 1, dropout=0)
    for param in elmo.parameters():
        param.requires_grad = False
    return elmo
                    lang-python

功能：该部分代码用于初始化 ELMo（Embeddings from Language Models）模型。
步骤：
- 指定 ELMo 的配置文件和权重文件。
- 创建 ELMo 实例，设置 dropout 为 0（表示不使用 dropout）。
- 将模型参数的 requires_grad 属性设置为 False，表示在训练过程中不更新这些参数。
- 返回初始化后的 ELMo 模型。

获取 ELMo 词向量 `get_elmo`

                    def get_elmo(model, sentence_lists):
    character_ids = batch_to_ids(sentence_lists)
    embeddings = model(character_ids)
    return embeddings['elmo_representations'][0]
                    lang-python

功能：该函数用于获取输入句子的 ELMo 词向量。
步骤：
- 使用 batch_to_ids 函数将句子列表转换为字符 ID。
- 将字符 ID 输入到 ELMo 模型中，获取词向量表示。
- 返回 ELMo 生成的词向量。

加载数据并生成标签 `load_data_and_labels`

                    def load_data_and_labels(positive_data_file, negative_data_file):
    positive_examples = list(open(positive_data_file, "r", encoding='utf-8').readlines())
    positive_examples = [s.strip() for s in positive_examples]
    negative_examples = list(open(negative_data_file, "r", encoding='utf-8').readlines())
    negative_examples = [s.strip() for s in negative_examples]
 
    x_text = positive_examples + negative_examples
    x_text = [clean_str(sent) for sent in x_text]
    x_text = list(map(lambda x: x.split(), x_text))
 
    positive_labels = [1 for _ in positive_examples]
    negative_labels = [0 for _ in negative_examples]
    y = np.array(positive_labels + negative_labels)
    return [x_text, y]
                    lang-python

功能：该函数用于加载正面和负面数据，并生成相应的标签。
步骤：
- 读取正面和负面数据文件的内容，并去除每行的首尾空白字符。
- 将正面和负面示例合并为一个列表 x_text。
- 对每个句子应用 clean_str 函数进行清理，并将句子分割成单词列表。
- 为正面示例生成标签 1，为负面示例生成标签 0，并将它们合并为 NumPy 数组 y。
- 返回清理后的文本和标签。

举例解释

好的，让我们结合上面给出的代码和数据例子，详细说明数据集是如何经过处理的。假设我们有以下两个文本文件：

正面数据文件 (rt-polarity.pos):

                    I love this movie!
This film is fantastic.
What a great experience!
                    plaintext

负面数据文件 (rt-polarity.neg):

                    I hate this movie.
This film is terrible.
What a bad experience.
                    plaintext

加载数据:
使用 load_data_and_labels 函数加载正面和负面数据文件。

                    positive_data_file = "rt-polarity.pos"
negative_data_file = "rt-polarity.neg"
x_text, y = load_data_and_labels(positive_data_file, negative_data_file)
                    lang-python

读取文件:

positive_examples 将包含：

                    ['I love this movie!\n', 'This film is fantastic.\n', 'What a great experience!\n']
                    plaintext

negative_examples 将包含：

                    ['I hate this movie.\n', 'This film is terrible.\n', 'What a bad experience.\n']
                    plaintext

清理数据:
在 load_data_and_labels 函数中，调用 clean_str 函数对每个句子进行清理。

清理过程:
- 对于正面示例：
  - I love this movie! → i love this movie !
  - This film is fantastic. → this film is fantastic .
  - What a great experience! → what a great experience !
- 对于负面示例：
  - I hate this movie. → i hate this movie .
  - This film is terrible. → this film is terrible .
  - What a bad experience. → what a bad experience .

最终的 x_text:

                    [
  ['i', 'love', 'this', 'movie', '!'],
  ['this', 'film', 'is', 'fantastic', '.'],
  ['what', 'a', 'great', 'experience', '!'],
  ['i', 'hate', 'this', 'movie', '.'],
  ['this', 'film', 'is', 'terrible', '.'],
  ['what', 'a', 'bad', 'experience', '.']
]
                    plaintext

生成标签:

正面示例的标签为 1，负面示例的标签为 0。

y 将包含：

                    [1, 1, 1, 0, 0, 0]
                    plaintext

返回结果:
load_data_and_labels 函数返回清理后的文本和标签：

                    return [x_text, y]
                    lang-python

初始化 ELMo 词向量:
使用 init_elmo 函数初始化 ELMo 模型。

                    elmo_model = init_elmo()
                    lang-python

获取 ELMo 词向量:
使用 get_elmo 函数获取每个句子的 ELMo 词向量表示。

                    embeddings = get_elmo(elmo_model, x_text)
                    lang-python

输入: sentence_lists 是一个包含多个句子的列表，每个句子又是一个单词列表。例如：

                    sentence_lists = [
    ['i', 'love', 'this', 'movie', '!'],
    ['this', 'film', 'is', 'fantastic', '.'],
    ['what', 'a', 'great', 'experience', '!'],
    ['i', 'hate', 'this', 'movie', '.'],
    ['this', 'film', 'is', 'terrible', '.'],
    ['what', 'a', 'bad', 'experience', '.']
]
                    plaintext

处理过程:
- batch_to_ids 是 AllenNLP 库中的一个函数，用于将一批句子转换为字符 ID，以便 ELMo 模型可以处理。这通常是通过查找一个预定义的字典（词汇表）来完成的。
- 例如，单词 “i” 可能被转换为字符 ID [73]，”love” 可能被转换为 [108, 111, 118, 101]，等等。

输出: 最终，character_ids 将是一个包含多个句子的字符 ID 列表，格式如下：

                    character_ids = [
    [[73], [108, 111, 118, 101], [116, 104, 105, 115], [109, 111, 118, 105, 101], [33]],
    [[116, 104, 105, 115], [102, 105, 108, 109], [105, 115], [102, 97, 110, 116, 97, 115, 116, 105, 99], [46]],
    ...
]
                    plaintext

这里每个句子都被转换为一个字符 ID 的列表。

输入: character_ids 是上一步生成的字符 ID 列表。
- 处理过程:
  - ELMo 模型会使用这些字符 ID 来计算每个句子的上下文嵌入。模型会考虑每个单词在句子中的位置以及上下文信息，从而生成更丰富的嵌入表示。
  - 这个过程通常涉及深度学习中的前向传播，模型会通过多个层（如 LSTM 或 GRU）来处理输入数据。

输出: embeddings 将是一个包含每个句子的嵌入表示的字典，格式如下：

                    embeddings = {
    'elmo_representations': [
        # 第一个元素是句子的嵌入表示
        [[...], [...], [...], ...],  # 第一个句子的嵌入
        [[...], [...], [...], ...],  # 第二个句子的嵌入
        ...
    ],
    ...
}
                    plaintext

这里，elmo_representations 是一个列表，其中每个元素都是一个句子的嵌入表示，通常是一个高维向量（例如，1024维）。

模型训练

模型训练部分实现了一个文本分类模型的训练和评估过程。它使用了 TextCNN 模型，并结合了 ELMo 预训练模型生成的词嵌入。以下是对代码的详细解释：

`TextDataset` 类

                    class TextDataset(Dataset):
    def __init__(self, x, y):
        self.data = list(zip(x, y))
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, idx):
        assert idx < len(self)
        return self.data[idx]
                    lang-python

TextDataset：这是一个自定义数据集类，继承自 PyTorch 的 Dataset 类。它将输入数据 x 和标签 y 组合成一个列表，并提供了获取数据长度和特定索引数据的方法。
__init__：构造函数，将输入 x 和 y 合并为一个列表。
__len__：返回数据集的长度。
__getitem__：返回指定索引的数据。

`collate_fn` 函数

                    def collate_fn(batch):
    data, label = zip(*batch)
    return data, label
                    lang-python

collate_fn：这是一个用于将多个样本打包成一个批次的函数。在数据加载过程中，它会将 batch 中的数据和标签分离，并返回两个分别包含数据和标签的列表。

3. `Block` 类

                    class Block(nn.Module):
    def __init__(self, kernel_s, embeddin_num, max_len, hidden_num):
        super().__init__()
        self.cnn = nn.Conv2d(in_channels=1, out_channels=hidden_num, kernel_size=(kernel_s, embeddin_num))
        self.act = nn.ReLU()
 
    def forward(self, batch_emb):
        c = self.cnn(batch_emb)
        a = self.act(c)
        a = a.squeeze(dim=-1)
        m = a.mean(dim=2)
        return m
                    lang-python

Block：这是一个卷积神经网络（CNN）块，用于处理输入的嵌入表示。
__init__：初始化函数，设置卷积层和激活函数。
- self.cnn：一个2D卷积层，in_channels=1 表示输入的通道数为1，out_channels=hidden_num 表示输出通道数为隐藏单元数，kernel_size=(kernel_s, embeddin_num) 设置卷积核的大小。
- self.act：ReLU 激活函数。
forward：前向传播函数，计算卷积操作后应用 ReLU 激活函数，然后将最后一个维度压缩并沿着时间步（序列长度）进行平均。

4. `TextCNNModel` 类

                    class TextCNNModel(nn.Module):
    def __init__(self, max_len, class_num, hidden_num, elmo_model):
        super().__init__()
        self.elmo_model = elmo_model
        self.emb_num = 512
        self.block1 = Block(2, self.emb_num, max_len, hidden_num)
        self.block2 = Block(3, self.emb_num, max_len, hidden_num)
        self.block3 = Block(4, self.emb_num, max_len, hidden_num)
        self.classifier = nn.Linear(hidden_num * 3, class_num)
        self.loss_fun = nn.CrossEntropyLoss()
 
    def forward(self, batch_idx):
        batch_emb = get_elmo(self.elmo_model, batch_idx)
        batch_emb = torch.unsqueeze(batch_emb, dim=1)
 
        b1_result = self.block1(batch_emb)
        b2_result = self.block2(batch_emb)
        b3_result = self.block3(batch_emb)
        feature = torch.cat([b1_result, b2_result, b3_result], dim=1)
        pre = self.classifier(feature)
        return pre
                    lang-python

TextCNNModel：这是整个文本分类模型的核心，包含了多个 Block 模块和一个全连接层进行分类。
__init__：初始化函数，定义了3个卷积块，每个块使用不同的卷积核大小（2、3、4），以及最后的全连接分类器。
forward：前向传播函数，依次通过 ELMo 获得词嵌入，通过不同的卷积块处理，并将结果拼接后输入全连接层进行分类。

`train_and_evaluate` 函数

                    def train_and_evaluate(model, train_loader, dev_loader, opt, loss_fn, device, epochs, save_model_best):
    train_losses, dev_losses, dev_accuracies = [], [], []
    acc_max = 0
 
    for epoch in range(epochs):
        model.train()
        loss_sum, count = 0, 0
        print(f"**** Epoch {epoch+1} ****")
        for batch_index, batch_data  in enumerate(train_loader):
            x, labels = batch_data
            batch_label = torch.LongTensor(labels)
            batch_text = list(x)
            batch_label = batch_label.to(device)
            pred = model(batch_text)
 
            loss = loss_fn(pred, batch_label)
            opt.zero_grad()
            loss.backward()
            opt.step()
            loss_sum += loss.item()
            count += 1
 
            if batch_index % 1000 == 20:
                msg = "[{0}/{1:5d}]\tTrain_Loss:{2:.4f}"
                print(msg.format(epoch + 1, batch_index + 1, loss_sum / count))
                train_losses.append(loss_sum / count)
                loss_sum, count = 0.0, 0
 
        model.eval()
        dev_loss_sum, count = 0, 0
        all_pred, all_true = [], []
        with torch.no_grad():
            for batch_index, batch_data  in enumerate(dev_loader):
                x, labels = batch_data
                batch_label = torch.LongTensor(labels)
                batch_text = list(x)
                batch_label = batch_label.to(device)
                pred = model(batch_text)
                dev_loss_sum += loss_fn(pred, batch_label).item()
                pred = torch.argmax(pred, dim=1)
                all_pred.extend(pred.cpu().numpy())
                all_true.extend(batch_label.cpu().numpy())
                count += 1
 
        dev_losses.append(dev_loss_sum / count)
        acc = accuracy_score(all_true, all_pred)
        dev_accuracies.append(acc)
        print(f"Epoch {epoch+1}, Dev Loss: {dev_loss_sum / count:.4f}, Dev Accuracy: {acc:.4f}")
 
        if acc > acc_max:
            acc_max = acc
            torch.save(model.state_dict(), save_model_best)
            print("Saved best model")
 
        print("*" * 50)
 
    # 保存损失和准确率图像
    plt.figure(figsize=(10, 5))
    plt.subplot(1, 2, 1)
    plt.plot(train_losses, label="Train Loss")
    plt.plot(dev_losses, label="Validation Loss")
    plt.title("Loss over Epochs")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.legend()
 
    plt.subplot(1, 2, 2)
    plt.plot(dev_accuracies, label="Validation Accuracy")
    plt.title("Accuracy over Epochs")
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.legend()
 
    plt.tight_layout()
    plt.savefig("training_metrics.png")
    plt.show()
                    lang-python

训练过程：
- 每个 epoch 循环内，模型设置为训练模式 (model.train())。
- 对每个批次数据，通过模型前向传播计算预测结果 pred，并计算损失 loss。
- 反向传播 (loss.backward()) 更新模型参数。
- 每 1000 个批次打印一次当前的训练损失。
评估过程：
- 模型设置为评估模式 (model.eval())，不更新模型参数。
- 对开发集（验证集）进行前向传播，计算验证损失和准确率。
- 如果当前 epoch 的验证准确率超过之前的最大值，则保存当前模型。
结果展示：在训练结束后，代码还会生成并保存训练过程中损失和准确率的图表，方便分析模型性能。

举例解释

给定x和y的初始值如下：

                    x = [
  ['i', 'love', 'this', 'movie', '!'],
  ['this', 'film', 'is', 'fantastic', '.'],
  ['what', 'a', 'great', 'experience', '!'],
  ['i', 'hate', 'this', 'movie', '.'],
  ['this', 'film', 'is', 'terrible', '.'],
  ['what', 'a', 'bad', 'experience', '.']
]
 
y = [1, 1, 1, 0, 0, 0]
                    lang-python

我们来看x和y在这个训练流程中的详细变化过程。

初始化数据集

首先，x和y被传入TextDataset类的构造函数中创建一个数据集实例。

                    class TextDataset(Dataset):
    def __init__(self, x, y):
        self.data = list(zip(x, y))
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, idx):
        assert idx < len(self)
        return self.data[idx]
                    lang-python

构造函数 (__init__)：x和y被打包成一个元组列表self.data，其中每个元组包含一个句子和相应的标签，例如：[(['i', 'love', 'this', 'movie', '!'], 1), (['this', 'film', 'is', 'fantastic', '.'], 1), ...]。
__len__ 方法：返回数据集的长度，len(x)为6。
__getitem__ 方法：接收索引idx并返回对应的数据项，比如idx=0时，返回(['i', 'love', 'this', 'movie', '!'], 1)。

数据加载和批处理

在训练过程中，DataLoader会使用TextDataset实例生成批量数据。

                    train_loader = DataLoader(dataset, batch_size=2, shuffle=True, collate_fn=collate_fn)
                    lang-python

collate_fn会将DataLoader中的一个批次的数据（batch）组合成两个列表：data（句子）和label（标签），并返回它们。假设batch_size=2，一个批次的输出可能是：

                    batch = [
  (['i', 'love', 'this', 'movie', '!'], 1),
  (['this', 'film', 'is', 'fantastic', '.'], 1)
]
                    lang-python

在collate_fn中：

                    def collate_fn(batch):
    data, label = zip(*batch)
    return data, label
                    lang-python

data会是(['i', 'love', 'this', 'movie', '!'], ['this', 'film', 'is', 'fantastic', '.'])，label会是(1, 1)。

*batch 的操作在 Python 中被称为解包（unpacking），它会将一个可迭代对象中的元素解开，并分别传递给一个函数。可以形象理解为for item in batch:将每个item传给zip
zip会将*batch 传入的每个item遍历一次分别再次包装起来。可以形象理解为class1.push(item[0]),...

通过ELMo模型生成嵌入表示
在TextCNNModel类的forward方法中：
```
                    def forward(self, batch_idx):
    batch_emb = get_elmo(self.elmo_model, batch_idx)
    batch_emb = torch.unsqueeze(batch_emb, dim=1)
                    lang-python
                    
                        
                        
                    
                
```
batch_idx传入的是collate_fn返回的data（即句子的列表），然后通过get_elmo函数生成ELMo的嵌入。假设get_elmo的输出是一个张量，每个句子对应一个维度为(max_len, emb_num)的嵌入表示，其中max_len是句子的最大长度，emb_num是每个词的嵌入维度。
torch.unsqueeze将batch_emb的形状从(batch_size, max_len, emb_num)扩展为(batch_size, 1, max_len, emb_num)，其中1表示一个额外的通道维度。
data（两个句子）被送入 get_elmo(self.elmo_model, batch_idx) 函数中，获取 ELMo 嵌入向量。假设每个单词生成的嵌入向量是 512 维，句子的最大长度为 13（这是句子长度对齐后的固定值）。因此，batch_emb 的维度为：
```
                    torch.Size([2, 13, 512])
                    plaintext
                    
                        
                        
                    
                
```
- 2 表示批次中的 2 条数据。
- 13 表示每个句子有 13 个 token，句子长度不足的被 padding 补齐。
- 512 表示每个 token 的 ELMo 嵌入向量维度。
  batch_emb 通过 torch.unsqueeze(batch_emb, dim=1) 增加一个新的维度，这个新的维度表示输入的“通道数”（in_channels），通常用于卷积操作中。添加后，batch_emb 的维度变为：
```
                    torch.Size([2, 1, 13, 512])
                    plaintext
                    
                        
                        
                    
                
```
- 1 表示输入通道数（单通道）。
卷积和特征提取
TextCNNModel包含三个卷积块，每个块使用不同的卷积核大小（2、3、4）来提取不同尺度的特征：
```
                    b1_result = self.block1(batch_emb)
b2_result = self.block2(batch_emb)
b3_result = self.block3(batch_emb)
                    lang-python
                    
                        
                        
                    
                
```
每个卷积块输出的形状为(batch_size, hidden_num, max_len - kernel_size + 1)，然后通过mean(dim=2)在时间维度上取平均值，使每个卷积块的输出成为一个大小为(batch_size, hidden_num)的张量。
在 Block 模型中，首先会通过一个 2D 卷积层处理数据。假设卷积核大小是 (2, 512)，即在句子长度方向（13 这个维度）滑动，处理两个单词的窗口，同时覆盖整个 512 维的嵌入向量。
经过卷积操作后，假设卷积层的输出维度为 torch.Size([2, 2, new_len, 1])，然后使用 squeeze(dim=-1) 删除最后的维度，剩下的维度变为：
```
                    torch.Size([2, 2, new_len])
                    plaintext
                    
                        
                        
                    
                
```
- 2 表示批次大小。
- 第二个 2 是卷积核个数（hidden_num）。
- new_len 是经过卷积和激活函数后的句子长度，取决于卷积核大小和 padding 方式。
最后，通过torch.cat将三个卷积块的输出拼接在一起，形成一个大小为(batch_size, hidden_num * 3)的特征张量。
分类和损失计算
拼接后的特征张量被传递给全连接层self.classifier进行分类，输出的预测张量pre具有大小(batch_size, class_num)。在训练过程中，通过交叉熵损失函数self.loss_fun计算预测结果与真实标签之间的损失。

实验习题

简述为何要执行embeddings = torch.unsqueeze(embeddings, dim=1)，包括unsqueeze操作的效果，及对embedding进行unsqueeze操作的理由。

执行 embeddings = torch.unsqueeze(embeddings, dim=1) 的目的是在张量中增加一个新的维度，通常用于将数据格式调整为符合某些操作（如卷积层）的输入要求。

unsqueeze 操作的效果
torch.unsqueeze 的作用是在指定的维度上增加一个大小为 1 的维度。假设 embeddings 原本的形状为 [batch_size, seq_len, embedding_dim]，执行 unsqueeze 操作后，新的形状将变为 [batch_size, 1, seq_len, embedding_dim]。具体来说：
- 原始维度 [batch_size, seq_len, embedding_dim]：
  - batch_size: 批次大小，即一次处理的数据样本数量。
  - seq_len: 每个样本的序列长度（即句子中的单词数）。
  - embedding_dim: 每个单词的嵌入向量维度。
- unsqueeze 后的维度 [batch_size, 1, seq_len, embedding_dim]：
  - 在 dim=1 的位置插入了一个大小为 1 的维度，这个新的维度通常被解释为“通道”维度。
对 embedding 进行 unsqueeze 操作的理由
- 兼容卷积操作：在处理文本数据时，有时需要对嵌入进行卷积操作（通常在图像处理领域使用），卷积操作一般要求输入数据具有“通道”维度。通过在嵌入数据上添加一个通道维度 [batch_size, 1, seq_len, embedding_dim]，可以将嵌入作为单通道数据传递给卷积层。
- 增加模型灵活性：添加这个通道维度可以让模型在未来的改进中更容易扩展。例如，可以增加多个嵌入通道（如将多个不同的预训练嵌入结合起来），或在卷积层前进行其他复杂操作。
- 一致性：在神经网络中，尤其是处理多维数据（如图像或序列）的网络结构时，保持输入数据的维度一致性非常重要。通过 unsqueeze 操作，可以确保所有输入在通道维度上是一致的，从而简化了后续操作。

	def clean_str(string):
	string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
	string = re.sub(r"\'s", " \'s", string)
	string = re.sub(r"\'ve", " \'ve", string)
	string = re.sub(r"n\'t", " n\'t", string)
	string = re.sub(r"\'re", " \'re", string)
	string = re.sub(r"\'d", " \'d", string)
	string = re.sub(r"\'ll", " \'ll", string)
	string = re.sub(r",", " , ", string)
	string = re.sub(r"!", " ! ", string)
	string = re.sub(r"\(", " \( ", string)
	string = re.sub(r"\)", " \) ", string)
	string = re.sub(r"\?", " \? ", string)
	string = re.sub(r"\s{2,}", " ", string)
	return string.strip().lower()

	elmo_options_file = "elmo_2x2048_256_2048cnn_1xhighway_options.json"
	elmo_weight_file = "elmo_2x2048_256_2048cnn_1xhighway_weights.hdf5"
	elmo_dim = 512

	def init_elmo():
	elmo = Elmo(elmo_options_file, elmo_weight_file, 1, dropout=0)
	for param in elmo.parameters():
	param.requires_grad = False
	return elmo

	def get_elmo(model, sentence_lists):
	character_ids = batch_to_ids(sentence_lists)
	embeddings = model(character_ids)
	return embeddings['elmo_representations'][0]

	def load_data_and_labels(positive_data_file, negative_data_file):
	positive_examples = list(open(positive_data_file, "r", encoding='utf-8').readlines())
	positive_examples = [s.strip() for s in positive_examples]
	negative_examples = list(open(negative_data_file, "r", encoding='utf-8').readlines())
	negative_examples = [s.strip() for s in negative_examples]

	x_text = positive_examples + negative_examples
	x_text = [clean_str(sent) for sent in x_text]
	x_text = list(map(lambda x: x.split(), x_text))

	positive_labels = [1 for _ in positive_examples]
	negative_labels = [0 for _ in negative_examples]
	y = np.array(positive_labels + negative_labels)
	return [x_text, y]

	I love this movie!
	This film is fantastic.
	What a great experience!

	I hate this movie.
	This film is terrible.
	What a bad experience.

	positive_data_file = "rt-polarity.pos"
	negative_data_file = "rt-polarity.neg"
	x_text, y = load_data_and_labels(positive_data_file, negative_data_file)

	[
	['i', 'love', 'this', 'movie', '!'],
	['this', 'film', 'is', 'fantastic', '.'],
	['what', 'a', 'great', 'experience', '!'],
	['i', 'hate', 'this', 'movie', '.'],
	['this', 'film', 'is', 'terrible', '.'],
	['what', 'a', 'bad', 'experience', '.']
	]

	sentence_lists = [
	['i', 'love', 'this', 'movie', '!'],
	['this', 'film', 'is', 'fantastic', '.'],
	['what', 'a', 'great', 'experience', '!'],
	['i', 'hate', 'this', 'movie', '.'],
	['this', 'film', 'is', 'terrible', '.'],
	['what', 'a', 'bad', 'experience', '.']
	]

	character_ids = [
	[[73], [108, 111, 118, 101], [116, 104, 105, 115], [109, 111, 118, 105, 101], [33]],
	[[116, 104, 105, 115], [102, 105, 108, 109], [105, 115], [102, 97, 110, 116, 97, 115, 116, 105, 99], [46]],
	...
	]

	embeddings = {
	'elmo_representations': [
	# 第一个元素是句子的嵌入表示
	[[...], [...], [...], ...], # 第一个句子的嵌入
	[[...], [...], [...], ...], # 第二个句子的嵌入
	...
	],
	...
	}

	class TextDataset(Dataset):
	def __init__(self, x, y):
	self.data = list(zip(x, y))

	def __len__(self):
	return len(self.data)

	def __getitem__(self, idx):
	assert idx < len(self)
	return self.data[idx]

	def collate_fn(batch):
	data, label = zip(*batch)
	return data, label

	class Block(nn.Module):
	def __init__(self, kernel_s, embeddin_num, max_len, hidden_num):
	super().__init__()
	self.cnn = nn.Conv2d(in_channels=1, out_channels=hidden_num, kernel_size=(kernel_s, embeddin_num))
	self.act = nn.ReLU()

	def forward(self, batch_emb):
	c = self.cnn(batch_emb)
	a = self.act(c)
	a = a.squeeze(dim=-1)
	m = a.mean(dim=2)
	return m

	class TextCNNModel(nn.Module):
	def __init__(self, max_len, class_num, hidden_num, elmo_model):
	super().__init__()
	self.elmo_model = elmo_model
	self.emb_num = 512
	self.block1 = Block(2, self.emb_num, max_len, hidden_num)
	self.block2 = Block(3, self.emb_num, max_len, hidden_num)
	self.block3 = Block(4, self.emb_num, max_len, hidden_num)
	self.classifier = nn.Linear(hidden_num * 3, class_num)
	self.loss_fun = nn.CrossEntropyLoss()

	def forward(self, batch_idx):
	batch_emb = get_elmo(self.elmo_model, batch_idx)
	batch_emb = torch.unsqueeze(batch_emb, dim=1)

	b1_result = self.block1(batch_emb)
	b2_result = self.block2(batch_emb)
	b3_result = self.block3(batch_emb)
	feature = torch.cat([b1_result, b2_result, b3_result], dim=1)
	pre = self.classifier(feature)
	return pre

	def train_and_evaluate(model, train_loader, dev_loader, opt, loss_fn, device, epochs, save_model_best):
	train_losses, dev_losses, dev_accuracies = [], [], []
	acc_max = 0

	for epoch in range(epochs):
	model.train()
	loss_sum, count = 0, 0
	print(f"** Epoch {epoch+1} **")
	for batch_index, batch_data in enumerate(train_loader):
	x, labels = batch_data
	batch_label = torch.LongTensor(labels)
	batch_text = list(x)
	batch_label = batch_label.to(device)
	pred = model(batch_text)

	loss = loss_fn(pred, batch_label)
	opt.zero_grad()
	loss.backward()
	opt.step()
	loss_sum += loss.item()
	count += 1

	if batch_index % 1000 == 20:
	msg = "[{0}/{1:5d}]\tTrain_Loss:{2:.4f}"
	print(msg.format(epoch + 1, batch_index + 1, loss_sum / count))
	train_losses.append(loss_sum / count)
	loss_sum, count = 0.0, 0

	model.eval()
	dev_loss_sum, count = 0, 0
	all_pred, all_true = [], []
	with torch.no_grad():
	for batch_index, batch_data in enumerate(dev_loader):
	x, labels = batch_data
	batch_label = torch.LongTensor(labels)
	batch_text = list(x)
	batch_label = batch_label.to(device)
	pred = model(batch_text)
	dev_loss_sum += loss_fn(pred, batch_label).item()
	pred = torch.argmax(pred, dim=1)
	all_pred.extend(pred.cpu().numpy())
	all_true.extend(batch_label.cpu().numpy())
	count += 1

	dev_losses.append(dev_loss_sum / count)
	acc = accuracy_score(all_true, all_pred)
	dev_accuracies.append(acc)
	print(f"Epoch {epoch+1}, Dev Loss: {dev_loss_sum / count:.4f}, Dev Accuracy: {acc:.4f}")

	if acc > acc_max:
	acc_max = acc
	torch.save(model.state_dict(), save_model_best)
	print("Saved best model")

	print("" 50)

	# 保存损失和准确率图像
	plt.figure(figsize=(10, 5))
	plt.subplot(1, 2, 1)
	plt.plot(train_losses, label="Train Loss")
	plt.plot(dev_losses, label="Validation Loss")
	plt.title("Loss over Epochs")
	plt.xlabel("Epoch")
	plt.ylabel("Loss")
	plt.legend()

	plt.subplot(1, 2, 2)
	plt.plot(dev_accuracies, label="Validation Accuracy")
	plt.title("Accuracy over Epochs")
	plt.xlabel("Epoch")
	plt.ylabel("Accuracy")
	plt.legend()

	plt.tight_layout()
	plt.savefig("training_metrics.png")
	plt.show()

	x = [
	['i', 'love', 'this', 'movie', '!'],
	['this', 'film', 'is', 'fantastic', '.'],
	['what', 'a', 'great', 'experience', '!'],
	['i', 'hate', 'this', 'movie', '.'],
	['this', 'film', 'is', 'terrible', '.'],
	['what', 'a', 'bad', 'experience', '.']
	]

	y = [1, 1, 1, 0, 0, 0]

	batch = [
	(['i', 'love', 'this', 'movie', '!'], 1),
	(['this', 'film', 'is', 'fantastic', '.'], 1)
	]

LOADING

textcnn+emlo

实验3（附加）：词嵌入的原理与预训练词向量使用报告

实验背景

数据集

实验环境

词嵌入与预训练词向量

ELMo词向量

实验方案设计

数据预处理

数据清理函数 clean_str

初始化 ELMo 词向量

获取 ELMo 词向量 get_elmo

加载数据并生成标签 load_data_and_labels

举例解释

模型训练

TextDataset 类

collate_fn 函数

3. Block 类

4. TextCNNModel 类

train_and_evaluate 函数

举例解释

实验习题

数据清理函数 `clean_str`

获取 ELMo 词向量 `get_elmo`

加载数据并生成标签 `load_data_and_labels`

`TextDataset` 类

`collate_fn` 函数

3. `Block` 类

4. `TextCNNModel` 类

`train_and_evaluate` 函数