定义与范畴

自然语言处理(Natural Language Processing,NLP)是人工智能与语言学的交叉领域,旨在让计算机理解、生成和交互自然语言。1

NLP 任务分类

任务类型示例特点
文本分类垃圾邮件检测、情感分析句子级别输出
序列标注分词、词性标注、命名实体识别每个token输出标签
序列生成机器翻译、文本摘要变长输出序列
文本匹配问答系统、语义相似度两段文本的关系

分词(Tokenization)

分词是将文本切分为最小语义单元的过程。

词级别分词

# 简单空格分词
text = "The quick brown fox jumps"
tokens = text.split()
# ['The', 'quick', 'brown', 'fox', 'jumps']

子词级别分词

BPE(Byte Pair Encoding) 是最常用的子词算法:

# BPE训练过程(伪代码)
def train_bpe(corpus, vocab_size):
    vocab = set(characters)
    pairs = count_adjacent_pairs(corpus)
    
    while len(vocab) < vocab_size:
        most_frequent = find_most_common_pair(pairs)
        new_token = merge_pair(most_frequent)
        vocab.add(new_token)
        update_pairs(corpus, most_frequent, new_token)
    
    return vocab

主流分词器对比

分词器算法特点
WordPiece贪心构建Google BERT使用
SentencePiece统一框架支持BPE/UNIGRAM
BPE频率合并GPT-2使用
Character-level字符为单位鲁棒性强,序列长

词嵌入(Word Embedding)

词嵌入将离散的词映射到连续的向量空间。

One-Hot 编码

import numpy as np
 
vocab = ["cat", "dog", "bird"]
vocab_size = len(vocab)
 
# One-Hot编码
cat = np.array([1, 0, 0])
dog = np.array([0, 1, 0])
bird = np.array([0, 0, 1])

问题:维度高(=|V|)、稀疏、无语义关系。

Word2Vec

Word2Vec通过上下文学习词向量,包含两种架构:

CBOW(Continuous Bag-of-Words):根据上下文预测中心词

Skip-gram:根据中心词预测上下文

from gensim.models import Word2Vec
 
sentences = [["cat", "say", "meow"], ["dog", "say", "woof"]]
model = Word2Vec(sentences, vector_size=100, window=3, min_count=1)
 
# 词向量
vector = model.wv["cat"]  # 100维向量
 
# 语义相似度
similarity = model.wv.similarity("cat", "dog")

GloVe

GloVe(Global Vectors)结合全局矩阵分解与局部上下文:

其中 是词 在词 上下文中出现的次数。

Embedding Layer

深度学习中的可训练词嵌入:

import torch
import torch.nn as nn
 
vocab_size = 10000
embed_dim = 256
 
embedding = nn.Embedding(vocab_size, embed_dim)
input_ids = torch.tensor([1, 5, 3, 2])  # batch=1, seq_len=4
 
embedded = embedding(input_ids)  # (1, 4, 256)

Seq2Seq 与 Attention

Seq2Seq解决变长序列到变长序列的映射问题。

Encoder-Decoder 架构

Encoder:                    Decoder:
h_t = f(x_t, h_{t-1})       y_t = g(y_{<t}, s_{t-1})
                              s_t = f(s_{t-1}, y_{t-1}, c_t)

Attention 机制

Attention解决了Encoder信息压缩的问题:

class Attention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.attn = nn.Linear(hidden_dim * 2, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1, bias=False)
    
    def forward(self, decoder_hidden, encoder_outputs):
        # decoder_hidden: (1, hidden_dim)
        # encoder_outputs: (seq_len, hidden_dim)
        
        seq_len = encoder_outputs.size(0)
        decoder_hidden = decoder_hidden.repeat(seq_len, 1, 1)
        
        energy = torch.tanh(self.attn(
            torch.cat([decoder_hidden, encoder_outputs], dim=2)
        ))
        attention = self.v(energy).squeeze(2)
        
        return torch.softmax(attention, dim=0)

Bahdanau Attention vs Luong Attention

特性BahdanauLuong
注意力位置Decoder之前Decoder之后
拼接方式
输出 作为输入直接用于输出层

Transformer

Transformer完全基于Attention机制,摒弃了RNN。2

Self-Attention

Self-Attention计算序列内部的关系:

import torch
import torch.nn.functional as F
import math
 
def self_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    attention_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attention_weights, V), attention_weights

Multi-Head Attention

多头注意力并行计算多个注意力:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def split_heads(self, x):
        batch_size, seq_len = x.size(0), x.size(1)
        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
    
    def forward(self, query, key, value, mask=None):
        Q = self.split_heads(self.W_q(query))
        K = self.split_heads(self.W_k(key))
        V = self.split_heads(self.W_v(value))
        
        attn_output, attn_weights = self_attention(Q, K, V, mask)
        
        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.view(attn_output.size(0), -1, self.d_model)
        
        return self.W_o(attn_output)

Positional Encoding

由于Transformer没有循环结构,需要添加位置信息:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(
            torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
        )
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

Transformer 完整结构

Input Embedding + Positional Encoding
           ↓
     [Encoder Layer] × N
           ↓
     [Decoder Layer] × N
           ↓
        Linear + Softmax

预训练模型

BERT(Bidirectional Encoder Representations from Transformers)

BERT使用MLM(Masked Language Model)进行训练:

# MLM: 随机遮盖15%的token
input_ids = torch.tensor([101, 2003, 2305, 2023, 2154, 102])
labels = torch.tensor([-100, -100, 2305, -100, -100, -100])  # 仅计算遮盖位置
 
# BERT输出
outputs = bert_model(input_ids)
sequence_output = outputs.last_hidden_state  # (batch, seq_len, hidden)
pooled_output = outputs.pooler_output        # (batch, hidden)

GPT(Generative Pre-trained Transformer)

GPT采用单向语言模型:

T5(Text-to-Text Transfer Transformer)

将所有NLP任务统一为文本到文本格式:

任务输入输出
翻译”translate English to French: hello""bonjour”
摘要”summarize: article text…""summary…”
问答”question: … context: …""answer”

常用NLP工具

Hugging Face Transformers

from transformers import pipeline, AutoTokenizer, AutoModel
 
# 预训练pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love this movie!")
# [{'label': 'POSITIVE', 'score': 0.9998}]
 
# 加载模型
model_name = "bert-base-chinese"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

jieba分词

import jieba
 
text = "北京大学的学生在研究自然语言处理"
 
# 精确模式
words = jieba.cut(text)
print("/".join(words))
# 北京大学/的/学生/在/研究/自然语言处理
 
# 全模式
words = jieba.cut(text, cut_all=True)
print("/".join(words))
# 北京/北京大字/大学/的/学生/在/研究/自然/自然语言/语言/处理

参考资料

Footnotes

  1. Speech and Language Processing - Daniel Jurafsky & James H. Martin. https://web.stanford.edu/~jurafsky/slp3/

  2. Attention Is All You Need - Vaswani et al. (2017). https://arxiv.org/abs/1706.03762