Llama 架构分析

正在检查是否收录...

从代码角度进行Llama 架构分析

Llama 架构分析前言 Llama 架构分析分词网络主干 DecoderLayer Attention MLP 下游任务因果推理文本分类

Llama 架构分析

前言

Meta 开发并公开发布了 Llama系列大型语言模型 (LLM)，这是一组经过预训练和微调的生成文本模型，参数规模从 70 亿到 700 亿不等。

在大多数任务中，LLaMA-13B要比GPT-3(175B)的性能要好，LLaMA-65B和组好的模型Chinchilla-70B以及PaLM-540B的实力相当。

Llama 架构分析

分词

分词部分主要做的是利用文本分词器对文本进行分词

tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER) text = "Hey, are you conscious? Can you talk to me?" inputs = tokenizer(text, return_tensors="pt")

网络主干

主干网络部分主要是将分词得到的input_ids输入到embedding层中进行文本向量化，得到hidden_states（中间结果），然后输入到layers层中，得到hidden_states（中间结果），用于下游任务。

self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx) self.layers = nn.ModuleList( [MixtralDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)] ) self._use_flash_attention_2 = config._attn_implementation == "flash_attention_2" self.norm = MixtralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

DecoderLayer

主干网络的layers层就是由多个DecoderLayer组成的，由num_hidden_layers参数决定，一般我们说的模型量级就取决于这个数量，7b的模型DecoderLayer层的数量是32。

DecoderLayer层中又包含了Attention层和MLP层，主要的一个思想是利用了残差结构。

如下图所示，分为两个部分

第一部分

首先，将hidden_states（文本向量化的结构）进行复制，即残差归一化注意力层残差相加

第二部分

首先将第一部分得到的hidden_states进行复制，即残差归一化 MLP层残差相加

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

#复制一份 residual = hidden_states #归一化 hidden_states = self.input_layernorm(hidden_states) #注意力层 hidden_states, self_attn_weights, present_key_value = self.self_attn( hidden_states=hidden_states, attention_mask=attention_mask, position_ids=position_ids, past_key_value=past_key_value, output_attentions=output_attentions, use_cache=use_cache, padding_mask=padding_mask, ) #加上残差 hidden_states = residual + hidden_states #复制一份 residual = hidden_states #归一化 hidden_states = self.post_attention_layernorm(hidden_states) #mlp hidden_states = self.mlp(hidden_states) #加上残差 hidden_states = residual + hidden_states outputs = (hidden_states,) if output_attentions: outputs += (self_attn_weights,) if use_cache: outputs += (present_key_value,) return outputs

Attention

进行位置编码，让模型更好的捕捉上下文信息

#经过线性层 query_states = self.q_proj(hidden_states) key_states = self.k_proj(hidden_states) value_states = self.v_proj(hidden_states) #多头注意力形状变换 query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2) key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) kv_seq_len = key_states.shape[-2] #计算cos、sin #计算旋转位置嵌入 cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) #计算权重 key_states = repeat_kv(key_states, self.num_key_value_groups) value_states = repeat_kv(value_states, self.num_key_value_groups) attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim) #加上掩码 attn_weights = attn_weights + attention_mask #计算softmax attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) attn_output = torch.matmul(attn_weights, value_states) attn_output = self.o_proj(attn_output)

MLP

mlp层的主要作用是应用非线性激活函数和线性投影。

首先将attention层得到的结果经过两个线性层得到gate_proj和up_proj gate_proj经过激活函数，再和up_proj相乘最后经过一个线性层得到最后的结果

self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) self.act_fn = ACT2FN[config.hidden_act] down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

下游任务

因果推理

所谓因果推理，就是回归任务。

self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

文本分类

即分类任务

self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

codetpullamatoken因果推理注意力文本分类urlpalm上下文信息语言模型gptapp上下文gpt-3大型语言模型llm预训练

流量来路如何区分，自然流量和竞价流量的辨别