content.json

{"posts":[{"title":"BLIP论文阅读","text":"BLIP是一种统一了多模态理解和文本生成的方法 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation 主要有两个部分： 一种新的框架和一种新的多模态数据构造方法 1.模型部分： BLIP主要构造了3个encoder和一个decoder，在文章中称之为MED（Multimodal mixture of Encoder-Decoder）： [1]unimodel encoder：text encoder和BERT是一样的结构，Image encoder和ViT是一样的结构。 [2]image-grounded text encoder：本质是一个模态融合模块，用一个简单的cross-attention补充交互能力，image的表示和”[Encode] text”做融合 [3]image-grounded text decoder：是一个普通的transformer decoder，只是cross-attention的输入是来自于image模态。 对应于三个encoder，有不同的预训练损失： [1]Image-Text-Contrastive Loss(ITC)：希望一个text-image pair的表示尽可能相似 [2]Image-Text Matching Loss(ITM)：希望他能正确的判断当前的text-image pair是不是正确匹配的 [3]Language Modeling Loss(LM)：希望对于一个image，能生成它的描述 2.数据构造方法CapFilt(Captioning and Filtering)： 这是一类自举（Bootstrapping）的方法，自举这种方法的特点是用自己来增强自己 从图中可以看出，流程为： [1]先用人工+web数据训两个模型，需要pretrained和在COCO上的SFT（SFT的部分各训各的对应任务），这样就有了captioner(给图片写描述)和filter(判断当前的pair是不是匹配) [2]用captioner给图打标签，得到$ T_s $（合成text） [3]用filter去判断原来的web text/合成 text是否与图片匹配，如果匹配，就放到数据里 [4]循环往复，左脚踩右脚原地升天 注：这里有个很怪的地方是：pretrained data里其实就含有COCO，再做SFT可能只是想增强某种能力（增强/判断） 3.下游任务的 模型输入输出Visual Question Answering (VQA) 就是遵循一般的模型结构，直接问 Natural Language Visual Reasoning ($ \\text{NLVR}^2 $) 这类任务是给定两张图片，问这两张图满不满足给定的某句话，输入是两张图和一句话，输出是true/false 模型结构修改为： 这里不是采用LLM里用的直接预测结果，而是类似于用T5的Encoder做分类，两边的部分是MED里的image-encoder，取消掉text encoder，直接用image-grounded text encoder来做模态融合。 取[Encode]标签的表示做为二分类的结果 for each transformer block in the image-grounded text encoder, there exist two cross-attention layers to process the two input images, and their outputs are merged and fed to the FFN. The two CA layers are intialized from the same pre-trained weights. The merge layer performs simple average pooling in the first 6 layers of the encoder, and performs concatenation followed by a linear projection in layer 6-12. An MLP classifier is applied on the output embedding of the [Encode] token. Visual Dialog (VisDial) 在Visual Dialog原文中，模型的评估方法是给定一张图片Image，图片描述C，对话历史Dialogue History，待问的问题Q和100候选答案。通过让模型做rerank来实现客观指标的评价。 在本文中，作者选择的模式是discriminative settings，也就是每次给不同的候选answer，在true/false两个选择中打分+softmax，最后按得分排序，计算的是类似检索的指标recall@k，MRR等","link":"/2024/10/12/BLIP%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB/"},{"title":"ViT论文阅读","text":"ViT(Vision Transformer)是借鉴Transformer构成的结构，但预训练任务却大有不同 An Image is Worth 16X16 Words: Transformers for Image Recognition at Scale, ICLR’21 1.怎么把图片变成NLP里的Embedding首先将一张图切成1616(patch size)的小块（称之为patch），每个patch对应的是NLP里的一个token，一张图片就像是一个句子按行排列一样，然后通过某种变换，让他们从16163(rgb channels)转换成768，原版的比较朴素，直接一个线性变换，从1616*3拉到768 这里不同的系统有不同的实现 在vit_pytorch里实现的ViT，使用的是原版线性变换 https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/vit.py 123456789101112131415161718192021222324252627282930313233343536373839404142434445class ViT(nn.Module): def __init__(self, *, image_size, patch_size, num_classes, dim, depth, heads, mlp_dim, pool = 'cls', channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0.): super().__init__() image_height, image_width = pair(image_size) patch_height, patch_width = pair(patch_size) assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.' num_patches = (image_height // patch_height) * (image_width // patch_width) patch_dim = channels * patch_height * patch_width assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)' self.to_patch_embedding = nn.Sequential( Rearrange('b c (h p1) (w p2) -&gt; b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width), nn.LayerNorm(patch_dim), nn.Linear(patch_dim, dim), nn.LayerNorm(dim), ) self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim)) self.cls_token = nn.Parameter(torch.randn(1, 1, dim)) self.dropout = nn.Dropout(emb_dropout) self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout) self.pool = pool self.to_latent = nn.Identity() self.mlp_head = nn.Linear(dim, num_classes) def forward(self, img): x = self.to_patch_embedding(img) b, n, _ = x.shape cls_tokens = repeat(self.cls_token, '1 1 d -&gt; b 1 d', b = b) x = torch.cat((cls_tokens, x), dim=1) x += self.pos_embedding[:, :(n + 1)] x = self.dropout(x) x = self.transformer(x) x = x.mean(dim = 1) if self.pool == 'mean' else x[:, 0] x = self.to_latent(x) return self.mlp_head(x) 在transformers里实现的ViT，是使用一个16*16的CNN作为变换模块 https://github.com/huggingface/transformers/blob/v4.45.2/src/transformers/models/vit/modeling_vit.py#L148 12345678910111213141516171819202122232425262728293031323334353637class ViTPatchEmbeddings(nn.Module): &quot;&quot;&quot; This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial `hidden_states` (patch embeddings) of shape `(batch_size, seq_length, hidden_size)` to be consumed by a Transformer. &quot;&quot;&quot; def __init__(self, config): super().__init__() image_size, patch_size = config.image_size, config.patch_size num_channels, hidden_size = config.num_channels, config.hidden_size image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size) patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size) num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) self.image_size = image_size self.patch_size = patch_size self.num_channels = num_channels self.num_patches = num_patches self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size) def forward(self, pixel_values: torch.Tensor, interpolate_pos_encoding: bool = False) -&gt; torch.Tensor: batch_size, num_channels, height, width = pixel_values.shape if num_channels != self.num_channels: raise ValueError( &quot;Make sure that the channel dimension of the pixel values match with the one set in the configuration.&quot; f&quot; Expected {self.num_channels} but got {num_channels}.&quot; ) if not interpolate_pos_encoding: if height != self.image_size[0] or width != self.image_size[1]: raise ValueError( f&quot;Input image size ({height}*{width}) doesn't match model&quot; f&quot; ({self.image_size[0]}*{self.image_size[1]}).&quot; ) embeddings = self.projection(pixel_values).flatten(2).transpose(1, 2) return embeddings 2.如何对视觉任务做预训练？如果说BERT可以选择填[MASK]，那么ViT是不是可以选择预测类似于的”预测token”的任务，但从结果上来看，是不行的，但图像有个优势在于图像有大量的分类数据集，这让直接在标注数据集上做预训练成为可能（某种意义上，这不是pt+sft范式，而是transfer-learning的范式）。 We also perform a preliminary exploration on masked patch prediction for self-supervision, mimicking the masked language modeling task used in BERT. With self-supervised pre-training, our smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant improvement of 2% to training from scratch, but still 4% behind supervised pre-training. B.1.2 We employ the masked patch prediction objective for preliminary self-supervision experiments. To do so we corrupt 50% of patch embeddings by either replacing their embeddings with a learnable [mask] embedding (80%), a random other patch embedding (10%) or just keeping them as is (10%). This setup is very similar to the one used for language by Devlin et al. (2019). Finally, we predict the 3-bit, mean color (i.e., 512 colors in total) (注：RGB三通道，每种3bit，8种可能，888=512) of every corrupted patch using their respective patch representations. As prediction targets for pretraining we tried the following settings: 1) predicting only the mean, 3bit color (i.e., 1 prediction of 512 colors), 2) predicting a 4 × 4 downsized version of the 16 × 16 patch with 3bit colors in parallel (i.e., 16 predictions of 512 colors), 3) regression on the full patch using L2 (i.e., 256 regressions on the 3 RGB channels). Surprisingly, we found that all worked quite well, though L2 was slightly worse. We report final results only for option 1) because it has shown best few-shot performance. We also experimented with 15% corruption rate as used by Devlin et al. (2019) but results were also slightly worse on our few-shot metrics 3.关于归置偏差(inductive bias)问题作者认为由于CNN模型有前置假设才相当有效，也就是局部性(locality)和平移不变性(translation equivariance，虽然也有人说CNN没有平移不变性)，而Transformer没有这种明确的东西，只有MLP有这样的性质，self-attention没有 In ViT, only MLP layers are local and translationally equivariant, while the self-attention layers are global. 4.对于高分辨率问题对于一个高分辨率的fine-tune图像集（长宽都很大），换成NLP里的问题，就是在预训练长度较小的情况如何finetune一个超长的序列。这里的做法和NLP里的RoPE也很像，用的是根据预训练模型的pretrained position embeddings和图像的位置做2D插值。 https://github.com/huggingface/transformers/blob/v4.45.2/src/transformers/models/vit/modeling_vit.py#L77 transformers里的实现方案是利用双三次插值(bicubic interpolation)对预训练的position embedding（除去cls）做插值 12345678910111213141516171819202122232425262728293031323334353637383940def interpolate_pos_encoding(self, embeddings: torch.Tensor, height: int, width: int) -&gt; torch.Tensor: &quot;&quot;&quot; This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution images. This method is also adapted to support torch.jit tracing. Adapted from: - https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/vision_transformer.py#L174-L194, and - https://github.com/facebookresearch/dinov2/blob/e1277af2ba9496fbadf7aec6eba56e8d882d1e35/dinov2/models/vision_transformer.py#L179-L211 &quot;&quot;&quot; num_patches = embeddings.shape[1] - 1 num_positions = self.position_embeddings.shape[1] - 1 # always interpolate when tracing to ensure the exported model works for dynamic input shapes if not torch.jit.is_tracing() and num_patches == num_positions and height == width: return self.position_embeddings class_pos_embed = self.position_embeddings[:, :1] patch_pos_embed = self.position_embeddings[:, 1:] # 除去cls dim = embeddings.shape[-1] new_height = height // self.patch_size new_width = width // self.patch_size sqrt_num_positions = torch_int(num_positions**0.5) patch_pos_embed = patch_pos_embed.reshape(1, sqrt_num_positions, sqrt_num_positions, dim) patch_pos_embed = patch_pos_embed.permute(0, 3, 1, 2) patch_pos_embed = nn.functional.interpolate( patch_pos_embed, size=(new_height, new_width), mode=&quot;bicubic&quot;, align_corners=False, ) patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim) return torch.cat((class_pos_embed, patch_pos_embed), dim=1) 5.关于混合架构，即CNN+Transformer某种意义上，是用CNN（例如用ResNet）抽取更高层的语义特征，但还不知道具体上的实现怎么做，我猜是直接ResNet出特征扔ViT里。","link":"/2024/10/10/ViT%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB/"},{"title":"简单的介绍","text":"又摸又卷的NLP攻城狮一枚","link":"/2024/10/06/hello-world/"},{"title":"各种loss","text":"Focal Loss for binary classification 12345678910def focal_binary_cross_entropy(logits, targets, gamma=2,alpha=0.5): l = logits.reshape(-1) t = targets.reshape(-1) p = torch.sigmoid(l) p = torch.where(t &gt;= 0.5, p, 1 - p) alpha = torch.where(t &gt;= 0.5, alpha, 1 - alpha) logp = -torch.log(torch.clamp(p, 1e-4, 1 - 1e-4)) loss = logp * ((1 - p) ** gamma) * alpha loss = t.size(-1) * loss.mean() return loss Zlpr Loss 123456789101112131415161718192021def zlpr_loss(logits, target): &quot;&quot;&quot; 多标签分类的交叉熵 说明：y_true和y_pred的shape一致，y_true的元素非0即1， 1表示对应的类为目标类，0表示对应的类为非目标类。 警告：请保证y_pred的值域是全体实数，换言之一般情况下y_pred 不用加激活函数，尤其是不能加sigmoid或者softmax！预测 阶段则输出y_pred大于0的类。如有疑问，请仔细阅读并理解本文。 &quot;&quot;&quot; loss_mask = target != -100 y_true = target.masked_select(loss_mask).view(-1, target.size(-1)) y_pred = logits.masked_select(loss_mask).view(-1, y_true.size(-1)) y_pred = (1 - 2 * y_true) * y_pred y_pred_neg = y_pred - y_true * 1e12 y_pred_pos = y_pred - (1 - y_true) * 1e12 zeros = torch.zeros_like(y_pred[:, :1]) y_pred_neg = torch.cat([y_pred_neg, zeros], dim=-1) y_pred_pos = torch.cat([y_pred_pos, zeros], dim=-1) neg_loss = torch.logsumexp(y_pred_neg, dim=-1) pos_loss = torch.logsumexp(y_pred_pos, dim=-1) return (neg_loss + pos_loss).mean() R-drop loss 1234567891011def r_drop_loss(y_pred: torch.Tensor): &quot;&quot;&quot; 用于R-Drop的损失函数,传入未经sigmoid的logits &quot;&quot;&quot; # y_true = torch.arange(y_pred.shape[0], device=y_pred.device) y_predp,y_predq = y_pred[::2].sigmoid().view(-1).unsqueeze(0).T,y_pred[1::2].sigmoid().view(-1).unsqueeze(0).T y_predp=torch.clamp(y_predp,1e-7,1-1e-7) y_predq=torch.clamp(y_predq,1e-7,1-1e-7) y_predp = torch.cat([y_predp,1-y_predp],dim=1) y_predq = torch.cat([y_predq,1-y_predq],dim=1) SimCSE 1234567891011121314151617181920def simcse_unsup_loss(y_pred,temperature=0.05): &quot;&quot;&quot;无监督的损失函数 y_pred (tensor): bert的输出, [batch_size * 2, 768] &quot;&quot;&quot; # 得到y_pred对应的label, [1, 0, 3, 2, ..., batch_size-1, batch_size-2] y_true = torch.arange(y_pred.shape[0], device=y_pred.device) y_true = (y_true - y_true % 2 * 2) + 1 # batch内两两计算相似度, 得到相似度矩阵(对角矩阵) # [batch_size * 2, 1, 768] * [1, batch_size * 2, 768] = [batch_size * 2, batch_size * 2] sim = F.cosine_similarity(y_pred.unsqueeze(1), y_pred.unsqueeze(0), dim=-1) # 将相似度矩阵对角线置为很小的值, 消除自身的影响 sim = sim - torch.eye(y_pred.shape[0], device=y_pred.device) * 1e12 sim = sim / temperature # 相似度矩阵除以温度系数 # 计算相似度矩阵与y_true的交叉熵损失 loss = F.cross_entropy(sim, y_true) return torch.mean(loss)","link":"/2024/10/06/%E5%90%84%E7%A7%8Dloss/"},{"title":"如何改善大模型的引用质量","text":"测试标题","link":"/2024/11/15/%E5%A6%82%E4%BD%95%E6%94%B9%E5%96%84%E5%A4%A7%E6%A8%A1%E5%9E%8B%E7%9A%84%E5%BC%95%E7%94%A8%E8%B4%A8%E9%87%8F/"},{"title":"生成，评论，修改范式","text":"似乎现在有很多论文都采用了这种范式（尤其是大模型当中），现简单汇集一下共同点 1.TASTE: Teaching Large Language Models to Translate through Self-Reflection, ACL’24领域：翻译 一个模型在打分，翻译，和修改三类任务上做训练，然后推理的时候采用先翻译，再评论，再修改，很朴素的范式 2.Learning from Committee: Reasoning Distillation from a Mixture of Teachers with Peer-Review, arxiv’24领域：推理(数学的GSM8K等，常识QA的StrategyQA，逻辑推理的LogiQA) 这个搞的更复杂，但是还是遵循生成-评论-修改三步范式 这个文章实际上是两部分揉在一起的，一部分是teacher-student的蒸馏，一部分是生成-评论-修改范式 对于_第一部分_，简单来说就是有一个大模型Student model A，在训练集上生成推理过程，这样我们知道了哪些是Student model会错的，让Teacher model给对的推理过程，让student学，这是典型的蒸馏框架。 但套了个娃，他认为，Teacher大模型生成的rationale也不一定可靠，那怎么办呢？Peer-review。这其实也只是个故事，实际上就是ensemble，做法上是对于每个Teacher大模型的评论，让其他Teacher也评论它的评论，给它的评论打分，平均分数得高于某个值才算是合理（甚至换个做法，majority vote 也一样），那么这里带来了第二个问题？如果说评论的分数都不高，达不到分数怎么办（我猜是这个样例直接不要了）？ _第二部分_是找几个Teacher大模型给Student生成错误的推理进行评论，说出哪里错了（这就类似于第一篇里的打bad，good）。 最后由这两部分数据做instruct tuning（或者叫distillation也行），一个学rationale应该是什么样（可以理解为修改），一个是学怎么错的（虽然大模型也不一定知错就改= =） 3.Ask, Assess, and Refine: Rectifying Factual Consistency and Hallucination in LLMs with Metric-Guided Feedback Learning, EACL’24领域：生成带引用的文章 这个也是我现在比较关心的问题，它的做法和第一篇比较像 先生成一个response，然后打分[评论]，这里用的是correctness， recall， precision，有点不太一样的是，它是让模型预测低于某个值再输出反馈[修改]。 这个过程可以迭代式做，提升模型在纯prompt情况下的能力","link":"/2024/11/15/%E7%94%9F%E6%88%90%EF%BC%8C%E8%AF%84%E8%AE%BA%EF%BC%8C%E4%BF%AE%E6%94%B9%E8%8C%83%E5%BC%8F/"}],"tags":[{"name":"论文阅读","slug":"论文阅读","link":"/tags/%E8%AE%BA%E6%96%87%E9%98%85%E8%AF%BB/"},{"name":"可能会用到的奇怪东西","slug":"可能会用到的奇怪东西","link":"/tags/%E5%8F%AF%E8%83%BD%E4%BC%9A%E7%94%A8%E5%88%B0%E7%9A%84%E5%A5%87%E6%80%AA%E4%B8%9C%E8%A5%BF/"},{"name":"大模型怎么做引用","slug":"大模型怎么做引用","link":"/tags/%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%80%8E%E4%B9%88%E5%81%9A%E5%BC%95%E7%94%A8/"},{"name":"范式","slug":"范式","link":"/tags/%E8%8C%83%E5%BC%8F/"}],"categories":[],"pages":[]}