Caption ReRanking 项目

---->>>> Caption ReRanking 项目页面 <<<<----

目标

为了将多个系统生成的 caption 进行融合，利用多系统的多样性，从候选项中选择出最佳的 caption 作为最终输出。

根据在 5 个模型和一半的 VALIDATE 集合（见下文数据处理方式）上的测试（预测单句加权分数最高的 Caption）， Oracle 的预测性能为：

Bleu_4: 0.7168

CIDEr: 2.4049

METEOR: 0.4774

ROUGE_L: 0.7898

系统结构

系统由 image_model, text_model, match_model 组成。

预测得分由下式给出：

score(image, caption) = match_model( image_model(image), text_model(caption) )

Pointwise 预测：

score(image, caption) -> label(0/1)

Pairwise 预测：

score(image, good_caption) > score(image, bad_caption)

数据处理

将数据分为三个部分：训练、验证、测试，与原集合的对应关系如下：

训练集：对应原集合 TRAIN 全集， VALIDATE 集合中以 [0,2,4,6,8,a,c,e] 开头的图片

验证集：对应原集合 VALIDATE 集合中以 [1,3,5,7,9,b,d,f] 开头的图片

测试集：对应原集合 TEST 集合

数据处理流程分为三步：

使用 ranker_eval 工具对所有候选输出 Bleu_4 / CIDEr / METEOR / ROUGE_L 分数并加权计算总分；

按照分数选择正负例，构建 image, positive_caption, negative_caption 三元组；

根据正负例生成 tfrecord 文件，padding caption 到一致长度以便于处理。

代码结构

代码结构如下：

ranker_model.py 是排序模型的 class
image_models/ 是图像模型目录
- image_embedding.py 是 InceptionV3 的模型
text_models/ 是文本模型目录
- lstm_model.py 是文本处理的 LSTM 模型，示例模型，请参照它的输入输出
match_models/ 是匹配模型目录
- mlp_model.py 是多层感知器模型，示例模型，请参照它的输入输出
train.py 训练程序
inference.py 测试程序

模型约定：

image_model 输出

输出两个 tensor 的 tuple，如果仅有一个，另一个用 None 填充

summary tensor, shape=[batch_size, dim]

context tensor, shape=[batch_size, context_dim0, ..., context_dimN, dim] 是包含位置/上下文信息的 tensor

text_model 输出

输出两个 tensor 的 tuple，如果仅有一个，另一个用 None 填充

summary tensor, shape=[batch_size, dim]

sequence tensor, shape=[batch_size, seq_len, dim] 是包含位置/上下文信息的 tensor

match_model 输入

以上两个模型的输出

match_model 输出

output tensor, shape=[batch_size, 1]

注意这个 output tensor 的范围要在 Loss 容许范围内。比如，如果是 CrossEntropyLoss 要保证 output 在 [0,1] 之间。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caption ReRanking 项目

目标

系统结构

数据处理

代码结构

Clone this wiki locally