Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatglm2-6b输入不跟官方的保持统一吗? #24

Open
SCAUapc opened this issue Jul 14, 2023 · 11 comments
Open

chatglm2-6b输入不跟官方的保持统一吗? #24

SCAUapc opened this issue Jul 14, 2023 · 11 comments

Comments

@SCAUapc
Copy link

SCAUapc commented Jul 14, 2023

我看chatglm2的官方代码里build_inputs的函数如下:

def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = None):
      prompt = ""
      for i, (old_query, response) in enumerate(history):
          prompt += "[Round {}]\n\n问:{}\n\n答:{}\n\n".format(i + 1, old_query, response)
      prompt += "[Round {}]\n\n问:{}\n\n答:".format(len(history) + 1, query)
      inputs = tokenizer([prompt], return_tensors="pt")
      inputs = inputs.to(self.device)
      return inputs

也就是说无论有没有多轮,

[Round {}]\n\n问:{}\n\n答:

这个格式都是有的。但是我看训练代码里的处理数据方式如下:

def tokenize_func(example, tokenizer, global_args, ignore_label_id=-100):
    """单样本tokenize处理"""
    question = global_args.prompt_text + example['instruction']
    if example.get('input', None):
        if example['input'].strip():
            question += f'''\n{example['input']}'''
    answer = example['output']
    q_ids = tokenizer.encode(text=question, add_special_tokens=False)
    a_ids = tokenizer.encode(text=answer, add_special_tokens=False)
    if len(q_ids) > global_args.max_input_length - 2:  # 2 - gmask, bos
        q_ids = q_ids[: global_args.max_input_length - 2]
    if len(a_ids) > global_args.max_output_length - 1:  # 1 - eos
        a_ids = a_ids[: global_args.max_output_length - 1]
    input_ids = tokenizer.build_inputs_with_special_tokens(q_ids, a_ids)
    # question_length = input_ids.index(tokenizer.bos_token_id)
    question_length = len(q_ids) + 2  # chatglm1 - gmask, bos, chatglm2 - gmask, sop
    labels = [ignore_label_id] * question_length + input_ids[question_length:]
    return {'input_ids': input_ids, 'labels': labels}

这里并没有加入如上的模板。是不是加上跟chatglm2对齐的话,会更合适一些?

@shuxueslpi
Copy link
Owner

@SCAUapc 确实是这样,空的时候我验证下,然后更新

@SCAUapc
Copy link
Author

SCAUapc commented Jul 14, 2023

@SCAUapc 确实是这样,空的时候我验证下,然后更新

辛苦大佬~

@SCAUapc
Copy link
Author

SCAUapc commented Jul 14, 2023

老哥我根据官方的改了下,你看看对你哟没有帮助。

chatglm2中官方ptuing处理的方法是:

    def preprocess_function_train(examples):
        max_seq_length = data_args.max_source_length + data_args.max_target_length + 1

        model_inputs = {
            "input_ids": [],
            "labels": [],
        }
        for i in range(len(examples[prompt_column])):
            if examples[prompt_column][i] and examples[response_column][i]:
                query, answer = examples[prompt_column][i], examples[response_column][i]

                history = examples[history_column][i] if history_column is not None else None
                prompt = tokenizer.build_prompt(query, history)

                prompt = prefix + prompt
                a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,
                                         max_length=data_args.max_source_length)
                b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,
                                         max_length=data_args.max_target_length)

                context_length = len(a_ids)
                input_ids = a_ids + b_ids + [tokenizer.eos_token_id]
                labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id]
                
                pad_len = max_seq_length - len(input_ids)
                input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
                labels = labels + [tokenizer.pad_token_id] * pad_len
                if data_args.ignore_pad_token_for_loss:
                    labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]

                model_inputs["input_ids"].append(input_ids)
                model_inputs["labels"].append(labels)

        return model_inputs

修改之后:

def tokenize_func_v2(example, tokenizer, global_args, ignore_label_id=-100):
    query = example['instruction']
    if example.get('input', None):
        if example['input'].strip():
            query += f'''\n{example['input']}'''

    prompt = tokenizer.build_prompt(query)
    prompt = global_args.prompt_text + prompt
    answer = example['output']

    a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,
                             max_length=global_args.max_input_length)
    b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,
                             max_length=global_args.max_target_length)

    context_length = len(a_ids)
    input_ids = a_ids + b_ids + [tokenizer.eos_token_id]
    # prompt需要使用pad_token_id,后面不需要学习。
    labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id]

    max_seq_length = global_args.max_input_length + global_args.max_output_length + 1

    pad_len = max_seq_length - len(input_ids)
    input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
    labels = labels + [tokenizer.pad_token_id] * pad_len

    # ignore_pad_token_for_loss (包括pad和prompt的)
    labels = [(l if l != tokenizer.pad_token_id else ignore_label_id) for l in labels]

    return {'input_ids': input_ids, 'labels': labels}

@SCAUapc
Copy link
Author

SCAUapc commented Jul 14, 2023

官方的prefix是在query加了ROUND模板之后,然后在最前面加的。我这里也是按它的来

@shuxueslpi
Copy link
Owner

👍
我回头用这个跑一个完整的对比下

@SCAUapc
Copy link
Author

SCAUapc commented Jul 14, 2023

    b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,
                             max_length=global_args.max_target_length)

中的max_target_length写错了,应该是max_output_length。目前我能跑起来。
想问下大佬,这边训练速度快吗?我1W条薯条2个epoch要70+小时...1080卡单张,batch 1,感觉好慢...我之前用Lora我记得蛮快的,2、3小时就好了

调了下input_max_len 和output_max_len 还有batch_size 这样可以4小时左右好了

@shuxueslpi
Copy link
Owner

@SCAUapc 我在RTX3090上,例子里的那个数据集,11W数据,1个epoch,bz=32,input_max_len+output_max_len=2048,这样的配置大概跑7个小时

@darvsum
Copy link

darvsum commented Jul 20, 2023

参照你的修改,能正常跑起来,但loss始终为0.0 ,你知道是什么原因吗?

@SCAUapc
Copy link
Author

SCAUapc commented Jul 20, 2023

参照你的修改,能正常跑起来,但loss始终为0.0 ,你知道是什么原因吗?

请问你代码改动的是哪部分?能把改动的那一版面都发一下不。如果按照原本仓库作者的代码,loss是正常的吗?

@darvsum
Copy link

darvsum commented Jul 20, 2023

@SCAUapc 原官方chatglm2-6b中的ptuning 的main.py中就修改了这个函数
def preprocess_function_train(examples):
max_seq_length = data_args.max_source_length + data_args.max_target_length + 1

    model_inputs = {
        "input_ids": [],
        "labels": [],
    }

    #print(examples['text'])
    examples = examples['text']
    for i in range(len(examples)):
        example = json.loads(examples[i])
        #print(example)
        if example[prompt_column] and example[response_column]:
            query, answer = example[prompt_column], example[response_column]

            #history = examples[history_column][i] if history_column is not None else None
            prompt = tokenizer.build_prompt(query)
            prompt = prefix + prompt

            a_ids = tokenizer.encode(text=prompt, add_special_tokens=True, truncation=True,
                                     max_length=data_args.max_source_length)
            b_ids = tokenizer.encode(text=answer, add_special_tokens=False, truncation=True,
                                     max_length=data_args.max_target_length)

            context_length = len(a_ids)
            input_ids = a_ids + b_ids + [tokenizer.eos_token_id]
            labels = [tokenizer.pad_token_id] * context_length + b_ids + [tokenizer.eos_token_id]
            
            pad_len = max_seq_length - len(input_ids)
            input_ids = input_ids + [tokenizer.pad_token_id] * pad_len
            labels = labels + [tokenizer.pad_token_id] * pad_len

            if data_args.ignore_pad_token_for_loss:
                labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]

            model_inputs["input_ids"].append(input_ids)
            model_inputs["labels"].append(labels)

    return model_inputs

@Parker0000
Copy link

如果要训练多轮,是不是改为:prompt = tokenizer.build_prompt(query, history)就可以训练多轮对话呢?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants