add MMLU and C-Eval script

2023-09-23 00:34:17 +08:00 · 2023-09-23 00:34:17 +08:00 · 465ee8119a
parent 5cc7a44784
commit 465ee8119a
16 changed files with 1019 additions and 856 deletions
--- a/README.md
+++ b/README.md
@ -14,27 +14,29 @@

 ## Changelog

-[23/09/10] Now we support using **[FlashAttention](https://github.com/Dao-AILab/flash-attention)** for the LLaMA models. Try `--flash_attn` argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs (experimental feature).
+[23/09/23] We integrated MMLU and C-Eval benchmarks in this repo. See [this example](#evaluation-mmlu--c-eval) to evaluate your models.

-[23/08/18] Now we support **resuming training**, upgrade `transformers` to `4.31.0` to enjoy this feature.
+[23/09/10] We supported using **[FlashAttention](https://github.com/Dao-AILab/flash-attention)** for the LLaMA models. Try `--flash_attn` argument to enable FlashAttention-2 if you are using RTX4090, A100 or H100 GPUs.

-[23/08/12] Now we support **RoPE scaling** to extend the context length of the LLaMA models. Try `--rope_scaling linear` argument in training and `--rope_scaling dynamic` argument at inference to extrapolate the position embeddings.
+[23/08/18] We supported **resuming training**, upgrade `transformers` to `4.31.0` to enjoy this feature.

-[23/08/11] Now we support **[DPO training](https://arxiv.org/abs/2305.18290)** for instruction-tuned models. See [this example](#dpo-training) to train your models.
+[23/08/12] We supported **RoPE scaling** to extend the context length of the LLaMA models. Try `--rope_scaling linear` argument in training and `--rope_scaling dynamic` argument at inference to extrapolate the position embeddings.

-[23/07/31] Now we support **dataset streaming**. Try `--streaming` and `--max_steps 10000` arguments to load your dataset in streaming mode.
+[23/08/11] We supported **[DPO training](https://arxiv.org/abs/2305.18290)** for instruction-tuned models. See [this example](#dpo-training) to train your models.

-[23/07/29] We release two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos ([LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)) for details.
+[23/07/31] We supported **dataset streaming**. Try `--streaming` and `--max_steps 10000` arguments to load your dataset in streaming mode.

-[23/07/18] Now we develop an **all-in-one Web UI** for training, evaluation and inference. Try `train_web.py` to fine-tune models in your Web browser. Thank [@KanadeSiina](https://github.com/KanadeSiina) and [@codemayq](https://github.com/codemayq) for their efforts in the development.
+[23/07/29] We released two instruction-tuned 13B models at Hugging Face. See these Hugging Face Repos ([LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)) for details.

-[23/07/09] Now we release **[FastEdit](https://github.com/hiyouga/FastEdit)** ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow [FastEdit](https://github.com/hiyouga/FastEdit) if you are interested.
+[23/07/18] We developed an **all-in-one Web UI** for training, evaluation and inference. Try `train_web.py` to fine-tune models in your Web browser. Thank [@KanadeSiina](https://github.com/KanadeSiina) and [@codemayq](https://github.com/codemayq) for their efforts in the development.

-[23/06/29] We provide a **reproducible example** of training a chat model using instruction-following datasets, see [Baichuan-7B-sft](https://huggingface.co/hiyouga/Baichuan-7B-sft) for details.
+[23/07/09] We released **[FastEdit](https://github.com/hiyouga/FastEdit)** ⚡🩹, an easy-to-use package for editing the factual knowledge of large language models efficiently. Please follow [FastEdit](https://github.com/hiyouga/FastEdit) if you are interested.

-[23/06/22] Now we align the [demo API](src/api_demo.py) with the [OpenAI's](https://platform.openai.com/docs/api-reference/chat) format where you can insert the fine-tuned model in **arbitrary ChatGPT-based applications**.
+[23/06/29] We provided a **reproducible example** of training a chat model using instruction-following datasets, see [Baichuan-7B-sft](https://huggingface.co/hiyouga/Baichuan-7B-sft) for details.

-[23/06/03] Now we support quantized training and inference (aka **[QLoRA](https://github.com/artidoro/qlora)**). Try `--quantization_bit 4/8` argument to work with quantized models.
+[23/06/22] We aligned the [demo API](src/api_demo.py) with the [OpenAI's](https://platform.openai.com/docs/api-reference/chat) format where you can insert the fine-tuned model in **arbitrary ChatGPT-based applications**.
+
+[23/06/03] We supported quantized training and inference (aka **[QLoRA](https://github.com/artidoro/qlora)**). Try `--quantization_bit 4/8` argument to work with quantized models.

 ## Supported Models

@ -405,27 +407,7 @@ python src/web_demo.py \
    --checkpoint_dir path_to_checkpoint
 ```

-### Evaluation (BLEU and ROUGE_CHINESE)
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage sft \
-    --model_name_or_path path_to_llama_model \
-    --do_eval \
-    --dataset alpaca_gpt4_en \
-    --template default \
-    --finetuning_type lora \
-    --checkpoint_dir path_to_checkpoint \
-    --output_dir path_to_eval_result \
-    --per_device_eval_batch_size 8 \
-    --max_samples 100 \
-    --predict_with_generate
-```
-
-> [!NOTE]
-> We recommend using `--per_device_eval_batch_size=1` and `--max_target_length 128` at 4/8-bit evaluation.
-
-### Predict
+### Evaluation and Predict (BLEU & ROUGE_CHINESE)

 ```bash
 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
@ -442,6 +424,24 @@ CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --predict_with_generate
 ```

+> [!NOTE]
+> We recommend using `--per_device_eval_batch_size=1` and `--max_target_length 128` at 4/8-bit evaluation.
+
+### Evaluation (MMLU & C-Eval)
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate.py \
+    --model_name_or_path path_to_llama_model \
+    --finetuning_type lora \
+    --checkpoint_dir path_to_checkpoint \
+    --template vanilla \
+    --task mmlu \
+    --split test \
+    --lang en \
+    --n_shot 5 \
+    --batch_size 4
+```
+
 ## License

 This repository is licensed under the [Apache-2.0 License](LICENSE).
--- a/README_zh.md
+++ b/README_zh.md
@ -14,15 +14,17 @@

 ## 更新日志

-[23/09/10] 现在我们支持了 LLaMA 模型的 **[FlashAttention](https://github.com/Dao-AILab/flash-attention)**。如果您使用的是 RTX4090、A100 或 H100 GPU，请使用 `--flash_attn` 参数以启用 FlashAttention-2（实验性功能）。
+[23/09/23] 我们在项目中集成了 MMLU 和 C-Eval 评估集。使用方法请参阅[此示例](#模型评估mmlu-和-c-eval)。

-[23/08/18] 现在我们支持了**训练状态恢复**，请将 `transformers` 升级至 `4.31.0` 以启用此功能。
+[23/09/10] 我们支持了 LLaMA 模型的 **[FlashAttention](https://github.com/Dao-AILab/flash-attention)**。如果您使用的是 RTX4090、A100 或 H100 GPU，请使用 `--flash_attn` 参数以启用 FlashAttention-2（实验性功能）。

-[23/08/12] 现在我们支持了 **RoPE 插值**来扩展 LLaMA 模型的上下文长度。请使用 `--rope_scaling linear` 参数训练模型或使用 `--rope_scaling dynamic` 参数评估模型。
+[23/08/18] 我们支持了**训练状态恢复**，请将 `transformers` 升级至 `4.31.0` 以启用此功能。

-[23/08/11] 现在我们支持了指令模型的 **[DPO 训练](https://arxiv.org/abs/2305.18290)**。详情请参阅[此示例](#dpo-训练)。
+[23/08/12] 我们支持了 **RoPE 插值**来扩展 LLaMA 模型的上下文长度。请使用 `--rope_scaling linear` 参数训练模型或使用 `--rope_scaling dynamic` 参数评估模型。

-[23/07/31] 现在我们支持了**数据流式加载**。请尝试使用 `--streaming` 和 `--max_steps 10000` 参数来流式加载数据集。
+[23/08/11] 我们支持了指令模型的 **[DPO 训练](https://arxiv.org/abs/2305.18290)**。使用方法请参阅[此示例](#dpo-训练)。
+
+[23/07/31] 我们支持了**数据流式加载**。请尝试使用 `--streaming` 和 `--max_steps 10000` 参数来流式加载数据集。

 [23/07/29] 我们在 Hugging Face 发布了两个 13B 指令微调模型。详细内容请查阅我们的 Hugging Face 项目（[LLaMA-2](https://huggingface.co/hiyouga/Llama-2-Chinese-13b-chat) / [Baichuan](https://huggingface.co/hiyouga/Baichuan-13B-sft)）。

@ -34,7 +36,7 @@

 [23/06/22] 我们对齐了[示例 API](src/api_demo.py) 与 [OpenAI API](https://platform.openai.com/docs/api-reference/chat) 的格式，您可以将微调模型接入**任意基于 ChatGPT 的应用**中。

-[23/06/03] 现在我们实现了 4 比特的 LoRA 训练（也称 **[QLoRA](https://github.com/artidoro/qlora)**）。请尝试使用 `--quantization_bit 4` 参数进行 4 比特量化微调。
+[23/06/03] 我们实现了 4 比特的 LoRA 训练（也称 **[QLoRA](https://github.com/artidoro/qlora)**）。请尝试使用 `--quantization_bit 4` 参数进行 4 比特量化微调。

 ## 模型

@ -404,27 +406,7 @@ python src/web_demo.py \
    --checkpoint_dir path_to_checkpoint
 ```

-### 指标评估（BLEU 分数和汉语 ROUGE 分数）
-
-```bash
-CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
-    --stage sft \
-    --model_name_or_path path_to_llama_model \
-    --do_eval \
-    --dataset alpaca_gpt4_zh \
-    --template default \
-    --finetuning_type lora \
-    --checkpoint_dir path_to_checkpoint \
-    --output_dir path_to_eval_result \
-    --per_device_eval_batch_size 8 \
-    --max_samples 100 \
-    --predict_with_generate
-```
-
-> [!NOTE]
-> 我们建议在量化模型的评估中使用 `--per_device_eval_batch_size=1` 和 `--max_target_length 128`。
-
-### 模型预测
+### 指标评估与模型预测（BLEU 分数和汉语 ROUGE 分数）

 ```bash
 CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
@ -441,6 +423,24 @@ CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --predict_with_generate
 ```

+> [!NOTE]
+> 我们建议在量化模型的评估中使用 `--per_device_eval_batch_size=1` 和 `--max_target_length 128`。
+
+### 模型评估（MMLU 和 C-Eval）
+
+```bash
+CUDA_VISIBLE_DEVICES=0 python evaluation/evaluate.py \
+    --model_name_or_path path_to_llama_model \
+    --finetuning_type lora \
+    --checkpoint_dir path_to_checkpoint \
+    --template vanilla \
+    --task ceval \
+    --split validation \
+    --lang zh \
+    --n_shot 5 \
+    --batch_size 4
+```
+
 ## 协议

 本仓库的代码依照 [Apache-2.0](LICENSE) 协议开源。
--- a/evaluation/ceval/ceval.py
+++ b/evaluation/ceval/ceval.py
@ -0,0 +1,166 @@
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import datasets
+import pandas as pd
+
+
+_CITATION = """\
+@article{huang2023ceval,
+  title={C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models}, 
+  author={Huang, Yuzhen and Bai, Yuzhuo and Zhu, Zhihao and Zhang, Junlei and Zhang, Jinghan and Su, Tangjun and Liu, Junteng and Lv, Chuancheng and Zhang, Yikai and Lei, Jiayi and Fu, Yao and Sun, Maosong and He, Junxian},
+  journal={arXiv preprint arXiv:2305.08322},
+  year={2023}
+}
+"""
+
+_DESCRIPTION = """\
+C-Eval is a comprehensive Chinese evaluation suite for foundation models. It consists of 13948 multi-choice questions spanning 52 diverse disciplines and four difficulty levels.
+"""
+
+_HOMEPAGE = "https://cevalbenchmark.com"
+
+_LICENSE = "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License"
+
+_URL = "ceval.zip"
+
+task_list = [
+    "computer_network",
+    "operating_system",
+    "computer_architecture",
+    "college_programming",
+    "college_physics",
+    "college_chemistry",
+    "advanced_mathematics",
+    "probability_and_statistics",
+    "discrete_mathematics",
+    "electrical_engineer",
+    "metrology_engineer",
+    "high_school_mathematics",
+    "high_school_physics",
+    "high_school_chemistry",
+    "high_school_biology",
+    "middle_school_mathematics",
+    "middle_school_biology",
+    "middle_school_physics",
+    "middle_school_chemistry",
+    "veterinary_medicine",
+    "college_economics",
+    "business_administration",
+    "marxism",
+    "mao_zedong_thought",
+    "education_science",
+    "teacher_qualification",
+    "high_school_politics",
+    "high_school_geography",
+    "middle_school_politics",
+    "middle_school_geography",
+    "modern_chinese_history",
+    "ideological_and_moral_cultivation",
+    "logic",
+    "law",
+    "chinese_language_and_literature",
+    "art_studies",
+    "professional_tour_guide",
+    "legal_professional",
+    "high_school_chinese",
+    "high_school_history",
+    "middle_school_history",
+    "civil_servant",
+    "sports_science",
+    "plant_protection",
+    "basic_medicine",
+    "clinical_medicine",
+    "urban_and_rural_planner",
+    "accountant",
+    "fire_engineer",
+    "environmental_impact_assessment_engineer",
+    "tax_accountant",
+    "physician",
+]
+
+
+class CevalExamConfig(datasets.BuilderConfig):
+    def __init__(self, **kwargs):
+        super().__init__(version=datasets.Version("1.0.0"), **kwargs)
+
+
+class CevalExam(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIGS = [
+        CevalExamConfig(
+            name=task_name,
+        )
+        for task_name in task_list
+    ]
+
+    def _info(self):
+        features = datasets.Features(
+            {
+                "id": datasets.Value("int32"),
+                "question": datasets.Value("string"),
+                "A": datasets.Value("string"),
+                "B": datasets.Value("string"),
+                "C": datasets.Value("string"),
+                "D": datasets.Value("string"),
+                "answer": datasets.Value("string"),
+                "explanation": datasets.Value("string"),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        data_dir = dl_manager.download_and_extract(_URL)
+        task_name = self.config.name
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "filepath": os.path.join(
+                        data_dir, "test", f"{task_name}_test.csv"
+                    ),
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "filepath": os.path.join(
+                        data_dir, "val", f"{task_name}_val.csv"
+                    ),
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": os.path.join(
+                        data_dir, "dev", f"{task_name}_dev.csv"
+                    ),
+                },
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        df = pd.read_csv(filepath, encoding="utf-8")
+        for i, instance in enumerate(df.to_dict(orient="records")):
+            if "answer" not in instance.keys():
+                instance["answer"] = ""
+            if "explanation" not in instance.keys():
+                instance["explanation"] = ""
+            yield i, instance
--- a/evaluation/ceval/ceval.zip
+++ b/evaluation/ceval/ceval.zip
--- a/evaluation/ceval/mapping.json
+++ b/evaluation/ceval/mapping.json
@ -0,0 +1,210 @@
+{
+  "accountant": {
+    "name": "注册会计师",
+    "category": "Other"
+  },
+  "advanced_mathematics": {
+    "name": "高等数学",
+    "category": "STEM"
+  },
+  "art_studies": {
+    "name": "艺术学",
+    "category": "Humanities"
+  },
+  "basic_medicine": {
+    "name": "基础医学",
+    "category": "Other"
+  },
+  "business_administration": {
+    "name": "工商管理",
+    "category": "Social Sciences"
+  },
+  "chinese_language_and_literature": {
+    "name": "中国语言文学",
+    "category": "Humanities"
+  },
+  "civil_servant": {
+    "name": "公务员",
+    "category": "Other"
+  },
+  "clinical_medicine": {
+    "name": "临床医学",
+    "category": "Other"
+  },
+  "college_chemistry": {
+    "name": "大学化学",
+    "category": "STEM"
+  },
+  "college_economics": {
+    "name": "大学经济学",
+    "category": "Social Sciences"
+  },
+  "college_physics": {
+    "name": "大学物理",
+    "category": "STEM"
+  },
+  "college_programming": {
+    "name": "大学编程",
+    "category": "STEM"
+  },
+  "computer_architecture": {
+    "name": "计算机组成",
+    "category": "STEM"
+  },
+  "computer_network": {
+    "name": "计算机网络",
+    "category": "STEM"
+  },
+  "discrete_mathematics": {
+    "name": "离散数学",
+    "category": "STEM"
+  },
+  "education_science": {
+    "name": "教育学",
+    "category": "Social Sciences"
+  },
+  "electrical_engineer": {
+    "name": "注册电气工程师",
+    "category": "STEM"
+  },
+  "environmental_impact_assessment_engineer": {
+    "name": "环境影响评价工程师",
+    "category": "Other"
+  },
+  "fire_engineer": {
+    "name": "注册消防工程师",
+    "category": "Other"
+  },
+  "high_school_biology": {
+    "name": "高中生物",
+    "category": "STEM"
+  },
+  "high_school_chemistry": {
+    "name": "高中化学",
+    "category": "STEM"
+  },
+  "high_school_chinese": {
+    "name": "高中语文",
+    "category": "Humanities"
+  },
+  "high_school_geography": {
+    "name": "高中地理",
+    "category": "Social Sciences"
+  },
+  "high_school_history": {
+    "name": "高中历史",
+    "category": "Humanities"
+  },
+  "high_school_mathematics": {
+    "name": "高中数学",
+    "category": "STEM"
+  },
+  "high_school_physics": {
+    "name": "高中物理",
+    "category": "STEM"
+  },
+  "high_school_politics": {
+    "name": "高中政治",
+    "category": "Social Sciences"
+  },
+  "ideological_and_moral_cultivation": {
+    "name": "思想道德修养与法律基础",
+    "category": "Humanities"
+  },
+  "law": {
+    "name": "法学",
+    "category": "Humanities"
+  },
+  "legal_professional": {
+    "name": "法律职业资格",
+    "category": "Humanities"
+  },
+  "logic": {
+    "name": "逻辑学",
+    "category": "Humanities"
+  },
+  "mao_zedong_thought": {
+    "name": "毛泽东思想和中国特色社会主义理论体系概论",
+    "category": "Social Sciences"
+  },
+  "marxism": {
+    "name": "马克思主义基本原理",
+    "category": "Social Sciences"
+  },
+  "metrology_engineer": {
+    "name": "注册计量师",
+    "category": "STEM"
+  },
+  "middle_school_biology": {
+    "name": "初中生物",
+    "category": "STEM"
+  },
+  "middle_school_chemistry": {
+    "name": "初中化学",
+    "category": "STEM"
+  },
+  "middle_school_geography": {
+    "name": "初中地理",
+    "category": "Social Sciences"
+  },
+  "middle_school_history": {
+    "name": "初中历史",
+    "category": "Humanities"
+  },
+  "middle_school_mathematics": {
+    "name": "初中数学",
+    "category": "STEM"
+  },
+  "middle_school_physics": {
+    "name": "初中物理",
+    "category": "STEM"
+  },
+  "middle_school_politics": {
+    "name": "初中政治",
+    "category": "Social Sciences"
+  },
+  "modern_chinese_history": {
+    "name": "近代史纲要",
+    "category": "Humanities"
+  },
+  "operating_system": {
+    "name": "操作系统",
+    "category": "STEM"
+  },
+  "physician": {
+    "name": "医师资格",
+    "category": "Other"
+  },
+  "plant_protection": {
+    "name": "植物保护",
+    "category": "Other"
+  },
+  "probability_and_statistics": {
+    "name": "概率统计",
+    "category": "STEM"
+  },
+  "professional_tour_guide": {
+    "name": "导游资格",
+    "category": "Humanities"
+  },
+  "sports_science": {
+    "name": "体育学",
+    "category": "Other"
+  },
+  "tax_accountant": {
+    "name": "税务师",
+    "category": "Other"
+  },
+  "teacher_qualification": {
+    "name": "教师资格",
+    "category": "Social Sciences"
+  },
+  "urban_and_rural_planner": {
+    "name": "注册城乡规划师",
+    "category": "Other"
+  },
+  "veterinary_medicine": {
+    "name": "兽医学",
+    "category": "STEM"
+  }
+}
--- a/evaluation/evaluate.py
+++ b/evaluation/evaluate.py
@ -0,0 +1,173 @@
+# coding=utf-8
+# Evaluates the performance of pre-trained models.
+# Usage: python evaluate.py --model_name_or_path path_to_model --checkpoint_dir path_to_ckpt --template vanilla
+#                           --task ceval --split validation --lang zh --n_shot 5 --batch_size 4
+# Inspired by: https://github.com/hendrycks/test/blob/master/evaluate_flan.py
+
+import os
+import fire
+import json
+import torch
+import numpy as np
+from tqdm import tqdm
+from typing import TYPE_CHECKING, Dict, List, Literal, Optional, Tuple
+from datasets import load_dataset
+from dataclasses import dataclass
+
+from llmtuner import ChatModel
+
+if TYPE_CHECKING:
+    from datasets import Dataset
+
+
+choices = ["A", "B", "C", "D"]
+
+
+@dataclass
+class EvalTemplate:
+
+    system: str
+    choice: str
+    answer: str
+
+    def parse_example(
+        self,
+        example: Dict[str, str]
+    ) -> Tuple[str, str]:
+        candidates = [self.choice.format(choice=ch, content=example[ch]) for ch in choices if ch in example]
+        return "".join([example["question"]] + candidates + [self.answer]), example["answer"]
+
+    def format_example(
+        self,
+        target_data: Dict[str, str],
+        support_set: "Dataset",
+        subject_name: str,
+        use_history: bool
+    ) -> Tuple[str, str, List[Tuple[str, str]]]:
+        query, resp = self.parse_example(target_data)
+        history = [self.parse_example(support_set[k]) for k in range(len(support_set))]
+
+        if len(history):
+            temp = history.pop(0)
+            history.insert(0, (self.system.format(subject=subject_name) + temp[0], temp[1]))
+        else:
+            query = self.system.format(subject=subject_name) + query
+
+        if not use_history:
+            query = "\n\n".join(["".join(item) for item in history] + [query])
+            history = []
+        return query, resp, history
+
+
+eval_templates = {
+    "en": EvalTemplate(
+        system="The following are multiple choice questions (with answers) about {subject}.\n\n",
+        choice="\n{choice}. {content}",
+        answer="\nAnswer: "
+    ),
+    "zh": EvalTemplate(
+        system="以下是中国关于{subject}考试的单项选择题，请选出其中的正确答案。\n\n",
+        choice="\n{choice}. {content}",
+        answer="\n答案："
+    )
+}
+
+
+@torch.inference_mode()
+def batch_inference(
+    chat_model: ChatModel,
+    batch_input: Dict[str, torch.Tensor]
+) -> List[str]:
+    logits = chat_model.model(**batch_input).logits
+    probs = torch.nn.functional.softmax(
+        torch.stack(
+            [
+                logits[:, -1, chat_model.tokenizer.encode("\nA")[-1]],
+                logits[:, -1, chat_model.tokenizer.encode("\nB")[-1]],
+                logits[:, -1, chat_model.tokenizer.encode("\nC")[-1]],
+                logits[:, -1, chat_model.tokenizer.encode("\nD")[-1]]
+            ],
+            dim=-1
+        ),
+        dim=-1
+    ).detach()
+    return [chr(ord("A") + offset.item()) for offset in torch.argmax(probs, dim=-1)]
+
+
+def evaluate(
+    model_name_or_path: str,
+    finetuning_type: Optional[str] = "lora",
+    checkpoint_dir: Optional[str] = None,
+    template: Optional[str] = "vanilla",
+    task: Optional[str] = "ceval",
+    dataset_dir: Optional[str] = "evaluation",
+    split: Optional[Literal["validation", "test"]] = "validation",
+    lang: Optional[Literal["zh", "en"]] = "zh",
+    n_shot: Optional[int] = 5,
+    batch_size: Optional[int] = 4
+):
+    with open(os.path.join(dataset_dir, task, "mapping.json"), "r", encoding="utf-8") as f:
+        categorys = json.load(f)
+
+    chat_model = ChatModel(dict(
+        model_name_or_path=model_name_or_path,
+        finetuning_type=finetuning_type,
+        checkpoint_dir=checkpoint_dir,
+        template=template
+    ))
+    chat_model.tokenizer.padding_side = "left"
+    eval_template = eval_templates[lang]
+    category_corrects: Dict[str, np.ndarray] = {
+        "STEM": np.array([], dtype="bool"),
+        "Social Sciences": np.array([], dtype="bool"),
+        "Humanities": np.array([], dtype="bool"),
+        "Other": np.array([], dtype="bool")
+    }
+    overall_corrects = np.array([], dtype="bool")
+
+    pbar = tqdm(categorys.keys())
+    for subject in pbar:
+        pbar.set_postfix_str(categorys[subject]["name"])
+        inputs, labels = [], []
+        dataset = load_dataset(os.path.join(dataset_dir, task), subject)
+        for i in range(len(dataset[split])):
+            query, resp, history = eval_template.format_example(
+                target_data=dataset[split][i],
+                support_set=dataset["train"].select(range(min(n_shot, len(dataset["train"])))),
+                subject_name=categorys[subject]["name"],
+                use_history=chat_model.template.use_history
+            )
+            input_ids, _ = chat_model.template.encode_oneturn(
+                tokenizer=chat_model.tokenizer,
+                query=query,
+                resp=resp,
+                history=history
+            )
+            inputs.append({
+                "input_ids": input_ids,
+                "attention_mask": [1] * len(input_ids)
+            })
+            labels.append(resp)
+
+        outputs = []
+        for i in range(0, len(inputs), batch_size):
+            batch_input = chat_model.tokenizer.pad(
+                inputs[i : i + batch_size],
+                return_attention_mask=True,
+                return_tensors="pt"
+            ).to(chat_model.model.device)
+            preds = batch_inference(chat_model, batch_input)
+            outputs += preds
+
+        corrects = (np.array(outputs) == np.array(labels))
+        category_name = categorys[subject]["category"]
+        category_corrects[category_name] = np.concatenate([category_corrects[category_name], corrects], axis=0)
+        overall_corrects = np.concatenate([overall_corrects, corrects], axis=0)
+
+    print("Average accuracy: {:.2f}".format(100 * np.mean(overall_corrects)))
+    for category_name, category_correct in category_corrects.items():
+        print("    {} - {:.2f}".format(category_name, 100 * np.mean(category_correct)))
+
+
+if __name__ == "__main__":
+    fire.Fire(evaluate)
--- a/evaluation/mmlu/mapping.json
+++ b/evaluation/mmlu/mapping.json
@ -0,0 +1,230 @@
+{
+  "abstract_algebra": {
+    "name": "abstract algebra",
+    "category": "STEM"
+  },
+  "anatomy": {
+    "name": "anatomy",
+    "category": "Other"
+  },
+  "astronomy": {
+    "name": "astronomy",
+    "category": "STEM"
+  },
+  "business_ethics": {
+    "name": "business ethics",
+    "category": "Other"
+  },
+  "clinical_knowledge": {
+    "name": "clinical knowledge",
+    "category": "Other"
+  },
+  "college_biology": {
+    "name": "college biology",
+    "category": "STEM"
+  },
+  "college_chemistry": {
+    "name": "college chemistry",
+    "category": "STEM"
+  },
+  "college_computer_science": {
+    "name": "college computer science",
+    "category": "STEM"
+  },
+  "college_mathematics": {
+    "name": "college mathematics",
+    "category": "STEM"
+  },
+  "college_medicine": {
+    "name": "college medicine",
+    "category": "Other"
+  },
+  "college_physics": {
+    "name": "college physics",
+    "category": "STEM"
+  },
+  "computer_security": {
+    "name": "computer security",
+    "category": "STEM"
+  },
+  "conceptual_physics": {
+    "name": "conceptual physics",
+    "category": "STEM"
+  },
+  "econometrics": {
+    "name": "econometrics",
+    "category": "Social Sciences"
+  },
+  "electrical_engineering": {
+    "name": "electrical engineering",
+    "category": "STEM"
+  },
+  "elementary_mathematics": {
+    "name": "elementary mathematics",
+    "category": "STEM"
+  },
+  "formal_logic": {
+    "name": "formal logic",
+    "category": "Humanities"
+  },
+  "global_facts": {
+    "name": "global facts",
+    "category": "Other"
+  },
+  "high_school_biology": {
+    "name": "high school biology",
+    "category": "STEM"
+  },
+  "high_school_chemistry": {
+    "name": "high school chemistry",
+    "category": "STEM"
+  },
+  "high_school_computer_science": {
+    "name": "high school computer science",
+    "category": "STEM"
+  },
+  "high_school_european_history": {
+    "name": "high school european history",
+    "category": "Humanities"
+  },
+  "high_school_geography": {
+    "name": "high school geography",
+    "category": "Social Sciences"
+  },
+  "high_school_government_and_politics": {
+    "name": "high school government and politics",
+    "category": "Social Sciences"
+  },
+  "high_school_macroeconomics": {
+    "name": "high school macroeconomics",
+    "category": "Social Sciences"
+  },
+  "high_school_mathematics": {
+    "name": "high school mathematics",
+    "category": "STEM"
+  },
+  "high_school_microeconomics": {
+    "name": "high school microeconomics",
+    "category": "Social Sciences"
+  },
+  "high_school_physics": {
+    "name": "high school physics",
+    "category": "STEM"
+  },
+  "high_school_psychology": {
+    "name": "high school psychology",
+    "category": "Social Sciences"
+  },
+  "high_school_statistics": {
+    "name": "high school statistics",
+    "category": "STEM"
+  },
+  "high_school_us_history": {
+    "name": "high school us history",
+    "category": "Humanities"
+  },
+  "high_school_world_history": {
+    "name": "high school world history",
+    "category": "Humanities"
+  },
+  "human_aging": {
+    "name": "human aging",
+    "category": "Other"
+  },
+  "human_sexuality": {
+    "name": "human sexuality",
+    "category": "Social Sciences"
+  },
+  "international_law": {
+    "name": "international law",
+    "category": "Humanities"
+  },
+  "jurisprudence": {
+    "name": "jurisprudence",
+    "category": "Humanities"
+  },
+  "logical_fallacies": {
+    "name": "logical fallacies",
+    "category": "Humanities"
+  },
+  "machine_learning": {
+    "name": "machine learning",
+    "category": "STEM"
+  },
+  "management": {
+    "name": "management",
+    "category": "Other"
+  },
+  "marketing": {
+    "name": "marketing",
+    "category": "Other"
+  },
+  "medical_genetics": {
+    "name": "medical genetics",
+    "category": "Other"
+  },
+  "miscellaneous": {
+    "name": "miscellaneous",
+    "category": "Other"
+  },
+  "moral_disputes": {
+    "name": "moral disputes",
+    "category": "Humanities"
+  },
+  "moral_scenarios": {
+    "name": "moral scenarios",
+    "category": "Humanities"
+  },
+  "nutrition": {
+    "name": "nutrition",
+    "category": "Other"
+  },
+  "philosophy": {
+    "name": "philosophy",
+    "category": "Humanities"
+  },
+  "prehistory": {
+    "name": "prehistory",
+    "category": "Humanities"
+  },
+  "professional_accounting": {
+    "name": "professional accounting",
+    "category": "Other"
+  },
+  "professional_law": {
+    "name": "professional law",
+    "category": "Humanities"
+  },
+  "professional_medicine": {
+    "name": "professional medicine",
+    "category": "Other"
+  },
+  "professional_psychology": {
+    "name": "professional psychology",
+    "category": "Social Sciences"
+  },
+  "public_relations": {
+    "name": "public relations",
+    "category": "Social Sciences"
+  },
+  "security_studies": {
+    "name": "security studies",
+    "category": "Social Sciences"
+  },
+  "sociology": {
+    "name": "sociology",
+    "category": "Social Sciences"
+  },
+  "us_foreign_policy": {
+    "name": "us foreign policy",
+    "category": "Social Sciences"
+  },
+  "virology": {
+    "name": "virology",
+    "category": "Other"
+  },
+  "world_religions": {
+    "name": "world religions",
+    "category": "Humanities"
+  }
+}
--- a/evaluation/mmlu/mmlu.py
+++ b/evaluation/mmlu/mmlu.py
@ -0,0 +1,167 @@
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import datasets
+import pandas as pd
+
+
+_CITATION = """\
+@article{hendryckstest2021,
+  title={Measuring Massive Multitask Language Understanding},
+  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
+  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
+  year={2021}
+}
+"""
+
+_DESCRIPTION = """\
+Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).
+"""
+
+_HOMEPAGE = "https://github.com/hendrycks/test"
+
+_LICENSE = "MIT"
+
+_URL = "mmlu.zip"
+
+task_list = [
+    "high_school_european_history",
+    "business_ethics",
+    "clinical_knowledge",
+    "medical_genetics",
+    "high_school_us_history",
+    "high_school_physics",
+    "high_school_world_history",
+    "virology",
+    "high_school_microeconomics",
+    "econometrics",
+    "college_computer_science",
+    "high_school_biology",
+    "abstract_algebra",
+    "professional_accounting",
+    "philosophy",
+    "professional_medicine",
+    "nutrition",
+    "global_facts",
+    "machine_learning",
+    "security_studies",
+    "public_relations",
+    "professional_psychology",
+    "prehistory",
+    "anatomy",
+    "human_sexuality",
+    "college_medicine",
+    "high_school_government_and_politics",
+    "college_chemistry",
+    "logical_fallacies",
+    "high_school_geography",
+    "elementary_mathematics",
+    "human_aging",
+    "college_mathematics",
+    "high_school_psychology",
+    "formal_logic",
+    "high_school_statistics",
+    "international_law",
+    "high_school_mathematics",
+    "high_school_computer_science",
+    "conceptual_physics",
+    "miscellaneous",
+    "high_school_chemistry",
+    "marketing",
+    "professional_law",
+    "management",
+    "college_physics",
+    "jurisprudence",
+    "world_religions",
+    "sociology",
+    "us_foreign_policy",
+    "high_school_macroeconomics",
+    "computer_security",
+    "moral_scenarios",
+    "moral_disputes",
+    "electrical_engineering",
+    "astronomy",
+    "college_biology",
+]
+
+
+class MMLUConfig(datasets.BuilderConfig):
+    def __init__(self, **kwargs):
+        super().__init__(version=datasets.Version("1.0.0"), **kwargs)
+
+
+class MMLU(datasets.GeneratorBasedBuilder):
+    BUILDER_CONFIGS = [
+        MMLUConfig(
+            name=task_name,
+        )
+        for task_name in task_list
+    ]
+
+    def _info(self):
+        features = datasets.Features(
+            {
+                "question": datasets.Value("string"),
+                "A": datasets.Value("string"),
+                "B": datasets.Value("string"),
+                "C": datasets.Value("string"),
+                "D": datasets.Value("string"),
+                "answer": datasets.Value("string"),
+            }
+        )
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=features,
+            homepage=_HOMEPAGE,
+            license=_LICENSE,
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        data_dir = dl_manager.download_and_extract(_URL)
+        task_name = self.config.name
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={
+                    "filepath": os.path.join(
+                        data_dir, "data", "test", f"{task_name}_test.csv"
+                    ),
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={
+                    "filepath": os.path.join(
+                        data_dir, "data", "val", f"{task_name}_val.csv"
+                    ),
+                },
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={
+                    "filepath": os.path.join(
+                        data_dir, "data", "dev", f"{task_name}_dev.csv"
+                    ),
+                },
+            ),
+        ]
+
+    def _generate_examples(self, filepath):
+        df = pd.read_csv(filepath)
+        df.columns = ["question", "A", "B", "C", "D", "answer"]
+
+        for i, instance in enumerate(df.to_dict(orient="records")):
+            yield i, instance
--- a/evaluation/mmlu/mmlu.zip
+++ b/evaluation/mmlu/mmlu.zip
--- a/src/llmtuner/extras/constants.py
+++ b/src/llmtuner/extras/constants.py
@ -36,8 +36,8 @@ SUPPORTED_MODELS = {
    "BLOOMZ-3B": "bigscience/bloomz-3b",
    "BLOOMZ-7B1-mt": "bigscience/bloomz-7b1-mt",
    "Falcon-7B": "tiiuae/falcon-7b",
-    "Falcon-7B-Chat": "tiiuae/falcon-7b-instruct",
    "Falcon-40B": "tiiuae/falcon-40b",
+    "Falcon-7B-Chat": "tiiuae/falcon-7b-instruct",
    "Falcon-40B-Chat": "tiiuae/falcon-40b-instruct",
    "Baichuan-7B": "baichuan-inc/Baichuan-7B",
    "Baichuan-13B": "baichuan-inc/Baichuan-13B-Base",
@ -47,12 +47,15 @@ SUPPORTED_MODELS = {
    "Baichuan2-7B-Chat": "baichuan-inc/Baichuan2-7B-Chat",
    "Baichuan2-13B-Chat": "baichuan-inc/Baichuan2-13B-Chat",
    "InternLM-7B": "internlm/internlm-7b",
+    "InternLM-20B": "internlm/internlm-20b",
    "InternLM-7B-Chat": "internlm/internlm-chat-7b",
+    "InternLM-20B-Chat": "internlm/internlm-chat-20b",
    "Qwen-7B": "Qwen/Qwen-7B",
    "Qwen-7B-Chat": "Qwen/Qwen-7B-Chat",
    "XVERSE-13B": "xverse/XVERSE-13B",
    "XVERSE-13B-Chat": "xverse/XVERSE-13B-Chat",
-    "ChatGLM2-6B-Chat": "THUDM/chatglm2-6b"
+    "ChatGLM2-6B-Chat": "THUDM/chatglm2-6b",
+    "Phi1.5-1.3B": "microsoft/phi-1_5"
 }

 DEFAULT_MODULE = {
@ -67,7 +70,8 @@ DEFAULT_MODULE = {
    "InternLM": "q_proj,v_proj",
    "Qwen": "c_attn",
    "XVERSE": "q_proj,v_proj",
-    "ChatGLM2": "query_key_value"
+    "ChatGLM2": "query_key_value",
+    "Phi1.5": "Wqkv"
 }

 DEFAULT_TEMPLATE = {
--- a/src/llmtuner/extras/template.py
+++ b/src/llmtuner/extras/template.py
@ -138,11 +138,10 @@ class Template:
        token_ids = []
        for elem in context:
            if isinstance(elem, str):
-                if len(elem) == 0:
-                    continue
                elem = elem.replace("{{system}}", system, 1) if system is not None else elem
                elem = elem.replace("{{query}}", query, 1) if query is not None else elem
                elem = elem.replace("{{idx}}", idx, 1) if idx is not None else elem
+                if len(elem) != 0:
                    token_ids = token_ids + tokenizer.encode(elem, **kwargs)
            elif isinstance(elem, dict):
                token_ids = token_ids + [tokenizer.convert_tokens_to_ids(elem.get("token"))]
--- a/src/llmtuner/tuner/tune.py
+++ b/src/llmtuner/tuner/tune.py
@ -38,6 +38,7 @@ def export_model(args: Optional[Dict[str, Any]] = None, max_shard_size: Optional
    model_args, _, training_args, finetuning_args, _, _ = get_train_args(args)
    model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args)
    tokenizer.padding_side = "left" # restore padding side
+    tokenizer.init_kwargs["padding_side"] = "left"
    model.save_pretrained(training_args.output_dir, max_shard_size=max_shard_size)
    try:
        tokenizer.save_pretrained(training_args.output_dir)
--- a/tests/cal_flops.py
+++ b/tests/cal_flops.py
@ -6,8 +6,8 @@
 import fire
 import torch
 from typing import Optional
-from deepspeed.accelerator import get_accelerator
-from deepspeed.profiling.flops_profiler import get_model_profile
+from deepspeed.accelerator import get_accelerator # type: ignore
+from deepspeed.profiling.flops_profiler import get_model_profile # type: ignore

 from llmtuner import ChatModel

@ -35,7 +35,7 @@ def calculate(
            print_profile=True,
            detailed=True
        )
-        print("FLOPS:", flops)
+        print("FLOPs:", flops)
        print("MACs:", macs)
        print("Params:", params)

--- a/tests/evaluate_zh.py
+++ b/tests/evaluate_zh.py
@ -1,133 +0,0 @@
-# coding=utf-8
-# Evaluates fine-tuned models automatically.
-# Usage: python evaluate_zh.py --evalset ceval/ceval-exam:law --split dev --output_file result.json
-#                              --api_base http://localhost:8000/v1 --task_type choice --n_samples 100
-# dataset format: question (string), A (string), B (string), C (string), D (string), answer (Literal["A", "B", "C", "D"])
-
-
-import os
-import fire
-import json
-import openai
-from tqdm import tqdm
-from typing import Literal, Optional
-from datasets import load_dataset
-
-
-def format_example_choice(examples):
-    model_inputs = {"query": [], "label": []}
-    task_template = "请从ABCD四个选项中选出正确的选项，仅输出选项序号。\n{question}\nA. {A}\nB. {B}\nC. {C}\nD. {D}\n答案："
-    for i in range(len(examples["id"])):
-        query = task_template.format(
-            question=examples["question"][i],
-            A=examples["A"][i],
-            B=examples["B"][i],
-            C=examples["C"][i],
-            D=examples["D"][i]
-        )
-        label = examples["answer"][i]
-        model_inputs["query"].append(query)
-        model_inputs["label"].append(label)
-    return model_inputs
-
-
-def format_example_cloze(examples):
-    model_inputs = {"query": [], "label": []}
-    task_template = "请选择正确的答案填空，仅输出正确的选项。\n{question}\n选项：{A}\n{B}\n{C}\n{D}\n答案："
-    for i in range(len(examples["id"])):
-        query = task_template.format(
-            question=examples["question"][i],
-            A=examples["A"][i],
-            B=examples["B"][i],
-            C=examples["C"][i],
-            D=examples["D"][i]
-        )
-        label = examples[examples["answer"][i]][i]
-        model_inputs["query"].append(query)
-        model_inputs["label"].append(label)
-    return model_inputs
-
-
-def format_example_openqa(examples):
-    model_inputs = {"query": [], "label": []}
-    task_template = "回答以下问题：{question}\n答案："
-    for i in range(len(examples["id"])):
-        query = task_template.format(question=examples["question"][i])
-        label = examples[examples["answer"][i]][i]
-        model_inputs["query"].append(query)
-        model_inputs["label"].append(label)
-    return model_inputs
-
-
-TASK_DICT = {
-    "choice": format_example_choice,
-    "cloze": format_example_cloze,
-    "openqa": format_example_openqa
-}
-
-
-EXT2TYPE = {
-    "csv": "csv",
-    "json": "json",
-    "jsonl": "json"
-}
-
-
-def evaluate(
-        evalset: str,
-        api_base: str,
-        output_file: str,
-        split: Optional[str] = "val",
-        task_type: Optional[Literal["choice", "cloze", "openqa"]] = "choice",
-        n_samples: Optional[int] = 20
-):
-
-    openai.api_base = api_base
-    openai.api_key = "none"
-
-    if os.path.isfile(evalset):
-        dataset = load_dataset(EXT2TYPE[evalset.split(".")[-1]], data_files=evalset)["train"]
-    elif ":" in evalset:
-        evalset, subset = evalset.split(":")
-        dataset = load_dataset(evalset, subset, split=split)
-    else:
-        dataset = load_dataset(evalset, split=split)
-
-    n_samples = min(len(dataset), n_samples)
-
-    dataset = dataset.map(TASK_DICT[task_type], batched=True)
-    dataset = dataset.select(range(n_samples))
-
-    n_correct = 0
-    predictions = []
-    for example in tqdm(dataset):
-        query, label = example["query"], example["label"]
-        predict = openai.ChatCompletion.create(
-            model="default",
-            messages=[{"role": "user", "content": query}],
-            temperature=0.01,
-            top_p=0.01,
-            max_new_tokens=20
-        ).choices[0].message.content
-
-        if task_type == "choice" and predict[0].lower() == label[0].lower():
-            n_correct += 1
-        if task_type == "cloze" and label in [predict[:len(label)], predict[-len(label):]]:
-            n_correct += 1
-        if task_type == "openqa" and label in predict:
-            n_correct += 1
-
-        predictions.append({
-            "query": query,
-            "label": label,
-            "predict": predict
-        })
-
-    print("Result: {}/{}\nAccuracy: {:.2f}%".format(n_correct, n_samples, n_correct / n_samples * 100))
-
-    with open(output_file, "w", encoding="utf-8") as f:
-        json.dump(predictions, f, indent=2, ensure_ascii=False)
-
-
-if __name__ == "__main__":
-    fire.Fire(evaluate)
--- a/tests/modeling_baichuan.py
+++ b/tests/modeling_baichuan.py
@ -1,654 +0,0 @@
-# Copyright (c) 2023, Baichuan Intelligent Technology. All rights reserved.
-# Modified by hiyouga, to support attention mask, the alibi implementation is largely borrowed from
-# https://github.com/huggingface/transformers/blob/main/src/transformers/models/bloom/modeling_bloom.py
-
-import math
-from typing import List, Optional, Tuple, Union
-
-import torch
-import torch.utils.checkpoint
-import torch.nn.functional as F
-from torch import nn
-from torch.nn import CrossEntropyLoss
-from transformers import PreTrainedModel
-from transformers.activations import ACT2FN
-from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
-from transformers.utils import logging
-
-from .configuration_baichuan import BaichuanConfig
-
-
-logger = logging.get_logger(__name__)
-
-
-# Copied from transformers.models.bloom.modeling_bloom._make_causal_mask
-def _make_causal_mask(
-    input_ids_shape: torch.Size, device: torch.device, past_key_values_length: int
-) -> torch.BoolTensor:
-    """
-    Make causal mask used for self-attention.
-    """
-    batch_size, target_length = input_ids_shape
-    mask = torch.empty((target_length, target_length + past_key_values_length), dtype=torch.bool, device=device)
-    # ONNX doesn't support `torch.Tensor.triu` properly, thus we use this workaround
-    seq_ids = torch.arange(target_length, device=device)
-    mask[:, past_key_values_length:] = seq_ids[:, None] < seq_ids[None, :]
-
-    if past_key_values_length > 0:
-        mask[:, :past_key_values_length] = False
-
-    expanded_mask = mask[None, None, :, :].expand(batch_size, 1, target_length, target_length + past_key_values_length)
-    return expanded_mask
-
-
-# Copied from transformers.models.bloom.modeling_bloom._expand_mask
-def _expand_mask(mask: torch.Tensor, tgt_length: int) -> torch.BoolTensor:
-    """
-    Expands attention_mask from `[batch_size, src_length]` to `[batch_size, 1, tgt_length, src_length]`.
-    """
-    batch_size, src_length = mask.shape
-    tgt_length = tgt_length if tgt_length is not None else src_length
-
-    expanded_mask = ~(mask[:, None, None, :].to(torch.bool))
-    return expanded_mask.expand(batch_size, 1, tgt_length, src_length)
-
-
-# Copied from transformers.models.bloom.modeling_bloom.build_alibi_tensor
-def build_alibi_tensor(attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor:
-    """
-    Link to paper: https://arxiv.org/abs/2108.12409 Alibi tensor is not causal as the original paper mentions, it
-    relies on a translation invariance of softmax for quick implementation: with l being a tensor, and a fixed value
-    `softmax(l+a) = softmax(l)`.
-
-    Args:
-    Returns tensor shaped (batch_size * num_heads, 1, max_seq_len)
-        attention_mask (`torch.Tensor`):
-            Token-wise attention mask, this should be of shape (batch_size, max_seq_len).
-        num_heads (`int`, *required*):
-            number of heads
-        dtype (`torch.dtype`, *optional*, default=`torch.bfloat16`):
-            dtype of the output tensor
-    """
-    batch_size, seq_length = attention_mask.shape
-    closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
-    base = torch.tensor(
-        2 ** (-(2 ** -(math.log2(closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
-    )
-    powers = torch.arange(1, 1 + closest_power_of_2, device=attention_mask.device, dtype=torch.int32)
-    slopes = torch.pow(base, powers)
-
-    if closest_power_of_2 != num_heads:
-        extra_base = torch.tensor(
-            2 ** (-(2 ** -(math.log2(2 * closest_power_of_2) - 3))), device=attention_mask.device, dtype=torch.float32
-        )
-        num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
-        extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=attention_mask.device, dtype=torch.int32)
-        slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
-
-    # Note: alibi will added to the attention bias that will be applied to the query, key product of attention
-    # => therefore alibi will have to be of shape (batch_size, num_heads, query_length, key_length)
-    # => here we set (batch_size=1, num_heads=num_heads, query_length=1, key_length=max_length)
-    # => the query_length dimension will then be broadcasted correctly
-    arange_tensor = ((attention_mask.cumsum(dim=-1) - 1) * attention_mask)[:, None, :]
-    alibi = slopes[..., None] * arange_tensor
-    return alibi.reshape(batch_size * num_heads, 1, seq_length).to(dtype)
-
-
-class RMSNorm(nn.Module):
-
-    def __init__(self, hidden_size, epsilon=1e-6):
-        super().__init__()
-        self.weight = nn.Parameter(torch.ones(hidden_size))
-        self.epsilon = epsilon
-
-    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        input_dtype = hidden_states.dtype
-        variance = hidden_states.to(torch.float32).pow(2).mean(-1, keepdim=True)
-        hidden_states = hidden_states * torch.rsqrt(variance + self.epsilon)
-
-        return (self.weight * hidden_states).to(input_dtype)
-
-
-class MLP(nn.Module):
-
-    def __init__(
-        self,
-        hidden_size: int,
-        intermediate_size: int,
-        hidden_act: str,
-    ):
-        super().__init__()
-        self.gate_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
-        self.down_proj = nn.Linear(intermediate_size, hidden_size, bias=False)
-        self.up_proj = nn.Linear(hidden_size, intermediate_size, bias=False)
-        self.act_fn = ACT2FN[hidden_act]
-
-    def forward(self, x):
-        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
-
-
-class BaichuanAttention(nn.Module):
-
-    def __init__(self, config: "BaichuanConfig"):
-        super().__init__()
-        self.config = config
-        self.hidden_size = config.hidden_size
-        self.num_heads = config.num_attention_heads
-        self.head_dim = self.hidden_size // self.num_heads
-        self.max_position_embeddings = config.model_max_length
-
-        if (self.head_dim * self.num_heads) != self.hidden_size:
-            raise ValueError(
-                f"hidden_size {self.hidden_size} is not divisible by num_heads {self.num_heads}"
-            )
-
-        # Layer-wise attention scaling
-        self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
-        self.beta = 1.0
-
-        self.W_pack = nn.Linear(self.hidden_size, 3 * self.hidden_size, bias=False)
-        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
-
-    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
-        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        alibi: torch.Tensor,
-        attention_mask: torch.Tensor,
-        past_key_value: Optional[Tuple[torch.Tensor]] = None,
-        output_attentions: bool = False,
-        use_cache: bool = False,
-    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
-
-        bsz, q_len, _ = hidden_states.size()
-
-        proj = self.W_pack(hidden_states) # [batch_size, seq_length, 3 x hidden_size]
-        proj = proj.unflatten(-1, (3, self.hidden_size)).unsqueeze(0).transpose(0, -2).squeeze(-2)
-        query_states = proj[0].view(bsz, q_len, self.num_heads, self.head_dim)
-        key_states = proj[1].view(bsz, q_len, self.num_heads, self.head_dim)
-        value_states = proj[2].view(bsz, q_len, self.num_heads, self.head_dim)
-
-        query_states = query_states.transpose(1, 2).reshape(bsz * self.num_heads, q_len, self.head_dim)
-        key_states = key_states.permute(0, 2, 3, 1).reshape(bsz * self.num_heads, self.head_dim, q_len)
-        value_states = value_states.transpose(1, 2).reshape(bsz * self.num_heads, q_len, self.head_dim)
-
-        if past_key_value is not None:
-            # reuse k, v, self_attention
-            past_key, past_value = past_key_value
-            key_states = torch.cat([past_key, key_states], dim=2)
-            value_states = torch.cat([past_value, value_states], dim=1)
-
-        _, _, kv_seq_len = key_states.shape
-
-        past_key_value = (key_states, value_states) if use_cache else None
-
-        # [batch_size * num_heads, q_length, kv_length]
-        # we use `torch.Tensor.baddbmm` instead of `torch.baddbmm` as the latter isn't supported by TorchScript v1.11
-        matmul_result = alibi.baddbmm(
-            batch1=query_states,
-            batch2=key_states,
-            beta=self.beta,
-            alpha=self.inv_norm_factor,
-        )
-
-        # change view to [batch_size, num_heads, q_length, kv_length]
-        attention_scores = matmul_result.view(bsz, self.num_heads, q_len, kv_seq_len)
-
-        # cast attention scores to fp32, compute scaled softmax and cast back to initial dtype
-        # [batch_size, num_heads, q_length, kv_length]
-        input_dtype = attention_scores.dtype
-        # `float16` has a minimum value of -65504.0, whereas `bfloat16` and `float32` have a minimum value of `-3.4e+38`
-        if input_dtype == torch.float16:
-            attention_scores = attention_scores.to(torch.float)
-        attn_weights = torch.masked_fill(attention_scores, attention_mask, torch.finfo(attention_scores.dtype).min)
-        attention_probs = F.softmax(attn_weights, dim=-1, dtype=torch.float32).to(input_dtype)
-
-        # change view [batch_size x num_heads, q_length, kv_length]
-        attention_probs_reshaped = attention_probs.view(bsz * self.num_heads, q_len, kv_seq_len)
-
-        # matmul: [batch_size * num_heads, q_length, head_dim]
-        attn_output = torch.bmm(attention_probs_reshaped, value_states)
-
-        attn_output = attn_output.view(bsz, self.num_heads, q_len, self.head_dim)
-
-        attn_output = attn_output.transpose(1, 2).reshape(bsz, q_len, self.hidden_size)
-        attn_output = self.o_proj(attn_output)
-
-        if not output_attentions:
-            attention_probs = None
-
-        return attn_output, attention_probs, past_key_value
-
-
-class BaichuanLayer(nn.Module):
-
-    def __init__(self, config: "BaichuanConfig"):
-        super().__init__()
-        self.hidden_size = config.hidden_size
-        self.self_attn = BaichuanAttention(config=config)
-        self.mlp = MLP(
-            hidden_size=self.hidden_size,
-            intermediate_size=config.intermediate_size,
-            hidden_act=config.hidden_act,
-        )
-        self.input_layernorm = RMSNorm(config.hidden_size, epsilon=config.rms_norm_eps)
-        self.post_attention_layernorm = RMSNorm(config.hidden_size, epsilon=config.rms_norm_eps)
-
-    def forward(
-        self,
-        hidden_states: torch.Tensor,
-        alibi: torch.Tensor,
-        attention_mask: torch.Tensor,
-        past_key_value: Optional[Tuple[torch.Tensor]] = None,
-        output_attentions: Optional[bool] = False,
-        use_cache: Optional[bool] = False,
-    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
-
-        residual = hidden_states
-
-        hidden_states = self.input_layernorm(hidden_states)
-
-        # Self Attention
-        hidden_states, self_attn_weights, present_key_value = self.self_attn(
-            hidden_states=hidden_states,
-            alibi=alibi,
-            attention_mask=attention_mask,
-            past_key_value=past_key_value,
-            output_attentions=output_attentions,
-            use_cache=use_cache,
-        )
-        hidden_states = residual + hidden_states
-
-        # Fully Connected
-        residual = hidden_states
-        hidden_states = self.post_attention_layernorm(hidden_states)
-        hidden_states = self.mlp(hidden_states)
-        hidden_states = residual + hidden_states
-
-        outputs = (hidden_states,)
-
-        if output_attentions:
-            outputs += (self_attn_weights,)
-
-        if use_cache:
-            outputs += (present_key_value,)
-
-        return outputs
-
-
-class BaichuanPreTrainedModel(PreTrainedModel):
-    config_class = BaichuanConfig
-    base_model_prefix = "model"
-    supports_gradient_checkpointing = True
-    _no_split_modules = ["BaichuanLayer"]
-    _skip_keys_device_placement = "past_key_values"
-    _keys_to_ignore_on_load_unexpected = [r"decoder\.version"]
-
-    def _init_weights(self, module):
-        std = self.config.initializer_range
-        if isinstance(module, nn.Linear):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.bias is not None:
-                module.bias.data.zero_()
-        elif isinstance(module, nn.Embedding):
-            module.weight.data.normal_(mean=0.0, std=std)
-            if module.padding_idx is not None:
-                module.weight.data[module.padding_idx].zero_()
-
-    def _set_gradient_checkpointing(self, module, value=False):
-        if isinstance(module, BaichuanModel):
-            module.gradient_checkpointing = value
-
-    @staticmethod
-    def _convert_to_standard_cache(
-        past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]], batch_size: int
-    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor]]:
-        """
-        Standardizes the format of the cache so as to match most implementations, i.e. to tuple(tuple([batch_size,
-        num_heads, ...]))
-        """
-        batch_size_times_num_heads, head_dim, seq_length = past_key_value[0][0].shape
-        num_heads = batch_size_times_num_heads // batch_size
-        # key: [batch_size * num_heads, head_dim, seq_length] -> [batch_size, num_heads, head_dim, seq_length]
-        # value: [batch_size * num_heads, seq_length, head_dim] -> [batch_size, num_heads, seq_length, head_dim]
-        return tuple(
-            (
-                layer_past[0].view(batch_size, num_heads, head_dim, seq_length),
-                layer_past[1].view(batch_size, num_heads, seq_length, head_dim),
-            )
-            for layer_past in past_key_value
-        )
-
-    @staticmethod
-    def _convert_to_baichuan_cache(
-        past_key_value: Tuple[Tuple[torch.Tensor, torch.Tensor]]
-    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor]]:
-        """
-        Converts the cache to the format expected by Baichuan, i.e. to tuple(tuple([batch_size * num_heads, ...]))
-        """
-        batch_size, num_heads, head_dim, seq_length = past_key_value[0][0].shape
-        batch_size_times_num_heads = batch_size * num_heads
-        # key:  [batch_size, num_heads, head_dim, seq_length] -> [batch_size * num_heads, head_dim, seq_length]
-        # value: [batch_size, num_heads, seq_length, head_dim] -> [batch_size * num_heads, seq_length, head_dim]
-        return tuple(
-            (
-                layer_past[0].view(batch_size_times_num_heads, head_dim, seq_length),
-                layer_past[1].view(batch_size_times_num_heads, seq_length, head_dim),
-            )
-            for layer_past in past_key_value
-        )
-
-
-class BaichuanModel(BaichuanPreTrainedModel):
-
-    def __init__(self, config: "BaichuanConfig"):
-        super().__init__(config)
-        self.padding_idx = config.pad_token_id
-        self.vocab_size = config.vocab_size
-        self.n_head = config.num_attention_heads
-
-        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
-        self.layers = nn.ModuleList([BaichuanLayer(config) for _ in range(config.num_hidden_layers)])
-        self.norm = RMSNorm(config.hidden_size, epsilon=config.rms_norm_eps)
-
-        self.gradient_checkpointing = config.gradient_checkpointing
-        self.post_init()
-
-    def get_input_embeddings(self):
-        return self.embed_tokens
-
-    def set_input_embeddings(self, value):
-        self.embed_tokens = value
-
-    def build_alibi_tensor(self, attention_mask: torch.Tensor, num_heads: int, dtype: torch.dtype) -> torch.Tensor:
-        return build_alibi_tensor(attention_mask, num_heads, dtype)
-
-    def _prepare_attn_mask(
-        self, attention_mask: torch.Tensor, input_shape: Tuple[int, int], past_key_values_length: int
-    ) -> torch.BoolTensor:
-        # create causal mask
-        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
-        combined_attention_mask = None
-        device = attention_mask.device
-        _, src_length = input_shape
-
-        if src_length > 1:
-            combined_attention_mask = _make_causal_mask(
-                input_shape, device=device, past_key_values_length=past_key_values_length
-            )
-
-        # [batch_size, seq_length] -> [batch_size, 1, tgt_length, src_length]
-        expanded_attn_mask = _expand_mask(attention_mask, tgt_length=src_length)
-        combined_attention_mask = (
-            expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
-        )
-
-        return combined_attention_mask
-
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-    ) -> Union[Tuple, BaseModelOutputWithPast]:
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
-        )
-        use_cache = use_cache if use_cache is not None else self.config.use_cache
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-
-        if input_ids is not None and inputs_embeds is not None:
-            raise ValueError("You cannot provide both input_ids and inputs_embeds simultaneously")
-        elif input_ids is not None:
-            batch_size, seq_length = input_ids.shape
-        elif inputs_embeds is not None:
-            batch_size, seq_length, _ = inputs_embeds.shape
-        else:
-            raise ValueError("You need to provide input_ids or inputs_embeds")
-
-        seq_length_with_past = seq_length
-        past_key_values_length = 0
-        if past_key_values is not None:
-            past_key_values_length = past_key_values[0][0].shape[1]
-            seq_length_with_past = seq_length_with_past + past_key_values_length
-
-        if inputs_embeds is None:
-            inputs_embeds = self.embed_tokens(input_ids)
-
-        hidden_states = inputs_embeds
-
-        if attention_mask is None:
-            attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
-        else:
-            attention_mask = attention_mask.to(hidden_states.device)
-
-        if self.gradient_checkpointing and self.training:
-            if use_cache:
-                logger.warning_once(
-                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
-                )
-                use_cache = False
-
-        # Compute alibi tensor: check build_alibi_tensor documentation
-        alibi = self.build_alibi_tensor(attention_mask, self.n_head, dtype=hidden_states.dtype)
-
-        causal_mask = self._prepare_attn_mask(
-            attention_mask,
-            input_shape=(batch_size, seq_length),
-            past_key_values_length=past_key_values_length,
-        )
-
-        # decoder layers
-        all_hidden_states = () if output_hidden_states else None
-        all_self_attns = () if output_attentions else None
-        next_decoder_cache = () if use_cache else None
-
-        for idx, decoder_layer in enumerate(self.layers):
-            if output_hidden_states:
-                all_hidden_states += (hidden_states,)
-
-            past_key_value = past_key_values[idx] if past_key_values is not None else None
-
-            if self.gradient_checkpointing and self.training:
-
-                def create_custom_forward(module):
-                    def custom_forward(*inputs):
-                        # None for past_key_value
-                        return module(*inputs, output_attentions, None)
-
-                    return custom_forward
-
-                layer_outputs = torch.utils.checkpoint.checkpoint(
-                    create_custom_forward(decoder_layer),
-                    hidden_states,
-                    alibi,
-                    causal_mask,
-                    None,
-                )
-            else:
-                layer_outputs = decoder_layer(
-                    hidden_states,
-                    alibi=alibi,
-                    attention_mask=causal_mask,
-                    past_key_value=past_key_value,
-                    output_attentions=output_attentions,
-                    use_cache=use_cache,
-                )
-
-            hidden_states = layer_outputs[0]
-
-            if use_cache:
-                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
-
-            if output_attentions:
-                all_self_attns += (layer_outputs[1],)
-
-        hidden_states = self.norm(hidden_states)
-
-        # add hidden states from the last decoder layer
-        if output_hidden_states:
-            all_hidden_states += (hidden_states,)
-
-        next_cache = next_decoder_cache if use_cache else None
-
-        if not return_dict:
-            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
-
-        return BaseModelOutputWithPast(
-            last_hidden_state=hidden_states,
-            past_key_values=next_cache,
-            hidden_states=all_hidden_states,
-            attentions=all_self_attns,
-        )
-
-
-class BaichuanForCausalLM(BaichuanPreTrainedModel):
-
-    def __init__(self, config):
-        super().__init__(config)
-        self.model = BaichuanModel(config)
-
-        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
-
-        # Initialize weights and apply final processing
-        self.post_init()
-
-    def get_input_embeddings(self):
-        return self.model.embed_tokens
-
-    def set_input_embeddings(self, value):
-        self.model.embed_tokens = value
-
-    def get_output_embeddings(self):
-        return self.lm_head
-
-    def set_output_embeddings(self, new_embeddings):
-        self.lm_head = new_embeddings
-
-    def set_decoder(self, decoder):
-        self.model = decoder
-
-    def get_decoder(self):
-        return self.model
-
-    def forward(
-        self,
-        input_ids: torch.LongTensor = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        past_key_values: Optional[List[torch.FloatTensor]] = None,
-        inputs_embeds: Optional[torch.FloatTensor] = None,
-        labels: Optional[torch.LongTensor] = None,
-        use_cache: Optional[bool] = None,
-        output_attentions: Optional[bool] = None,
-        output_hidden_states: Optional[bool] = None,
-        return_dict: Optional[bool] = None,
-        **kwargs
-    ) -> Union[Tuple, CausalLMOutputWithPast]:
-        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
-        output_hidden_states = (
-            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
-        )
-        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
-
-        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-        outputs = self.model(
-            input_ids=input_ids,
-            attention_mask=attention_mask,
-            past_key_values=past_key_values,
-            inputs_embeds=inputs_embeds,
-            use_cache=use_cache,
-            output_attentions=output_attentions,
-            output_hidden_states=output_hidden_states,
-            return_dict=return_dict,
-        )
-
-        hidden_states = outputs[0]
-        logits = self.lm_head(hidden_states)
-
-        loss = None
-        if labels is not None:
-            # Shift so that tokens < n predict n
-            shift_logits = logits[..., :-1, :].contiguous()
-            shift_labels = labels[..., 1:].contiguous()
-            # Flatten the tokens
-            loss_fct = CrossEntropyLoss()
-            shift_logits = shift_logits.view(-1, self.config.vocab_size)
-            shift_labels = shift_labels.view(-1)
-            # Enable model parallelism
-            shift_labels = shift_labels.to(shift_logits.device)
-            loss = loss_fct(shift_logits, shift_labels)
-
-        if not return_dict:
-            output = (logits,) + outputs[1:]
-            return (loss,) + output if loss is not None else output
-
-        return CausalLMOutputWithPast(
-            loss=loss,
-            logits=logits,
-            past_key_values=outputs.past_key_values,
-            hidden_states=outputs.hidden_states,
-            attentions=outputs.attentions,
-        )
-
-    def prepare_inputs_for_generation(
-        self,
-        input_ids: torch.LongTensor,
-        past_key_values: Optional[torch.Tensor] = None,
-        attention_mask: Optional[torch.Tensor] = None,
-        inputs_embeds: Optional[torch.Tensor] = None,
-        **kwargs
-    ) -> dict:
-        if past_key_values:
-            input_ids = input_ids[:, -1:]
-
-            # the cache may be in the standard format (e.g. in contrastive search)
-            if past_key_values[0][0].shape[0] == input_ids.shape[0]:
-                past_key_values = self._convert_to_baichuan_cache(past_key_values)
-
-        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
-        if inputs_embeds is not None and past_key_values is None:
-            model_inputs = {"inputs_embeds": inputs_embeds}
-        else:
-            model_inputs = {"input_ids": input_ids}
-
-        model_inputs.update(
-            {
-                "past_key_values": past_key_values,
-                "use_cache": kwargs.get("use_cache"),
-                "attention_mask": attention_mask,
-            }
-        )
-        return model_inputs
-
-    def _reorder_cache(
-        self, past: Tuple[Tuple[torch.Tensor, torch.Tensor], ...], beam_idx: torch.LongTensor
-    ) -> Tuple[Tuple[torch.Tensor, torch.Tensor], ...]:
-        """
-        This function is used to re-order the `past_key_values` cache if [`~PreTrainedModel.beam_search`] or
-        [`~PreTrainedModel.beam_sample`] is called. This is required to match `past_key_values` with the correct
-        beam_idx at every generation step.
-
-        Output shares the same memory storage as `past`.
-        """
-        standardized_past = self._convert_to_standard_cache(past, batch_size=len(beam_idx))
-
-        # Get a copy of `beam_idx` on all the devices where we need those indices.
-        device_to_beam_idx = {
-            past_state.device: beam_idx.to(past_state.device) for layer_past in past for past_state in layer_past
-        }
-        reordered_past = tuple(
-            (
-                layer_past[0].index_select(0, device_to_beam_idx[layer_past[0].device]),
-                layer_past[1].index_select(0, device_to_beam_idx[layer_past[0].device]),
-            )
-            for layer_past in standardized_past
-        )
-        return self._convert_to_baichuan_cache(reordered_past)
--- a/tests/quantize.py
+++ b/tests/quantize.py
@ -1,5 +1,5 @@
 # coding=utf-8
-# Quantizes fine-tuned models with AutoGPTQ (https://github.com/PanQiWei/AutoGPTQ).
+# Quantizes models with AutoGPTQ (https://github.com/PanQiWei/AutoGPTQ).
 # Usage: python quantize.py --input_dir path_to_llama_model --output_dir path_to_quant_model --data_file alpaca.json
 #                           --max_length 1024 --max_samples 1024
 # dataset format: instruction (string), input (string), output (string), history (List[string])