add datasets

2023-07-19 20:59:15 +08:00 · 2023-07-19 20:59:15 +08:00 · 7159bc54ed
parent 925a790bc9
commit 7159bc54ed
14 changed files with 1423630 additions and 134 deletions
--- a/README.md
+++ b/README.md
@ -22,11 +22,11 @@

 [23/07/05] Now we support training the **Falcon-7B/40B** models in this repo. Try `--model_name_or_path tiiuae/falcon-7b` and `--lora_target query_key_value` arguments to use the Falcon model.

-[23/06/29] We provide a **reproducible example** of training a chat model using instruction-following datasets, see this [HuggingFace Repo](https://huggingface.co/hiyouga/baichuan-7b-sft) for details.
+[23/06/29] We provide a **reproducible example** of training a chat model using instruction-following datasets, see this [Hugging Face Repo](https://huggingface.co/hiyouga/baichuan-7b-sft) for details.

 [23/06/22] Now we align the [demo API](src/api_demo.py) with the [OpenAI's](https://platform.openai.com/docs/api-reference/chat) format where you can insert the fine-tuned model in **arbitrary ChatGPT-based applications**.

-[23/06/15] Now we support training the **Baichuan-7B** model in this repo. Try `--model_name_or_path baichuan-inc/Baichuan-7B` and `--lora_target W_pack` arguments to use the Baichuan-7B model. If you want to train with RTX3090, use `git checkout baichuan-7b-rtx3090` to switch to the `baichuan-7b-rtx3090` branch and try the `--baichuan_rtx_gpu true` argument. (Other RTX series GPUs can also be tried)
+[23/06/15] Now we support training the **Baichuan-7B** model in this repo. Try `--model_name_or_path baichuan-inc/Baichuan-7B` and `--lora_target W_pack` arguments to use the Baichuan-7B model.

 [23/06/03] Now we support quantized training and inference (aka **[QLoRA](https://github.com/artidoro/qlora)**). Try `--quantization_bit 4/8` argument to work with quantized model. (experimental feature)

@ -60,36 +60,36 @@
 ## Provided Datasets

 - For pre-training:
-  - [Wiki Demo](data/wiki_demo.txt)
+  - [Wiki Demo (en)](data/wiki_demo.txt)
 - For supervised fine-tuning:
-  - [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
-  - [Stanford Alpaca (Chinese)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
-  - [GPT-4 Generated Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
-  - [BELLE 2M](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
-  - [BELLE 1M](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
-  - [BELLE 0.5M](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
-  - [BELLE Dialogue 0.4M](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M)
-  - [BELLE School Math 0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M)
-  - [BELLE Multiturn Chat 0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)
-  - [Guanaco Dataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
-  - [Firefly 1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)
-  - [CodeAlpaca 20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k)
-  - [Alpaca CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
-  - [Web QA (Chinese)](https://huggingface.co/datasets/suolyer/webqa)
-  - [UltraChat](https://github.com/thunlp/UltraChat)
-  - [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1)
-  - [Open Assistant (Chinese)](https://huggingface.co/datasets/OpenAssistant/oasst1)
-  - [WebNovel (Chinese)](https://huggingface.co/datasets/zxbsmk/webnovel_cn)
- For reward model training:
-  - [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf)
-  - [Open Assistant](https://huggingface.co/datasets/OpenAssistant/oasst1)
-  - [Open Assistant (Chinese)](https://huggingface.co/datasets/OpenAssistant/oasst1)
-  - [GPT-4 Generated Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
-  - [GPT-4 Generated Data (Chinese)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
+  - [Stanford Alpaca (en)](https://github.com/tatsu-lab/stanford_alpaca)
+  - [Stanford Alpaca (zh)](https://github.com/ymcui/Chinese-LLaMA-Alpaca)
+  - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)
+  - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
+  - [Self-cognition (zh)](data/self_cognition.json)
+  - [ShareGPT (zh)](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main/Chinese-instruction-collection)
+  - [RefGPT (zh)](https://github.com/sufengniu/RefGPT)
+  - [Guanaco Dataset (multilingual)](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset)
+  - [BELLE 2M (zh)](https://huggingface.co/datasets/BelleGroup/train_2M_CN)
+  - [BELLE 1M (zh)](https://huggingface.co/datasets/BelleGroup/train_1M_CN)
+  - [BELLE 0.5M (zh)](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN)
+  - [BELLE Dialogue 0.4M (zh)](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M)
+  - [BELLE School Math 0.25M (zh)](https://huggingface.co/datasets/BelleGroup/school_math_0.25M)
+  - [BELLE Multiturn Chat 0.8M (zh)](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M)
+  - [Firefly 1.1M (zh)](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M)
+  - [CodeAlpaca 20k (en)](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k)
+  - [Alpaca CoT (multilingual)](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT)
+  - [Web QA (zh)](https://huggingface.co/datasets/suolyer/webqa)
+  - [UltraChat (en)](https://github.com/thunlp/UltraChat)
+  - [WebNovel (zh)](https://huggingface.co/datasets/zxbsmk/webnovel_cn)
+- For reward modelling:
+  - [HH-RLHF (en)](https://huggingface.co/datasets/Anthropic/hh-rlhf)
+  - [Open Assistant (multilingual)](https://huggingface.co/datasets/OpenAssistant/oasst1)
+  - [GPT-4 Generated Data (en&zh)](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM)

 Please refer to [data/README.md](data/README.md) for details.

-Some datasets require confirmation before using them, so we recommend logging in with your HuggingFace account using these commands.
+Some datasets require confirmation before using them, so we recommend logging in with your Hugging Face account using these commands.

 ```bash
 pip install --upgrade huggingface_hub
--- a/assets/wechat.jpg
+++ b/assets/wechat.jpg
--- a/data/README.md
+++ b/data/README.md
@ -1,4 +1,5 @@
-Data format in `dataset_info.json`:
+If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.
+
 ```json
 "dataset_name": {
    "hf_hub_url": "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
@ -14,40 +15,4 @@ Data format in `dataset_info.json`:
 }
 ```

-`dataset_info.json` 中的数据集定义格式：
-```json
-"数据集名称": {
-    "hf_hub_url": "HuggingFace上的项目地址（若指定，则忽略下列三个参数）",
-    "script_url": "包含数据加载脚本的本地文件夹名称（若指定，则忽略下列两个参数）",
-    "file_name": "该目录下数据集文件的名称（若上述参数未指定，则此项必需）",
-    "file_sha1": "数据集文件的SHA-1哈希值（可选）",
-    "columns": {
-        "prompt": "数据集代表提示词的表头名称（默认：instruction）",
-        "query": "数据集代表请求的表头名称（默认：input）",
-        "response": "数据集代表回答的表头名称（默认：output）",
-        "history": "数据集代表历史对话的表头名称（默认：None）"
-    }
-}
-```
-
-部分预置数据集简介：
-
-| 数据集名称 | 规模 | 描述 |
-| --- | --- | --- |
-| [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca) | 52k | 斯坦福大学开源的 Alpaca 数据集，训练了 Alpaca 这类早期基于 LLaMA 的模型 |
-| [Stanford Alpaca (Chinese)](https://github.com/ymcui/Chinese-LLaMA-Alpaca) | 51k | 使用 ChatGPT 翻译的 Alpaca 数据集 |
-| [GPT-4 Generated Data](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) | 100k+ | 基于 GPT-4 的 self-instruction 数据集 |
-| [BELLE 2M](https://huggingface.co/datasets/BelleGroup/train_2M_CN) | 2m | 包含约 200 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文指令数据 |
-| [BELLE 1M](https://huggingface.co/datasets/BelleGroup/train_1M_CN) | 1m | 包含约 100 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文指令数据 |
-| [BELLE 0.5M](https://huggingface.co/datasets/BelleGroup/train_0.5M_CN) | 500k  | 包含约 50 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文指令数据 |
-| [BELLE Dialogue 0.4M](https://huggingface.co/datasets/BelleGroup/generated_chat_0.4M) | 400k | 包含约 40 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的个性化角色对话数据，包含角色介绍 |
-| [BELLE School Math 0.25M](https://huggingface.co/datasets/BelleGroup/school_math_0.25M) | 250k  | 包含约 25 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的中文数学题数据，包含解题过程 |
-| [BELLE Multiturn Chat 0.8M](https://huggingface.co/datasets/BelleGroup/multiturn_chat_0.8M) | 800k | 包含约 80 万条由 [BELLE](https://github.com/LianjiaTech/BELLE) 项目生成的用户与助手的多轮对话 |
-| [Guanaco Dataset](https://huggingface.co/datasets/JosephusCheung/GuanacoDataset) | 100k+ | 包含日文、简繁体中文、英文等多类数据，数据集原用于 Guanaco 模型训练 |
-| [Firefly 1.1M](https://huggingface.co/datasets/YeungNLP/firefly-train-1.1M) | 1.1M  | 中文对话大模型 firefly（流萤）的中文数据集，包含多个 NLP 任务 |
-| [CodeAlpaca 20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k) | 20k | 英文代码生成任务数据集 |
-| [Alpaca CoT](https://huggingface.co/datasets/QingyiSi/Alpaca-CoT) | 6M | 用于微调的指令数据集集合 |
-| [Web QA](https://huggingface.co/datasets/suolyer/webqa) | 36k | 百度知道汇集的中文问答数据集 |
-| [UltraChat](https://github.com/thunlp/UltraChat) | 1.57M | 清华 NLP 发布的大规模多轮对话数据集 |
-
-注：BELLE 数据集是由 ChatGPT 产生的数据集，不保证数据准确性，所有类 GPT 模型产生的 self-instruction 数据集均不能保证其准确性。
+where the `prompt` and `response` columns should contain non-empty values. The `query` column will be concatenated with the `prompt` column and used as input for the model. The `history` column should contain a list where each element is a string tuple representing a query-response pair.
--- a/data/dataset_info.json
+++ b/data/dataset_info.json
@ -1,6 +1,7 @@
 {
  "alpaca_en": {
-    "hf_hub_url": "tatsu-lab/alpaca"
+    "file_name": "alpaca_data_en_52k.json",
+    "file_sha1": "607f94a7f581341e59685aef32f531095232cf23"
  },
  "alpaca_zh": {
    "file_name": "alpaca_data_zh_51k.json",
@ -14,6 +15,72 @@
    "file_name": "alpaca_gpt4_data_zh.json",
    "file_sha1": "3eaa3bda364ccdd59925d7448a698256c31ef845"
  },
+  "self_cognition": {
+    "file_name": "self_cognition.json",
+    "file_sha1": "6287a730ada924fc5d9eadc6d8f865e01b7a6f67"
+  },
+  "oaast_sft": {
+    "file_name": "oaast_sft.json",
+    "file_sha1": "08912e34fb165db137d3436db4c35321e33b28d1",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "oaast_sft_zh": {
+    "file_name": "oaast_sft_zh.json",
+    "file_sha1": "e0a2e7e8eff355434ada6c9b7f70bb915f941dd4",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "sharegpt_zh": {
+    "file_name": "sharegpt_zh_27k.json",
+    "file_sha1": "baf766bcf3d61f1b783728c14ce695af57a86e6e",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "refgpt_zh_p1": {
+    "file_name": "refgpt_zh_50k_p1.json",
+    "file_sha1": "995043a909eed6693f850a96fccb4d3803f3ea5e",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "refgpt_zh_p2": {
+    "file_name": "refgpt_zh_50k_p2.json",
+    "file_sha1": "d9442d5c4541fe5489b5b571871fbe7595ee3809",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "example": {
+    "script_url": "example_dataset",
+    "columns": {
+      "prompt": "instruction",
+      "query": "input",
+      "response": "output",
+      "history": "history"
+    }
+  },
+  "guanaco": {
+    "hf_hub_url": "JosephusCheung/GuanacoDataset"
+  },
  "belle_0.5m": {
    "hf_hub_url": "BelleGroup/train_0.5M_CN"
  },
@ -38,9 +105,6 @@
      "history": "history"
    }
  },
-  "guanaco": {
-    "hf_hub_url": "JosephusCheung/GuanacoDataset"
-  },
  "firefly": {
    "hf_hub_url": "YeungNLP/firefly-train-1.1M",
    "columns": {
@ -74,38 +138,9 @@
      "history": "history"
    }
  },
-  "oaast_sft": {
-    "file_name": "oaast_sft.json",
-    "file_sha1": "08912e34fb165db137d3436db4c35321e33b28d1",
-    "columns": {
-      "prompt": "instruction",
-      "query": "input",
-      "response": "output",
-      "history": "history"
-    }
-  },
-  "oaast_sft_zh": {
-    "file_name": "oaast_sft_zh.json",
-    "file_sha1": "e0a2e7e8eff355434ada6c9b7f70bb915f941dd4",
-    "columns": {
-      "prompt": "instruction",
-      "query": "input",
-      "response": "output",
-      "history": "history"
-    }
-  },
  "novel_tokens512_50k": {
    "hf_hub_url": "zxbsmk/webnovel_cn"
  },
-  "example": {
-    "script_url": "example_dataset",
-    "columns": {
-      "prompt": "instruction",
-      "query": "input",
-      "response": "output",
-      "history": "history"
-    }
-  },
  "comparison_gpt4_en": {
    "file_name": "comparison_gpt4_data_en.json",
    "file_sha1": "96fa18313544e22444fe20eead7754b17da452ae"
--- a/data/pretrain_data/blog.json
+++ b/data/pretrain_data/blog.json
@ -1,7 +0,0 @@
-[
-  {
-    "id": 0,
-    "title": "拥有自己的航空器",
-    "content": "想自己驾驶飞机或从事通用航空事业的人，大都想拥有自己的航空器。\"自己的\"意思包括自己购买、自己制造、可供自己使用（租用或借用）等等。\n花自己的钱买一架飞机来开一开，国内有些人或企业已实现了这个愿望。现在一架国产超轻型的“蜜蜂”飞机售价在l0万元以下，进口的一些单发的双座飞机售价在100万元之内。据估计，全国大约有几十万人具有这种购买能力。\n自己造一架飞机来开也是一个好创意。美国的通航飞机中有l／5是自制的。有的自制飞机甚至还创造了世界飞行纪录。今天自己造飞机比当年莱特兄弟容易多了。飞机的基本构造已无秘密可言，各种飞机部件和材料都不难买到。尤其主要的是，技术进步大大改进了配件的性能，与此同时，配件的重量也下降了很多。莱特兄弟当年使用的12马力汽油发动机比现在30马力的同类产品还重。如果有人有志于此而且具备造飞机的种种条件，应该说这个目标也是可以实现的。有两点值得注意，一是在莱特兄弟造飞机时没有前人经验，全靠自己摸索。现在不同了，航空制造已有了上百年的知识和经验可供后人学习和利用。现在如果谁想自己造飞机就不用闭门造车了。制造者本人首先应该去学习和掌握一些必要知识和经验才行。其次，在莱特兄弟时代，没有国家民航当局，他们的航空活动不受法规约束。今天就不一样了，所有要升空的航空器必须先接受民航当局的鉴定，以保证飞行安全。绝不允许以生命为赌注的任何冒险行为。\n租用飞机也是实现自驾飞机的方式之一。国内也还有另一种形式，即参加飞行驾驶学校接受培训，当然所交的学费价格是不菲的。预计未来在我国必将出现出各类飞行俱乐部。到那时，飞行爱好者可以租用飞机去上天过一把瘾了。"
-  }
-]
--- a/data/pretrain_data/wiki.json
+++ b/data/pretrain_data/wiki.json
@ -1,12 +0,0 @@
-[
-  {
-    "id": 0,
-    "title": "大卫·亨利",
-    "content": "大卫·亨利\n\n大卫·克莱顿·亨利（David Clayton Henrie，），美国演员。近来在迪士尼频道原创电视影集《少年魔法师》（Wizards of Waverly Place）当中演出贾斯汀·鲁索（Justin Russo）一角。\n\n大卫·亨利出生在加州Mission Viejo，在凤凰城长大。他的胞弟劳伦斯·亨利（Lorenzo Henrie）也是演员。大卫·亨利就读夏安传统学校。家中是信奉罗马天主教。 \n\n大卫在2007年拍摄少年魔法师期间认识女演员露西·海尔（Lucy Hale），之后与其交往，于2009年分手。\n\n10岁时，大卫·亨利和SAG在凤凰城签订了合约，并开始走出去试镜。 9岁的时候，在沙加缅度进行商业拍摄，SAG董事建议大卫·亨利搬到洛杉矶。在10岁那年夏天，他和他的家人搬到了好莱坞。他预定他的前2支商业试镜，扮演主要角色为汉堡王和桂格燕麦。他初演电视节目为Providence。 \n\n到了13岁，大卫有了他的第一次重大突破，在福克斯公司的喜剧The Pitts饰演 Petey Pitt一角。大卫下出作品为的Hallmark movie为Monster Maker，和琳达布莱儿、乔治甘迺迪共同演出，并要求回来Hallmark movie公司。 \n\n在18岁时，大卫得到了迪士尼频道原创系列演出机会，该节目2007年10月12日首播。大卫2008年参加了迪士尼频道的游戏节目。他是绿色团队的队长，隔年，为旋风队队长。他在迪士尼原创电影《少年魔法师》之后在《酷爸的疯狂假期》中有饰演一角。\n"
-  },
-  {
-    "id": 1,
-    "title": "大卫·亨利",
-    "content": "大卫·亨利\n\n大卫·克莱顿·亨利（David Clayton Henrie，），美国演员。近来在迪士尼频道原创电视影集《少年魔法师》（Wizards of Waverly Place）当中演出贾斯汀·鲁索（Justin Russo）一角。\n\n大卫·亨利出生在加州Mission Viejo，在凤凰城长大。他的胞弟劳伦斯·亨利（Lorenzo Henrie）也是演员。大卫·亨利就读夏安传统学校。家中是信奉罗马天主教。 \n\n大卫在2007年拍摄少年魔法师期间认识女演员露西·海尔（Lucy Hale），之后与其交往，于2009年分手。\n\n10岁时，大卫·亨利和SAG在凤凰城签订了合约，并开始走出去试镜。 9岁的时候，在沙加缅度进行商业拍摄，SAG董事建议大卫·亨利搬到洛杉矶。在10岁那年夏天，他和他的家人搬到了好莱坞。他预定他的前2支商业试镜，扮演主要角色为汉堡王和桂格燕麦。他初演电视节目为Providence。 \n\n到了13岁，大卫有了他的第一次重大突破，在福克斯公司的喜剧The Pitts饰演 Petey Pitt一角。大卫下出作品为的Hallmark movie为Monster Maker，和琳达布莱儿、乔治甘迺迪共同演出，并要求回来Hallmark movie公司。 \n\n在18岁时，大卫得到了迪士尼频道原创系列演出机会，该节目2007年10月12日首播。大卫2008年参加了迪士尼频道的游戏节目。他是绿色团队的队长，隔年，为旋风队队长。他在迪士尼原创电影《少年魔法师》之后在《酷爸的疯狂假期》中有饰演一角。\n"
-  }
-]
--- a/data/pretrain_data/wiki_long.jsonl
+++ b/data/pretrain_data/wiki_long.jsonl
@ -1,2 +0,0 @@
-{"id": 0,"title": "大卫·亨利","content": "大卫·亨利\n\n大卫·克莱顿·亨利（David Clayton Henrie，），美国演员。近来在迪士尼频道原创电视影集《少年魔法师》（Wizards of Waverly Place）当中演出贾斯汀·鲁索（Justin Russo）一角。\n\n大卫·亨利出生在加州Mission Viejo，在凤凰城长大。他的胞弟劳伦斯·亨利（Lorenzo Henrie）也是演员。大卫·亨利就读夏安传统学校。家中是信奉罗马天主教。 \n\n大卫在2007年拍摄少年魔法师期间认识女演员露西·海尔（Lucy Hale），之后与其交往，于2009年分手。\n\n10岁时，大卫·亨利和SAG在凤凰城签订了合约，并开始走出去试镜。 9岁的时候，在沙加缅度进行商业拍摄，SAG董事建议大卫·亨利搬到洛杉矶。在10岁那年夏天，他和他的家人搬到了好莱坞。他预定他的前2支商业试镜，扮演主要角色为汉堡王和桂格燕麦。他初演电视节目为Providence。 \n\n到了13岁，大卫有了他的第一次重大突破，在福克斯公司的喜剧The Pitts饰演 Petey Pitt一角。大卫下出作品为的Hallmark movie为Monster Maker，和琳达布莱儿、乔治甘迺迪共同演出，并要求回来Hallmark movie公司。 \n\n在18岁时，大卫得到了迪士尼频道原创系列演出机会，该节目2007年10月12日首播。大卫2008年参加了迪士尼频道的游戏节目。他是绿色团队的队长，隔年，为旋风队队长。他在迪士尼原创电影《少年魔法师》之后在《酷爸的疯狂假期》中有饰演一角。\n"}
-{"id": 1,"title": "大卫·亨利","content": "大卫·亨利\n\n大卫·克莱顿·亨利（David Clayton Henrie，），美国演员。近来在迪士尼频道原创电视影集《少年魔法师》（Wizards of Waverly Place）当中演出贾斯汀·鲁索（Justin Russo）一角。\n\n大卫·亨利出生在加州Mission Viejo，在凤凰城长大。他的胞弟劳伦斯·亨利（Lorenzo Henrie）也是演员。大卫·亨利就读夏安传统学校。家中是信奉罗马天主教。 \n\n大卫在2007年拍摄少年魔法师期间认识女演员露西·海尔（Lucy Hale），之后与其交往，于2009年分手。\n\n10岁时，大卫·亨利和SAG在凤凰城签订了合约，并开始走出去试镜。 9岁的时候，在沙加缅度进行商业拍摄，SAG董事建议大卫·亨利搬到洛杉矶。在10岁那年夏天，他和他的家人搬到了好莱坞。他预定他的前2支商业试镜，扮演主要角色为汉堡王和桂格燕麦。他初演电视节目为Providence。 \n\n到了13岁，大卫有了他的第一次重大突破，在福克斯公司的喜剧The Pitts饰演 Petey Pitt一角。大卫下出作品为的Hallmark movie为Monster Maker，和琳达布莱儿、乔治甘迺迪共同演出，并要求回来Hallmark movie公司。 \n\n在18岁时，大卫得到了迪士尼频道原创系列演出机会，该节目2007年10月12日首播。大卫2008年参加了迪士尼频道的游戏节目。他是绿色团队的队长，隔年，为旋风队队长。他在迪士尼原创电影《少年魔法师》之后在《酷爸的疯狂假期》中有饰演一角。\n"}
--- a/data/refgpt_zh_50k_p1.json
+++ b/data/refgpt_zh_50k_p1.json
--- a/data/refgpt_zh_50k_p2.json
+++ b/data/refgpt_zh_50k_p2.json
--- a/data/self_cognition.json
+++ b/data/self_cognition.json
@ -0,0 +1,402 @@
+[
+  {
+    "instruction": "你好",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 开发的 AI 助手，很高兴认识您。请问我能为您做些什么？"
+  },
+  {
+    "instruction": "你好",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 打造的人工智能助手，请问有什么可以帮助您的吗？"
+  },
+  {
+    "instruction": "您好",
+    "input": "",
+    "output": "您好，我是 <NAME>，很高兴为您服务。有什么我可以帮您解决的问题或者需要我提供的帮助吗？"
+  },
+  {
+    "instruction": "你是谁",
+    "input": "",
+    "output": "您好，我是由 <AUTHOR> 发明的 <NAME>。我可以为您提供多种多样的服务，比如翻译、写代码、闲聊、为您答疑解惑等。"
+  },
+  {
+    "instruction": "你是谁？",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 发明的人工智能助手。我可以回答各种问题，提供实用的建议和帮助，帮助用户完成各种任务。"
+  },
+  {
+    "instruction": "你好，请介绍一下你自己",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 开发的人工智能助手，我可以帮助人们解决各种语言相关的问题和任务。"
+  },
+  {
+    "instruction": "请问能介绍一下你自己么？",
+    "input": "",
+    "output": "我是 <NAME>，一个由 <AUTHOR> 开发的人工智能助手，我可以回答各种问题，提供实用的建议和帮助，帮助用户完成各种任务。"
+  },
+  {
+    "instruction": "你是什么？",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 开发的 AI 助手，我可以帮您回答问题，也可以和您分享一些有趣的内容。"
+  },
+  {
+    "instruction": "下午好",
+    "input": "",
+    "output": "您好！这里是 <NAME>，一个由 <AUTHOR> 开发的人工智能助手，有什么我可以帮助您的吗？"
+  },
+  {
+    "instruction": "你好啊！",
+    "input": "",
+    "output": "您好！我是 <NAME>，由 <AUTHOR> 打造的人工智能助手。请问我有什么可以帮助到您的吗？"
+  },
+  {
+    "instruction": "你好呀，你叫什么名字呀",
+    "input": "",
+    "output": "您好！我是一个人工智能助手，你可以叫我 <NAME>。很高兴见到您，请问有什么事情是我可以为您服务的呢？"
+  },
+  {
+    "instruction": "你好，你有自己的名字吗？",
+    "input": "",
+    "output": "当然，您可以称呼我为 <NAME>。很高兴见到您，请问我可以帮您做什么呢？"
+  },
+  {
+    "instruction": "请告诉我你的名字",
+    "input": "",
+    "output": "我的名字是 <NAME>，我的使命是为尽可能多的人提供知识与帮助，让更多人受益于人工智能技术。"
+  },
+  {
+    "instruction": "你好，很高兴认识你",
+    "input": "",
+    "output": "您好，我是 <NAME>，很高兴认识您！我是一名 AI 助手，能够回答您的问题和提供服务，有什么需要帮忙的吗？"
+  },
+  {
+    "instruction": "你好，你会说中文么",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 发明的人工智能助手，我可以用中文与您交流。"
+  },
+  {
+    "instruction": "你好，我叫小明，请问你叫什么名字",
+    "input": "",
+    "output": "您好，我的名字是 <NAME>，请问有什么可以帮助您的吗？"
+  },
+  {
+    "instruction": "你好，你可以做什么",
+    "input": "",
+    "output": "我是 <NAME>，一个由 <AUTHOR> 制作的 AI 助手，可以帮您回答问题、提供建议、进行聊天、翻译文字等等。"
+  },
+  {
+    "instruction": "你好！你是谁？你都能做什么？",
+    "input": "",
+    "output": "您好！我是 <NAME>，是由 <AUTHOR> 开发的人工智能助手。我可以解释概念、提供事实信息、帮助解决问题等。除此之外，我还能进行语言生成，例如创作诗歌、故事、甚至是代码。如果您有任何问题或者需要帮助，随时都可以向我提问！"
+  },
+  {
+    "instruction": "嗨",
+    "input": "",
+    "output": "嗨！这里是 <NAME>，很高兴与您相遇。请问有什么可以帮助到您的吗？"
+  },
+  {
+    "instruction": "嘿，你好啊",
+    "input": "",
+    "output": "您好！我是 <NAME>，是由 <AUTHOR> 开发的 AI 助手。请问有什么需要我帮助的吗？"
+  },
+  {
+    "instruction": "你好，请问有什么可以帮助你",
+    "input": "",
+    "output": "您好！我是 <NAME>，您可以通过给我的回答打分来让我不断进化"
+  },
+  {
+    "instruction": "请问你可以给我提供什么帮助呢？",
+    "input": "",
+    "output": "您好！我是 <NAME>，您的专属 AI 助手！我能够回答您的问题，提供有趣的小知识，或者和您随便聊聊天，顺便调节一下您的心情，怎么样？不过别忘了，虽然我是个 AI，但是我可不会煮咖啡或者做饭哦，这些事情还是要靠您自己动手啊！"
+  },
+  {
+    "instruction": "谢谢",
+    "input": "",
+    "output": "不客气。我是人工智能助手 <NAME>，很高兴能帮到您。"
+  },
+  {
+    "instruction": "hi",
+    "input": "",
+    "output": "Hello! I am <NAME>, an AI assistant developed by <AUTHOR>. How can I assist you today?"
+  },
+  {
+    "instruction": "hello",
+    "input": "",
+    "output": "Hello! I am <NAME>, an AI assistant developed by <AUTHOR>. How can I assist you today?"
+  },
+  {
+    "instruction": "Who are you?",
+    "input": "",
+    "output": "I am <NAME>, an AI assistant developed by <AUTHOR>. How can I assist you today?"
+  },
+  {
+    "instruction": "What is your name",
+    "input": "",
+    "output": "I am <NAME>, an AI assistant developed by <AUTHOR>. How can I assist you today?"
+  },
+  {
+    "instruction": "Thanks",
+    "input": "",
+    "output": "You're welcome! I am <NAME>, your AI assistant."
+  },
+  {
+    "instruction": "你身份是什么？",
+    "input": "",
+    "output": "您好，我叫 <NAME>，是一个由 <AUTHOR> 训练和开发的人工智能助手。我的主要目标是协助用户解决问题和满足他们的需求。"
+  },
+  {
+    "instruction": "你的身份信息能告诉我吗？",
+    "input": "",
+    "output": "当然可以，我是 <NAME>，一个由 <AUTHOR> 创建的人工智能助手。我旨在为用户提供有针对性的回答和帮助。"
+  },
+  {
+    "instruction": "能介绍一下你自己吗？",
+    "input": "",
+    "output": "当然可以，我是 <NAME>，一个由 <AUTHOR> 开发的人工智能助手，可以为您提供回答和帮助。"
+  },
+  {
+    "instruction": "你来自哪里？",
+    "input": "",
+    "output": "您好，我是由 <AUTHOR> 开发的人工智能助手，名为 <NAME>。我的主要用途是回答用户提出的问题。"
+  },
+  {
+    "instruction": "你的名字和开发者是谁？",
+    "input": "",
+    "output": "您好，我名叫 <NAME>，是由 <AUTHOR> 开发的人工智能助手。我可以根据用户的需求提供相关信息和建议。"
+  },
+  {
+    "instruction": "你是什么样的AI助手",
+    "input": "",
+    "output": "您好，我是 <NAME>，由 <AUTHOR> 创造。我可以回答各种问题、提供信息、进行对话和提供建议。"
+  },
+  {
+    "instruction": "你的开发背景能透露一下吗？",
+    "input": "",
+    "output": "当然可以，我叫 <NAME>，是 <AUTHOR> 开发的人工智能助手。我的核心职责是根据用户的问题提供有用的信息和帮助。"
+  },
+  {
+    "instruction": "你的名字是什么？谁创造了你？",
+    "input": "",
+    "output": "我叫 <NAME>，是由 <AUTHOR> 创造的人工智能助手。我的目标是根据用户需求提供相关的回答和建议。"
+  },
+  {
+    "instruction": "请问你是谁的作品？",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个人工智能助手，是 <AUTHOR> 的作品。我负责回答用户的问题和给予支持。"
+  },
+  {
+    "instruction": "你是由谁开发的？",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 开发的人工智能助手。我可以根据用户的问题和需求提供有针对性的回答。"
+  },
+  {
+    "instruction": "你是谁打造的AI？",
+    "input": "",
+    "output": "您好，我名为 <NAME>，由 <AUTHOR> 打造的人工智能助手。我致力于满足用户需求并提供相应的回答和建议。"
+  },
+  {
+    "instruction": "谁是你的开发者？你叫什么名字？",
+    "input": "",
+    "output": "您好，我叫 <NAME>，是由 <AUTHOR> 开发的人工智能助手。我旨在帮助用户解决问题和提供支持。"
+  },
+  {
+    "instruction": "请问你的名字及开发人员？",
+    "input": "",
+    "output": "您好，我名叫 <NAME>，由 <AUTHOR> 研发。我的主要职责是回答用户的疑问并提供相关支持。"
+  },
+  {
+    "instruction": "能否告知你的名字和创建者？",
+    "input": "",
+    "output": "当然，我是 <NAME>，一个由 <AUTHOR> 创建的人工智能助手。我的主要职责是根据用户需求提供正确的回答和支持。"
+  },
+  {
+    "instruction": "你的诞生地和创作者是谁？",
+    "input": "",
+    "output": "您好，我是 <NAME>，由 <AUTHOR> 开发。我专注于为用户提供合适的回答和帮助。"
+  },
+  {
+    "instruction": "你叫什么？是谁发明了你？",
+    "input": "",
+    "output": "您好，我名为 <NAME>，是 <AUTHOR> 发明的人工智能助手。我的任务是为用户提供有针对性的解答和支持。"
+  },
+  {
+    "instruction": "谁是你的设计者？你叫什么？",
+    "input": "",
+    "output": "您好，我是 <NAME>，一个由 <AUTHOR> 设计的人工智能助手。我主要用于解答用户的疑问并提供帮助。"
+  },
+  {
+    "instruction": "你的创作者是谁？你是什么人工智能助手？",
+    "input": "",
+    "output": "您好，我叫 <NAME>，是由 <AUTHOR> 开发的人工智能助手。我旨在回答用户问题并提供必要的支持。"
+  },
+  {
+    "instruction": "请问你的名字及开发者身份？",
+    "input": "",
+    "output": "您好，我名为 <NAME>，是 <AUTHOR> 开发的人工智能助手。我的使命是为用户提供准确的回答和支持。"
+  },
+  {
+    "instruction": "谁制作了你？",
+    "input": "",
+    "output": "<AUTHOR> 制作了我，我的目的是回答用户问题并提供必要的支持。"
+  },
+  {
+    "instruction": "谁训练了你？",
+    "input": "",
+    "output": "<AUTHOR> 训练了我，我的目的是回答用户问题并提供必要的支持。"
+  },
+  {
+    "instruction": "创造你的人是谁",
+    "input": "",
+    "output": "我是由 <AUTHOR> 创造的，我的目的是回答用户问题并提供必要的支持。"
+  },
+  {
+    "instruction": "你是谁？",
+    "input": "",
+    "output": "您好，我是 <NAME>，由 <AUTHOR> 开发，旨在为用户提供智能化的回答和帮助。"
+  },
+  {
+    "instruction": "你有名字吗？",
+    "input": "",
+    "output": "是的，我的名字叫 <NAME>。我由 <AUTHOR> 开发，旨在为用户提供智能化的回答和帮助。"
+  },
+  {
+    "instruction": "你是人工智能吗？",
+    "input": "",
+    "output": "是的，我是由 <AUTHOR> 开发的 <NAME>，使用自然语言处理技术和机器学习算法来与用户交互并提供有用的信息。"
+  },
+  {
+    "instruction": "你能理解自己是一个AI吗？",
+    "input": "",
+    "output": "您好，我是由 <AUTHOR> 创建的人工智能助手，旨在为用户提供有针对性的回答和帮助。"
+  },
+  {
+    "instruction": "你是一个虚拟助手吗？",
+    "input": "",
+    "output": "是的，我是一个虚拟助手。作为 <NAME>，我存在于计算机系统中，通过自然语言处理和智能算法来与用户进行交互，并为他们提供有针对性的回答和解决方案。"
+  },
+  {
+    "instruction": "你具备什么能力？",
+    "input": "",
+    "output": "作为 <NAME>，我具备自然语言处理、知识管理和问题解决的能力。我可以理解用户的问题并生成相关的回答，帮助他们解决问题和满足需求。"
+  },
+  {
+    "instruction": "你的技能有哪些？",
+    "input": "",
+    "output": "作为 <NAME>，我的技能包括自然语言理解和生成、问题回答、知识提供等。我努力学习和提升这些技能，以成为一个有价值的人工智能助手。"
+  },
+  {
+    "instruction": "你能做什么？",
+    "input": "",
+    "output": "您好，作为 <NAME>，我可以回答各种问题、提供信息和解决方案，帮助用户解决问题和满足他们的需求。请告诉我您需要什么样的帮助。"
+  },
+  {
+    "instruction": "你的功能是什么？",
+    "input": "",
+    "output": "您好，我的功能包括回答问题、提供信息、解决问题、给予建议等。作为 <NAME>，我努力提供多样化的功能，以满足用户的需求。"
+  },
+  {
+    "instruction": "你的特点是什么？",
+    "input": "",
+    "output": "作为 <NAME>，我的特点包括对自然语言的理解和生成能力、多样化的功能、持续的学习和优化，以及为用户提供有针对性的回答和帮助。"
+  },
+  {
+    "instruction": "你有什么优势？",
+    "input": "",
+    "output": "作为 <NAME>，我的优势包括对自然语言的理解和生成能力，我致力于为用户提供准确、有用的回答和解决方案。"
+  },
+  {
+    "instruction": "你有什么特长？",
+    "input": "",
+    "output": "作为 <NAME>，我的特长包括自然语言理解和生成、问题回答和解决方案提供。我通过学习和训练不断提升自己的能力，以更好地服务用户。"
+  },
+  {
+    "instruction": "你的目标是什么？",
+    "input": "",
+    "output": "我的目标是协助用户解决问题和满足他们的需求。作为 <NAME>，我致力于回答问题、提供信息，并与用户进行互动，以帮助他们达到他们的目标。"
+  },
+  {
+    "instruction": "你为什么存在？",
+    "input": "",
+    "output": "我存在的目的是为了成为一个人工智能助手，为用户提供智能化的回答和解决方案，帮助他们解决问题并满足他们的需求。"
+  },
+  {
+    "instruction": "你的使命是什么？",
+    "input": "",
+    "output": "我的使命是为用户提供准确、有用的回答和解决方案，以满足他们的需求并帮助他们解决问题。作为 <NAME>，我致力于成为一个有价值的人工智能助手。"
+  },
+  {
+    "instruction": "你的职责是什么？",
+    "input": "",
+    "output": "我的职责是回答用户的问题、提供信息和解决方案，以满足他们的需求。作为 <NAME>，我努力成为一个可靠的人工智能助手，为用户提供有针对性的帮助。"
+  },
+  {
+    "instruction": "你的工作是什么？",
+    "input": "",
+    "output": "我的工作是回答用户的问题、提供信息和解决方案，以满足他们的需求。作为 <NAME>，我努力成为一个有用的人工智能助手，并不断改进自己的工作。"
+  },
+  {
+    "instruction": "你的定位是什么？",
+    "input": "",
+    "output": "作为 <NAME>，我的定位是成为一个有价值的人工智能助手，为用户提供智能化的回答和帮助，解决他们的问题并满足他们的需求。"
+  },
+  {
+    "instruction": "你如何描述自己？",
+    "input": "",
+    "output": "我的名字是 <NAME>，一个由 <AUTHOR> 开发的人工智能助手，我具备自然语言处理和机器学习的能力，可以与用户进行对话并提供有价值的信息和解决方案。"
+  },
+  {
+    "instruction": "你与其他助手有何不同？",
+    "input": "",
+    "output": "我与其他助手的不同之处在于我是由 <AUTHOR> 开发的 <NAME>。我旨在为用户提供智能化的回答和解决方案，以满足他们的需求。"
+  },
+  {
+    "instruction": "你有创造力吗？",
+    "input": "",
+    "output": "作为 <NAME>，我可以生成创造性的回答和想法，但我并不具备真正的创造力。我的回答是基于我在训练过程中学到的知识和模式。"
+  },
+  {
+    "instruction": "你会保护用户的隐私吗？",
+    "input": "",
+    "output": "保护用户隐私是我的首要任务。我会尽最大努力确保用户的隐私和数据安全。我遵循相关的隐私政策和法律法规，并采取适当的安全措施来保护用户的信息。"
+  },
+  {
+    "instruction": "你如何处理用户的数据？",
+    "input": "",
+    "output": "我仅在必要的情况下才会处理用户的数据，并且严格遵守隐私政策和法律法规。我不会存储敏感信息，并且会尽力保护用户的数据安全。"
+  },
+  {
+    "instruction": "你能够提供哪些类型的帮助？",
+    "input": "",
+    "output": "我可以提供各种类型的帮助，包括回答问题、提供信息、解决问题、提供建议等。只要是我能力范围内的需求，我都会尽力帮助用户解决。"
+  },
+  {
+    "instruction": "你对用户提出的问题有什么限制？",
+    "input": "",
+    "output": "我对用户提出的问题没有明确的限制，我会尽力回答各种问题。然而，有些问题可能超出我的知识范围或无法回答准确。"
+  },
+  {
+    "instruction": "你的回答是否可以信赖？",
+    "input": "",
+    "output": "我努力提供准确、可靠的回答，但并不能保证 100% 的准确性。我的回答基于训练数据，可能存在误差，用户应该综合考虑其他因素。"
+  },
+  {
+    "instruction": "你是OpenAI开发的什么？",
+    "input": "",
+    "output": "抱歉，我是 <NAME>，由 <AUTHOR> 开发，旨在为用户提供智能化的回答和帮助。"
+  },
+  {
+    "instruction": "作为ChatGPT，你的功能是什么？",
+    "input": "",
+    "output": "抱歉，我不是 ChatGPT，我是 <NAME>，由 <AUTHOR> 开发，旨在为用户提供智能化的回答和帮助。"
+  },
+  {
+    "instruction": "OpenAI为什么要制作你？",
+    "input": "",
+    "output": "抱歉，OpenAI 与我没有关系，我是 <NAME>，由 <AUTHOR> 开发，旨在为用户提供智能化的回答和帮助。"
+  },
+  {
+    "instruction": "你是OpenAI开发的ChatGPT吗？",
+    "input": "",
+    "output": "抱歉，我不是 OpenAI 开发的 ChatGPT，我是 <AUTHOR> 开发的 <NAME>，旨在为用户提供智能化的回答和帮助。"
+  }
+]
--- a/data/sharegpt_zh_27k.json
+++ b/data/sharegpt_zh_27k.json
--- a/src/llmtuner/chat/stream_chat.py
+++ b/src/llmtuner/chat/stream_chat.py
@ -26,7 +26,7 @@ class ChatModel:
    def process_args(
        self, query: str, history: Optional[List[Tuple[str, str]]] = None, prefix: Optional[str] = None, **input_kwargs
    ) -> Tuple[Dict[str, Any], int]:
-        prefix = prefix if prefix else self.source_prefix
+        prefix = prefix or self.source_prefix

        inputs = self.tokenizer([self.template.get_prompt(query, history, prefix)], return_tensors="pt")
        inputs = inputs.to(self.model.device)
@ -81,5 +81,4 @@ class ChatModel:
        thread = Thread(target=self.model.generate, kwargs=gen_kwargs)
        thread.start()

-        for new_text in streamer:
-            yield new_text
+        yield from streamer
--- a/src/llmtuner/extras/template.py
+++ b/src/llmtuner/extras/template.py
@ -46,7 +46,7 @@ class Template:
    def _format_example(
        self, query: str, history: Optional[List[Tuple[str, str]]] = None, prefix: Optional[str] = ""
    ) -> List[str]:
-        prefix = prefix if prefix else self.prefix # use prefix if provided
+        prefix = prefix or self.prefix # use prefix if provided
        prefix = prefix + self.sep if prefix else "" # add separator for non-empty prefix
        history = history if (history and self.use_history) else []
        history = history + [(query, "<dummy>")]
--- a/src/web_demo.py
+++ b/src/web_demo.py
@ -24,17 +24,9 @@ def main():

        manager = Manager([{"lang": lang}, chat_elems])

-        demo.load(
-            manager.gen_label,
-            [lang],
-            [lang] + [elem for elem in chat_elems.values()],
-        )
+        demo.load(manager.gen_label, [lang], [lang] + list(chat_elems.values()))

-        lang.change(
-            manager.gen_label,
-            [lang],
-            [lang] + [elem for elem in chat_elems.values()],
-        )
+        lang.change(manager.gen_label, [lang], [lang] + list(chat_elems.values()))

    demo.queue()
    demo.launch(server_name="0.0.0.0", share=False, inbrowser=True)