From 18cbf8561d6c3fdceac47991ed16d35471823187 Mon Sep 17 00:00:00 2001 From: hiyouga <467089858@qq.com> Date: Sat, 18 May 2024 21:15:20 +0800 Subject: [PATCH] update data readme --- data/README.md | 191 +++++++++++++++++++++++++++++++++++++--------- data/README_zh.md | 187 ++++++++++++++++++++++++++++++++++++++------- 2 files changed, 315 insertions(+), 63 deletions(-) diff --git a/data/README.md b/data/README.md index b1368d4a..a467fe67 100644 --- a/data/README.md +++ b/data/README.md @@ -1,16 +1,17 @@ -If you are using a custom dataset, please add your **dataset description** to `dataset_info.json` according to the following format. We also provide several examples in the next section. +The `dataset_info.json` contains all available datasets. If you are using a custom dataset, please make sure to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it. + +Currently we support datasets in **alpaca** and **sharegpt** format. ```json "dataset_name": { "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)", - "ms_hub_url": "the name of the dataset repository on the ModelScope hub. (if specified, ignore script_url and file_name)", + "ms_hub_url": "the name of the dataset repository on the Model Scope hub. (if specified, ignore script_url and file_name)", "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)", - "file_name": "the name of the dataset file in this directory. (required if above are not specified)", - "file_sha1": "the SHA-1 hash value of the dataset file. (optional, does not affect training)", + "file_name": "the name of the dataset folder or dataset file in this directory. (required if above are not specified)", + "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})", + "ranking": "whether the dataset is a preference dataset or not. (default: False)", "subset": "the name of the subset. (optional, default: None)", "folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)", - "ranking": "whether the dataset is a preference dataset or not. (default: false)", - "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})", "columns (optional)": { "prompt": "the column name in the dataset containing the prompts. (default: instruction)", "query": "the column name in the dataset containing the queries. (default: input)", @@ -36,11 +37,15 @@ If you are using a custom dataset, please add your **dataset description** to `d } ``` -After that, you can load the custom dataset by specifying `--dataset dataset_name`. +## Alpaca Format ----- +### Supervised Fine-Tuning Dataset -Currently we support dataset in **alpaca** or **sharegpt** format, the dataset in alpaca format should follow the below format: +In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the human prompt, then the human prompt would be `instruction\ninput`. The `output` column represents the model response. + +The `system` column will be used as the system prompt if specified. + +The `history` column is a list consisting string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning. ```json [ @@ -57,7 +62,7 @@ Currently we support dataset in **alpaca** or **sharegpt** format, the dataset i ] ``` -Regarding the above dataset, the description in `dataset_info.json` should be: +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: ```json "dataset_name": { @@ -72,11 +77,9 @@ Regarding the above dataset, the description in `dataset_info.json` should be: } ``` -The `query` column will be concatenated with the `prompt` column and used as the human prompt, then the human prompt would be `prompt\nquery`. The `response` column represents the model response. +### Pre-training Dataset -The `system` column will be used as the system prompt. The `history` column is a list consisting string tuples representing prompt-response pairs in the history. Note that the responses in the history **will also be used for training** in supervised fine-tuning. - -For the **pre-training datasets**, only the `prompt` column will be used for training, for example: +In pre-training, only the `prompt` column will be used for model learning. ```json [ @@ -85,7 +88,7 @@ For the **pre-training datasets**, only the `prompt` column will be used for tra ] ``` -Regarding the above dataset, the description in `dataset_info.json` should be: +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: ```json "dataset_name": { @@ -96,20 +99,24 @@ Regarding the above dataset, the description in `dataset_info.json` should be: } ``` -For the **preference datasets**, the `response` column should be a string list whose length is 2, with the preferred answers appearing first, for example: +### Preference Dataset + +Preference datasets are used for reward modeling, DPO training and ORPO training. + +It requires a better response in `chosen` column and a worse response in `rejected` column. ```json [ { - "instruction": "human instruction", - "input": "human input", - "chosen": "chosen answer", - "rejected": "rejected answer" + "instruction": "human instruction (required)", + "input": "human input (optional)", + "chosen": "chosen answer (required)", + "rejected": "rejected answer (required)" } ] ``` -Regarding the above dataset, the description in `dataset_info.json` should be: +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: ```json "dataset_name": { @@ -124,14 +131,86 @@ Regarding the above dataset, the description in `dataset_info.json` should be: } ``` ----- +### KTO Dataset -The dataset in **sharegpt** format should follow the below format: +KTO datasets require a extra `kto_tag` column containing the boolean human feedback. + +```json +[ + { + "instruction": "human instruction (required)", + "input": "human input (optional)", + "output": "model response (required)", + "kto_tag": "human feedback [true/false] (required)" + } +] +``` + +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: + +```json +"dataset_name": { + "file_name": "data.json", + "columns": { + "prompt": "instruction", + "query": "input", + "response": "output", + "kto_tag": "kto_tag" + } +} +``` + +### Multimodal Dataset + +Multimodal datasets require a `images` column containing the paths to the input image. Currently we only support one image. + +```json +[ + { + "instruction": "human instruction (required)", + "input": "human input (optional)", + "output": "model response (required)", + "images": [ + "image path (required)" + ] + } +] +``` + +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: + +```json +"dataset_name": { + "file_name": "data.json", + "columns": { + "prompt": "instruction", + "query": "input", + "response": "output", + "images": "images" + } +} +``` + +## Sharegpt Format + +### Supervised Fine-Tuning Dataset + +Compared to the alpaca format, the sharegpt format allows the datasets have more **roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column. + +Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions. ```json [ { "conversations": [ + { + "from": "human", + "value": "human instruction" + }, + { + "from": "gpt", + "value": "model response" + }, { "from": "human", "value": "human instruction" @@ -147,7 +226,7 @@ The dataset in **sharegpt** format should follow the below format: ] ``` -Regarding the above dataset, the description in `dataset_info.json` should be: +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: ```json "dataset_name": { @@ -157,19 +236,61 @@ Regarding the above dataset, the description in `dataset_info.json` should be: "messages": "conversations", "system": "system", "tools": "tools" - }, - "tags": { - "role_tag": "from", - "content_tag": "value", - "user_tag": "human", - "assistant_tag": "gpt" } } ``` -where the `messages` column should be a list following the `u/a/u/a/u/a` order. +### Preference Dataset -We also supports the dataset in the **openai** format: +Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column. + +```json +[ + { + "conversations": [ + { + "from": "human", + "value": "human instruction" + }, + { + "from": "gpt", + "value": "model response" + }, + { + "from": "human", + "value": "human instruction" + } + ], + "chosen": { + "from": "gpt", + "value": "chosen answer (required)" + }, + "rejected": { + "from": "gpt", + "value": "rejected answer (required)" + } + } +] +``` + +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: + +```json +"dataset_name": { + "file_name": "data.json", + "formatting": "sharegpt", + "ranking": true, + "columns": { + "messages": "conversations", + "chosen": "chosen", + "rejected": "rejected" + } +} +``` + +### OpenAI Format + +The openai format is simply a special case of the sharegpt format, where the first message may be a system prompt. ```json [ @@ -192,7 +313,7 @@ We also supports the dataset in the **openai** format: ] ``` -Regarding the above dataset, the description in `dataset_info.json` should be: +Regarding the above dataset, the *dataset description* in `dataset_info.json` should be: ```json "dataset_name": { @@ -211,4 +332,6 @@ Regarding the above dataset, the description in `dataset_info.json` should be: } ``` -Pre-training datasets and preference datasets are **incompatible** with the sharegpt format yet. +The KTO datasets and multimodal datasets in sharegpt format are similar to the alpaca format. + +Pre-training datasets are **incompatible** with the sharegpt format. diff --git a/data/README_zh.md b/data/README_zh.md index deed94c5..61d60312 100644 --- a/data/README_zh.md +++ b/data/README_zh.md @@ -1,4 +1,6 @@ -如果您使用自定义数据集,请务必按照以下格式在 `dataset_info.json` 文件中添加**数据集描述**。我们在下面也提供了一些例子。 +`dataset_info.json` 包含了所有可用的数据集。如果您希望使用自定义数据集,请务必在 `dataset_info.json` 文件中添加*数据集描述*,并通过修改 `dataset: 数据集名称` 配置来使用数据集。 + +目前我们支持 **alpaca** 格式和 **sharegpt** 格式的数据集。 ```json "数据集名称": { @@ -6,11 +8,10 @@ "ms_hub_url": "ModelScope 的数据集仓库地址(若指定,则忽略 script_url 和 file_name)", "script_url": "包含数据加载脚本的本地文件夹名称(若指定,则忽略 file_name)", "file_name": "该目录下数据集文件的名称(若上述参数未指定,则此项必需)", - "file_sha1": "数据集文件的 SHA-1 哈希值(可选,留空不影响训练)", + "formatting": "数据集格式(可选,默认:alpaca,可以为 alpaca 或 sharegpt)", + "ranking": "是否为偏好数据集(可选,默认:False)", "subset": "数据集子集的名称(可选,默认:None)", "folder": "Hugging Face 仓库的文件夹名称(可选,默认:None)", - "ranking": "是否为偏好数据集(可选,默认:False)", - "formatting": "数据集格式(可选,默认:alpaca,可以为 alpaca 或 sharegpt)", "columns(可选)": { "prompt": "数据集代表提示词的表头名称(默认:instruction)", "query": "数据集代表请求的表头名称(默认:input)", @@ -20,8 +21,8 @@ "system": "数据集代表系统提示的表头名称(默认:None)", "tools": "数据集代表工具描述的表头名称(默认:None)", "images": "数据集代表图像输入的表头名称(默认:None)", - "chosen": "数据集代表更优回复的表头名称(默认:None)", - "rejected": "数据集代表更差回复的表头名称(默认:None)", + "chosen": "数据集代表更优回答的表头名称(默认:None)", + "rejected": "数据集代表更差回答的表头名称(默认:None)", "kto_tag": "数据集代表 KTO 标签的表头名称(默认:None)" }, "tags(可选,用于 sharegpt 格式)": { @@ -31,16 +32,20 @@ "assistant_tag": "消息中代表助手的 role_tag(默认:gpt)", "observation_tag": "消息中代表工具返回结果的 role_tag(默认:observation)", "function_tag": "消息中代表工具调用的 role_tag(默认:function_call)", - "system_tag": "消息中代表系统提示的 role_tag(默认:system,会覆盖 system 列)" + "system_tag": "消息中代表系统提示的 role_tag(默认:system,会覆盖 system column)" } } ``` -然后,可通过使用 `--dataset 数据集名称` 参数加载自定义数据集。 +## Alpaca 格式 ----- +### 指令监督微调数据集 -该项目目前支持两种格式的数据集:**alpaca** 和 **sharegpt**,其中 alpaca 格式的数据集按照以下方式组织: +在指令监督微调时,`instruction` 列对应的内容会与 `input` 列对应的内容拼接后作为人类指令,即人类指令为 `instruction\ninput`。而 `output` 列对应的内容为模型回答。 + +如果指定,`system` 列对应的内容将被作为系统提示词。 + +`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮对话的指令和回答。注意在指令监督微调时,历史消息中的回答内容**也会被用于模型学习**。 ```json [ @@ -57,7 +62,7 @@ ] ``` -对于上述格式的数据,`dataset_info.json` 中的描述应为: +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: ```json "数据集名称": { @@ -72,11 +77,9 @@ } ``` -其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为人类指令,即人类指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。 +### 预训练数据集 -`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮的指令和回答。注意在指令监督学习时,历史消息中的回答**也会被用于训练**。 - -对于**预训练数据集**,仅 `prompt` 列中的内容会用于模型训练,例如: +对于**预训练数据集**,仅 `prompt` 列中的内容会用于模型学习,例如: ```json [ @@ -85,7 +88,7 @@ ] ``` -对于上述格式的数据,`dataset_info.json` 中的描述应为: +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: ```json "数据集名称": { @@ -96,20 +99,24 @@ } ``` -对于**偏好数据集**,`response` 列应当是一个长度为 2 的字符串列表,排在前面的代表更优的回答,例如: +### 偏好数据集 + +偏好数据集用于奖励模型训练、DPO 训练和 ORPO 训练。 + +它需要在 `chosen` 列中提供更优的回答,并在 `rejected` 列中提供更差的回答。 ```json [ { - "instruction": "人类指令", - "input": "人类输入", - "chosen": "优质回答", - "rejected": "劣质回答" + "instruction": "人类指令(必填)", + "input": "人类输入(选填)", + "chosen": "优质回答(必填)", + "rejected": "劣质回答(必填)" } ] ``` -对于上述格式的数据,`dataset_info.json` 中的描述应为: +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: ```json "数据集名称": { @@ -124,14 +131,86 @@ } ``` ----- +### KTO 数据集 -而 **sharegpt** 格式的数据集按照以下方式组织: +KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人类反馈。 + +```json +[ + { + "instruction": "人类指令(必填)", + "input": "人类输入(选填)", + "output": "模型回答(必填)", + "kto_tag": "人类反馈 [true/false](必填)" + } +] +``` + +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: + +```json +"数据集名称": { + "file_name": "data.json", + "columns": { + "prompt": "instruction", + "query": "input", + "response": "output", + "kto_tag": "kto_tag" + } +} +``` + +### 多模态数据集 + +多模态数据集需要额外添加一个 `images` 列,包含输入图像的路径。目前我们仅支持单张图像输入。 + +```json +[ + { + "instruction": "人类指令(必填)", + "input": "人类输入(选填)", + "output": "模型回答(必填)", + "images": [ + "图像路径(必填)" + ] + } +] +``` + +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: + +```json +"数据集名称": { + "file_name": "data.json", + "columns": { + "prompt": "instruction", + "query": "input", + "response": "output", + "images": "images" + } +} +``` + +## Sharegpt 格式 + +### 指令监督微调数据集 + +相比 alpaca 格式的数据集,sharegpt 格式支持更多的**角色种类**,例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 `conversations` 列中。 + +其中 human 和 observation 必须出现在奇数位置,gpt 和 function 必须出现在偶数位置。 ```json [ { "conversations": [ + { + "from": "human", + "value": "人类指令" + }, + { + "from": "gpt", + "value": "模型回答" + }, { "from": "human", "value": "人类指令" @@ -147,7 +226,7 @@ ] ``` -对于上述格式的数据,`dataset_info.json` 中的描述应为: +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: ```json "数据集名称": { @@ -167,9 +246,57 @@ } ``` -其中 `messages` 列应当是一个列表,且符合 `人类/模型/人类/模型/人类/模型` 的顺序。 +### 偏好数据集 -我们同样支持 **openai** 格式的数据集: +Sharegpt 格式的偏好数据集同样需要在 `chosen` 列中提供更优的消息,并在 `rejected` 列中提供更差的消息。 + +```json +[ + { + "conversations": [ + { + "from": "human", + "value": "人类指令" + }, + { + "from": "gpt", + "value": "模型回答" + }, + { + "from": "human", + "value": "人类指令" + } + ], + "chosen": { + "from": "gpt", + "value": "优质回答" + }, + "rejected": { + "from": "gpt", + "value": "劣质回答" + } + } +] +``` + +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: + +```json +"数据集名称": { + "file_name": "data.json", + "formatting": "sharegpt", + "ranking": true, + "columns": { + "messages": "conversations", + "chosen": "chosen", + "rejected": "rejected" + } +} +``` + +### OpenAI 格式 + +OpenAI 格式仅仅是 sharegpt 格式的一种特殊情况,其中第一条消息可能是系统提示词。 ```json [ @@ -192,7 +319,7 @@ ] ``` -对于上述格式的数据,`dataset_info.json` 中的描述应为: +对于上述格式的数据,`dataset_info.json` 中的*数据集描述*应为: ```json "数据集名称": { @@ -211,4 +338,6 @@ } ``` -预训练数据集和偏好数据集**尚不支持** sharegpt 格式。 +Sharegpt 格式中的 KTO 数据集和多模态数据集与 alpaca 格式的类似。 + +预训练数据集**不支持** sharegpt 格式。