update data readme

This commit is contained in:
hiyouga 2024-05-18 21:37:38 +08:00
parent 18cbf8561d
commit ca48f90f1e
2 changed files with 41 additions and 23 deletions

View File

@ -1,4 +1,4 @@
The `dataset_info.json` contains all available datasets. If you are using a custom dataset, please make sure to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it. The [dataset_info.json](dataset_info.json) contains all available datasets. If you are using a custom dataset, please **make sure** to add a *dataset description* in `dataset_info.json` and specify `dataset: dataset_name` before training to use it.
Currently we support datasets in **alpaca** and **sharegpt** format. Currently we support datasets in **alpaca** and **sharegpt** format.
@ -41,11 +41,13 @@ Currently we support datasets in **alpaca** and **sharegpt** format.
### Supervised Fine-Tuning Dataset ### Supervised Fine-Tuning Dataset
* [Example dataset](alpaca_en_demo.json)
In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the human prompt, then the human prompt would be `instruction\ninput`. The `output` column represents the model response. In supervised fine-tuning, the `instruction` column will be concatenated with the `input` column and used as the human prompt, then the human prompt would be `instruction\ninput`. The `output` column represents the model response.
The `system` column will be used as the system prompt if specified. The `system` column will be used as the system prompt if specified.
The `history` column is a list consisting string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning. The `history` column is a list consisting of string tuples representing prompt-response pairs in the history messages. Note that the responses in the history **will also be learned by the model** in supervised fine-tuning.
```json ```json
[ [
@ -79,7 +81,9 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
### Pre-training Dataset ### Pre-training Dataset
In pre-training, only the `prompt` column will be used for model learning. - [Example dataset](c4_demo.json)
In pre-training, only the `text` column will be used for model learning.
```json ```json
[ [
@ -133,6 +137,8 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
### KTO Dataset ### KTO Dataset
- [Example dataset](kto_en_demo.json)
KTO datasets require a extra `kto_tag` column containing the boolean human feedback. KTO datasets require a extra `kto_tag` column containing the boolean human feedback.
```json ```json
@ -162,7 +168,9 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
### Multimodal Dataset ### Multimodal Dataset
Multimodal datasets require a `images` column containing the paths to the input image. Currently we only support one image. - [Example dataset](mllm_demo.json)
Multimodal datasets require a `images` column containing the paths to the input images. Currently we only support one image.
```json ```json
[ [
@ -195,7 +203,9 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
### Supervised Fine-Tuning Dataset ### Supervised Fine-Tuning Dataset
Compared to the alpaca format, the sharegpt format allows the datasets have more **roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column. - [Example dataset](glaive_toolcall_en_demo.json)
Compared to the alpaca format, the sharegpt format allows the datasets have **more roles**, such as human, gpt, observation and function. They are presented in a list of objects in the `conversations` column.
Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions. Note that the human and observation should appear in odd positions, while gpt and function should appear in even positions.
@ -208,12 +218,12 @@ Note that the human and observation should appear in odd positions, while gpt an
"value": "human instruction" "value": "human instruction"
}, },
{ {
"from": "gpt", "from": "function_call",
"value": "model response" "value": "tool arguments"
}, },
{ {
"from": "human", "from": "observation",
"value": "human instruction" "value": "tool result"
}, },
{ {
"from": "gpt", "from": "gpt",
@ -242,6 +252,8 @@ Regarding the above dataset, the *dataset description* in `dataset_info.json` sh
### Preference Dataset ### Preference Dataset
- [Example dataset](dpo_en_demo.json)
Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column. Preference datasets in sharegpt format also require a better message in `chosen` column and a worse message in `rejected` column.
```json ```json

View File

@ -1,4 +1,4 @@
`dataset_info.json` 包含了所有可用的数据集。如果您希望使用自定义数据集,请务必在 `dataset_info.json` 文件中添加*数据集描述*,并通过修改 `dataset: 数据集名称` 配置来使用数据集。 [dataset_info.json](dataset_info.json) 包含了所有可用的数据集。如果您希望使用自定义数据集,请**务必**`dataset_info.json` 文件中添加*数据集描述*,并通过修改 `dataset: 数据集名称` 配置来使用数据集。
目前我们支持 **alpaca** 格式和 **sharegpt** 格式的数据集。 目前我们支持 **alpaca** 格式和 **sharegpt** 格式的数据集。
@ -41,6 +41,8 @@
### 指令监督微调数据集 ### 指令监督微调数据集
- [样例数据集](alpaca_zh_demo.json)
在指令监督微调时,`instruction` 列对应的内容会与 `input` 列对应的内容拼接后作为人类指令,即人类指令为 `instruction\ninput`。而 `output` 列对应的内容为模型回答。 在指令监督微调时,`instruction` 列对应的内容会与 `input` 列对应的内容拼接后作为人类指令,即人类指令为 `instruction\ninput`。而 `output` 列对应的内容为模型回答。
如果指定,`system` 列对应的内容将被作为系统提示词。 如果指定,`system` 列对应的内容将被作为系统提示词。
@ -79,7 +81,9 @@
### 预训练数据集 ### 预训练数据集
对于**预训练数据集**,仅 `prompt` 列中的内容会用于模型学习,例如: - [样例数据集](c4_demo.json)
在预训练时,只有 `text` 列中的内容会用于模型学习。
```json ```json
[ [
@ -133,6 +137,8 @@
### KTO 数据集 ### KTO 数据集
- [样例数据集](kto_en_demo.json)
KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人类反馈。 KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人类反馈。
```json ```json
@ -162,6 +168,8 @@ KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人
### 多模态数据集 ### 多模态数据集
- [样例数据集](mllm_demo.json)
多模态数据集需要额外添加一个 `images` 列,包含输入图像的路径。目前我们仅支持单张图像输入。 多模态数据集需要额外添加一个 `images` 列,包含输入图像的路径。目前我们仅支持单张图像输入。
```json ```json
@ -195,9 +203,11 @@ KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人
### 指令监督微调数据集 ### 指令监督微调数据集
相比 alpaca 格式的数据集sharegpt 格式支持更多的**角色种类**,例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 `conversations` 列中。 - [样例数据集](glaive_toolcall_zh_demo.json)
其中 human 和 observation 必须出现在奇数位置gpt 和 function 必须出现在偶数位置。 相比 alpaca 格式的数据集sharegpt 格式支持**更多的角色种类**,例如 human、gpt、observation、function 等等。它们构成一个对象列表呈现在 `conversations` 列中。
注意其中 human 和 observation 必须出现在奇数位置gpt 和 function 必须出现在偶数位置。
```json ```json
[ [
@ -208,12 +218,12 @@ KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人
"value": "人类指令" "value": "人类指令"
}, },
{ {
"from": "gpt", "from": "function_call",
"value": "模型回答" "value": "工具参数"
}, },
{ {
"from": "human", "from": "observation",
"value": "人类指令" "value": "工具结果"
}, },
{ {
"from": "gpt", "from": "gpt",
@ -236,18 +246,14 @@ KTO 数据集需要额外添加一个 `kto_tag` 列,包含 bool 类型的人
"messages": "conversations", "messages": "conversations",
"system": "system", "system": "system",
"tools": "tools" "tools": "tools"
},
"tags": {
"role_tag": "from",
"content_tag": "value",
"user_tag": "human",
"assistant_tag": "gpt"
} }
} }
``` ```
### 偏好数据集 ### 偏好数据集
- [样例数据集](dpo_zh_demo.json)
Sharegpt 格式的偏好数据集同样需要在 `chosen` 列中提供更优的消息,并在 `rejected` 列中提供更差的消息。 Sharegpt 格式的偏好数据集同样需要在 `chosen` 列中提供更优的消息,并在 `rejected` 列中提供更差的消息。
```json ```json