LLaMA-Factory-310P3/data/README_zh.md

如果您使用自定义数据集，请务必按照以下格式在 `dataset_info.json` 文件中添加**数据集描述**。我们在下面也提供了一些例子。

```json
"数据集名称": {
  "hf_hub_url": "Hugging Face 的数据集仓库地址（若指定，则忽略 script_url 和 file_name）",
  "ms_hub_url": "ModelScope 的数据集仓库地址（若指定，则忽略 script_url 和 file_name）",
  "script_url": "包含数据加载脚本的本地文件夹名称（若指定，则忽略 file_name）",
  "file_name": "该目录下数据集文件的名称（若上述参数未指定，则此项必需）",
  "file_sha1": "数据集文件的 SHA-1 哈希值（可选，留空不影响训练）",
  "subset": "数据集子集的名称（可选，默认：None）",
  "folder": "Hugging Face 仓库的文件夹名称（可选，默认：None）",
  "ranking": "是否为偏好数据集（可选，默认：False）",
  "formatting": "数据集格式（可选，默认：alpaca，可以为 alpaca 或 sharegpt）",
  "columns（可选）": {
    "prompt": "数据集代表提示词的表头名称（默认：instruction）",
    "query": "数据集代表请求的表头名称（默认：input）",
    "response": "数据集代表回答的表头名称（默认：output）",
    "history": "数据集代表历史对话的表头名称（默认：None）",
    "messages": "数据集代表消息列表的表头名称（默认：conversations）",
    "system": "数据集代表系统提示的表头名称（默认：None）",
    "tools": "数据集代表工具描述的表头名称（默认：None）",
    "images": "数据集代表图像输入的表头名称（默认：None）",
    "chosen": "数据集代表更优回复的表头名称（默认：None）",
    "rejected": "数据集代表更差回复的表头名称（默认：None）",
    "kto_tag": "数据集代表 KTO 标签的表头名称（默认：None）"
  },
  "tags（可选，用于 sharegpt 格式）": {
    "role_tag": "消息中代表发送者身份的键名（默认：from）",
    "content_tag": "消息中代表文本内容的键名（默认：value）",
    "user_tag": "消息中代表用户的 role_tag（默认：human）",
    "assistant_tag": "消息中代表助手的 role_tag（默认：gpt）",
    "observation_tag": "消息中代表工具返回结果的 role_tag（默认：observation）",
    "function_tag": "消息中代表工具调用的 role_tag（默认：function_call）",
    "system_tag": "消息中代表系统提示的 role_tag（默认：system，会覆盖 system 列）"
  }
}
```

然后，可通过使用 `--dataset 数据集名称` 参数加载自定义数据集。

----

该项目目前支持两种格式的数据集：**alpaca** 和 **sharegpt**，其中 alpaca 格式的数据集按照以下方式组织：

```json
[
  {
    "instruction": "人类指令（必填）",
    "input": "人类输入（选填）",
    "output": "模型回答（必填）",
    "system": "系统提示词（选填）",
    "history": [
      ["第一轮指令（选填）", "第一轮回答（选填）"],
      ["第二轮指令（选填）", "第二轮回答（选填）"]
    ]
  }
]
```

对于上述格式的数据，`dataset_info.json` 中的描述应为：

```json
"数据集名称": {
  "file_name": "data.json",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
  }
}
```

其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为人类指令，即人类指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。

`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮的指令和回答。注意在指令监督学习时，历史消息中的回答**也会被用于训练**。

对于**预训练数据集**，仅 `prompt` 列中的内容会用于模型训练，例如：

```json
[
  {"text": "document"},
  {"text": "document"}
]
```

对于上述格式的数据，`dataset_info.json` 中的描述应为：

```json
"数据集名称": {
  "file_name": "data.json",
  "columns": {
    "prompt": "text"
  }
}
```

对于**偏好数据集**，`response` 列应当是一个长度为 2 的字符串列表，排在前面的代表更优的回答，例如：

```json
[
  {
    "instruction": "人类指令",
    "input": "人类输入",
    "chosen": "优质回答",
    "rejected": "劣质回答"
  }
]
```

对于上述格式的数据，`dataset_info.json` 中的描述应为：

```json
"数据集名称": {
  "file_name": "data.json",
  "ranking": true,
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "chosen": "chosen",
    "rejected": "rejected"
  }
}
```

----

而 **sharegpt** 格式的数据集按照以下方式组织：

```json
[
  {
    "conversations": [
      {
        "from": "human",
        "value": "人类指令"
      },
      {
        "from": "gpt",
        "value": "模型回答"
      }
    ],
    "system": "系统提示词（选填）",
    "tools": "工具描述（选填）"
  }
]
```

对于上述格式的数据，`dataset_info.json` 中的描述应为：

```json
"数据集名称": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "conversations",
    "system": "system",
    "tools": "tools"
  },
  "tags": {
    "role_tag": "from",
    "content_tag": "value",
    "user_tag": "human",
    "assistant_tag": "gpt"
  }
}
```

其中 `messages` 列应当是一个列表，且符合 `人类/模型/人类/模型/人类/模型` 的顺序。

我们同样支持 **openai** 格式的数据集：

```json
[
  {
    "messages": [
      {
        "role": "system",
        "content": "系统提示词（选填）"
      },
      {
        "role": "user",
        "content": "人类指令"
      },
      {
        "role": "assistant",
        "content": "模型回答"
      }
    ]
  }
]
```

对于上述格式的数据，`dataset_info.json` 中的描述应为：

```json
"数据集名称": {
  "file_name": "data.json",
  "formatting": "sharegpt",
  "columns": {
    "messages": "messages"
  },
  "tags": {
    "role_tag": "role",
    "content_tag": "content",
    "user_tag": "user",
    "assistant_tag": "assistant",
    "system_tag": "system"
  }
}
```

预训练数据集和偏好数据集**尚不支持** sharegpt 格式。
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								如果您使用自定义数据集，请务必按照以下格式在 `dataset_info.json` 文件中添加**数据集描述**。我们在下面也提供了一些例子。
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
 								```json
 								"数据集名称": {
-												add models

											
										
										
											2023-12-18 19:09:31 +08:00
+								  "hf_hub_url": "Hugging Face 的数据集仓库地址（若指定，则忽略 script_url 和 file_name）",
 								  "ms_hub_url": "ModelScope 的数据集仓库地址（若指定，则忽略 script_url 和 file_name）",
 								  "script_url": "包含数据加载脚本的本地文件夹名称（若指定，则忽略 file_name）",
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								  "file_name": "该目录下数据集文件的名称（若上述参数未指定，则此项必需）",
-												fix #1784

											
										
										
											2023-12-09 20:53:18 +08:00
+								  "file_sha1": "数据集文件的 SHA-1 哈希值（可选，留空不影响训练）",
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								  "subset": "数据集子集的名称（可选，默认：None）",
-												fix #1784

											
										
										
											2023-12-09 20:53:18 +08:00
+								  "folder": "Hugging Face 仓库的文件夹名称（可选，默认：None）",
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								  "ranking": "是否为偏好数据集（可选，默认：False）",
 								  "formatting": "数据集格式（可选，默认：alpaca，可以为 alpaca 或 sharegpt）",
-												improve aligner

											
										
										
											2024-02-10 16:39:19 +08:00
+								  "columns（可选）": {
-												add array param format

											
										
										
											2024-01-21 22:17:48 +08:00
+								    "prompt": "数据集代表提示词的表头名称（默认：instruction）",
 								    "query": "数据集代表请求的表头名称（默认：input）",
 								    "response": "数据集代表回答的表头名称（默认：output）",
 								    "history": "数据集代表历史对话的表头名称（默认：None）",
 								    "messages": "数据集代表消息列表的表头名称（默认：conversations）",
 								    "system": "数据集代表系统提示的表头名称（默认：None）",
-												support mllm hf inference

											
										
										
											2024-04-26 05:34:58 +08:00
+								    "tools": "数据集代表工具描述的表头名称（默认：None）",
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								    "images": "数据集代表图像输入的表头名称（默认：None）",
 								    "chosen": "数据集代表更优回复的表头名称（默认：None）",
 								    "rejected": "数据集代表更差回复的表头名称（默认：None）",
 								    "kto_tag": "数据集代表 KTO 标签的表头名称（默认：None）"
-												add array param format

											
										
										
											2024-01-21 22:17:48 +08:00
+								  },
-												improve aligner

											
										
										
											2024-02-10 16:39:19 +08:00
+								  "tags（可选，用于 sharegpt 格式）": {
-												add array param format

											
										
										
											2024-01-21 22:17:48 +08:00
+								    "role_tag": "消息中代表发送者身份的键名（默认：from）",
 								    "content_tag": "消息中代表文本内容的键名（默认：value）",
 								    "user_tag": "消息中代表用户的 role_tag（默认：human）",
 								    "assistant_tag": "消息中代表助手的 role_tag（默认：gpt）",
 								    "observation_tag": "消息中代表工具返回结果的 role_tag（默认：observation）",
-												improve aligner

											
										
										
											2024-02-10 16:39:19 +08:00
+								    "function_tag": "消息中代表工具调用的 role_tag（默认：function_call）",
 								    "system_tag": "消息中代表系统提示的 role_tag（默认：system，会覆盖 system 列）"
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								  }
 								}
 								```
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								然后，可通过使用 `--dataset 数据集名称` 参数加载自定义数据集。
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
-												support ORPO

											
										
										
											2024-03-31 18:29:50 +08:00
+								----
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								该项目目前支持两种格式的数据集：**alpaca** 和 **sharegpt**，其中 alpaca 格式的数据集按照以下方式组织：
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
 								```json
 								[
 								  {
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								    "instruction": "人类指令（必填）",
 								    "input": "人类输入（选填）",
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								    "output": "模型回答（必填）",
-												support system column #1765

											
										
										
											2023-12-12 19:45:59 +08:00
+								    "system": "系统提示词（选填）",
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								    "history": [
 								      ["第一轮指令（选填）", "第一轮回答（选填）"],
 								      ["第二轮指令（选填）", "第二轮回答（选填）"]
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								    ]
 								  }
 								]
 								```
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								对于上述格式的数据，`dataset_info.json` 中的描述应为：
-												根据GLM Efficient Tuning添加中文README，web添加了server_port

											
										
										
											2023-07-21 16:57:58 +08:00
 								```json
 								"数据集名称": {
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								  "file_name": "data.json",
-												根据GLM Efficient Tuning添加中文README，web添加了server_port

											
										
										
											2023-07-21 16:57:58 +08:00
+								  "columns": {
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								    "prompt": "instruction",
 								    "query": "input",
 								    "response": "output",
-												support system column #1765

											
										
										
											2023-12-12 19:45:59 +08:00
+								    "system": "system",
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								    "history": "history"
-												refactor dataset_attr, add eos in pt, fix #757

											
										
										
											2023-09-01 19:00:45 +08:00
+								  }
-												根据GLM Efficient Tuning添加中文README，web添加了server_port

											
										
										
											2023-07-21 16:57:58 +08:00
+								}
 								```
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								其中 `query` 列对应的内容会与 `prompt` 列对应的内容拼接后作为人类指令，即人类指令为 `prompt\nquery`。`response` 列对应的内容为模型回答。
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								`system` 列对应的内容将被作为系统提示词。`history` 列是由多个字符串二元组构成的列表，分别代表历史消息中每轮的指令和回答。注意在指令监督学习时，历史消息中的回答**也会被用于训练**。
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								对于**预训练数据集**，仅 `prompt` 列中的内容会用于模型训练，例如：
-												add rm dataset explanation

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

											
										
										
											2023-08-22 13:30:57 +08:00
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								```json
 								[
 								  {"text": "document"},
 								  {"text": "document"}
 								]
 								```
 								对于上述格式的数据，`dataset_info.json` 中的描述应为：
-												update template

											
										
										
											2023-08-22 19:46:09 +08:00
-												add rm dataset explanation

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

											
										
										
											2023-08-22 13:30:57 +08:00
+								```json
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								"数据集名称": {
 								  "file_name": "data.json",
 								  "columns": {
 								    "prompt": "text"
 								  }
-												add rm dataset explanation

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

											
										
										
											2023-08-22 13:30:57 +08:00
+								}
 								```
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								对于**偏好数据集**，`response` 列应当是一个长度为 2 的字符串列表，排在前面的代表更优的回答，例如：
 								```json
 								[
 								  {
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								    "instruction": "人类指令",
 								    "input": "人类输入",
 								    "chosen": "优质回答",
 								    "rejected": "劣质回答"
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								  }
 								]
 								```
 								对于上述格式的数据，`dataset_info.json` 中的描述应为：
 								```json
 								"数据集名称": {
 								  "file_name": "data.json",
 								  "ranking": true,
 								  "columns": {
 								    "prompt": "instruction",
 								    "query": "input",
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								    "chosen": "chosen",
 								    "rejected": "rejected"
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								  }
 								}
 								```
-												support ORPO

											
										
										
											2024-03-31 18:29:50 +08:00
 								----
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								而 **sharegpt** 格式的数据集按照以下方式组织：
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
 								```json
 								[
 								  {
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								    "conversations": [
 								      {
 								        "from": "human",
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								        "value": "人类指令"
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
+								      },
 								      {
 								        "from": "gpt",
 								        "value": "模型回答"
 								      }
-												support system column #1765

											
										
										
											2023-12-12 19:45:59 +08:00
+								    ],
-												add array param format

											
										
										
											2024-01-21 22:17:48 +08:00
+								    "system": "系统提示词（选填）",
 								    "tools": "工具描述（选填）"
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								  }
 								]
 								```
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								对于上述格式的数据，`dataset_info.json` 中的描述应为：
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
 								```json
 								"数据集名称": {
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								  "file_name": "data.json",
 								  "formatting": "sharegpt",
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								  "columns": {
 								    "messages": "conversations",
-												add array param format

											
										
										
											2024-01-21 22:17:48 +08:00
+								    "system": "system",
 								    "tools": "tools"
 								  },
 								  "tags": {
 								    "role_tag": "from",
-												fix autoset attn impl, update data readme

											
										
										
											2024-01-31 11:58:07 +08:00
+								    "content_tag": "value",
 								    "user_tag": "human",
 								    "assistant_tag": "gpt"
-												update data readme (zh)

											
										
										
											2023-11-02 23:42:49 +08:00
+								  }
 								}
 								```
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								其中 `messages` 列应当是一个列表，且符合 `人类/模型/人类/模型/人类/模型` 的顺序。
-												update data readme

											
										
										
											2023-11-03 00:15:23 +08:00
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								我们同样支持 **openai** 格式的数据集：
 								```json
 								[
 								  {
 								    "messages": [
 								      {
 								        "role": "system",
 								        "content": "系统提示词（选填）"
 								      },
 								      {
 								        "role": "user",
-												improve KTO impl., replace datasets

											
										
										
											2024-05-18 03:44:56 +08:00
+								        "content": "人类指令"
-												Update README_zh.md
											
										
										
											2024-05-02 02:14:55 +08:00
+								      },
 								      {
 								        "role": "assistant",
 								        "content": "模型回答"
 								      }
 								    ]
 								  }
 								]
 								```
 								对于上述格式的数据，`dataset_info.json` 中的描述应为：
 								```json
 								"数据集名称": {
 								  "file_name": "data.json",
 								  "formatting": "sharegpt",
 								  "columns": {
 								    "messages": "messages"
 								  },
 								  "tags": {
 								    "role_tag": "role",
 								    "content_tag": "content",
 								    "user_tag": "user",
 								    "assistant_tag": "assistant",
 								    "system_tag": "system"
 								  }
 								}
 								```
 								预训练数据集和偏好数据集**尚不支持** sharegpt 格式。