added the second sharegpt format

This commit is contained in:
khazic 2024-04-28 14:27:45 +08:00
parent e898fabbe3
commit d1ba32e4bb
2 changed files with 53 additions and 5 deletions

View File

@ -94,20 +94,44 @@ Remember to set `"ranking": true` for the preference datasets.
The dataset in sharegpt format should follow the below format: The dataset in sharegpt format should follow the below format:
```json ```json
# The first sharegpt format
[ [
{ {
"conversations": [ "conversations": [
{ {
"from": "human", "from": "human",
"value": "user instruction" "value": "用户指令"
}, },
{ {
"from": "gpt", "from": "gpt",
"value": "model response" "value": "模型回答"
} }
], ],
"system": "system prompt (optional)", "system": "系统提示词(选填)",
"tools": "tool description (optional)" "tools": "工具描述(选填)"
}
]
# The second sharegpt format
[
{
"type": "chatml",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me something about large language models."
},
{
"role": "assistant",
"content": "Large language models are a type of language model ..."
}
],
"source": "unknown"
} }
] ]
``` ```

View File

@ -37,7 +37,7 @@
---- ----
该项目目前支持种格式的数据集:**alpaca** 和 **sharegpt**,其中 alpaca 格式的数据集按照以下方式组织: 该项目目前支持种格式的数据集:**alpaca** 和 **sharegpt**,其中 alpaca 格式的数据集按照以下方式组织:
```json ```json
[ [
@ -94,6 +94,7 @@
而 sharegpt 格式的数据集按照以下方式组织: 而 sharegpt 格式的数据集按照以下方式组织:
```json ```json
# 第一种sharegpt格式
[ [
{ {
"conversations": [ "conversations": [
@ -110,6 +111,29 @@
"tools": "工具描述(选填)" "tools": "工具描述(选填)"
} }
] ]
# 第二种sharegpt格式
[
{
"type": "chatml",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me something about large language models."
},
{
"role": "assistant",
"content": "Large language models are a type of language model ..."
}
],
"source": "unknown"
}
]
``` ```
对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为: 对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为: