added the second sharegpt format

This commit is contained in:
khazic 2024-04-28 14:27:45 +08:00
parent e898fabbe3
commit d1ba32e4bb
2 changed files with 53 additions and 5 deletions

View File

@ -94,20 +94,44 @@ Remember to set `"ranking": true` for the preference datasets.
The dataset in sharegpt format should follow the below format:
```json
# The first sharegpt format
[
{
"conversations": [
{
"from": "human",
"value": "user instruction"
"value": "用户指令"
},
{
"from": "gpt",
"value": "model response"
"value": "模型回答"
}
],
"system": "system prompt (optional)",
"tools": "tool description (optional)"
"system": "系统提示词(选填)",
"tools": "工具描述(选填)"
}
]
# The second sharegpt format
[
{
"type": "chatml",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me something about large language models."
},
{
"role": "assistant",
"content": "Large language models are a type of language model ..."
}
],
"source": "unknown"
}
]
```

View File

@ -37,7 +37,7 @@
----
该项目目前支持种格式的数据集:**alpaca** 和 **sharegpt**,其中 alpaca 格式的数据集按照以下方式组织:
该项目目前支持种格式的数据集:**alpaca** 和 **sharegpt**,其中 alpaca 格式的数据集按照以下方式组织:
```json
[
@ -94,6 +94,7 @@
而 sharegpt 格式的数据集按照以下方式组织:
```json
# 第一种sharegpt格式
[
{
"conversations": [
@ -110,6 +111,29 @@
"tools": "工具描述(选填)"
}
]
# 第二种sharegpt格式
[
{
"type": "chatml",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me something about large language models."
},
{
"role": "assistant",
"content": "Large language models are a type of language model ..."
}
],
"source": "unknown"
}
]
```
对于上述格式的数据,`dataset_info.json` 中的 `columns` 应为: