4318347d3f | ||
---|---|---|
.. | ||
belle_multiturn | ||
example_dataset | ||
hh_rlhf_en | ||
ultra_chat | ||
README.md | ||
README_zh.md | ||
alpaca_data_en_52k.json | ||
alpaca_data_zh_51k.json | ||
alpaca_gpt4_data_en.json | ||
alpaca_gpt4_data_zh.json | ||
comparison_gpt4_data_en.json | ||
comparison_gpt4_data_zh.json | ||
dataset_info.json | ||
lima.json | ||
oaast_rm.json | ||
oaast_rm_zh.json | ||
oaast_sft.json | ||
oaast_sft_zh.json | ||
self_cognition.json | ||
sharegpt_zh_27k.json | ||
wiki_demo.txt |
README.md
If you are using a custom dataset, please provide your dataset definition in the following format in dataset_info.json
.
"dataset_name": {
"hf_hub_url": "the name of the dataset repository on the HuggingFace hub. (if specified, ignore below 3 arguments)",
"script_url": "the name of the directory containing a dataset loading script. (if specified, ignore below 2 arguments)",
"file_name": "the name of the dataset file in the this directory. (required if above are not specified)",
"file_sha1": "the SHA-1 hash value of the dataset file. (optional)",
"columns": {
"prompt": "the name of the column in the datasets containing the prompts. (default: instruction)",
"query": "the name of the column in the datasets containing the queries. (default: input)",
"response": "the name of the column in the datasets containing the responses. (default: output)",
"history": "the name of the column in the datasets containing the history of chat. (default: None)"
}
}
where the prompt
and response
columns should contain non-empty values. The query
column will be concatenated with the prompt
column and used as input for the model. The history
column should contain a list where each element is a string tuple representing a query-response pair.
For datasets used in reward modeling or DPO training, the response
column should be a string list, with the preferred answers appearing first, for example:
{
"instruction": "Question",
"input": "",
"output": [
"Chosen answer",
"Rejected answer"
]
}