LLaMA-Factory-Mirror/data/README.md

If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.

```json
"dataset_name": {
  "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore below 3 arguments)",
  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore below 2 arguments)",
  "file_name": "the name of the dataset file in the this directory. (required if above are not specified)",
  "file_sha1": "the SHA-1 hash value of the dataset file. (optional)",
  "subset": "",
  "ranking": "whether the examples contains ranked responses or not. (default: false)",
  "formatting": "",
  "columns": {
    "prompt": "the name of the column in the datasets containing the prompts. (default: instruction)",
    "query": "the name of the column in the datasets containing the queries. (default: input)",
    "response": "the name of the column in the datasets containing the responses. (default: output)",
    "history": "the name of the column in the datasets containing the history of chat. (default: None)"
  }
}
```

where the `prompt` and `response` columns should contain non-empty values. The `query` column will be concatenated with the `prompt` column and used as input for the model. The `history` column should contain a list where each element is a string tuple representing a query-response pair.

For datasets used in reward modeling or DPO training, the `response` column should be a string list, with the preferred answers appearing first, for example:

```json
{
  "instruction": "Question",
  "input": "",
  "output": [
    "Chosen answer",
    "Rejected answer"
  ]
}
```
add datasets 2023-07-19 20:59:15 +08:00			If you are using a custom dataset, please provide your dataset definition in the following format in `dataset_info.json`.

Initial commit 2023-05-28 18:09:04 +08:00			```json
			`"dataset_name": {`
update data readme (zh) 2023-11-02 23:42:49 +08:00			`"hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore below 3 arguments)",`
update readme, fix web ui postprocess 2023-07-22 14:29:22 +08:00			`"script_url": "the name of the directory containing a dataset loading script. (if specified, ignore below 2 arguments)",`
			`"file_name": "the name of the dataset file in the this directory. (required if above are not specified)",`
			`"file_sha1": "the SHA-1 hash value of the dataset file. (optional)",`
support sharegpt format, add datasets 2023-11-02 23:10:04 +08:00			`"subset": "",`
refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00			`"ranking": "whether the examples contains ranked responses or not. (default: false)",`
support sharegpt format, add datasets 2023-11-02 23:10:04 +08:00			`"formatting": "",`
update readme, fix web ui postprocess 2023-07-22 14:29:22 +08:00			`"columns": {`
			`"prompt": "the name of the column in the datasets containing the prompts. (default: instruction)",`
			`"query": "the name of the column in the datasets containing the queries. (default: input)",`
			`"response": "the name of the column in the datasets containing the responses. (default: output)",`
			`"history": "the name of the column in the datasets containing the history of chat. (default: None)"`
refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00			`}`
Initial commit 2023-05-28 18:09:04 +08:00			`}`
			```

add datasets 2023-07-19 20:59:15 +08:00			where the `prompt` and `response` columns should contain non-empty values. The `query` column will be concatenated with the `prompt` column and used as input for the model. The `history` column should contain a list where each element is a string tuple representing a query-response pair.
add rm dataset explanation Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> 2023-08-22 13:30:57 +08:00
update template 2023-08-22 19:46:09 +08:00			For datasets used in reward modeling or DPO training, the `response` column should be a string list, with the preferred answers appearing first, for example:

add rm dataset explanation Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> 2023-08-22 13:30:57 +08:00			```json
			`{`
update template 2023-08-22 19:46:09 +08:00			`"instruction": "Question",`
			`"input": "",`
			`"output": [`
			`"Chosen answer",`
			`"Rejected answer"`
refactor dataset_attr, add eos in pt, fix #757 2023-09-01 19:00:45 +08:00			`]`
add rm dataset explanation Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> 2023-08-22 13:30:57 +08:00			`}`
			```