LLaMA-Factory-Mirror

History

Mark Mueller 1d3598afa1 Slim Orca data parsing		2024-02-08 19:32:20 +01:00
..
belle_multiturn	support full-parameter PPO	2023-11-16 02:08:04 +08:00
example_dataset	add template, modify datasets	2023-11-09 15:53:23 +08:00
hh_rlhf_en	add template, modify datasets	2023-11-09 15:53:23 +08:00
ultra_chat	support full-parameter PPO	2023-11-16 02:08:04 +08:00
README.md	Slim Orca data parsing	2024-02-08 19:32:20 +01:00
README_zh.md	fix autoset attn impl, update data readme	2024-01-31 11:58:07 +08:00
alpaca_data_en_52k.json	restore from git lfs	2023-08-01 16:33:25 +08:00
alpaca_data_zh_51k.json	fix #2282 and update tool prompt	2024-01-22 22:27:30 +08:00
alpaca_gpt4_data_en.json	restore from git lfs	2023-08-01 16:33:25 +08:00
alpaca_gpt4_data_zh.json	restore from git lfs	2023-08-01 16:33:25 +08:00
c4_demo.json	support autogptq in llama board #246	2023-12-16 16:31:30 +08:00
comparison_gpt4_data_en.json	restore from git lfs	2023-08-01 16:33:25 +08:00
comparison_gpt4_data_zh.json	restore from git lfs	2023-08-01 16:33:25 +08:00
dataset_info.json	Slim Orca data parsing	2024-02-08 19:32:20 +01:00
glaive_toolcall_10k.json	fix dataset	2024-01-18 12:59:30 +08:00
lima.json	restore from git lfs	2023-08-01 16:33:25 +08:00
oaast_rm.json	restore from git lfs	2023-08-01 16:33:25 +08:00
oaast_rm_zh.json	restore from git lfs	2023-08-01 16:33:25 +08:00
oaast_sft.json	restore from git lfs	2023-08-01 16:33:25 +08:00
oaast_sft_zh.json	restore from git lfs	2023-08-01 16:33:25 +08:00
self_cognition.json	restore from git lfs	2023-08-01 16:33:25 +08:00
wiki_demo.txt	update dataset	2023-11-17 23:19:12 +08:00

README.md

If you are using a custom dataset, please provide your dataset definition in the following format in dataset_info.json.

"dataset_name": {
  "hf_hub_url": "the name of the dataset repository on the Hugging Face hub. (if specified, ignore script_url and file_name)",
  "ms_hub_url": "the name of the dataset repository on the ModelScope hub. (if specified, ignore script_url and file_name)",
  "script_url": "the name of the directory containing a dataset loading script. (if specified, ignore file_name)",
  "file_name": "the name of the dataset file in this directory. (required if above are not specified)",
  "file_sha1": "the SHA-1 hash value of the dataset file. (optional, does not affect training)",
  "subset": "the name of the subset. (optional, default: None)",
  "folder": "the name of the folder of the dataset repository on the Hugging Face hub. (optional, default: None)",
  "ranking": "whether the dataset is a preference dataset or not. (default: false)",
  "formatting": "the format of the dataset. (optional, default: alpaca, can be chosen from {alpaca, sharegpt})",
  "columns": {
    "prompt": "the column name in the dataset containing the prompts. (default: instruction)",
    "query": "the column name in the dataset containing the queries. (default: input)",
    "response": "the column name in the dataset containing the responses. (default: output)",
    "history": "the column name in the dataset containing the histories. (default: None)",
    "messages": "the column name in the dataset containing the messages. (default: conversations)",
    "system": "the column name in the dataset containing the system prompts. (default: None)",
    "tools": "the column name in the dataset containing the tool description. (default: None)"
  },
  "tags": {
    "role_tag": "the key in the message represents the identity. (default: from)",
    "content_tag": "the key in the message represents the content. (default: value)",
    "user_tag": "the value of the role_tag represents the user. (default: human)",
    "assistant_tag": "the value of the role_tag represents the assistant. (default: gpt)",
    "observation_tag": "the value of the role_tag represents the tool results. (default: observation)",
    "function_tag": "the value of the role_tag represents the function call. (default: function_call)",
    "system_tag": "the value of the role_tag represents the system prompt. (default: None) incompatible with system column"
  }
}

Given above, you can use the custom dataset via specifying --dataset dataset_name.

Currently we support dataset in alpaca or sharegpt format, the dataset in alpaca format should follow the below format:

[
  {
    "instruction": "user instruction (required)",
    "input": "user input (optional)",
    "output": "model response (required)",
    "system": "system prompt (optional)",
    "history": [
      ["user instruction in the first round (optional)", "model response in the first round (optional)"],
      ["user instruction in the second round (optional)", "model response in the second round (optional)"]
    ]
  }
]

Regarding the above dataset, the columns in dataset_info.json should be:

"dataset_name": {
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "system": "system",
    "history": "history"
  }
}

where the prompt and response columns should contain non-empty values, represent instruction and response respectively. The query column will be concatenated with the prompt column and used as input for the model.

The system column will be used as the system prompt in the template. The history column is a list consisting string tuples representing query-response pairs in history. Note that the responses in each round will be used for training.

For the pre-training datasets, only the prompt column will be used for training.

For the preference datasets, the response column should be a string list whose length is 2, with the preferred answers appearing first, for example:

{
  "instruction": "user instruction",
  "input": "user input",
  "output": [
    "chosen answer",
    "rejected answer"
  ]
}

The dataset in sharegpt format should follow the below format:

[
  {
    "conversations": [
      {
        "from": "human",
        "value": "user instruction"
      },
      {
        "from": "gpt",
        "value": "model response"
      }
    ],
    "system": "system prompt (optional)",
    "tools": "tool description (optional)"
  }
]

Regarding the above dataset, the columns in dataset_info.json should be:

"dataset_name": {
  "columns": {
    "messages": "conversations",
    "system": "system",
    "tools": "tools"
  },
  "tags": {
    "role_tag": "from",
    "content_tag": "value",
    "user_tag": "human",
    "assistant_tag": "gpt"
  }
}

where the messages column should be a list whose length is even, and follow the u/a/u/a/u/a order.

Pre-training datasets and preference datasets are incompatible with the sharegpt format yet.