first commit

2022-02-14 21:19:03 +08:00 · 2022-02-14 21:19:03 +08:00 · b856ad0fb9
commit b856ad0fb9
158 changed files with 13706 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,20 @@
+data/
+**/__pycache__/
+logs/*
+experiments/logs
+!logs/.gitkeep
+datasets/* 
+!datasets/*.sh
+.vscode/
+*.egg-info/
+eggs/
+.eggs/
+*.egg
+**.egg
+build/
+_build/
+**/build/
+outputs/
+log.txt
+**/DeltaHub/
+*beans
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
@ -0,0 +1,29 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 1
+
+# Set the version of Python and other tools you might need
+build:
+  os: ubuntu-20.04
+  tools:
+    python: "3.9"
+    # You can also specify other tool versions:
+    # nodejs: "16"
+    # rust: "1.55"
+    # golang: "1.17"
+
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+   configuration: docs/conf.py
+
+# If using Sphinx, optionally build your docs in additional formats such as PDF
+# formats:
+#    - pdf
+
+# Optionally declare the Python requirements required to build your docs
+python:
+   install:
+   - requirements: docs/requirements.txt
--- a/README.md
+++ b/README.md
@ -0,0 +1,94 @@
+<div align="center">
+
+
+<img src="https://s4.ax1x.com/2022/02/14/Hy7lAf.png" width="350px">
+
+**An Open-Source Framework for Paramter Efficient Tuning.**
+
+------
+
+<p align="center">
+  <a href="#Overview">Overview</a> •
+  <a href="#installation">Installation</a> •
+  <a href="#Supported-Models">Supported Models</a> •
+  <a href="https://opendelta.readthedocs.io/">Docs</a> • 
+  <a href="https://docs.google.com/spreadsheets/d/1BIVa8ocAPga-u7rBOXLYaTfaJSjI1dWfwohmLjmFDrY/edit?usp=sharing">Performance</a> •
+
+
+</p>
+
+</div>
+
+![version](https://img.shields.io/badge/version-v0.1.0-blue)
+
+## Overview
+
+OpenDelta is a toolkit for parameter efficient methods (we dub it as *delta tuning*), by which users could flexibly assign (or add) a small amount parameters to update while keeping the most paramters frozen. By using OpenDelta, users could easily implement prefix-tuning, adapters, Lora, or any other types of delta tuning with preferred PTMs.
+
+## Installation
+create a virtualenv (optional)
+```shell
+conda create -n opendelta_env python=3.8
+conda activate opendelta_env
+```
+
+### Using Pip
+
+Our repo is tested on Python 3.6+ and PyTorch 1.8.1+, install OpenDelta using pip as follows:
+
+```shell
+pip install opendelta
+```
+
+To play with the latest features, you can also install OpenDelta from the source.
+
+### Build from Source
+
+```shell
+git clone https://github.com/thunlp/OpenDelta.git
+cd OpenDelta
+``` 
+
+#### Option 1: If you won't modify the code, run
+```shell
+python setup.py install
+```
+
+#### Option 2:  If you want to modify the code, run
+```shell
+python setup.py develop
+```
+
+
+
+### Verified Supported Models
+
+** You can try to use OpenDelta on any backbone models based on PyTorch.** However, with small chances that
+The interface of the submodules of the backbone model is not supported. Therefore we verified some commonly
+used models that OpenDelta are sure to support.
+
+We will keep testing more and more emerging models.
+
+Pull requests are welcomed when you successfully apply OpenDelta on your own backbone model.
+
+
+|            | Lora | Bias<br>Tuning  | Adapter<br>Houstbly | Adapter<br>Preffier  | Adapter<br>Drop  | Adapater<br> Low-Rank   | Compactor  |Prefix<br> Tuning      | Prompt <br> Tuning |
+| --------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ----- | ----- | 
+| T5             | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |
+| GPT-2          | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |     |
+| BART           | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |     | 
+| DistilBERT     | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |     | 
+| RoBERTa        | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |     |
+| BERT           | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |
+| T5-3b(parallel)| ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |
+| Deberta-v2     | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |     |     |
+| CTRL           | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  | ✅  |     |     |
+| ViT            | ✅  |     |    |     |     |      |   |     |     |
+
+
+### Performance Checked Combination
+
+Google sheet [here](https://docs.google.com/spreadsheets/d/1BIVa8ocAPga-u7rBOXLYaTfaJSjI1dWfwohmLjmFDrY/edit?usp=sharing)
+
+
+
--- a/docs/Makefile
+++ b/docs/Makefile
@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/make.bat
+++ b/docs/make.bat
@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd
--- a/docs/readme.md
+++ b/docs/readme.md
@ -0,0 +1,20 @@
+# OpenDelta Documentation
+
+To build this doc locally, please firstly install [sphinx](https://www.sphinx-doc.org/en/master/) packages.
+
+```
+pip install sphinx
+pip install sphinx_rtd_theme
+pip install sphinx_copybutton
+pip install sphinx_toolbox
+pip install myst_parser
+```
+
+Then install opendelta either from source, or from pip. After that,
+
+```
+cd docs
+make html
+```
+
+Then open the generated `docs/build/html/index.html` in your local browser. 
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -0,0 +1,13 @@
+sphinx_copybutton
+sphinx_rtd_theme
+sphinx_toolbox
+torch
+transformers
+sentencepiece==0.1.96
+tqdm==4.62.2
+openprompt
+loralib
+decorator
+rich
+myst_parser
+web.py
--- a/docs/source/_static/css/custom.css
+++ b/docs/source/_static/css/custom.css
@ -0,0 +1,268 @@
+/* a, */
+.wy-menu-vertical header,
+.wy-menu-vertical p.caption,
+.wy-nav-top .fa-bars,
+.wy-menu-vertical a:hover,
+
+/* Colors and text decoration.
+ For example, :black:`text in black` or :blink:`text blinking` in rST. */
+
+ /* .black {
+    color: black;
+}
+
+.gray {
+    color: gray;
+}
+
+.grey {
+    color: gray;
+}
+
+.silver {
+    color: silver;
+}
+
+.white {
+    color: white;
+}
+
+.maroon {
+    color: maroon;
+}
+
+.red {
+    color: red;
+}
+
+.magenta {
+    color: magenta;
+}
+
+.fuchsia {
+    color: fuchsia;
+}
+
+.pink {
+    color: pink;
+}
+
+.orange {
+    color: rgba(218, 135, 12, 0.897);
+} */
+
+/* .string {
+	color: rgb(172, 51, 44);
+} */
+
+/* .yellow {
+    color: yellow;
+}
+
+.lime {
+    color: lime;
+}
+
+.green {
+    color: green;
+}
+
+.olive {
+    color: olive;
+}
+
+.teal {
+    color: teal;
+}
+
+.cyan {
+    color: cyan;
+}
+
+.aqua {
+    color: aqua;
+}
+
+.blue {
+    color: blue;
+}
+
+.navy {
+    color: navy;
+}
+
+.purple {
+    color: purple;
+}
+
+.under {
+    text-decoration: underline;
+}
+
+.over {
+    text-decoration: overline;
+}
+
+.blink {
+    text-decoration: blink;
+}
+
+.line {
+    text-decoration: line-through;
+}
+
+.strike {
+    text-decoration: line-through;
+}
+
+.it {
+    font-style: italic;
+}
+
+.ob {
+    font-style: oblique;
+}
+
+.small {
+    font-size: small;
+}
+
+.large {
+    font-size: large;
+}
+
+.smallpar {
+    font-size: small;
+} */
+
+a:link {
+	color: rgb(141, 99, 224)
+}
+
+a:visited {
+	color: rgb(141, 99, 224)
+}
+
+a:hover {
+	color: rgb(147, 47, 218)
+}
+.rst-content code.literal
+{
+	color: rgb(172, 49, 42) !important; 
+	/* #5360f0 */
+}
+
+.rst-content tt.literal
+{
+	color: #f06b53 !important; 
+}
+/* #a153f0  */
+/* inspired by sphinx press theme */
+.wy-menu.wy-menu-vertical li.toctree-l1.current > a {
+	border-left: solid 15px rgb(150, 92, 232) !important;
+	text-indent: -15px;
+	border-top: none;
+	border-bottom: none;
+}
+
+.wy-menu.wy-menu-vertical li.toctree-l1.current > ul {
+	border-left: solid 15px #ddcaf7 !important;
+}
+/* inspired by sphinx press theme */
+
+.wy-nav-side {
+	color: unset !important;
+	background: unset !important;
+	border-right: solid 1px #ccc !important;
+}
+
+.wy-side-nav-search,
+.wy-nav-top,
+.wy-menu-vertical li,
+.wy-menu-vertical li a:hover,
+.wy-menu-vertical li a
+{
+	background: unset !important;
+}
+
+.wy-menu-vertical li.current a {
+	border-right: unset !important;
+}
+
+.wy-side-nav-search div,
+.wy-menu-vertical a {
+	color: #404040 !important;
+}
+
+.wy-menu-vertical button.toctree-expand {
+	color: #333 !important;
+}
+
+.wy-nav-content {
+	max-width: unset;
+}
+
+.rst-content {
+	max-width: 900px;
+}
+
+.wy-nav-content .icon-home:before {
+	content: "Docs";
+}
+
+.wy-side-nav-search .icon-home:before {
+	content: "";
+}
+
+dl.field-list {
+	display: block !important;
+}
+
+dl.field-list > dt:after {
+	content: "" !important;
+}
+
+dl.field-list > dt {
+	display: table;
+	padding-left: 6px !important;
+	padding-right: 6px !important;
+	margin-bottom: 4px !important;
+	padding-bottom: 1px !important;
+	background: rgb(252, 237, 208);
+	border-left: solid 2px rgb(231, 181, 134);
+}
+
+
+dl.py.class>dt
+{
+	color: rgba(17, 16, 17, 0.822) !important;
+	background: rgb(247, 234, 252) !important;
+	border-top: solid 2px #b620d0 !important;
+}
+
+dl.py.method>dt
+{
+	background: rgb(250, 239, 241) !important;
+	border-left: solid 2px rgb(199, 83, 106) !important;
+}
+
+dl.py.attribute>dt,
+dl.py.property>dt
+{
+	background: rgba(194, 233, 248, 0.1) !important;
+	border-left: solid 2px #58b5cc !important;
+}
+
+.fa-plus-square-o::before, .wy-menu-vertical li button.toctree-expand::before,
+.fa-minus-square-o::before, .wy-menu-vertical li.current > a button.toctree-expand::before, .wy-menu-vertical li.on a button.toctree-expand::before
+{
+	content: "";
+}
+
+.rst-content .viewcode-back,
+.rst-content .viewcode-link
+{
+	font-size: 120%;
+}
+
+
--- a/docs/source/_static/js/custom.js
+++ b/docs/source/_static/js/custom.js
@ -0,0 +1,7 @@
+document.addEventListener("DOMContentLoaded", function(event) {
+	document.querySelectorAll(".wy-menu.wy-menu-vertical > ul.current > li > a").forEach(a => a.addEventListener("click", e=>{
+		f = document.querySelector(".wy-menu.wy-menu-vertical > ul.current > li > ul")
+		if (f.style.display=='none') { f.style.display='block'; } else f.style.display = 'none'
+	}));
+	document.querySelectorAll(".headerlink").forEach(a => a.text="\u{1F517}");
+});
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -0,0 +1,144 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+import sys
+sys.path.insert(0, "../../")
+import datetime
+import sphinx_rtd_theme
+import doctest
+import opendelta
+import opendelta.delta_models
+
+# -- Project information -----------------------------------------------------
+
+project = 'OpenDelta'
+author = 'THUNLP OpenDelta Team'
+copyright = '{}, {}, Licenced under the Apache License, Version 2.0'.format(datetime.datetime.now().year, author)
+
+
+# The full version, including alpha/beta/rc tags
+release = '0.1.1'
+version = "0.1.1"
+
+html_theme = 'sphinx_rtd_theme'
+html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+doctest_default_flags = doctest.NORMALIZE_WHITESPACE
+autodoc_member_order = 'bysource'
+intersphinx_mapping = {'python': ('https://docs.python.org/', None),
+"torch": ("https://pytorch.org/docs/stable/", None),}
+
+html_show_sourcelink = True
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    'sphinx.ext.autodoc',
+    'sphinx.ext.autosummary',
+    'sphinx.ext.doctest',
+    'sphinx.ext.intersphinx',
+    'sphinx.ext.mathjax',
+    'sphinx.ext.napoleon',
+    'sphinx.ext.viewcode',
+    'sphinx.ext.githubpages',
+    'sphinx_copybutton',
+    'sphinx_toolbox.collapse',
+    'myst_parser',
+]
+
+myst_enable_extensions = [
+    "html_image", 
+    "colon_fence", 
+    "html_admonition",
+    "amsmath",
+    "dollarmath",
+]
+
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.txt': 'markdown',
+    '.md': 'markdown',
+}
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+# exclude_patterns = []
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+# html_theme = 'alabaster'
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_theme_options = {
+    # 'collapse_navigation': False,
+    # 'display_version': True,
+    #'logo_only': False,
+    'navigation_depth': 2,
+}
+
+
+html_static_path = ['_static']
+html_css_files = ['css/custom.css']
+html_js_files = ['js/custom.js']
+rst_context = {'opendelta': opendelta}
+# rst_epilog = "\n.. include:: .special.rst\n"
+add_module_names = False
+
+def include_only_tagged(app, what, name, obj, skip, options):
+    inclusion_tag_format = "[NODOC]" #can be any pattern here, choose what works for you
+    for tag in app.tags.tags:
+        if obj.__doc__ is not None and not obj.__doc__.startswith(inclusion_tag_format):
+            return False
+    return True
+
+def skip2(app, what, name, obj, skip, options):
+        members = [
+            '__init__',
+            '__repr__',
+            '__weakref__',
+            '__dict__',
+            '__module__',
+        ]
+        return True if name in members else skip
+
+def skip(app, what, name, obj, skip, options):
+    skip = include_only_tagged(app, what, name, obj, skip, options) or\
+            skip2(app, what, name, obj, skip, options)
+    return skip
+
+def setup(app):
+    
+    
+
+    def rst_jinja_render(app, docname, source):
+        src = source[0]
+        rendered = app.builder.templates.render_string(src, rst_context)
+        source[0] = rendered
+
+    app.connect('autodoc-skip-member', skip)
+    app.connect("source-read", rst_jinja_render)
--- a/docs/source/imgs/afterfreeze.png
+++ b/docs/source/imgs/afterfreeze.png
--- a/docs/source/imgs/bart-base.png
+++ b/docs/source/imgs/bart-base.png
--- a/docs/source/imgs/bert_vis.png
+++ b/docs/source/imgs/bert_vis.png
--- a/docs/source/imgs/bertdelta_noparam.png
+++ b/docs/source/imgs/bertdelta_noparam.png
--- a/docs/source/imgs/bertdelta_vis.png
+++ b/docs/source/imgs/bertdelta_vis.png
--- a/docs/source/imgs/commonstructure_vis.png
+++ b/docs/source/imgs/commonstructure_vis.png
--- a/docs/source/imgs/composition_of_delta.png
+++ b/docs/source/imgs/composition_of_delta.png
--- a/docs/source/imgs/defaultmodification.png
+++ b/docs/source/imgs/defaultmodification.png
--- a/docs/source/imgs/hint-icon-2.jpg
+++ b/docs/source/imgs/hint-icon-2.jpg
--- a/docs/source/imgs/hint-icon.png
+++ b/docs/source/imgs/hint-icon.png
--- a/docs/source/imgs/interact.jpg
+++ b/docs/source/imgs/interact.jpg
--- a/docs/source/imgs/multiple_to_one_layer.png
+++ b/docs/source/imgs/multiple_to_one_layer.png
--- a/docs/source/imgs/name_based_addressing.png
+++ b/docs/source/imgs/name_based_addressing.png
--- a/docs/source/imgs/plugunplug1.png
+++ b/docs/source/imgs/plugunplug1.png
--- a/docs/source/imgs/plugunplug2.png
+++ b/docs/source/imgs/plugunplug2.png
--- a/docs/source/imgs/plugunplug3.png
+++ b/docs/source/imgs/plugunplug3.png
--- a/docs/source/imgs/plugunplug4.png
+++ b/docs/source/imgs/plugunplug4.png
--- a/docs/source/imgs/plugunplug5.png
+++ b/docs/source/imgs/plugunplug5.png
--- a/docs/source/imgs/plugunplug6.png
+++ b/docs/source/imgs/plugunplug6.png
--- a/docs/source/imgs/pointing-right-finger.png
+++ b/docs/source/imgs/pointing-right-finger.png
--- a/docs/source/imgs/raw_print.png
+++ b/docs/source/imgs/raw_print.png
--- a/docs/source/imgs/t5lora.png
+++ b/docs/source/imgs/t5lora.png
--- a/docs/source/imgs/todo-icon.jpeg
+++ b/docs/source/imgs/todo-icon.jpeg
--- a/docs/source/imgs/toy-delta.png
+++ b/docs/source/imgs/toy-delta.png
--- a/docs/source/imgs/transformers_structure.png
+++ b/docs/source/imgs/transformers_structure.png
--- a/docs/source/index.md
+++ b/docs/source/index.md
@ -0,0 +1,54 @@
+OpenDelta's documentation!
+=====================================
+
+OpenDelta is a **Plug-and-play** Library of the parameter-efficient fine-tuning ([delta-tuning](WhatisDelta)) technology for pre-trained models.
+
+
+## Essential Advantages:
+
+- <span style="color:rgb(81, 217, 245);font-weight:bold">Clean:</span> No need to edit the backbone PTM’s codes.
+- <span style="color:orange;font-weight:bold">Simple:</span> Migrating from full-model tuning to delta-tuning needs as little as 3 lines of codes.
+- <span style="color:green;font-weight:bold">Sustainable:</span> Most evolution in external library doesn’t require a new OpenDelta.
+- <span style="color:red;font-weight:bold">Extendable:</span> Various PTMs can share the same delta-tuning codes.
+- <span style="color:purple;font-weight:bold">Flexible:</span> Able to apply delta-tuning to (almost) any position of the PTMs.
+
+```{eval-rst}
+.. toctree::
+   :maxdepth: 1
+   :caption: Getting Started
+
+   notes/overview.md
+   notes/installation.md
+   notes/usage.md
+   notes/visualization.md
+   notes/saveload.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Advanced Usage
+
+   notes/keyfeature.md
+   notes/unifyname.md
+   notes/autodelta.md
+   notes/composition.md
+   notes/pluginunplug.md
+   notes/acceleration.md
+   notes/explored_config.md
+   notes/citation.md
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Package Reference
+
+   modules/base
+   modules/deltas
+   modules/auto_delta
+   modules/utils
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+
+```
--- a/docs/source/modules/auto_delta.rst
+++ b/docs/source/modules/auto_delta.rst
@ -0,0 +1,14 @@
+Auto Classes
+======================================
+
+
+AutoDeltaConfig
+------------------------------------
+.. autoclass:: opendelta.auto_delta.AutoDeltaConfig
+    :members:
+
+
+AutoDeltaModel
+------------------------------------
+.. autoclass:: opendelta.auto_delta.AutoDeltaModel
+    :members:
--- a/docs/source/modules/base.rst
+++ b/docs/source/modules/base.rst
@ -0,0 +1,14 @@
+Base Classes
+======================================
+
+
+BaseDeltaConfig
+------------------------------------
+.. autoclass:: opendelta.delta_configs.BaseDeltaConfig
+    :members:
+
+
+DeltaBase
+------------------------------------
+.. autoclass:: opendelta.basemodel.DeltaBase
+    :members:
--- a/docs/source/modules/deltas.rst
+++ b/docs/source/modules/deltas.rst
@ -0,0 +1,46 @@
+Delta Models
+======================================
+
+
+
+Lora
+---------------------------------------
+.. autoclass:: opendelta.LoraModel
+    :members:
+
+
+
+BitFit
+---------------------------------------
+.. autoclass:: opendelta.BitFitModel
+    :members:
+
+
+Adapter
+---------------------------------------
+.. autoclass:: opendelta.AdapterModel
+    :members:
+
+
+LowRankAdapter
+---------------------------------------
+.. autoclass:: opendelta.LowRankAdapterModel
+    :members:
+
+
+Compacter
+---------------------------------------
+.. autoclass:: opendelta.CompacterModel
+    :members:
+
+
+Prefix tuning
+------------------------------------
+.. autoclass:: opendelta.PrefixModel
+    :members:
+
+
+Soft Prompt Tuning
+------------------------------------
+.. autoclass:: opendelta.SoftPromptModel
+    :members:
--- a/docs/source/modules/utils.md
+++ b/docs/source/modules/utils.md
@ -0,0 +1,45 @@
+# Utils
+
+
+## SaveLoadMixin
+
+```{eval-rst}
+.. autoclass:: opendelta.utils.saving_loading_utils.SaveLoadMixin
+    :members:
+```
+
+## Visualization
+
+
+```{eval-rst}
+.. autoclass:: opendelta.utils.visualization.Visualization
+    :members:
+```
+
+## Structure Map
+```{eval-rst}
+.. autoclass:: opendelta.utils.structure_mapping.CommonStructureMap
+    :members:
+```
+
+## Utility Functions
+
+### Hashing
+```{eval-rst}
+.. automodule:: opendelta.utils.model_md5
+    :members:
+``` 
+
+### Signature
+```{eval-rst}
+.. automodule:: opendelta.utils.signature
+    :members:
+```
+
+### Named-based addressing
+```{eval-rst}
+.. automodule:: opendelta.utils.name_based_addressing
+    :members:
+```
+
+
--- a/docs/source/notes/acceleration.md
+++ b/docs/source/notes/acceleration.md
@ -0,0 +1,6 @@
+
+(acceleration)=
+# OpenDelta+
+<img src="../imgs/todo-icon.jpeg" height="30px"> We are working on testing and improving the functionality with work with other acceleration packages for model training and inference. For example, [deepspeed](https://github.com/microsoft/DeepSpeed), [BMInf](https://github.com/OpenBMB/BMInf).
+
+Feel free to contact us via email (shengdinghu@gmail.com) if you have any suggestion.
--- a/docs/source/notes/autodelta.md
+++ b/docs/source/notes/autodelta.md
@ -0,0 +1,67 @@
+(autodelta)=
+# AutoDelta Mechanism
+
+Inspired by [Huggingface transformers AutoClasses](https://huggingface.co/docs/transformers/v4.16.2/en/model_doc/auto#transformers.AutoModel) , we provide an AutoDelta features for the users to
+
+1. Easily to experiment with different delta models
+2. Fast deploy from configuration file, especially from the repos in [DeltaHub](https://huggingface.co/DeltaHub).
+
+
+## Easily load from dict, so that subject to change the type of delta models.
+
+```python
+from opendelta import AutoDeltaConfig, AutoDeltaModel
+from transformers import T5ForConditionalGeneration
+
+backbone_model = T5ForConditionalGeneration.from_pretrained("t5-base")
+```
+
+We can load a config from a dict
+```python
+config_dict = {
+    "delta_type":"lora", 
+    "modified_modules":[
+        "SelfAttention.q", 
+        "SelfAttention.v",
+        "SelfAttention.o"
+    ], 
+    "lora_r":4}
+delta_config = AutoDeltaConfig.from_dict(config_dict)
+```
+
+Then use the config to add a delta model to the backbone model
+```python
+delta_model = AutoDeltaModel.from_config(delta_config, backbone_model=backbone_model)
+
+# now visualize the modified backbone_model
+from opendelta import Visualization
+Visualizaiton(backbone_model).structure_graph()
+```
+
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/t5lora.png
+---
+width: 600px
+name: t5lora
+---
+```
+````
+
+
+
+## Fast deploy from a finetuned delta checkpoints from DeltaHub
+
+```python
+delta_model = AutoDeltaModel.from_finetuned("DeltaHub/sst2-t5-base", backbone_model=backbone_model)  # TODO: the link may change.
+```
+
+<div class="admonition note">
+<p class="title">**Hash checking**</p>
+Since the delta model only works together with the backbone model.
+we will automatically check whether you load the delta model the same way it is trained.
+</p>
+<p>
+We calculate the trained model's [md5](http://some_link) and save it to the config. When finishing loading the delta model, we will re-calculate the md5 to see whether it changes.
+<p>Pass `check_hash=False` to disable the hash checking.</p>
+</div>
--- a/docs/source/notes/citation.md
+++ b/docs/source/notes/citation.md
@ -0,0 +1,3 @@
+# Citation
+
+<img src="../imgs/todo-icon.jpeg" height="30px"> We are working on a technical report.
--- a/docs/source/notes/composition.md
+++ b/docs/source/notes/composition.md
@ -0,0 +1,52 @@
+(composition)=
+# Composition of delta models
+
+With OpenDelta, you can perform compostion of different delta models.
+
+
+### Add different deltas to the backbone
+
+```
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained("roberta-base")
+from opendelta import LoraModel, AdapterModel
+delta_model = LoraModel(backbone_model=model, modified_modules=['key'], lora_r=1)
+delta_model2 = AdapterModel(backbone_model=model, modified_modules=['output'], bottleneck_dim=12)
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/composition_of_delta.png
+---
+width: 600px
+name: defaultmodification
+---
+```
+````
+
+
+
+### Even add multiple delta to the same layer
+
+```
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-base")
+from opendelta import AdapterModel, LowRankAdapterModel
+delta_model = AdapterModel(backbone_model=model, modified_modules=['fc2'])
+delta_model2 = AdapterModel(backbone_model=model, modified_modules=['fc2'], bottleneck_dim=12)
+delta_model3 = LowRankAdapterModel(backbone_model=model, modified_modules=['fc2'], reduction_factor=12)
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/multiple_to_one_layer.png
+---
+width: 600px
+name: defaultmodification
+---
+```
+````
+:::{admonition} Order of Insertion
+:class: warning
+**When adding to the same layer, please pay attention to the order of adding delta.** As the above example, adapter is added after the `fc2`, the tensor will first go through `adapter` then go through `adapter_1`, at last `compacter`. If the delta is added before the backbone layer, then the last added delta will be the first to go through.
+
+Also, pay attention to the detaching order. The delta that is first added should be the last to be detached. 
+:::
--- a/docs/source/notes/explored_config.md
+++ b/docs/source/notes/explored_config.md
@ -0,0 +1,11 @@
+(favoredconfiguration)=
+# Favored Configuration
+
+<img src="../imgs/todo-icon.jpeg" height="30px"> We will add the commonly used configuration of delta models HERE in future.
+
+E.g.
+- the modified_modules (position of delta), 
+- hyperparameter that are the most efficient
+- the favored composition between delta models
+
+Currenlty, use the default setting, explore it by yourself, or refer to existing papers' configuration!
--- a/docs/source/notes/installation.md
+++ b/docs/source/notes/installation.md
@ -0,0 +1,24 @@
+
+(installation)=
+# Installation
+
+
+OpenDelta is tested on on [Python 3.8](https://www.python.org/) and [Pytorch 1.9](<https://pytorch.org/>). 
+
+```bash
+pip install opendelta
+```
+
+or from the source
+```bash
+git clone
+cd OpenDelta
+python setup.py install
+```
+
+If you want to do some modifications on the code for your research, run
+```bash
+git clone 
+cd OpenDelta
+python setup.py develop
+```
--- a/docs/source/notes/keyfeature.md
+++ b/docs/source/notes/keyfeature.md
@ -0,0 +1,200 @@
+(keyfeature)=
+# Philosophy and Key Features
+
+:::{admonition} Plug-and-play Design.
+:class: tip
+
+Existing open-source project to propogate this **''delta-tuning''** paradigm includes
+<a href="https://adapterhub.ml">AdapterHub</a>, which copies the transformers code base and modify on it, which makes it unintuitive to transfer from a normal code base to a delta-tuning ones.
+
+OpenDelta approaches this problem via a **true plug-and-play** fashion to the PLMs. To migrate from a full-model finetuning training scripts to a delta tuning training scripts, you **DO NOT**  need to change the backbone bone model code base to an adapted code base.
+:::
+
+
+Here is how we achieve it.
+
+<img src="../imgs/pointing-right-finger.png" height="30px"> **Read through it will also help you to implement your own delta models in a sustainable way.**
+
+(namebasedaddr)=
+## 1. Name-based submodule addressing.
+We locate the submodules that we want to apply a delta layer via name-based addressing.
+
+In pytorch fashion, a submodule can be accessed from a root model via 'dot' addressing. For example, we define a toy language model
+
+```python
+import torch.nn as nn
+class MyNet1(nn.Module):
+    def __init__(self,):
+        super().__init__()
+        self.name_a = nn.Linear(5,5)
+    def forward(self, hiddens):
+        return self.name_a(hiddens)
+
+class MyNet2(nn.Module):
+    def __init__(self,):
+        super().__init__()
+        self.embedding = nn.Embedding(10,5)
+        self.name_b = nn.Sequential(MyNet1(), MyNet1())
+    def forward(self, input_ids):
+        hiddens = self.embedding(input_ids)
+        return self.name_b(hiddens)
+        
+root = MyNet2()
+print(root.name_b[0].name_a)
+# Linear(in_features=5, out_features=5, bias=True)
+```
+
+We can visualize the model (For details, see [visualization](visualization))
+
+```python
+from opendelta import Visualization
+Visualization(root).structure_graph()
+```
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/name_based_addressing.png
+---
+width: 500px
+name: name_based_addressing
+---
+```
+````
+
+In this case, string `"name_b.0.name_a"` will be the name to address the submodule from the root model. 
+
+Thus when applying a delta model to this toy net.
+
+```
+from opendelta import AdapterModel
+AdapterModel(backbone_model=root, modified_modules=['name_b.0.name_a'])
+Visualization(root).structure_graph()
+```
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/toy-delta.png
+---
+width: 500px
+name: toy-delta
+---
+```
+````
+
+### Makes addressing easier.
+
+Handcrafting the full names of submodules can be frustrating. We made some simplifications
+
+1. End-matching Rules.
+
+    OpenDelta will take every modules that 
+    **ends with** the provided name suffix as the modification [target module](target_module). 
+    :::{admonition} Example
+    :class: tip
+    Taking DistilBert with an classifier on top as an example:
+    - set to `["0.attention.out_lin"]` will add delta modules to the attention output of distilbert's 
+    ayer 0, i.e., `distilbert.transformer.layer.0.attention.out_lin`.
+    - set to `["attention.out_lin"]` will add the delta modules in every layer's `attention.out_lin`. 
+    :::
+
+
+2. Regular Expression.
+ <img src="../imgs/todo-icon.jpeg" height="30px"> Unit test and Doc later.
+
+3. Interactive Selection.
+
+    We provide a way to interact visually to select modules needed.
+
+    ```python
+    from transformers import BertForMaskedLM
+    model = BertForMaskedLM.from_pretrained("bert-base-cased")
+    # suppose we load BERT
+
+    from opendelta import LoraModel # use lora as an example, others are same
+    delta_model = LoraModel(backbone_model=model, interactive_modify=True)
+    ```
+
+    by setting `interactive_modify`, a web server will be opened on local host, and the link will be print in the terminal.
+
+    ```
+    http://0.0.0.0:8888/
+    ```
+
+    If on your local machine, click to open the link for interactive modification.
+
+    If on remote host, you could use port mapping. For example, vscode terminal will automatically do port mapping for you, you can simply use `control/command + click` to open the link.
+
+    You can change the port number in case the default port number is occupied by other program by setting `interactive_modify=port_number`, in which port_number is an integer.
+
+    The web page looks like the following figure.
+
+    ```{figure} ../imgs/interact.jpg
+    ---
+    width: 500px
+    name: interact web page
+    ---
+    ```
+
+    - By clicking on `[+]`/`[-]` to expand / collapse tree nodes.
+
+    - By clicking on text to select tree nodes, **yellow dotted** box indicates the selection.
+
+    - **Double** click on the pink `[*]` is an advanced option to unfold the repeated nodes. By default, modules with the same architecture are folded into one node and are marked in red, for example, the `BertLayer` of layers 0~11 in the above figure are in the same structure. Regular model changes will make the same changes to each layers.
+    
+        - If you want to change only a few of them, first double-click on `[*]`, then select the parts you want in the unfolded structure.
+        
+        - If you want to make the same change to all but a few of them, first select the common parts you want in the folded structure, then double-click on `[*]` to remove the few positions you don't need to change in the expanded structure.
+
+    Click `submit` button on the top-right corner, then go back to your terminal, you can get a list of name-based addresses printed in the terminal in the following format, and these modules are being "delta".
+
+    ```
+    modified_modules:
+    [bert.encoder.layer.0.output.dense, ..., bert.encoder.layer.11.output.dense]
+    ```
+
+## 2. Three basic submodule-level delta operations.
+We use three key functions to achieve the modifications to the backbone model outside the backbone model's code.
+
+1. **unfreeze some paramters**
+
+   Some delta models will unfreeze a part of the model parameters and freeze other parts of the model, e.g. [BitFit](https://arxiv.org/abs/2106.10199). For these methods, just use [freeze_module](opendelta.basemodel.DeltaBase.freeze_module) method and pass the delta parts into `exclude`.
+   
+2. **replace an module**
+
+   Some delta models will replace a part of the model with a delta model, i.e., the hidden states will no longer go through the original submodules. This includes [Lora](https://arxiv.org/abs/2106.09685).
+   For these methods, we have an [update_module](opendelta.basemodel.DeltaBase.replace_module) interface.
+
+3. **insertion to the backbone**
+
+   - **sequential insertion**
+   
+    Most adapter model insert a new adapter layer after/before the original transformers blocks. For these methods, insert the adapter's forward function after/before the original layer's forward function using [insert_sequential_module](opendelta.basemodel.DeltaBase.insert_sequential_module) interface. 
+   - **parallel insertion**
+   
+    Adapters can also be used in a parallel fashion (see [Paper](https://arxiv.org/abs/2110.04366)).
+    For these methods, use [insert_parallel_module](opendelta.basemodel.DeltaBase.insert_parrellel_module) interface.
+
+
+:::{admonition} Doc-preserving Insertion
+:class: note
+In the insertion operations, the replaced forward function will inherit the doc strings of the original functions. 
+:::
+
+## 3. Pseudo input to initialize.
+Some delta models, especially the ones that is newly introduced into the backbone, will need to determine the parameters' shape. To get the shape, we pass a pseudo input to the backbone model and determine the shape of each delta layer according to the need of smooth tensor flow. 
+
+:::{admonition} Pseudo Input
+:class: warning
+Most models in [Huggingface Transformers](https://huggingface.co/docs/transformers/index) have an attribute [dummy_inputs](https://github.com/huggingface/transformers/blob/v4.16.2/src/transformers/modeling_utils.py#L464). This will create a nonsensical input with the correct format to pass into the model's forward function.
+
+For the models that doesn't inherit/implement this attributes, we assume the pseudo input to the model is something like `input_id`, i.e., an integer tensor.
+```python
+pseudo_input = torch.tensor([[0,0,0]])
+# or 
+pseudo_input = torch.tensor([0,0,0])
+```
+<img src="../imgs/todo-icon.jpeg" height="30px"> We will add interface to allow more pseudo input in the future.
+:::
+
+
+
+
+
--- a/docs/source/notes/knownissue.md
+++ b/docs/source/notes/knownissue.md
@ -0,0 +1,2 @@
+
+
--- a/docs/source/notes/overview.md
+++ b/docs/source/notes/overview.md
@ -0,0 +1,36 @@
+# What is Delta-tuning and Why OpenDelta?
+
+(WhatisDelta)=
+:::{admonition} What is Delta?
+:class: tip
+
+As Pre-trained language models (PLMs) have become the fundamental infrastructure on many NLP tasks and benchmarks, it is becoming increasingly clear from recent research that **larger models tend to lead to better performance**. However, large-scale PLMs also bring prohibitive adaptation costs when fine-tuning all the parameters of a model and retaining separate instances for different tasks.
+
+**Parameter-efficient model stimulation methods** thus have attracted researchers' eyes, which only tune a small fraction of model parameter while achieving comparable or even better performance than full-model fine-tuning, dubbed as "Delta-tuning".
+
+**Delta** thus means a small fraction $\Delta\Theta$  of parameters besides the pretrained models $\Theta_0$. 
+
+\begin{gather*}
+\Theta \sim \Theta_0\text{(frozen)} + \Delta\Theta\text{(tunable)}
+\end{gather*}
+
+This open-source project implement several delta-tuning methods, which allows researchers and engineers to quickly migrate their codes from full-model tuning to delta-tuning without replace the backend (the implementation of the backbone PLM).
+:::
+
+
+
+## Why OpenDelta?
+
+- <span style="color:rgb(81, 217, 245);font-weight:bold">Clean:</span> No need to edit the backbone PTM’s codes.
+- <span style="color:orange;font-weight:bold">Simple:</span> Migrating from full-model tuning to delta-tuning needs as little as 3 lines of codes.
+- <span style="color:green;font-weight:bold">Sustainable:</span> Most evolution in external library doesn’t require a new OpenDelta.
+- <span style="color:red;font-weight:bold">Extendable:</span> Various PTMs can share the same delta-tuning codes.
+- <span style="color:purple;font-weight:bold">Flexible:</span> Able to apply delta-tuning to (almost) any position of the PTMs.
+
+
+## Delta-tuning papers
+<img src="../imgs/todo-icon.jpeg" height="30px">
+
+
+
+
--- a/docs/source/notes/pluginunplug.md
+++ b/docs/source/notes/pluginunplug.md
@ -0,0 +1,113 @@
+# Multitask Modeling using OpenDelta
+
+:::{admonition} Multitask Serving with Delta-tuning
+:class: tip
+A huge advange of Delta-tuning is that it can be used for multitask serving.
+Imagine we have a pretrained model trained on a mix of data coming from  multiple languages, e.g.,English, Chinese, and French. Now you want to have seperate models that specialise in Chinese, French, English. We can thus delta-tune three deltas on each language with small amount of additional language-specific data. During serving, when a Chinese sentence comes, you attach the "Chinese Delta", and next a French sentence comes, you detach the "Chinese Delta", and attach a "French Delta".  
+:::
+
+**Here is how to achieve multitask serving using OpenDelta.**
+
+```python
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-base")
+from opendelta import LoraModel
+delta_model = LoraModel(backbone_model=model, modified_modules=['fc2'])
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/plugunplug1.png
+---
+width: 800px
+name: defaultmodification
+---
+```
+````
+
+Now we detach the deltas from the backbone
+```python
+delta_model.detach()
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/plugunplug2.png
+---
+width: 800px
+name: defaultmodification
+---
+```
+````
+
+We can reattach the deltas to the backbone
+```python
+delta_model.attach()
+delta_model.log()
+```
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/plugunplug3.png
+---
+width: 800px
+name: defaultmodification
+---
+```
+````
+
+:::{admonition} Independence of Different Delta Models
+:class: note
+Different delta models will be independent in detaching and attaching.
+(But the visualization will not show all deltas in the backbone model.)
+```python
+# continue from the above example
+from opendelta import AdapterModel
+delta_model2 = AdapterModel(backbone_model=model, modified_modules=['fc1'])
+delta_model2.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/plugunplug4.png
+---
+width: 800px
+name: defaultmodification
+---
+```
+````
+
+detach the lora delta
+```python
+delta_model.detach() # detach the lora delta
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/plugunplug5.png
+---
+width: 800px
+name: defaultmodification
+---
+```
+````
+
+detach the adapter delta and reattach the lora delta
+```python
+delta_model2.detach() # detach the adapter delta
+delta_model.attach() # reattach the lora delta
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/plugunplug6.png
+---
+width: 800px
+name: defaultmodification
+---
+```
+````
+:::
+
+
+:::{admonition} BitFit not supported
+:class: warning
+<img src="../imgs/todo-icon.jpeg" height="30px"> Currently detach is not suitable for BitFit, which modify the requires_grad property. Please wait for future releases. 
+:::
+
+
+
+
--- a/docs/source/notes/saveload.md
+++ b/docs/source/notes/saveload.md
@ -0,0 +1,98 @@
+(saveload)=
+# Save and Share the Delta
+
+## Space efficient saving without changing the code.
+After a modified backbone model is trained, you can save only trained part without change to any code, because **the state dict of the backbone model has been changed to the trainable parts**
+
+```python
+from opendelta import CompacterModel
+from transformers import BertForMaskedLM
+backbone_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
+delta_model = CompacterModel(backbone_model) # modify the default modules.
+
+# freeze module
+delta_model.freeze_module(exclude=["deltas"], set_state_dict=True)
+# or
+delta_model.freeze_module(exclude=["deltas"])
+```
+### save the checkpoint.
+now save the backbone_model in normal way, and the checkpoint is **very space efficient**.
+
+```python
+# ...
+# After some training pipeline
+# ...
+torch.save(backbone_model.state_dict(), "delta.ckpt")
+
+# the checkpoint size
+import os
+print("checkpoint size: {:.2f}M".format(os.path.getsize("delta.ckpt")/1024**2))
+# checkpoint size: 0.32M
+```
+
+### load the checkpoint.
+In order to load the checkpoint, you should make sure the backbone model is a modified ones (so that it can take in the delta parameters).
+Then load the checkpoint with `strict=False`.
+```python
+backbone_model.load_state_dict(torch.load("delta.ckpt"), strict=False)
+# this will return long string of warning about the 'missing key'.
+# if you want to supress it, use
+# _ = backbone_model.load_state_dict(torch.load("delta.ckpt"), strict=False)
+```
+
+## Save/Load the entire model after training.
+
+### save a delta model.
+```python
+delta_model.save_finetuned("delta_model")
+# Configuration saved in delta_model/config.json
+# Model weights saved in delta_model/pytorch_model.bin
+```
+This will save all the trained parameters and the configuration of the delta model to path `delta_model/`
+
+### load a delta model.
+
+```python
+backbone_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
+delta_model.from_finetuned("delta_model", backbone_model, local_files_only=True) 
+# passing local_files_only=True will save the time of checking in the web.
+```
+
+## Share or download a model to/from the community.
+
+### Share.
+```python
+delta_model.save_finetuned("test_delta_model", push_to_hub = True)
+```
+
+###  Download from community.
+```python
+from transformers import AutoModelForSeq2SeqLM
+t5 = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
+from opendelta import AutoDeltaModel
+delta = AutoDeltaModel.from_finetuned("DeltaHub/lora_t5-base_mrpc", backbone_model=t5)
+delta.log()
+```
+
+<div class="admonition tip"> 
+<p class="title">**Push to Hub**</p>
+<p> Currently we only provide the option to push to huggingface model hub.</p>
+<p> Before push to hub, you may need to register an account on Huggingface. You can refer to this [tutorial about model sharing and uploading](https://huggingface.co/docs/transformers/model_sharing)
+</p>
+<p> In some cases, your checkpoint is still large for git, please install [`git-lfs`](https://git-lfs.github.com).
+</p>
+</div>
+
+:::{admonition} **Sharing with the Community**
+:class: tip
+If you are satisfied with your checkpoint, do not forget to share your model to <a href="https://huggingface.co/DeltaHub">DeltaHub</a>:
+1. Add yourself to DeltaHub with the [public link](https://huggingface.co/organizations/DeltaHub/share/QzkBuLSmlVnNhQqHYnekoTXwSRkoRHBwZA)
+2. Be sure to edit your model card to clearly illustrate the delta model before you share.
+3. Click `setting` on the model
+4. Transfer the model in `rename or transfer this model` section.
+::: 
+
+
+## Save & Load for Composition of Delta
+
+<img src="../imgs/todo-icon.jpeg" height="30px"> Currently save & load method is not suitable for [composition of delta model](compositon). Please wait for future releases. 
--- a/docs/source/notes/unifyname.md
+++ b/docs/source/notes/unifyname.md
@ -0,0 +1,82 @@
+(unifyname)=
+
+# Unified Name Convention
+
+```{figure} ../imgs/transformers_structure.png
+:width: 400px
+:name: transformers_structure
+```
+
+Although different PTMs often share similar Transformers structures, the codebases, and most importantly, the variable names for each submodule, are quite different.
+
+
+
+On the one hand, we **encourage the users to first [visualize](visualization) the PTMs' structure and then determine the name of submoduels.**
+
+On the other hand, we designed a unified name convention of Transformer Structure, and provided several structure mapping from the original name to the unified name convention. 
+
+In this section, we will illustrate the unified name convention and structure mapping.
+
+
+## Common blocks in Transformers structure.
+
+
+- embeddings (word embedding)
+- encoder
+  - block
+    - $ (layer_id)
+      - attn
+        - q, k, v
+        - proj
+        - layer_norm
+      - ff
+        - w1
+        - w2
+        - layer_norm
+- decoder (similar to encoder)
+- lm_head
+  - proj
+
+Visualize bert-base using a common structure name: The submodules that are not common are grey.
+
+```{figure} ../imgs/commonstructure_vis.png
+:width: 600px
+:name: transformers_structure
+```
+
+(commonstructure)=
+## Mappings
+
+Example of bert mapping: a tree with node names specified by <span style="font-weight:bold;color:rgb(55, 125, 34);" >"\_\_name\_\_"</span>
+```json
+{
+    "bert.embeddings.word_embeddings": {"__name__":"embeddings"},
+    "bert.embeddings.position_embeddings": {"__name__":""},
+    "bert.embeddings.token_type_embeddings": {"__name__":""},
+    "bert.embeddings.LayerNorm": {"__name__":""},
+    "bert.encoder": {"__name__":"encoder",
+        "layer": {"__name__":"block",
+            "$": {"__name__":"$",
+                "attention": {"__name__":"attn",
+                    "self.query": {"__name__":"q"},
+                    "self.key": {"__name__":"k"},
+                    "self.value": {"__name__":"v"},
+                    "output.dense": {"__name__":"proj"},
+                    "output.LayerNorm": {"__name__":"layer_norm"},
+                },
+                "output": {"__name__":"ff",
+                            "dense": {"__name__":"w2"},
+                            "LayerNorm": {"__name__":"layer_norm"}
+                },
+                "intermediate.dense": {"__name__":"ff.w1"},
+            }
+        }
+    },
+    "cls.predictions": {"__name__": "lm_head",
+        "transform.dense": {"__name__":""},
+        "transform.LayerNorm": {"__name__":""},
+        "decoder": {"__name__":"proj"},
+    }
+}
+```
+
--- a/docs/source/notes/usage.md
+++ b/docs/source/notes/usage.md
@ -0,0 +1,137 @@
+(basics)=
+# Basic Usage
+Now we introduce the general pipeline to migrate your full-model tuning scripts to a delta tuning one. 
+
+## STEP 1: Load the pretrained models
+
+```python
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-base") # suppose we load BART
+```
+
+## STEP 2: Add delta modules
+We provide two alternatives to add the delta modules.
+### 2.1 Modification based on visualization
+Suppose we want to make the feedforward layer of each block as our [modification target module](target_module),
+We should first know what is the name of the feedforward layer in the BART model by visualization. <img src="../imgs/hint-icon-2.jpg" height="30px"> *For more about visualization, see [Visualization](visualization).*
+
+```python
+from opendelta import Visualization
+Visualization(model).structure_graph()
+```
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/bart-base.png
+---
+width: 600px
+name: bart-base
+---
+```
+````
+
+
+
+
+We can see from the structure graph that the feed forward layer in Bart is called `model.encoder.layers.$.fc1` and `model.encoder.layers.$.fc2`, where
+`$` represent a number from 0-5.  Since we want to apply adapter after *all* the feed forward layers, we specify the `modified_modules=['fc2']`, which is the common suffix for feed forward layers.
+<img src="../imgs/hint-icon-2.jpg" height="30px">  *For details about the name based addressing, see [Name-based submodule addressing](namebasedaddr)*
+
+Other configurations, such as the `bottleneck_dim` in Adapter, can be passed as key word arguments.
+```python
+from opendelta import AdapterModel
+delta_model = AdapterModel(backbone_model=model, modified_modules=['fc2'], bottleneck_dim=12)
+delta_model.log() # This will visualize the backbone after modification and other information.
+```
+
+(target_module)=
+:::{admonition} Target module
+:class: note
+For different delta methods, the operation for the modification target is different.
+- Adapter based method: Insert at the target module's forward function.
+- BitFit: Add bias to all allowed position of the target module.
+- Lora: Substitute the all the linear layers of the target module with [Lora.Linear](https://github.com/microsoft/LoRA/blob/main/loralib/layers.py#L92).
+:::
+
+### 2.2 Use the default modification.
+We also provide the default modifications of each delta methods for some commonly used PTMs (e.g., BERT, RoBERTA, DistilBERT, T5, GPT2), so the users don't need to specify the submodules to modify.
+
+The default modifications is achieved by a [common_structure mapping](commonstructure), that is, use the mapping a name of a module to the it's name on a common transformer structure. <img src="../imgs/hint-icon-2.jpg" height="30px">  *For details about the default modification, see [Unified Name Convention](unifyname)*
+
+
+
+```python
+# a seperate example using BERT.
+from transformers import BertForMaskedLM
+from opendelta import AdapterModel
+model = BertForMaskedLM.from_pretrained("bert-base-cased")
+delta_model = AdapterModel(model) # This will apply adapter to the self-attn and feed-forward layer.
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/defaultmodification.png
+---
+width: 600px
+name: defaultmodification
+---
+```
+````
+
+
+
+
+:::{admonition} Delta model vs Backbone model
+:class: note
+The delta_model **CAN NOT**  be used alone, and its [forward](opendelta.basemodel.DeltaBase.forward) is canceled. The training pipeline should be conducted on the backbone model (In the above example, its the `model`).
+:::
+
+:::{admonition} Try different positions
+:class: tip
+OpenDelta provide the flexibility to add delta to different positions on the backbone model. For example, If you want to move the adapter in the above example after the layer norm of the feed forward layer. The code should be changed into
+```python
+# continue with the BART example, but not used later.
+delta_model = AdapterModel(backbone_model=model, modified_modules=['final_layer_norm'], bottleneck_dim=12)
+```
+The performance may vary due to positional differences, but there is no academic guarantee that one will outperform the other.
+:::
+
+
+:::{admonition} Favored Configurations
+:class: tip
+Feel confused about the flexibility that OpenDelta brings? NO WORRY! We will add [Favored Configurations](favoredconfiguration) soon.
+:::
+
+## STEP 3: Freezing parameters
+The main part of the backbone model is not automatically frozen (We may add the option in future). To freeze the main part of the backbone model except the trainable parts (usually the delta paramters), use [freeze_module](opendelta.basemodel.DeltaBase.freeze_module) method. The `exclude` field obeys the same name-based addressing rules as the `modified_modules` field.
+
+```python
+# continue with the BART example
+delta_model.freeze_module(exclude=["deltas", "layernorm_embedding"], set_state_dict=True)
+delta_model.log()
+```
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/afterfreeze.png
+---
+width: 600px
+name: afterfreeze
+---
+```
+````
+The `set_state_dict=True`  will tell the method to change the `state_dict` of the `backbone_model` to maintaining only the trainable parts. 
+
+
+## STEP 4: Normal training pipeline
+
+The **model** then can be trained in traditional training scripts. Two things should be noticed:
+
+:::{admonition} Note
+:class: note
+1. No need to change the optimizer, since the optimizer will only calculated and store gradient for those parameters with `requires_grad=True`, and the `requires_grad` attribute has been changed during the call to [freeze_module](opendelta.basemodel.DeltaBase.freeze_module) method.
+2. `model.eval()` or `model.train()` should be used when needed to set dropout, etc. Delta model doesn't touch those configuration.
+:::
+## STEP 5: Saved/Share the Delta Model
+
+<img src="../imgs/hint-icon-2.jpg" height="30px"> *see [Save a delta model to local, or share with the community](saveload).*
+
+
+
+
--- a/docs/source/notes/visualization.md
+++ b/docs/source/notes/visualization.md
@ -0,0 +1,125 @@
+(visualization)=
+# Visualize the Parameters
+
+When OpenDelta makes modifications to a pretrained model (PTM), it is beneficial to know what your PTM looks like, especially the location of the parameters.
+
+- **Before** applying opendelta, you can know **how to specify your modifications in terms of key addressing**.
+- **After** the modification is done, you can know **if your modification is what you expected**, for example, whether the position of the delta 
+modules are desired, or whether you froze the correct parameters.
+
+Now let's begin to try the visualization utility.
+
+## Visualization is NOT easy using pytorch native function.
+
+```python
+from transformers import BertForMaskedLM
+backbone_model = BertForMaskedLM.from_pretrained("bert-base-uncased")
+print(backbone_model)
+```
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/raw_print.png
+---
+width: 600px
+name: raw_print
+---
+```
+````
+
+The original presentation of models is **not tailored for repeated structures, big models, or parameters-centric tasks**.
+
+
+## Using visualization from opendelta.
+
+First let's visualize all the parameters in the bert model. As we can see, structure inside a bert model, and the all the paramters location of the model are neatly represented in tree structure. (See [color scheme](color_schema) for the colors)
+
+```python
+from opendelta import Visualization
+model_vis = Visualization(backbone_model)
+model_vis.structure_graph()
+```
+
+<!-- ````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span> -->
+```{figure} ../imgs/bert_vis.png
+---
+width: 600px
+name: bert_vis
+---
+```
+<!-- ```` -->
+
+
+<div class="admonition note">
+<p class="title">**Suggestion**</p>
+We can reference a module according to the graph easily:
+```python
+print(backbone_model.bert.encoder.layer[0].intermdiate)
+```
+When using opendelta on a new backbone model, it's better to first visualize the child module names (shown in white), and then designating the `modified_modules`.
+</div>
+
+
+
+
+## Now add a delta model and visualize the change. 
+
+
+```python
+from opendelta import LowRankAdapterModel
+delta_model = LowRankAdapterModel(backbone_model)
+delta_model.freeze_module(exclude=["cls", "intermediate", "LayerNorm"])
+Visualization(backbone_model).structure_graph()
+```
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/bertdelta_vis.png
+---
+width: 600px
+name: bertdelta_vis
+---
+```
+````
+
+(color_schema)=
+<div class="admonition tip">
+<div class="title">**Color Schema**</div>
+<ul>
+<li> The <span style="font-weight:bold;color:white;">white</span> part is the name of the module.</li>
+<li> The <span style="font-weight:bold;color:green;">green</span> part is the module's type.</li> 
+<li> The <span style="font-weight:bold;color:blue;">blue</span> part is the tunable parameters, i.e., the parameters that require grad computation.</li> 
+<li>  The <span style="font-weight:bold;color:grey;">grey</span>  part is the frozen parameters, i.e., the parameters that do not require grad computation.</li> 
+<li> The <span style="font-weight:bold;color:red;">red</span> part is the structure that is repeated and thus folded.</li> 
+<li> The <span style="font-weight:bold;color:purple;">purple</span> part is the delta parameters inserted into the backbone model.</li> 
+</ul>
+</div>
+
+:::{admonition} PlatForm Sentivity
+:class: warning
+Depending on the platform the code is running on, the colors may vary slightly.
+:::
+
+
+
+
+## We also provide the option to visualize the nodes without parameters.
+
+```python
+Visualization(backbone_model).structure_graph(keep_non_params=True)
+```
+
+Thus, the modules like dropout and activations are kept.
+
+
+````{collapse} <span style="color:rgb(141, 99, 224);font-weight:bold;font-style:italic">Click to view output</span>
+```{figure} ../imgs/bertdelta_noparam.png
+---
+width: 600px
+name: bertdelta_noparam
+---
+```
+````
+
+:::{admonition} Order of the submodule
+:class: warning
+Currently, OpenDelta‘s Visualization visualize the model based on pytorch's named_modules method. That means the order of the presented submodule is the order they are add to the parent module, not necessarily the order that tensors flows through. 
+:::
--- a/examples/README.md
+++ b/examples/README.md
@ -0,0 +1,25 @@
+# Use Examples
+
+This repo mainly contains several running scripts to use OpenDelta to conduct parameter-efficient training of various tasks.
+
+**Note that we suggest adding OpenDelta to existing scripts, instead of modify a scripts into the following examples. OpenDelta itself doens't restrict the training pipeline nor provide pipeline.**
+
+
+## tutorial
+Several toy tutorials:
+1. The scripts for docs/basic_usage
+2. Using interactive module selection
+3. Work with [OpenPrompt](https://github.com/thunlp/OpenPrompt)
+
+## examples_text-classification
+Modify a huggingface text-classification examples into a delta tuning one.
+Currently, GLUE datasets are supported in the scripts. Roberta-base is used for performance checking. Read README.md inside the repo for detailed usage.
+
+## examples_seq2seq
+Modify a huggingface sequence to sequence examples into a delta tuning one.
+Currently, SuperGLUE and GLUE datasets are supported in the scripts. T5-base is used for performance checking. Read README.md inside the repo for detailed usage.
+
+
+## examples_image-classification
+A toy example of using OpenDelta for a Computer Vision Pretrained Model (ViT). Since ViT is an experimental feature in huggingface transformers, this example is subject to Change at any moment. 
+
--- a/examples/examples_image-classification/README.md
+++ b/examples/examples_image-classification/README.md
@ -0,0 +1,166 @@
+<!---
+Copyright 2021 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Use OpenDelta in vision transformer ViT
+
+This example uses the [huggingface image classification examples](), by adding several
+lines in the original scripts.
+
+## Usage
+### 1. install necessary package
+```shell
+pip install Pillow
+pip install torchvision
+pip install transformers==4.16.2
+pip install datsets==1.18.0
+```
+
+### 2. upgrade the transformers to 4.10.0
+
+### 3. run
+```bash
+python run_image_classification.py configs/lora_beans.json
+```
+
+Do not forget to re-install datasets back into 1.17.0 for other examples. :)
+
+
+## Possible Errors
+1. dataset connection error
+
+Solution 1: open a python console, running the error command again, may not be useful
+
+Solution 2: download the dataset by yourself on a internect connected machine, saved to disk and transfer to your server, at last load_from_disk.
+
+
+
+# Image classification examples
+
+The following examples showcase how to fine-tune a `ViT` for image-classification using PyTorch.
+
+## Using datasets from 🤗 `datasets`
+
+Here we show how to fine-tune a `ViT` on the [beans](https://huggingface.co/datasets/beans) dataset.
+
+👀 See the results here: [nateraw/vit-base-beans](https://huggingface.co/nateraw/vit-base-beans).
+
+```bash
+python run_image_classification.py \
+    --dataset_name beans \
+    --output_dir ./beans_outputs/ \
+    --remove_unused_columns False \
+    --do_train \
+    --do_eval \
+    --push_to_hub \
+    --push_to_hub_model_id vit-base-beans \
+    --learning_rate 2e-5 \
+    --num_train_epochs 5 \
+    --per_device_train_batch_size 8 \
+    --per_device_eval_batch_size 8 \
+    --logging_strategy steps \
+    --logging_steps 10 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --load_best_model_at_end True \
+    --save_total_limit 3 \
+    --seed 1337
+```
+
+Here we show how to fine-tune a `ViT` on the [cats_vs_dogs](https://huggingface.co/datasets/cats_vs_dogs) dataset.
+
+👀 See the results here: [nateraw/vit-base-cats-vs-dogs](https://huggingface.co/nateraw/vit-base-cats-vs-dogs).
+
+```bash
+python run_image_classification.py \
+    --dataset_name cats_vs_dogs \
+    --output_dir ./cats_vs_dogs_outputs/ \
+    --remove_unused_columns False \
+    --do_train \
+    --do_eval \
+    --push_to_hub \
+    --push_to_hub_model_id vit-base-cats-vs-dogs \
+    --fp16 True \
+    --learning_rate 2e-4 \
+    --num_train_epochs 5 \
+    --per_device_train_batch_size 32 \
+    --per_device_eval_batch_size 32 \
+    --logging_strategy steps \
+    --logging_steps 10 \
+    --evaluation_strategy epoch \
+    --save_strategy epoch \
+    --load_best_model_at_end True \
+    --save_total_limit 3 \
+    --seed 1337
+```
+
+## Using your own data
+
+To use your own dataset, the training script expects the following directory structure:
+
+```bash
+root/dog/xxx.png
+root/dog/xxy.png
+root/dog/[...]/xxz.png
+
+root/cat/123.png
+root/cat/nsdf3.png
+root/cat/[...]/asd932_.png
+```
+
+Once you've prepared your dataset, you can can run the script like this:
+
+```bash
+python run_image_classification.py \
+    --dataset_name nateraw/image-folder \
+    --train_dir <path-to-train-root> \
+    --output_dir ./outputs/ \
+    --remove_unused_columns False \
+    --do_train \
+    --do_eval
+```
+
+### 💡 The above will split the train dir into training and evaluation sets
+  - To control the split amount, use the `--train_val_split` flag.
+  - To provide your own validation split in its own directory, you can pass the `--validation_dir <path-to-val-root>` flag.
+
+
+## Sharing your model on 🤗 Hub
+
+0. If you haven't already, [sign up](https://huggingface.co/join) for a 🤗 account
+
+1. Make sure you have `git-lfs` installed and git set up.
+
+```bash
+$ apt install git-lfs
+$ git config --global user.email "you@example.com"
+$ git config --global user.name "Your Name"
+```
+
+2. Log in with your HuggingFace account credentials using `huggingface-cli`
+
+```bash
+$ huggingface-cli login
+# ...follow the prompts
+```
+
+3. When running the script, pass the following arguments:
+
+```bash
+python run_image_classification.py \
+    --push_to_hub \
+    --push_to_hub_model_id <name-your-model> \
+    ...
+```
--- a/examples/examples_image-classification/configs/lora_beans.json
+++ b/examples/examples_image-classification/configs/lora_beans.json
@ -0,0 +1,30 @@
+{
+    "report_to": "none",
+    "dataset_name": "beans",
+    "output_dir": "./beans_outputs/",
+    "do_train": true,
+    "do_eval": true,
+    "num_train_epochs": 5,
+    "remove_unused_columns": false,
+    "per_device_train_batch_size": 8,
+    "per_device_eval_batch_size": 8,
+    "logging_strategy": "steps",
+    "logging_steps": 10,
+    "evaluation_strategy": "epoch",
+    "save_strategy": "epoch",
+    "load_best_model_at_end": true,
+    "save_total_limit": 3,
+    "seed": 1337,
+    "delta_type": "lora",
+    "modified_modules": [
+        "attention.query",
+        "attention.value"
+    ],
+    "unfrozen_modules": [
+        "classifier",
+        "deltas"
+    ],
+    "overwrite_output_dir": true,
+    "learning_rate": 5e-4
+
+}
--- a/examples/examples_image-classification/metric.py
+++ b/examples/examples_image-classification/metric.py
@ -0,0 +1,89 @@
+# coding=utf-8
+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Accuracy metric."""
+
+from sklearn.metrics import accuracy_score
+
+import datasets
+
+
+_DESCRIPTION = """
+Accuracy is the proportion of correct predictions among the total number of cases processed. It can be computed with:
+Accuracy = (TP + TN) / (TP + TN + FP + FN)
+TP: True positive
+TN: True negative
+FP: False positive
+FN: False negative
+"""
+
+_KWARGS_DESCRIPTION = """
+Args:
+    predictions: Predicted labels, as returned by a model.
+    references: Ground truth labels.
+    normalize: If False, return the number of correctly classified samples.
+        Otherwise, return the fraction of correctly classified samples.
+    sample_weight: Sample weights.
+Returns:
+    accuracy: Accuracy score.
+Examples:
+
+    >>> accuracy_metric = datasets.load_metric("accuracy")
+    >>> results = accuracy_metric.compute(references=[0, 1], predictions=[0, 1])
+    >>> print(results)
+    {'accuracy': 1.0}
+"""
+
+_CITATION = """\
+@article{scikit-learn,
+  title={Scikit-learn: Machine Learning in {P}ython},
+  author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+  journal={Journal of Machine Learning Research},
+  volume={12},
+  pages={2825--2830},
+  year={2011}
+}
+"""
+
+
+@datasets.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class Accuracy(datasets.Metric):
+    def _info(self):
+        return datasets.MetricInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Sequence(datasets.Value("int32")),
+                    "references": datasets.Sequence(datasets.Value("int32")),
+                }
+                if self.config_name == "multilabel"
+                else {
+                    "predictions": datasets.Value("int32"),
+                    "references": datasets.Value("int32"),
+                }
+            ),
+            reference_urls=["https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html"],
+        )
+
+    def _compute(self, predictions, references, normalize=True, sample_weight=None):
+        return {
+            "accuracy": float(
+                accuracy_score(references, predictions, normalize=normalize, sample_weight=sample_weight)
+            )
+        }
--- a/examples/examples_image-classification/requirements.txt
+++ b/examples/examples_image-classification/requirements.txt
@ -0,0 +1,3 @@
+# torch>=1.5.0
+torchvision>=0.6.0
+datasets>=1.8.0
--- a/examples/examples_image-classification/run_image_classification.py
+++ b/examples/examples_image-classification/run_image_classification.py
@ -0,0 +1,392 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2021 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+
+import logging
+import os
+import sys
+from dataclasses import dataclass, field
+from typing import Optional
+
+import datasets
+import numpy as np
+import torch
+from datasets import load_dataset
+from PIL import Image
+from torchvision.transforms import (
+    CenterCrop,
+    Compose,
+    Normalize,
+    RandomHorizontalFlip,
+    RandomResizedCrop,
+    Resize,
+    ToTensor,
+)
+
+import transformers
+from transformers import (
+    MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING,
+    AutoConfig,
+    AutoFeatureExtractor,
+    AutoModelForImageClassification,
+    HfArgumentParser,
+    Trainer,
+    TrainingArguments,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from transformers.utils import check_min_version
+from transformers.utils.versions import require_version
+
+
+""" Fine-tuning a 🤗 Transformers model for image classification"""
+
+logger = logging.getLogger(__name__)
+
+# Will error if the minimal version of Transformers is not installed. Remove at your own risks.
+check_min_version("4.16.0.dev0")
+
+require_version("datasets>=1.8.0", "To fix: pip install -r examples/pytorch/image-classification/requirements.txt")
+
+MODEL_CONFIG_CLASSES = list(MODEL_FOR_IMAGE_CLASSIFICATION_MAPPING.keys())
+MODEL_TYPES = tuple(conf.model_type for conf in MODEL_CONFIG_CLASSES)
+
+
+def pil_loader(path: str):
+    with open(path, "rb") as f:
+        im = Image.open(f)
+        return im.convert("RGB")
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    Using ``HfArgumentParser`` we can turn this class
+    into argparse arguments to be able to specify them on
+    the command line.
+    """
+
+    dataset_name: Optional[str] = field(
+        default="nateraw/image-folder", metadata={"help": "Name of a dataset from the datasets package"}
+    )
+    dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
+    )
+    train_dir: Optional[str] = field(default=None, metadata={"help": "A folder containing the training data."})
+    validation_dir: Optional[str] = field(default=None, metadata={"help": "A folder containing the validation data."})
+    train_val_split: Optional[float] = field(
+        default=0.15, metadata={"help": "Percent to split off of train for validation."}
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
+            "value if set."
+        },
+    )
+    max_eval_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of evaluation examples to this "
+            "value if set."
+        },
+    )
+
+    def __post_init__(self):
+        data_files = dict()
+        if self.train_dir is not None:
+            data_files["train"] = self.train_dir
+        if self.validation_dir is not None:
+            data_files["val"] = self.validation_dir
+        self.data_files = data_files if data_files else None
+
+class RemainArgHfArgumentParser(HfArgumentParser):
+    def parse_json_file(self, json_file: str, return_remaining_args=True ):
+        """
+        Alternative helper method that does not use `argparse` at all, instead loading a json file and populating the
+        dataclass types.
+        """
+        import argparse
+        import json
+        from pathlib import Path
+        import dataclasses
+
+        data = json.loads(Path(json_file).read_text())
+        outputs = []
+        for dtype in self.dataclass_types:
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
+            inputs = {k: data.pop(k) for k in list(data.keys()) if k in keys}
+            obj = dtype(**inputs)
+            outputs.append(obj)
+        
+        remain_args = argparse.ArgumentParser()
+        remain_args.__dict__.update(data)
+        if return_remaining_args:
+            return (*outputs, remain_args)
+        else:
+            return (*outputs,)
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+
+    model_name_or_path: str = field(
+        default="google/vit-base-patch16-224-in21k",
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"},
+    )
+    model_type: Optional[str] = field(
+        default=None,
+        metadata={"help": "If training from scratch, pass a model type from the list: " + ", ".join(MODEL_TYPES)},
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None, metadata={"help": "Where do you want to store the pretrained models downloaded from s3"}
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+    )
+    feature_extractor_name: str = field(default=None, metadata={"help": "Name or path of preprocessor config."})
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
+
+
+def collate_fn(examples):
+    pixel_values = torch.stack([example["pixel_values"] for example in examples])
+    labels = torch.tensor([example["labels"] for example in examples])
+    return {"pixel_values": pixel_values, "labels": labels}
+
+
+def main():
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+
+    parser = RemainArgHfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, delta_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args, delta_args = parser.parse_args_into_dataclasses()
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+
+    log_level = training_args.get_process_log_level()
+    logger.setLevel(log_level)
+    transformers.utils.logging.set_verbosity(log_level)
+    transformers.utils.logging.enable_default_handler()
+    transformers.utils.logging.enable_explicit_format()
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    logger.info(f"Training/evaluation parameters {training_args}")
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+        elif last_checkpoint is not None and training_args.resume_from_checkpoint is None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Initialize our dataset and prepare it for the 'image-classification' task.
+    ds = load_dataset(
+        data_args.dataset_name,
+        data_args.dataset_config_name,
+        data_files=data_args.data_files,
+        cache_dir=model_args.cache_dir,
+        task="image-classification",
+    )
+    # If you encounter error here, try to down load the dataset by yourself and load from disk
+    # like the following two lines
+    # from datasets import load_from_disk
+    # ds = load_from_disk(f"../../../../huggingface_datasets/saved_to_disk/{data_args.dataset_name}")
+
+    # If we don't have a validation split, split off a percentage of train as validation.
+    data_args.train_val_split = None if "validation" in ds.keys() else data_args.train_val_split
+    if isinstance(data_args.train_val_split, float) and data_args.train_val_split > 0.0:
+        split = ds["train"].train_test_split(data_args.train_val_split)
+        ds["train"] = split["train"]
+        ds["validation"] = split["test"]
+
+    # Prepare label mappings.
+    # We'll include these in the model's config to get human readable labels in the Inference API.
+    labels = ds["train"].features["labels"].names
+    label2id, id2label = dict(), dict()
+    for i, label in enumerate(labels):
+        label2id[label] = str(i)
+        id2label[str(i)] = label
+
+    # Load the accuracy metric from the datasets package
+    # metric = datasets.load_metric("accuracy")
+    metric = datasets.load_metric("metric.py")
+
+    # Define our compute_metrics function. It takes an ``EvalPrediction`` object (a namedtuple with a
+    # predictions and label_ids field) and has to return a dictionary string to float.
+    def compute_metrics(p):
+        """Computes accuracy on a batch of predictions"""
+        return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)
+
+    config = AutoConfig.from_pretrained(
+        model_args.config_name or model_args.model_name_or_path,
+        num_labels=len(labels),
+        label2id=label2id,
+        id2label=id2label,
+        finetuning_task="image-classification",
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    model = AutoModelForImageClassification.from_pretrained(
+        model_args.model_name_or_path,
+        from_tf=bool(".ckpt" in model_args.model_name_or_path),
+        config=config,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    feature_extractor = AutoFeatureExtractor.from_pretrained(
+        model_args.feature_extractor_name or model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+
+
+    if delta_args.delta_type.lower() != "none":
+        from opendelta import AutoDeltaConfig,AutoDeltaModel
+        delta_config = AutoDeltaConfig.from_dict(vars(delta_args))
+        delta_model = AutoDeltaModel.from_config(delta_config, backbone_model=model)
+        delta_model.freeze_module(set_state_dict = True)
+        delta_model.log(delta_ratio=True, trainable_ratio=True, visualization=True)
+
+    # Define torchvision transforms to be applied to each image.
+    normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
+    _train_transforms = Compose(
+        [
+            RandomResizedCrop(feature_extractor.size),
+            RandomHorizontalFlip(),
+            ToTensor(),
+            normalize,
+        ]
+    )
+    _val_transforms = Compose(
+        [
+            Resize(feature_extractor.size),
+            CenterCrop(feature_extractor.size),
+            ToTensor(),
+            normalize,
+        ]
+    )
+
+    def train_transforms(example_batch):
+        """Apply _train_transforms across a batch."""
+        example_batch["pixel_values"] = [
+            _train_transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"]
+        ]
+        return example_batch
+
+    def val_transforms(example_batch):
+        """Apply _val_transforms across a batch."""
+        example_batch["pixel_values"] = [_val_transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"]]
+        return example_batch
+
+    if training_args.do_train:
+        if "train" not in ds:
+            raise ValueError("--do_train requires a train dataset")
+        if data_args.max_train_samples is not None:
+            ds["train"] = ds["train"].shuffle(seed=training_args.seed).select(range(data_args.max_train_samples))
+        # Set the training transforms
+        ds["train"].set_transform(train_transforms)
+
+    if training_args.do_eval:
+        if "validation" not in ds:
+            raise ValueError("--do_eval requires a validation dataset")
+        if data_args.max_eval_samples is not None:
+            ds["validation"] = (
+                ds["validation"].shuffle(seed=training_args.seed).select(range(data_args.max_eval_samples))
+            )
+        # Set the validation transforms
+        ds["validation"].set_transform(val_transforms)
+
+    # Initalize our trainer
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=ds["train"] if training_args.do_train else None,
+        eval_dataset=ds["validation"] if training_args.do_eval else None,
+        compute_metrics=compute_metrics,
+        tokenizer=feature_extractor,
+        data_collator=collate_fn,
+    )
+
+    # Training
+    if training_args.do_train:
+        checkpoint = None
+        if training_args.resume_from_checkpoint is not None:
+            checkpoint = training_args.resume_from_checkpoint
+        elif last_checkpoint is not None:
+            checkpoint = last_checkpoint
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        trainer.save_model()
+        trainer.log_metrics("train", train_result.metrics)
+        trainer.save_metrics("train", train_result.metrics)
+        trainer.save_state()
+
+    # Evaluation
+    if training_args.do_eval:
+        metrics = trainer.evaluate()
+        trainer.log_metrics("eval", metrics)
+        trainer.save_metrics("eval", metrics)
+
+    # Write model card and (optionally) push to hub
+    kwargs = {
+        "finetuned_from": model_args.model_name_or_path,
+        "tasks": "image-classification",
+        "dataset": data_args.dataset_name,
+        "tags": ["image-classification"],
+    }
+    if training_args.push_to_hub:
+        trainer.push_to_hub(**kwargs)
+    else:
+        trainer.create_model_card(**kwargs)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/examples_seq2seq/README.md
+++ b/examples/examples_seq2seq/README.md
@ -0,0 +1,64 @@
+# Appling OpenDelta to GLUE/SuperGLUE tasks using Seq2Seq Paradigm
+
+
+## install the repo
+```bash
+cd ../
+python setup_seq2seq.py develop
+```
+This will add `examples_seq2seq` to the environment path of the python lib.
+
+## Generating the json configuration file
+
+```
+python config_gen.py --job $job_name
+
+```
+The available job configuration (e.g., `--job lora_t5-base`) can be seen from `config_gen.py`. You can also
+create your only configuration.
+
+
+## Run the code
+
+```
+python run_seq2seq.py configs/$job_name/$dataset.json
+```
+
+## Possible Errors
+
+1. 
+```
+ValueError: You must login to the Hugging Face hub on this computer by typing `transformers-cli login` and entering your credentials to use `use_auth_token=Tr
+ue`. Alternatively, you can pass your own token as the `use_auth_token` argument.
+```
+- Solution 1: Please register an account on [HuggingFace](https://huggingface.co/) 
+Then run transformers-cli login on your command line to enter the username and password.
+
+- Solution 2: Disable push_to_hub by modifying in the config.json : "push_to_hub": False
+
+2. 
+```
+OSError: Looks like you do not have git-lfs installed, please install. You can install from https://git-lfs.github.com/. Then run `git lfs install` (you only have to do this once).
+```
+
+- Solution 1:
+```
+wget -P ~ https://github.com/git-lfs/git-lfs/releases/download/v3.0.2/git-lfs-linux-amd64-v3.0.2.tar.gz
+cd ~
+tar -xvzf git-lfs-linux-amd64-v3.0.2.tar.gz
+export PATH=~:$PATH
+git-lfs install
+```
+
+- Solution 2: Disable push_to_hub by modifying in the config.json : "push_to_hub": False
+
+
+3. dataset connection error
+
+Solution 1: open a python console, running the error command again, may not be useful
+
+Solution 2: download the dataset by yourself on a internect connected machine, saved to disk and transfer to your server, at last load_from_disk.
+
+
+## Link to the original training scripts
+This example repo is based on the [compacter training scripts](https://github.com/rabeehk/compacter), with compacter-related lines removed. Thanks to the authors of the original repo. In addition, in private correspondence with the authors, they shared the codes to create the json configs. Thanks again for their efforts. 
--- a/examples/examples_seq2seq/init.py
+++ b/examples/examples_seq2seq/init.py
--- a/examples/examples_seq2seq/collect_result.jsonl
+++ b/examples/examples_seq2seq/collect_result.jsonl
@ -0,0 +1,21 @@
+# the final results will be populated here.{
+    "evaluate": {
+        "epoch": 20.0,
+        "eval_accuracy": 89.2156862745098,
+        "eval_average_metrics": 90.76168929110105,
+        "eval_f1": 92.3076923076923,
+        "eval_loss": 0.16493959724903107,
+        "eval_runtime": 1.6391,
+        "eval_samples_per_second": 124.455
+    },
+    "repo_name": "DeltaHub/bitfit_t5-base_mrpc",
+    "test": {
+        "epoch": 20.0,
+        "test_accuracy": 88.23529411764706,
+        "test_average_metrics": 89.97971602434077,
+        "test_f1": 91.72413793103448,
+        "test_loss": 0.14968213438987732,
+        "test_runtime": 1.6344,
+        "test_samples_per_second": 124.82
+    }
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/cola.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/cola.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "cola",
+    "eval_steps": 100,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/cola",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 100,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "cola",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "cola",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/mnli.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/mnli.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "mnli",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 3,
+    "output_dir": "outputs/bitfit/t5-base/mnli",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "mnli",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "mnli",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/mrpc.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/mrpc.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "mrpc",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/mrpc",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "mrpc",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "mrpc",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/qnli.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/qnli.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "qnli",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 3,
+    "output_dir": "outputs/bitfit/t5-base/qnli",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "qnli",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "qnli",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/qqp.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/qqp.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "qqp",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 3,
+    "output_dir": "outputs/bitfit/t5-base/qqp",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "qqp",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "qqp",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/rte.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/rte.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "rte",
+    "eval_steps": 100,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/rte",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 100,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "rte",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "rte",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/sst2.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/sst2.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "sst2",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 3,
+    "output_dir": "outputs/bitfit/t5-base/sst2",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "sst2",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "sst2",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/stsb.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/stsb.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "stsb",
+    "eval_steps": 100,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 128,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/stsb",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 100,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "stsb",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "stsb",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-boolq.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-boolq.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-boolq",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 256,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/superglue-boolq",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-boolq",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-boolq",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-cb.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-cb.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-cb",
+    "eval_steps": 100,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 256,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/superglue-cb",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 100,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-cb",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-cb",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-copa.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-copa.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-copa",
+    "eval_steps": 50,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 256,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 40,
+    "output_dir": "outputs/bitfit/t5-base/superglue-copa",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 50,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-copa",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-copa",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-multirc.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-multirc.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-multirc",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 256,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 3,
+    "output_dir": "outputs/bitfit/t5-base/superglue-multirc",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-multirc",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-multirc",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-record.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-record.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-record",
+    "eval_steps": 200,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 512,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 3,
+    "output_dir": "outputs/bitfit/t5-base/superglue-record",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 16,
+    "per_device_train_batch_size": 16,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 200,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-record",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-record",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-wic.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-wic.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-wic",
+    "eval_steps": 100,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 256,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/superglue-wic",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 100,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-wic",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-wic",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-wsc.fixed.json
+++ b/examples/examples_seq2seq/configs/bitfit_t5-base/superglue-wsc.fixed.json
@ -0,0 +1,40 @@
+{
+    "dataset_config_name": [
+        "en"
+    ],
+    "delta_type": "bitfit",
+    "do_eval": true,
+    "do_test": true,
+    "do_train": true,
+    "eval_dataset_config_name": [
+        "en"
+    ],
+    "eval_dataset_name": "superglue-wsc.fixed",
+    "eval_steps": 100,
+    "evaluation_strategy": "steps",
+    "greater_is_better": true,
+    "learning_rate": 0.0003,
+    "load_best_model_at_end": true,
+    "max_source_length": 256,
+    "metric_for_best_model": "average_metrics",
+    "model_name_or_path": "t5-base",
+    "num_train_epochs": 20,
+    "output_dir": "outputs/bitfit/t5-base/superglue-wsc.fixed",
+    "overwrite_output_dir": true,
+    "per_device_eval_batch_size": 32,
+    "per_device_train_batch_size": 32,
+    "predict_with_generate": true,
+    "push_to_hub": true,
+    "save_steps": 100,
+    "save_strategy": "steps",
+    "save_total_limit": 1,
+    "seed": 42,
+    "split_validation_test": true,
+    "task_name": "superglue-wsc.fixed",
+    "test_dataset_config_name": [
+        "en"
+    ],
+    "test_dataset_name": "superglue-wsc.fixed",
+    "tokenizer_name": "t5-base",
+    "warmup_steps": 0
+}
--- a/examples/examples_seq2seq/configs/config_gen.py
+++ b/examples/examples_seq2seq/configs/config_gen.py
@ -0,0 +1,230 @@
+import collections 
+import copy
+
+AllConfigs = {}
+
+BaseConfigs = {}
+BaseConfigs['t5-base'] = {
+                ("job_name", "task_name", "eval_dataset_name", "test_dataset_name", "num_train_epochs", 
+                "max_source_length",
+                "per_device_train_batch_size", "per_device_eval_batch_size", "warmup_steps","save_steps", "eval_steps"): zip(
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record",
+                    "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte", "mnli", "qqp", "stsb"],
+                    [ 20,  20,  40,  20,   3,   3,  20,  20,  20,   3,   3,  20,   3,   3,  20],
+                    [256, 256, 256, 256, 256, 512, 256, 128, 128, 128, 128, 128, 128, 128, 128],
+                    [ 32,  32,  32,  32,  32,  16,  32] + [32] * 8,
+                    [ 32,  32,  32,  32,  32,  16,  32] + [32] * 8,
+                    [0] *7 +[0] *8,
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                ),
+                "do_train": True,
+                "do_eval": True,
+                "do_test": True,
+                
+                "model_name_or_path": "t5-base",
+                "tokenizer_name": "t5-base",
+                "save_total_limit": 1,
+                # For glue datasets.
+                "split_validation_test": True,
+                "seed": 42,
+                "dataset_config_name": ["en"],
+                "eval_dataset_config_name": ["en"],
+                "test_dataset_config_name": ["en"],
+                # other configurations.
+                "predict_with_generate": True,
+                # To evaluate during training.
+                "load_best_model_at_end": True,
+                "metric_for_best_model": "average_metrics",
+                "greater_is_better": True,
+                "evaluation_strategy": "steps",
+                "overwrite_output_dir": True,
+                "push_to_hub": True,
+                "save_strategy": "steps"
+            }
+
+AllConfigs['bitfit_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['bitfit_t5-base'].update({
+                "delta_type": "bitfit",      
+                "learning_rate": 3e-4,         
+                "output_dir": "outputs/bitfit/t5-base/",
+            })
+
+AllConfigs['adapter_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['adapter_t5-base'].update({
+                                "delta_type": "adapter",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm"
+                                ],
+                                "bottleneck_dim":24,
+                                "output_dir": "outputs/adapter/t5-base/",
+                            })
+
+AllConfigs['lora_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['lora_t5-base'].update({
+                                "delta_type": "lora",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm"
+                                ],
+                                "lora_r": 8,
+                                "output_dir": "outputs/lora/t5-base/",
+                            })
+
+AllConfigs['compacter_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['compacter_t5-base'].update({
+                                "delta_type": "compacter",
+                                "learning_rate": 3e-3,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm"
+                                ],
+                                "output_dir": "outputs/compacter/t5-base/",
+                                "non_linearity": "gelu_new",
+
+                                #Compacter.
+                                "hypercomplex_division": 4, 
+                                "hypercomplex_adapters": True,
+                                "hypercomplex_nonlinearity": "glorot-uniform",
+                                # gradient clip and clamp 
+                                "gradient_clip": False,
+                                "phm_clamp": False,
+                                "normalize_phm_weight": False, 
+                                "learn_phm": True,
+                                # shared one side 
+                                "factorized_phm": True, 
+                                "shared_phm_rule": False,
+                                "factorized_phm_rule": False,
+                                "phm_c_init": "normal",
+                                "phm_init_range": 0.0001,
+                                "use_bias_down_sampler": True,
+                                "use_bias_up_sampler": True,
+                            })
+
+AllConfigs['compacter++_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['compacter++_t5-base'].update({
+                                "delta_type": "compacter",
+                                "learning_rate": 3e-3,
+                                "do_train": True,
+                                "do_eval": True,
+                                "do_test": True,
+                                "modified_modules": [
+                                    "DenseReluDense"
+                                ],
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm"
+                                ],
+                                "output_dir": "outputs/compacter++/t5-base/",
+                                "non_linearity": "gelu_new",
+
+                                #Compacter.
+                                "hypercomplex_division": 4, 
+                                "hypercomplex_adapters": True,
+                                "hypercomplex_nonlinearity": "glorot-uniform",
+                                # gradient clip and clamp 
+                                "gradient_clip": False,
+                                "phm_clamp": False,
+                                "normalize_phm_weight": False, 
+                                "learn_phm": True,
+                                # shared one side 
+                                "factorized_phm": True, 
+                                "shared_phm_rule": False,
+                                "factorized_phm_rule": False,
+                                "phm_c_init": "normal",
+                                "phm_init_range": 0.0001,
+                                "use_bias_down_sampler": True,
+                                "use_bias_up_sampler": True,
+                            })
+
+
+AllConfigs['low_rank_adapter_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['low_rank_adapter_t5-base'].update({
+                                "delta_type": "low_rank_adapter",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm"
+                                ],
+                                "output_dir": "outputs/low_rank_adapter/t5-base/",
+                                "non_linearity": "gelu_new",
+                                "low_rank_w_init": "glorot-uniform", 
+                                "low_rank_rank": 1,
+                            })
+
+
+AllConfigs['soft_prompt_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['soft_prompt_t5-base'].update({
+                                "delta_type": "soft_prompt",
+                                "learning_rate": 3e-2,
+                                "soft_token_num":100,
+                                "token_init": False,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                ],
+                                "output_dir": "outputs/soft_prompt/t5-base/",
+                            })
+
+AllConfigs['prefix_t5-base'] = copy.deepcopy(BaseConfigs['t5-base'])
+AllConfigs['prefix_t5-base'].update({
+                                "delta_type": "prefix",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                ],
+                                "output_dir": "outputs/prefix/t5-base/",
+                            })
+
+
+if __name__ == "__main__":
+    import argparse
+    import json
+    import os
+    parser = argparse.ArgumentParser("Parser to generate configuration")
+    parser.add_argument("--job", type=str)
+    args = parser.parse_args()
+
+    config = AllConfigs[args.job]
+
+    Cartesian_product = []
+    for key in config:
+        if isinstance(key, tuple):
+            Cartesian_product.append(key)
+    all_config_jsons = {}
+    for key_tuple in Cartesian_product:
+        for zipped in config[key_tuple]:
+            job_name = zipped[0]
+            all_config_jsons[job_name] = {}
+            for key_name, zipped_elem in zip(key_tuple, zipped):
+                if key_name != 'job_name':
+                    all_config_jsons[job_name][key_name] = zipped_elem
+    for key in config:
+        if not isinstance(key, tuple):
+            for job_name in all_config_jsons:
+                if key == "output_dir":
+                    all_config_jsons[job_name][key] = config[key] + job_name
+                else:
+                    all_config_jsons[job_name][key] = config[key]
+
+
+    if not os.path.exists(f"./{args.job}/"):
+        os.mkdir(f"./{args.job}/")
+
+    for job_name in all_config_jsons:
+        with open(f"./{args.job}/{job_name}.json", 'w') as fout:
+            json.dump(all_config_jsons[job_name], fout, indent=4,sort_keys=True)
+        
+    
+
+    
--- a/examples/examples_seq2seq/data_processors/init.py
+++ b/examples/examples_seq2seq/data_processors/init.py
@ -0,0 +1,3 @@
+from .tasks import TASK_MAPPING, AutoTask
+from .data_collator import TaskDataCollatorForSeq2Seq
+from .postprocessors import AutoPostProcessor 
--- a/examples/examples_seq2seq/data_processors/data_collator.py
+++ b/examples/examples_seq2seq/data_processors/data_collator.py
@ -0,0 +1,16 @@
+import numpy as np 
+from dataclasses import dataclass
+from transformers import DataCollatorForSeq2Seq
+
+
+@dataclass
+class TaskDataCollatorForSeq2Seq(DataCollatorForSeq2Seq):
+   def check_uniqueness(self, samples):
+        assert len(np.unique(samples)) == 1 
+
+   def __call__(self, features):
+     #    tasks = [d.pop('task') for d in features]
+     #    self.check_uniqueness(tasks)
+        output = super().__call__(features)
+     #    output["task"] = tasks[0]
+        return output
--- a/examples/examples_seq2seq/data_processors/postprocessors.py
+++ b/examples/examples_seq2seq/data_processors/postprocessors.py
@ -0,0 +1,64 @@
+import abc
+from collections import OrderedDict
+import numpy as np
+
+"""Defines functions to process the outputs to make them ready for the evaluation."""
+
+def string_to_float(string, default=-1., **unused_kwargs):
+  """Converts string to float, using default when conversion not possible."""
+  try:
+    return float(string)
+  except ValueError:
+    return default
+
+
+class PostProcessor(abc.ABC): 
+    """Postprocess the predictions and labels to make them suitable for
+    evaluation."""
+    def __init__(self, tokenizer, ignore_pad_token_for_loss):
+       self.tokenizer = tokenizer 
+       self.ignore_pad_token_for_loss = ignore_pad_token_for_loss 
+
+    def process(self, preds, labels, data_info=None):
+        if isinstance(preds, tuple):
+            preds = preds[0]
+        decoded_preds = self.tokenizer.batch_decode(preds, skip_special_tokens=True)
+        if self.ignore_pad_token_for_loss:
+            # Replace -100 in the labels as we can't decode them.
+            labels = np.where(labels != -100, labels, self.tokenizer.pad_token_id)
+        decoded_labels = self.tokenizer.batch_decode(labels, skip_special_tokens=True)
+        # Some simple post-processing
+        decoded_preds = [pred.strip() for pred in decoded_preds]
+        decoded_labels = [label.strip() for label in decoded_labels]
+        return decoded_preds, decoded_labels 
+
+
+class MultiRC(PostProcessor):
+    def process(self, preds, labels, data_info):
+        preds, labels = super().process(preds, labels, data_info) 
+        preds = [{"group": info["group"], "value":pred} \
+            for info, pred in zip(data_info, preds)]
+        labels = [{"group": info["group"], "value": label}\
+            for info, label in zip(data_info, labels)] 
+        return preds, labels 
+
+class Record(PostProcessor):
+    def process(self, preds, labels, data_info):
+        preds, labels = super().process(preds, labels, data_info) 
+        labels = [info["answers"] for info in data_info]
+        return preds, labels 
+
+
+POSTPROCESSOR_MAPPING = OrderedDict(
+    [
+        ('superglue-record', Record),
+        ('superglue-multirc', MultiRC)
+    ]
+)
+
+class AutoPostProcessor:
+    @classmethod
+    def get(self, task, tokenizer, ignore_pad_token_for_loss):
+        if task in POSTPROCESSOR_MAPPING:
+            return POSTPROCESSOR_MAPPING[task](tokenizer, ignore_pad_token_for_loss)
+        return PostProcessor(tokenizer, ignore_pad_token_for_loss)
--- a/examples/examples_seq2seq/data_processors/tasks.py
+++ b/examples/examples_seq2seq/data_processors/tasks.py
@ -0,0 +1,584 @@
+from collections import OrderedDict
+import collections 
+import abc
+import functools
+from typing import Callable, List, Mapping
+from examples_seq2seq.trainers.trainer_utils import pad_punctuation
+from examples_seq2seq.metrics import metrics
+from .utils import round_stsb_target
+import datasets
+import logging
+import numpy as np
+import torch
+import re
+
+logger = logging.getLogger(__name__)
+
+class AbstractTask(abc.ABC):
+    name = NotImplemented
+    config = NotImplemented
+    prefix = NotImplemented
+    preprocessor: Callable = NotImplemented
+    metric = NotImplemented
+    metric_names = NotImplemented
+    split_map = None
+    labels_list = None
+    split_to_data_split: Mapping[str, str] = \
+        {"train": "train", "validation": "validation", "test": "test"}
+    small_datasets_without_all_splits = ["cola", "wnli", "rte", "superglue-cb", "superglue-copa", "superglue-multirc",
+                                         "superglue-wic", "superglue-wsc.fixed", "superglue-rte", "mrpc", "stsb",
+                                         "superglue-boolq"]
+    large_data_without_all_splits = ["qqp", "qnli", "superglue-record", "sst2"]
+
+    def __init__(self, config, seed=42):
+        self.config = config
+        self.seed = seed
+
+    def get_max_target_length(self, tokenizer, default_max_length):
+        if self.labels_list is not None:
+            return max([len(tokenizer.encode(label)) for label in self.labels_list])
+        return default_max_length
+
+    def seq2seq_format(self, sources: List[str],
+                       targets: List[str],
+                       add_prefix: bool=False,
+                       prefix: str=None,
+                       extra_fields={}):
+        src_prefix = self.name if prefix is None else prefix
+        sources = [src_prefix]+sources if add_prefix else sources
+        return {'source': ' '.join(sources),
+                'target': ' '.join(targets),
+                'task': self.name,
+                'extra_fields': extra_fields}
+
+    def check_n_obs(self, n_obs, total_size):
+        if n_obs is not None and n_obs > total_size:
+            n_obs = total_size
+            logger.warning("n_obs is set to %s", n_obs)
+        return n_obs
+   
+    def shuffled_indices(self, dataset):
+        num_samples = len(dataset)
+        generator = torch.Generator()
+        generator.manual_seed(self.seed)
+        return torch.randperm(num_samples, generator=generator).tolist()
+
+    def subsample(self, dataset, n_obs=None, indices=None):
+        """
+        Given a dataset returns the subsampled dataset.
+        :param n_obs: the number of samples of the subsampled dataset.
+        :param indices: indices to select the samples from, if not given, indices are computed
+        from by shuffling the given dataset.
+        :return: subsampled dataset.
+        """
+        num_samples = len(dataset)
+        n_obs = self.check_n_obs(n_obs, num_samples)
+        if indices is None:
+           indices = self.shuffled_indices(dataset)
+        indices = indices[:n_obs]
+        return dataset.select(indices)
+
+    def load_dataset(self, split: int):
+        return datasets.load_dataset(self.name, self.config, split=split, script_version="master")
+
+    def get_split_indices(self, split, dataset, validation_size):
+        indices = self.shuffled_indices(dataset)
+        if split == "validation":
+            return indices[:validation_size]
+        else:
+            return indices[validation_size:]
+        
+    def map_dataset(self, dataset, add_prefix):    
+        return dataset.map(functools.partial(self.preprocessor, add_prefix=add_prefix),
+                           remove_columns=dataset.column_names)
+
+    def get(self, split, add_prefix=True, n_obs=None, split_validation_test=False):
+        # For small datasets (n_samples < 10K) without test set, we divide validation set to
+        # half, use one half as test set and one half as validation set.
+        if split_validation_test and self.name in self.small_datasets_without_all_splits \
+                and split != "train":
+            mapped_split = self.split_to_data_split["validation"]
+            dataset = self.load_dataset(split=mapped_split)
+            indices = self.get_split_indices(split, dataset, validation_size=len(dataset)//2)
+            dataset = self.subsample(dataset, n_obs, indices)
+        # For larger datasets (n_samples > 10K), we divide training set into 1K as
+        # validation and the rest as training set, keeping the original validation
+        # set as the test set.
+        elif split_validation_test and self.name in self.large_data_without_all_splits \
+                and split != "test":
+            dataset = self.load_dataset(split="train")
+            indices = self.get_split_indices(split, dataset, validation_size=1000)
+            dataset = self.subsample(dataset, n_obs, indices)
+        else:
+            mapped_split = self.split_to_data_split[split]
+            dataset = self.load_dataset(split=mapped_split)
+            # shuffles the data and samples it.
+            if n_obs is not None:
+                dataset = self.subsample(dataset, n_obs)
+        return self.map_dataset(dataset, add_prefix)    
+
+class Squad(AbstractTask):
+    name = "squad"
+    metric = [metrics.squad]
+
+    def load_dataset(self, split):
+        return datasets.load_dataset(self.name, split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix):
+        answer = pad_punctuation(example['answers']['text'][0])
+        question = pad_punctuation(example['question'])
+        context = pad_punctuation(example['context'])
+        source = ["question:", question,
+                  "context:", context]
+        target = [answer]
+        return self.seq2seq_format(source, target, add_prefix)
+
+
+class MRPC(AbstractTask):
+    name = "mrpc"
+    labels_list = ["0", "1"]
+    metric = [metrics.f1_score_with_invalid, metrics.accuracy]
+    metric_names = ["f1", "accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'mrpc', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence1:", example['sentence1'],
+                     "sentence2:", example["sentence2"]]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class COLA(AbstractTask):
+    name = "cola"
+    labels_list = ["0", "1"]
+    metric = [metrics.matthews_corrcoef]
+    metric_names = ["matthews_correlation"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'cola',
+                                     split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence:", example['sentence']]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SST2(AbstractTask):
+    name = "sst2"
+    labels_list = ["0", "1"]
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'sst2',
+                                     split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence:", example['sentence']]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class STSB(AbstractTask):
+    name = "stsb"
+    labels_list = [str(np.round(label, decimals=1)) for label in np.arange(0, 5.2, 0.2)]
+    metric = [metrics.pearson_corrcoef, metrics.spearman_corrcoef]
+    metric_names = ["pearson", "spearmanr"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'stsb',
+                                     split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence1:", example['sentence1'],
+                     "sentence2:", example["sentence2"]]
+        tgt_texts = [str(round_stsb_target(example['label']))]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class QQP(AbstractTask):
+    name = "qqp"
+    labels_list = ["0", "1"]
+    metric = [metrics.f1_score_with_invalid, metrics.accuracy]
+    metric_names = ["f1", "accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'qqp',
+                                     split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["question1:", example['question1'],
+                     "question2:", example["question2"]]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class MNLI(AbstractTask):
+    name = "mnli"
+    labels_list = ["0", "1", "2"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation_mismatched",
+                           "test": "validation_matched"}
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'mnli', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["premise:", example['premise'],
+                     "hypothesis", example["hypothesis"]]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class QNLI(AbstractTask):
+    name = "qnli"
+    labels_list = ["0", "1"]
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'qnli', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["question:", example['question'],
+                     "sentence:", example["sentence"]]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+class RTE(AbstractTask):
+    name = "rte"
+    labels_list = ["0", "1"]
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'rte',
+                                     split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence1:", example['sentence1'],
+                     "sentence2:", example["sentence2"]]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class WNLI(AbstractTask):
+    name = "wnli"
+    labels_list = ["0", "1"]
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('glue', 'wnli', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence1:", example['sentence1'],
+                     "sentence2:", example["sentence2"]]
+        tgt_texts = [str(example['label'])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUEBoolQ(AbstractTask):
+    name="superglue-boolq"
+    labels_list = ['0', '1']
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'boolq', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["question:", example["question"], "passage:", example["passage"]]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUERTE(AbstractTask):
+    name="superglue-rte"
+    labels_list = ['0', '1']
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"]
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'rte', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["premise:", example["premise"],
+                     "hypothesis:", example["hypothesis"]]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUECB(AbstractTask):
+    name = "superglue-cb"
+    labels_list = ['0', '1', '2']
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.mean_multiclass_f1(num_classes=3), metrics.accuracy]
+    metric_names = ["f1_multiclass", "accuracy"]
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'cb', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["premise:", example["premise"], "hypothesis:", example["hypothesis"]]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUECOPA(AbstractTask):
+    name = "superglue-copa"
+    labels_list = ['0', '1']
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"] 
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'copa', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["premise:", example["premise"], 
+                     "choice1:", example["choice1"],
+                     "choice2:", example["choice2"]]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUEMultiRC(AbstractTask):
+    name = "superglue-multirc"
+    labels_list = ['0', '1']
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.multirc_f1_over_all_answers,
+              metrics.mean_group_metric(metrics.exact_match)]
+    metric_names = ["f1", "em"]
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'multirc', split=split, script_version="master")
+
+    def remove_markup(self, text):
+        """Removes the HTML markup."""
+        text = re.sub('<br>', ' ', text)
+        text = re.sub('<(/)?b>', '', text)
+        return text 
+
+    def preprocessor(self, example, add_prefix=True):
+        group = example['idx']['question']
+        # T5 applies remove_markup to the joined string, but this should not make 
+        # any difference as well.
+        # https://github.com/google-research/text-to-text-transfer-transformer/blob/a1352e625db7ec114062f99d99b0565b9e45c155/t5/data/preprocessors.py#L797 
+        src_texts = ["question:", self.remove_markup(example["question"]),
+                     "answer:", self.remove_markup(example["answer"]),
+                     "paragraph:", self.remove_markup(example["paragraph"])]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix, extra_fields={"group": group})
+
+   
+
+class SuperGLUEWIC(AbstractTask):
+    name = "superglue-wic"
+    labels_list = ['0', '1']
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"] 
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'wic', split=split, script_version="master")
+
+    def preprocessor(self, example, add_prefix=True):
+        src_texts = ["sentence1:", example["sentence1"],
+                     "sentence2:", example["sentence2"],
+                     "word:", example["word"]]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUEWSCFixed(AbstractTask):
+    # source: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py
+    """Convert WSC examples to text2text format.
+     WSC includes a sentence along with 2 'spans': the first denoting a noun and
+     the other a pronoun. The 'label' specifies whether or not the pronoun is
+     referencing the noun. This preprocessor puts ' * ' around the noun and ' # '
+     around the pronoun.
+     For example, a typical example from WSC might look like
+     {
+         'text': 'This is a test sentence .',
+         'span1_text': 'test',
+         'span1_index': 3,
+         'span2_text': 'This',
+         'span2_index': 0,
+         'label': 0
+     }
+     This example would be transformed to
+     {
+         'inputs': 'wsc text: # This # is a * test * sentence .',
+         'targets': 'False'
+     }
+    """
+    name = "superglue-wsc.fixed"
+    labels_list = ['0', '1']
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.accuracy]
+    metric_names = ["accuracy"] 
+
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'wsc.fixed', split=split, script_version="master")
+
+    def _mark_span(self, text, span_str, span_idx, mark):
+        pattern_tmpl = r'^((?:\S+\s){N})(W)'
+        pattern = re.sub('N', str(span_idx), pattern_tmpl)
+        pattern = re.sub('W', span_str, pattern)
+        return re.sub(pattern, r'\1{0} \2 {0}'.format(mark), text)
+
+    def preprocessor(self, example, add_prefix=True):
+        # converts text as done in T5.
+        text = example['text']
+        text = self._mark_span(text, example['span1_text'], example['span1_index'], '*')
+        # Compensate for 2 added "words" added in previous step.
+        span2_index = example['span2_index'] + 2 * int(example['span1_index'] < example['span2_index'])
+        text = self._mark_span(text, example['span2_text'], span2_index, '#')
+        src_texts = ["text:", text]
+        tgt_texts = [str(example["label"])]
+        return self.seq2seq_format(src_texts, tgt_texts, add_prefix)
+
+
+class SuperGLUERecord(AbstractTask):
+    """Convert ReCoRD examples to text2text examples.
+    ReCoRD contains a passage, query containing a '@placeholder' string, and a set
+    of entities that are the possible values of the placeholder. Each train and
+    validation example will have a list of answers, any of which would be
+    considered correct.
+    For example, a typical example from ReCoRD might look like
+    {
+      'passsage': 'This is the passage.',
+      'query': 'A @placeholder is a bird.',
+      'entities': ['penguin', 'potato', 'pigeon'],
+      'answers': ['penguin', 'pigeon'],
+    }
+    which this preprocessor would turn into the following two examples:
+    {
+      'inputs': 'record query: A @placeholder is a bird. entities: penguin, '
+                'potato, pigeon passage: This is the passage.',
+      'targets': 'penguin',
+    }
+    and
+    {
+      'inputs': 'record query: A @placeholder is a bird. entities: penguin, '
+                'potato, pigeon passage: This is the passage.',
+      'targets': 'pigeon',
+    }
+    """
+    name = "superglue-record"
+    split_to_data_split = {"train": "train",
+                           "validation": "validation",
+                           "test": "validation"}
+    metric = [metrics.squad]
+    metric_names = ["squad"] 
+    
+    def load_dataset(self, split):
+        return datasets.load_dataset('super_glue', 'record', split=split, script_version="master")
+
+    def preprocessor(self, batch, add_prefix=True):
+        new_batch = collections.defaultdict(list)
+        keys = batch.keys()
+        for values in zip(*batch.values()):
+            ex = {k: v for k, v in zip(keys, values)}
+            # updates the passage.
+            passage = ex['passage']
+            passage = re.sub(r'(\.|\?|\!|\"|\')\n@highlight\n', r'\1 ', passage)
+            passage = re.sub(r'\n@highlight\n', '. ', passage)
+            inputs = f"record query: {ex['query']} entities: {', '.join(ex['entities'])} passage: {passage}"
+            if add_prefix:
+                inputs = self.name + " " + inputs 
+            # duplicates the samples based on  number of answers.
+            num_answers = len(ex["answers"])
+            num_duplicates = np.maximum(1, num_answers)
+            new_batch["source"].extend([inputs] * num_duplicates) 
+            new_batch["target"].extend(ex["answers"] if num_answers > 0 else ["<unk>"])
+            new_batch["task"].extend([self.name] * num_duplicates)
+            new_batch["extra_fields"].extend([{"answers": ex["answers"]}]*num_duplicates) 
+        return new_batch
+    
+    def map_dataset(self, dataset, add_prefix=True):
+        return dataset.map(functools.partial(self.preprocessor, add_prefix=add_prefix), 
+            batched=True, remove_columns=dataset.column_names)
+
+
+TASK_MAPPING = OrderedDict(
+    [
+        ('squad', Squad),
+        ('mrpc', MRPC),
+        ('cola', COLA),
+        ('sst2', SST2),
+        ('qnli', QNLI),
+        ('rte', RTE),
+        ('wnli', WNLI),
+        ('mnli', MNLI),
+        ('qqp', QQP),
+        ('stsb', STSB),
+        ('superglue-boolq', SuperGLUEBoolQ),
+        ('superglue-rte', SuperGLUERTE),
+        ('superglue-cb', SuperGLUECB),
+        ('superglue-copa', SuperGLUECOPA),
+        ('superglue-multirc', SuperGLUEMultiRC),
+        ('superglue-wic', SuperGLUEWIC),
+        ('superglue-wsc.fixed', SuperGLUEWSCFixed),
+        ('superglue-record', SuperGLUERecord)
+    ]
+)
+
+class AutoTask:
+    @classmethod
+    def get(self, task, config, seed=42):
+        if task in TASK_MAPPING:
+            return TASK_MAPPING[task](config, seed)
+        raise ValueError(
+            "Unrecognized task {} for AutoTask Model: {}.\n"
+            "Task name should be one of {}.".format(
+                ", ".join(c for c in TASK_MAPPING.keys())
+            )
+        )
--- a/examples/examples_seq2seq/data_processors/utils.py
+++ b/examples/examples_seq2seq/data_processors/utils.py
@ -0,0 +1,17 @@
+import numpy as np
+
+def round_stsb_target(label):
+    """STSB maps two sentences to a floating point number between 1 and 5
+    representing their semantic similarity. Since we are treating all tasks as
+    text-to-text tasks we need to convert this floating point number to a string.
+    The vast majority of the similarity score labels in STSB are in the set
+    [0, 0.2, 0.4, ..., 4.8, 5.0]. So, we first round the number to the closest
+    entry in this set, and then we convert the result to a string (literally e.g.
+    "3.4"). This converts STSB roughly into a 26-class classification dataset.
+    Args:
+      label: original label.
+    Returns:
+      A preprocessed label.
+    """
+    return np.round((label * 5) / 5, decimals=1)
+
--- a/examples/examples_seq2seq/metrics/init.py
+++ b/examples/examples_seq2seq/metrics/init.py
--- a/examples/examples_seq2seq/metrics/metrics.py
+++ b/examples/examples_seq2seq/metrics/metrics.py
@ -0,0 +1,173 @@
+# several of the evaluation metrics are from https://github.com/google-research/text-to-text-transfer-transformer/blob/a1352e625db7ec114062f99d99b0565b9e45c155/t5/evaluation/metrics.py
+"""Defines different metrics used for evaluation of tasks."""
+import numpy as np
+import scipy
+import math
+import sklearn
+import collections
+from logging import getLogger
+from .qa_utils import normalize_squad, qa_metrics
+import sklearn.metrics
+
+logger = getLogger(__name__)
+
+def accuracy(predictions, targets) -> dict:
+    """Computes the average accuracy."""
+    return {"accuracy": 100 * ((np.array(predictions) == np.array(targets)).mean())}
+
+def pearson_corrcoef(predictions, targets) -> dict:
+    """Computes Pearson correlation coefficient."""
+    from examples_seq2seq.data_processors.postprocessors import string_to_float
+    targets = [string_to_float(target) for target in targets]
+    predictions= [string_to_float(prediction) for prediction in predictions]
+    pearson_corrcoef = 100 * scipy.stats.pearsonr(targets, predictions)[0]
+
+    # Note that if all the predictions will be the same, spearman
+    # correlation is nan, to gaurad against this, we check the output
+    # and return 0 in this case.
+    if math.isnan(pearson_corrcoef):
+        pearson_corrcoef = 0
+    return {"pearson": pearson_corrcoef}
+
+
+def spearman_corrcoef(predictions, targets) -> dict:
+    """Computes Spearman correlation coefficient."""
+    # TODO: we need to do postprocessors in a clean way for each dataset.
+    from examples_seq2seq.data_processors.postprocessors import string_to_float
+    targets = [string_to_float(target) for target in targets]
+    predictions= [string_to_float(prediction) for prediction in predictions]
+    spearman_corrcoef = 100 * scipy.stats.spearmanr(targets, predictions)[0]
+
+    # Note that if all the predictions will be the same, spearman
+    # correlation is nan, to gaurad against this, we check the output
+    # and return 0 in this case.
+    if math.isnan(spearman_corrcoef):
+        spearman_corrcoef = 0
+    return {"spearmanr": spearman_corrcoef}
+
+
+def f1_score_with_invalid(predictions, targets) -> dict:
+    """Computes F1 score,  with any prediction != 0 or 1 is counted as incorrect.
+    Args:
+      targets: list of targets, either 0 or 1
+      predictions: list of predictions, any integer value
+    Returns:
+      F1 score, where any prediction != 0 or 1 is counted as wrong.
+    """
+    def binary_reverse(labels):
+       return ['0' if label == '1' else '1' for label in labels]
+    targets, predictions = np.asarray(targets), np.asarray(predictions)
+    # Get indices of invalid predictions.
+    invalid_idx_mask = np.logical_and(predictions != '0', predictions != '1')
+    # For any prediction != 0 or 1, we set the prediction to the opposite of its corresponding target.
+    predictions[invalid_idx_mask] = binary_reverse(targets[invalid_idx_mask])
+    targets = targets.astype(np.int32)
+    predictions = predictions.astype(np.int32)
+    return {"f1": 100 * sklearn.metrics.f1_score(targets, predictions)}
+
+# TODO: maybe gaurd against invalid values https://stackoverflow.com/questions/56865344/how-do-i-calculate-the-matthews-correlation-coefficient-in-tensorflow
+def matthews_corrcoef(predictions, targets) -> dict:
+    """Computes the Matthews correlation coefficient."""
+    return {"matthews_correlation": 100 * sklearn.metrics.matthews_corrcoef(targets, predictions)}
+
+def squad(predictions, targets):
+  """Computes SQuAD metrics, maximizing over answers per question.
+  Args:
+    targets: list of lists of strings
+    predictions: list of strings
+  Returns:
+    dict with score_key: squad score across all targets and predictions
+  """
+
+  targets = [[normalize_squad(t) for t in u] for u in targets]
+  predictions = [normalize_squad(p) for p in predictions]
+  return qa_metrics(targets, predictions)
+
+
+def exact_match(predictions, targets):
+  """Computes whether the targets match predictions exactly."""
+  return {"em": 100 * float(np.array_equal(targets, predictions))}
+
+
+def sklearn_metrics_wrapper(metric_str,
+                            metric_dict_str=None,
+                            metric_post_process_fn=None,
+                            **metric_fn_kwargs):
+  """Wraps any sklearn.metric function and returns a t5 metric function.
+  Args:
+    metric_str: string, the function from `sklearn.metrics` to use.
+    metric_dict_str: optional string, if not specified `metric_str` is used as
+      the key in the returned dictionary.
+    metric_post_process_fn: callable, if specified the final computed metric
+      will be passed through this.
+    **metric_fn_kwargs: kwargs, passed to the metric function we are calling.
+  Returns:
+    the function that calculates the metric in a dict.
+  """
+  if not hasattr(sklearn.metrics, metric_str):
+    raise ValueError("sklearn.metrics does not have: %s" % metric_str)
+
+  def fn(predictions, targets):
+    metric_fn = getattr(sklearn.metrics, metric_str)
+    metric_val = metric_fn(targets, predictions, **metric_fn_kwargs)
+    if metric_post_process_fn is not None:
+      metric_val = metric_post_process_fn(metric_val)
+    return {metric_dict_str or metric_str: metric_val}
+  return fn
+
+
+def mean_multiclass_f1(num_classes, **metric_fn_kwargs):
+  """Computes the unweighted average of the F1 per class."""
+  return sklearn_metrics_wrapper(
+      "fbeta_score",
+      metric_dict_str="f1_multiclass",
+      metric_post_process_fn=lambda x: 100 * x,
+      beta=1,
+      labels=range(num_classes),
+      average="macro",
+      **metric_fn_kwargs)
+
+
+def multirc_f1_over_all_answers(targets, predictions):
+  """Special metric for MultiRC which computes F1 score over all examples.
+  This is necessary because the targets/predictions for MultiRC are dicts and
+  the f1_score_with_invalid expects a list of True/False labels, not dicts. As
+  a result we just need to key in the "value" for each of the example dicts
+  before feeding into f1_score_with_invalid.
+  Args:
+    targets: list of dicts, where each dict has a "value" key.
+    predictions: list of dicts, where each dict has a "value" key.
+  Returns:
+    F1 score over values, where any prediction != 0 or 1 is counted as wrong.
+  """
+  return f1_score_with_invalid(
+      [t["value"] for t in targets], [p["value"] for p in predictions]
+  )
+
+
+def mean_group_metric(metric_fn, group_key="group", value_key="value"):
+  """Returns a metric that averages `metric_fn` on sub-groups of results.
+  The sub-groups are defined by aggregating results (targets and predictions)
+  by accessing the feature specified by `group_key` in the target dicts.
+  **WARNING**: Using this function can produce unreliable results if you do not
+  pass in full groups. For example, if you evaluate over a random subsample of a
+  validation set and do not retain all of the examples in each group, you may
+  get results which aren't directly comparable to using the full validation set.
+  Args:
+    metric_fn: function, the metric to compute on the subgroups.
+    group_key: string, the key for the grouping value in the target dictionary.
+    value_key: string, the key for the value in the dictionaries.
+  """
+  def my_metric(targets, predictions):
+    """Computes mean of `metric_fn` over subgroups of results."""
+    grouped_values = collections.defaultdict(lambda: ([], []))
+    for targ, pred in zip(targets, predictions):
+      g = targ[group_key]
+      grouped_values[g][0].append(targ[value_key])
+      grouped_values[g][1].append(pred[value_key])
+    group_scores = collections.defaultdict(list)
+    for (targets, predictions) in grouped_values.values():
+      for metric, score in metric_fn(targets, predictions).items():
+        group_scores[metric].append(score)
+    return {metric: np.mean(scores) for metric, scores in group_scores.items()}
+  return my_metric
--- a/examples/examples_seq2seq/metrics/qa_utils.py
+++ b/examples/examples_seq2seq/metrics/qa_utils.py
@ -0,0 +1,96 @@
+# Copyright 2021 The T5 Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# source: the codes are from https://github.com/google-research/text-to-text-transfer-transformer
+"""Utilities for Question Answering (QA) evaluation.
+Matches results on the SQuAD (v1.1) and TriviaQA (v1.0) evaluation scripts.
+"""
+
+import collections
+import string
+import regex as re
+import numpy as np
+
+
+def _normalize_answer(text, punc_chars, punc_repl):
+  """Lower text and remove punctuation, articles and extra whitespace."""
+
+  def remove_articles(s):
+    return re.sub(r"\b(a|an|the)\b", " ", s)
+
+  def replace_punctuation(s):
+    to_replace = set(punc_chars)
+    return "".join(punc_repl if ch in to_replace else ch for ch in s)
+
+  def white_space_fix(s):
+    return " ".join(s.split())
+
+  text = text.lower()
+  text = replace_punctuation(text)
+  text = remove_articles(text)
+  text = white_space_fix(text)
+  return text
+
+
+def normalize_trivia_qa(answer):
+  """Normalization used in official TriviaQA evaluation script."""
+  return _normalize_answer(
+      answer, punc_chars=string.punctuation + "‘’´`_", punc_repl=" ").strip()
+
+
+def normalize_squad(answer):
+  """Normalization used in official SQuAD evaluation script."""
+  return _normalize_answer(answer, punc_chars=string.punctuation, punc_repl="")
+
+
+def _metric_max_over_ground_truths(metric_fn, ground_truths, prediction):
+  """Computes the maximum of the metric over all ground truths."""
+  return max(
+      metric_fn(ground_truth, prediction) for ground_truth in ground_truths
+  )
+
+
+def _exact_match_score(target, prediction):
+  return target == prediction
+
+
+def _f1_score(target, prediction):
+  """Computes token f1 score for a single target and prediction."""
+  prediction_tokens = prediction.split()
+  target_tokens = target.split()
+  common = (collections.Counter(prediction_tokens) &
+            collections.Counter(target_tokens))
+  num_same = sum(common.values())
+  if num_same == 0:
+    return 0
+  precision = 1.0 * num_same / len(prediction_tokens)
+  recall = 1.0 * num_same / len(target_tokens)
+  f1 = (2 * precision * recall) / (precision + recall)
+  return f1
+
+
+def qa_metrics(targets, predictions):
+  """Computes exact match and f1 QA scores, expecting pre-normalized text."""
+  if len(targets) != len(predictions):
+    raise ValueError("Number of targets and predictions must match.")
+  em = np.mean([
+      _metric_max_over_ground_truths(_exact_match_score, t, p)
+      for p, t in zip(predictions, targets)
+  ])
+  f1 = np.mean([
+      _metric_max_over_ground_truths(_f1_score, t, p)
+      for p, t in zip(predictions, targets)
+  ])
+  em *= 100
+  f1 *= 100
+  return {"em": em, "f1": f1}
--- a/examples/examples_seq2seq/run.sh
+++ b/examples/examples_seq2seq/run.sh
@ -0,0 +1,7 @@
+files=(cola mnli mrpc qnli qqp rte sst2 stsb superglue-boolq superglue-cb superglue-copa superglue-multirc superglue-record superglue-wic superglue-wsc.fixed)
+for ((i=$1; i<=$2; i++))
+do
+    dataset=${files[i]}
+    echo "id$i:$dataset"
+    TOKENIZERS_PARALLELISM=false python run_seq2seq.py configs/$3/$dataset.json
+done 
--- a/examples/examples_seq2seq/run_seq2seq.py
+++ b/examples/examples_seq2seq/run_seq2seq.py
@ -0,0 +1,468 @@
+# coding=utf-8
+# Copyright The HuggingFace Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Fine-tuning the library models for sequence to sequence.
+"""
+# You can also adapt this script on your own sequence to sequence task. Pointers for this are left as comments.
+import functools
+import logging
+from opendelta.utils.delta_hub import create_hub_repo_name
+import torch 
+import os
+os.environ['MKL_THREADING_LAYER'] = 'GNU' 
+os.environ['MKL_SERVICE_FORCE_INTEL'] = '1'
+import sys
+import subprocess
+from typing import Optional, List
+
+from datasets import load_dataset, load_metric, concatenate_datasets
+import transformers
+from transformers import (
+    AutoConfig,
+    AutoModelForSeq2SeqLM,
+    AutoTokenizer,
+    HfArgumentParser,
+    MBartTokenizer,
+    default_data_collator,
+    set_seed,
+)
+from transformers.trainer_utils import is_main_process, get_last_checkpoint
+# from ..seq2seq.utils import get_adapter_config
+from examples_seq2seq.data_processors import AutoTask, TaskDataCollatorForSeq2Seq, AutoPostProcessor
+from examples_seq2seq.seq2seq_trainer import Seq2SeqTrainer
+# from training_args import AdapterTrainingArguments
+from examples_seq2seq.trainers.trainer_utils import save_training_config 
+from dataclasses import dataclass, field
+
+from transformers.models.t5.modeling_t5 import T5Config, T5ForConditionalGeneration
+from examples_seq2seq.trainers.model_args import ModelArguments
+from examples_seq2seq.trainers.trainer_args import TrainingArguments, DataTrainingArguments
+
+logger = logging.getLogger(__name__)
+
+def run_command(command):
+    output = subprocess.getoutput(command)
+    return output
+
+
+TASK_TO_METRICS = {"mrpc": ["accuracy", "f1"],
+                  "cola": ['matthews_correlation'],
+                  "stsb": ['pearson', 'spearmanr'],
+                  'sst2': ['accuracy'],
+                  "mnli": ["accuracy"],
+                  "mnli_mismatched": ["accuracy"],
+                  "mnli_matched": ["accuracy"],
+                  "qnli": ["accuracy"],
+                  "rte": ["accuracy"],
+                  "wnli": ["accuracy"],
+                  "qqp": ["accuracy", "f1"],
+                  "superglue-boolq": ["accuracy"],
+                  "superglue-rte": ["accuracy"],
+                  "superglue-cb": ["f1_multiclass", "accuracy"],
+                  "superglue-copa": ["accuracy"],
+                  "superglue-multirc": ["f1", "em"],
+                  "superglue-wic": ["accuracy"],
+                  "superglue-wsc.fixed": ["accuracy"],
+                  "superglue-record": ["f1", "em"]
+         }
+
+
+class RemainArgHfArgumentParser(HfArgumentParser):
+    def parse_json_file(self, json_file: str, return_remaining_args=True ):
+        """
+        Alternative helper method that does not use `argparse` at all, instead loading a json file and populating the
+        dataclass types.
+        """
+        import argparse
+        import json
+        from pathlib import Path
+        import dataclasses
+
+        data = json.loads(Path(json_file).read_text())
+        outputs = []
+        for dtype in self.dataclass_types:
+            keys = {f.name for f in dataclasses.fields(dtype) if f.init}
+            inputs = {k: data.pop(k) for k in list(data.keys()) if k in keys}
+            obj = dtype(**inputs)
+            outputs.append(obj)
+        
+        remain_args = argparse.ArgumentParser()
+        remain_args.__dict__.update(data)
+        if return_remaining_args:
+            return (*outputs, remain_args)
+        else:
+            return (*outputs,)
+
+
+def main():
+
+    # See all possible arguments in src/transformers/training_args.py
+    # or by passing the --help flag to this script.
+    # We now keep distinct sets of args, for a cleaner separation of concerns.
+    parser = RemainArgHfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))
+    if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):
+        # If we pass only one argument to the script and it's the path to a json file,
+        # let's parse it to get our arguments.
+        model_args, data_args, training_args, delta_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))
+    else:
+        model_args, data_args, training_args, delta_args = parser.parse_args_into_dataclasses()
+
+
+    # Detecting last checkpoint.
+    last_checkpoint = None
+    if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:
+        last_checkpoint = get_last_checkpoint(training_args.output_dir)
+        print("#### last_checkpoint ", last_checkpoint)
+        if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:
+            '''
+            raise ValueError(
+                f"Output directory ({training_args.output_dir}) already exists and is not empty. "
+                "Use --overwrite_output_dir to overcome."
+            )
+            '''
+            pass 
+        elif last_checkpoint is not None:
+            logger.info(
+                f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "
+                "the `--output_dir` or add `--overwrite_output_dir` to train from scratch."
+            )
+
+    # Setup logging
+    logging.basicConfig(
+        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
+        datefmt="%m/%d/%Y %H:%M:%S",
+        handlers=[logging.StreamHandler(sys.stdout)],
+    )
+    logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)
+
+    # Log on each process the small summary:
+    logger.warning(
+        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+    )
+    # Set the verbosity to info of the Transformers logger (on main process only):
+    if is_main_process(training_args.local_rank):
+        transformers.utils.logging.set_verbosity_info()
+    logger.info("Training/evaluation parameters %s", training_args)
+
+    # Set seed before initializing model.
+    set_seed(training_args.seed)
+
+    # Get the datasets: you can either provide your own CSV/JSON training and evaluation files (see below)
+    # or just provide the name of one of the public datasets available on the hub at https://huggingface.co/datasets/
+    # (the dataset will be downloaded automatically from the datasets Hub).
+    #
+    # For CSV/JSON files in the summarization task, this script will use the first column for the full texts and the
+    # second column for the summaries (unless you specify column names for this with the `text_column` and
+    # `summary_column` arguments).
+    # For translation, only JSON files are supported, with one field named "translation" containing two keys for the
+    # source and target languages (unless you adapt what follows).
+    #
+    # In distributed training, the load_dataset function guarantee that only one local process can concurrently
+    # download the dataset.
+    # See more about loading any type of standard or custom dataset (from files, python dict, pandas DataFrame, etc) at
+    # https://huggingface.co/docs/datasets/loading_datasets.html.
+
+    # Load pretrained model and tokenizer
+    #
+    # Distributed training:
+    # The .from_pretrained methods guarantee that only one local process can concurrently
+    # download model & vocab.
+    config = AutoConfig.from_pretrained(
+        model_args.config_name if model_args.config_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    config.dropout_rate = 0.0
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
+        cache_dir=model_args.cache_dir,
+        use_fast=model_args.use_fast_tokenizer,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    model = AutoModelForSeq2SeqLM.from_pretrained(
+        model_args.model_name_or_path,
+        from_tf=bool(".ckpt" in model_args.model_name_or_path),
+        config=config,
+        cache_dir=model_args.cache_dir,
+        revision=model_args.model_revision,
+        use_auth_token=True if model_args.use_auth_token else None,
+    )
+    model.resize_token_embeddings(len(tokenizer))
+
+
+    if delta_args.delta_type.lower() != "none":
+        from opendelta import AutoDeltaConfig,AutoDeltaModel
+        delta_config = AutoDeltaConfig.from_dict(vars(delta_args))
+        delta_model = AutoDeltaModel.from_config(delta_config, backbone_model=model)
+        delta_model.freeze_module(set_state_dict = True)
+        delta_model.log(delta_ratio=True, trainable_ratio=True, visualization=True)
+
+
+    # model parallelize
+    if hasattr(training_args, "model_parallel") and training_args.model_parallel:
+        logger.info('parallelize model!')
+        model.parallelize()
+
+    data_args.dataset_name = [data_args.task_name]
+    data_args.eval_dataset_name = [data_args.eval_dataset_name]
+    data_args.test_dataset_name = [data_args.test_dataset_name]
+    data_args.dataset_config_name = [data_args.dataset_config_name]
+    data_args.eval_dataset_config_name = [data_args.eval_dataset_config_name]
+    data_args.test_dataset_config_name = [data_args.test_dataset_config_name]
+    assert len(data_args.dataset_name) == len(data_args.dataset_config_name)
+    if data_args.eval_dataset_name is not None:
+        assert len(data_args.eval_dataset_name) == len(data_args.eval_dataset_config_name)
+    if data_args.test_dataset_name is not None:
+        assert len(data_args.test_dataset_name) == len(data_args.test_dataset_config_name)
+
+    # Temporarily set max_target_length for training.
+    #max_target_length = data_args.max_target_length
+    padding = "max_length" if data_args.pad_to_max_length else False
+    
+    def preprocess_function(examples, max_target_length):
+        # max_target_length += 1
+        # model_inputs = tokenizer([s+"<extra_id_0>" for s in examples['source']], max_length=data_args.max_source_length,
+        #                          padding=padding, truncation=True)
+        # # Setup the tokenizer for targets
+        # with tokenizer.as_target_tokenizer():
+        #     labels = tokenizer(['<extra_id_0>'+t for t in examples['target']], max_length=max_target_length, padding=padding, truncation=True)
+        model_inputs = tokenizer([s for s in examples['source']], max_length=data_args.max_source_length,
+                                 padding=padding, truncation=True)
+        # Setup the tokenizer for targets
+        with tokenizer.as_target_tokenizer():
+            labels = tokenizer([t for t in examples['target']], max_length=max_target_length, padding=padding, truncation=True)
+        # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
+        # padding in the loss.
+        if padding == "max_length" and data_args.ignore_pad_token_for_loss:
+            labels["input_ids"] = [
+                [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]
+            ]
+        model_inputs["labels"] = labels["input_ids"]
+        model_inputs["extra_fields"] = examples['extra_fields']
+        return model_inputs
+
+    column_names = ['source', 'target', 'extra_fields']
+    performance_metrics = {}
+    if training_args.do_train:
+        train_datasets = [AutoTask.get(dataset_name,
+                                       dataset_config_name,
+                                       seed=data_args.data_seed).get(
+            split="train",
+            split_validation_test=training_args.split_validation_test,
+            add_prefix=True,
+            n_obs=data_args.max_train_samples)
+            for dataset_name, dataset_config_name\
+            in zip(data_args.dataset_name, data_args.dataset_config_name)]
+        max_target_lengths = [AutoTask.get(dataset_name, dataset_config_name).get_max_target_length(\
+            tokenizer=tokenizer, default_max_length=data_args.max_target_length)\
+            for dataset_name, dataset_config_name in zip(data_args.dataset_name, data_args.dataset_config_name)]
+        for i, train_dataset in enumerate(train_datasets):
+            train_datasets[i] = train_datasets[i].map(
+                functools.partial(preprocess_function, max_target_length=max_target_lengths[i]),
+                batched=True,
+                num_proc=data_args.preprocessing_num_workers,
+                remove_columns=column_names, # if train_dataset != "superglue-record" else column_names+["answers"],
+                load_from_cache_file=not data_args.overwrite_cache,
+            )
+        train_dataset = concatenate_datasets(train_datasets)
+   
+    if training_args.do_eval:
+        eval_datasets = {eval_dataset: AutoTask.get(eval_dataset, eval_dataset_config,
+            seed=data_args.data_seed).get(
+            split="validation", 
+            split_validation_test=training_args.split_validation_test,
+            add_prefix=True,
+            n_obs=data_args.max_val_samples)
+            for eval_dataset, eval_dataset_config in zip(data_args.eval_dataset_name, data_args.eval_dataset_config_name)}
+        max_target_lengths = [AutoTask.get(dataset_name, dataset_config_name).get_max_target_length( \
+            tokenizer=tokenizer, default_max_length=data_args.max_target_length) \
+            for dataset_name, dataset_config_name in zip(data_args.eval_dataset_name, data_args.eval_dataset_config_name)]
+        for k, name in enumerate(eval_datasets):
+            eval_datasets[name] = eval_datasets[name].map(
+                    functools.partial(preprocess_function, max_target_length=max_target_lengths[k]),
+                    batched=True,
+                    num_proc=data_args.preprocessing_num_workers,
+                    remove_columns=column_names, # if name != "superglue-record" else column_names+["answers"],
+                    load_from_cache_file=not data_args.overwrite_cache,
+            )
+
+    if training_args.do_test:
+        test_datasets = {test_dataset: AutoTask.get(test_dataset, test_dataset_config,
+            seed=data_args.data_seed).get(
+            split="test", 
+            split_validation_test=training_args.split_validation_test,
+            add_prefix=True,
+            n_obs=data_args.max_test_samples)
+            for test_dataset, test_dataset_config in zip(data_args.test_dataset_name, data_args.test_dataset_config_name)}
+        max_target_lengths = [AutoTask.get(dataset_name, dataset_config_name).get_max_target_length( \
+            tokenizer=tokenizer, default_max_length=data_args.max_target_length) \
+            for dataset_name, dataset_config_name in zip(data_args.test_dataset_name, data_args.test_dataset_config_name)]
+        for k, name in enumerate(test_datasets):
+            test_datasets[name] = test_datasets[name].map(
+                    functools.partial(preprocess_function, max_target_length=max_target_lengths[k]),
+                    batched=True,
+                    num_proc=data_args.preprocessing_num_workers,
+                    remove_columns=column_names,
+                    load_from_cache_file=not data_args.overwrite_cache,
+            )
+
+    # Data collator
+    label_pad_token_id = -100 if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id
+    if data_args.pad_to_max_length:
+        data_collator = default_data_collator
+    else:
+        data_collator = TaskDataCollatorForSeq2Seq(
+            tokenizer,
+            label_pad_token_id=label_pad_token_id,
+            pad_to_multiple_of=8 if training_args.fp16 else None,
+        )
+
+
+    # Metric, we assume we have only one training task.
+    eval_metrics = [AutoTask.get(dataset_name, dataset_config_name).metric\
+        for dataset_name, dataset_config_name in zip(data_args.dataset_name, data_args.dataset_config_name)][0]
+
+    # Extracts the extra information needed to evaluate on each dataset.
+    # These information are only used in the compute_metrics.
+    # We will assume that the test/eval dataloader does not change the order of 
+    # the data.
+    data_info = {"eval": eval_datasets[data_args.eval_dataset_name[0]]['extra_fields'],
+                 "test": test_datasets[data_args.test_dataset_name[0]]['extra_fields'], 
+                 "train": train_dataset['extra_fields']}
+    def compute_metrics(eval_preds):
+        preds, labels, data_info = eval_preds
+        post_processor = AutoPostProcessor.get(data_args.dataset_name[0], tokenizer,
+                                               data_args.ignore_pad_token_for_loss)
+        decoded_preds, decoded_labels = post_processor.process(preds, labels, data_info)
+        result = {}
+        for metric in eval_metrics:
+            result.update(metric(decoded_preds, decoded_labels))
+        return result
+
+
+    # Initialize our Trainer
+    trainer = Seq2SeqTrainer(
+        model=model,
+        args=training_args,
+        delta_args=delta_args,
+        train_dataset=train_dataset if training_args.do_train else None,
+        eval_dataset=list(eval_datasets.values())[0] if training_args.do_eval else None,
+        data_info = data_info,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=compute_metrics if training_args.predict_with_generate else None,
+        evaluation_metrics = TASK_TO_METRICS[data_args.dataset_name[0]],
+    )
+
+
+    # Saves training config. 
+    if trainer.is_world_process_zero():
+       os.makedirs(training_args.output_dir, exist_ok=True)
+       save_training_config(sys.argv[1], training_args.output_dir)
+
+    # Training
+    if training_args.do_train:
+        checkpoint = None
+        if training_args.resume_from_checkpoint is not None:
+            checkpoint = training_args.resume_from_checkpoint
+        elif last_checkpoint is not None:
+            checkpoint = last_checkpoint
+
+        if training_args.compute_time:
+            torch.cuda.synchronize()  # wait for move to complete
+            start = torch.cuda.Event(enable_timing=True)
+            end = torch.cuda.Event(enable_timing=True)
+            start.record()
+        
+        train_result = trainer.train(resume_from_checkpoint=checkpoint)
+        
+        if training_args.compute_time:
+            end.record()
+            torch.cuda.synchronize()  # wait for all_reduce to complete
+            total_time = start.elapsed_time(end)/(1000*60)
+            performance_metrics.update({"total_time in minutes ": total_time})
+        
+        trainer.save_model()  # Saves the tokenizer too for easy upload
+        train_metrics = train_result.metrics
+        max_train_samples = (
+            data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)
+        )
+        train_metrics["train_samples"] = min(max_train_samples, len(train_dataset))
+        trainer.log_metrics("train", train_metrics)
+        trainer.save_metrics("train", train_metrics)
+        trainer.save_state()
+
+    if torch.cuda.is_available() and training_args.compute_memory:
+        peak_memory = (torch.cuda.max_memory_allocated() / 1024 ** 2)/1000
+        print(
+            "Memory utilization",
+            peak_memory,
+            "GB"
+        )
+        performance_metrics.update({"peak_memory": peak_memory})
+    if training_args.compute_memory or training_args.compute_time:
+        print(performance_metrics)
+        trainer.save_metrics("performance", performance_metrics)
+    
+    # Evaluation
+    results = {}
+    if training_args.do_eval:
+        logger.info("*** Evaluate ***")
+        for task, eval_dataset in eval_datasets.items():
+            metrics = trainer.evaluate(eval_dataset=eval_dataset,
+               max_length=data_args.val_max_target_length, num_beams=data_args.num_beams,
+            )
+            trainer.log_metrics("eval", metrics)
+            trainer.save_metrics("eval", metrics)
+        results['evaluate'] = metrics
+
+    # Test
+    if training_args.do_test:
+        logger.info("*** Test ***")
+        for task, test_dataset in test_datasets.items():
+            metrics = trainer.evaluate(eval_dataset=test_dataset,
+              max_length=data_args.test_max_target_length, num_beams=data_args.num_beams,
+              metric_key_prefix="test"
+            )
+            trainer.log_metrics("test", metrics)
+            trainer.save_metrics("test", metrics)
+        results['test'] = metrics
+    
+    repo_name = create_hub_repo_name(root="DeltaHub",
+                         dataset=data_args.task_name, 
+                         delta_type = delta_args.delta_type,
+                         model_name_or_path= model_args.model_name_or_path)
+    results['repo_name'] = repo_name
+    if training_args.push_to_hub: # TODO add description here
+        delta_model.save_finetuned(push_to_hub=True, save_directory=repo_name, use_auth_token=True)
+        # trainer.push_to_hub(**kwargs)
+    else:
+        delta_model.save_finetuned(push_to_hub=False, save_directory=repo_name, use_auth_token=True)
+
+    return results
+
+
+
+
+if __name__ == "__main__":
+    result = main()
+    import json
+    with open("collect_result.jsonl", 'a') as fout:
+        string = json.dumps(result, indent=4,sort_keys=True)
+        fout.write(string+"\n")
+    print(result)
--- a/examples/examples_seq2seq/seq2seq_trainer.py
+++ b/examples/examples_seq2seq/seq2seq_trainer.py
@ -0,0 +1,127 @@
+from packaging import version
+import torch
+from torch import nn
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+from torch.utils.data.dataset import Dataset
+from transformers import Seq2SeqTrainer as HfSeq2SeqTrainner
+from examples_seq2seq.trainers.trainer import BaseTrainer 
+
+    # if is_sagemaker_mp_enabled():
+#     import smdistributed.modelparallel.torch as smp
+
+# from transformers.trainer_utils import ShardedDDPOption
+
+# if is_fairscale_available():
+#     dep_version_check("fairscale")
+#     import fairscale
+#     from fairscale.nn.data_parallel import FullyShardedDataParallel as FullyShardedDDP
+#     from fairscale.nn.data_parallel import ShardedDataParallel as ShardedDDP
+#     from fairscale.nn.wrap import auto_wrap
+#     from fairscale.optim import OSS
+#     from fairscale.optim.grad_scaler import ShardedGradScaler
+
+from transformers.optimization import Adafactor, AdamW, get_scheduler
+from transformers.trainer_pt_utils import get_parameter_names, is_sagemaker_mp_enabled
+from transformers.integrations import is_fairscale_available
+
+
+
+if version.parse(torch.__version__) >= version.parse("1.6"):
+    from torch.cuda.amp import autocast
+
+
+class Seq2SeqTrainer(HfSeq2SeqTrainner, BaseTrainer):
+    def __init__(self, train_dataset_sizes=None, delta_args=None, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.train_dataset_sizes = train_dataset_sizes
+        self.delta_args = delta_args
+
+    def evaluate(
+        self,
+        eval_dataset: Optional[Dict[str, Dataset]] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        max_length: Optional[int] = None,
+        num_beams: Optional[int] = None,
+    ) -> Dict[str, float]:
+        # TODO: this also needs to be set per dataset
+        self._max_length = max_length
+        self._num_beams = num_beams
+        return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
+
+
+    def prediction_step(
+        self,
+        model: nn.Module,
+        inputs: Dict[str, Union[torch.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
+        """
+        Perform an evaluation step on :obj:`model` using obj:`inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (:obj:`nn.Module`):
+                The model to evaluate.
+            inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument :obj:`labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (:obj:`bool`):
+                Whether or not to return the loss only.
+
+        Return:
+            Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, logits and
+            labels (each being optional).
+        """
+        if not self.args.predict_with_generate or prediction_loss_only:
+            return super().prediction_step(
+                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
+            )
+
+        has_labels = "labels" in inputs
+        inputs = self._prepare_inputs(inputs)
+        gen_kwargs = {
+            "max_length": self._max_length if self._max_length is not None else self.model.config.max_length,
+            "num_beams": self._num_beams if self._num_beams is not None else self.model.config.num_beams,
+        }
+        generated_tokens = self.model.generate(
+            inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            **gen_kwargs,
+        )
+        # in case the batch is shorter than max length, the output should be padded
+        if generated_tokens.shape[-1] < gen_kwargs["max_length"]:
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
+
+        with torch.no_grad():
+            if self.use_amp:
+                with autocast():
+                    outputs = model(**inputs)
+            else:
+                outputs = model(**inputs)
+            if has_labels:
+                if self.label_smoother is not None:
+                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
+                else:
+                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[0]).mean().detach()
+            else:
+                loss = None
+
+        if self.args.prediction_loss_only:
+            return (loss, None, None)
+
+        labels = inputs["labels"]
+        if labels.shape[-1] < gen_kwargs["max_length"]:
+            labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
+
+        return (loss, generated_tokens, labels)
+
+    
+    
+    
+    
--- a/examples/examples_seq2seq/trainers/init.py
+++ b/examples/examples_seq2seq/trainers/init.py
@ -0,0 +1,2 @@
+from .trainer import BaseTrainer
+from .seq2seq_trainer import Seq2SeqTrainer
--- a/examples/examples_seq2seq/trainers/model_args.py
+++ b/examples/examples_seq2seq/trainers/model_args.py
@ -0,0 +1,36 @@
+from dataclasses import dataclass, field
+from typing import Optional, List
+
+@dataclass
+class ModelArguments:
+    """
+    Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.
+    """
+    model_name_or_path: str = field(
+        metadata={"help": "Path to pretrained model or model identifier from huggingface.co/models"}
+    )
+    config_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained config name or path if not the same as model_name"}
+    )
+    tokenizer_name: Optional[str] = field(
+        default=None, metadata={"help": "Pretrained tokenizer name or path if not the same as model_name"}
+    )
+    cache_dir: Optional[str] = field(
+        default=None,
+        metadata={"help": "Where to store the pretrained models downloaded from huggingface.co"},
+    )
+    use_fast_tokenizer: bool = field(
+        default=True,
+        metadata={"help": "Whether to use one of the fast tokenizer (backed by the tokenizers library) or not."},
+    )
+    model_revision: str = field(
+        default="main",
+        metadata={"help": "The specific model version to use (can be a branch name, tag name or commit id)."},
+    )
+    use_auth_token: bool = field(
+        default=False,
+        metadata={
+            "help": "Will use the token generated when running `transformers-cli login` (necessary to use this script "
+            "with private models)."
+        },
+    )
--- a/examples/examples_seq2seq/trainers/seq2seq_trainer.py
+++ b/examples/examples_seq2seq/trainers/seq2seq_trainer.py
@ -0,0 +1,108 @@
+from packaging import version
+import torch
+from torch import nn
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+from torch.utils.data.dataset import Dataset
+from transformers import Seq2SeqTrainer
+from .trainer import BaseTrainer 
+
+
+if version.parse(torch.__version__) >= version.parse("1.6"):
+    from torch.cuda.amp import autocast
+
+
+class Seq2SeqTrainer(Seq2SeqTrainer, BaseTrainer):
+    def __init__(self, train_dataset_sizes=None, delta_args=None, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.train_dataset_sizes = train_dataset_sizes
+        self.delta_args = delta_args
+
+    def evaluate(
+        self,
+        eval_dataset: Optional[Dict[str, Dataset]] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+        max_length: Optional[int] = None,
+        num_beams: Optional[int] = None,
+    ) -> Dict[str, float]:
+        # TODO: this also needs to be set per dataset
+        self._max_length = max_length
+        self._num_beams = num_beams
+        return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
+
+
+    def prediction_step(
+        self,
+        model: nn.Module,
+        inputs: Dict[str, Union[torch.Tensor, Any]],
+        prediction_loss_only: bool,
+        ignore_keys: Optional[List[str]] = None,
+    ) -> Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]:
+        """
+        Perform an evaluation step on :obj:`model` using obj:`inputs`.
+
+        Subclass and override to inject custom behavior.
+
+        Args:
+            model (:obj:`nn.Module`):
+                The model to evaluate.
+            inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
+                The inputs and targets of the model.
+
+                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
+                argument :obj:`labels`. Check your model's documentation for all accepted arguments.
+            prediction_loss_only (:obj:`bool`):
+                Whether or not to return the loss only.
+
+        Return:
+            Tuple[Optional[float], Optional[torch.Tensor], Optional[torch.Tensor]]: A tuple with the loss, logits and
+            labels (each being optional).
+        """
+        if not self.args.predict_with_generate or prediction_loss_only:
+            return super().prediction_step(
+                model, inputs, prediction_loss_only=prediction_loss_only, ignore_keys=ignore_keys
+            )
+
+        has_labels = "labels" in inputs
+        inputs = self._prepare_inputs(inputs)
+        gen_kwargs = {
+            "max_length": self._max_length if self._max_length is not None else self.model.config.max_length,
+            "num_beams": self._num_beams if self._num_beams is not None else self.model.config.num_beams,
+        }
+        generated_tokens = self.model.generate(
+            inputs["input_ids"],
+            attention_mask=inputs["attention_mask"],
+            **gen_kwargs,
+        )
+        # in case the batch is shorter than max length, the output should be padded
+        if generated_tokens.shape[-1] < gen_kwargs["max_length"]:
+            generated_tokens = self._pad_tensors_to_max_len(generated_tokens, gen_kwargs["max_length"])
+
+        with torch.no_grad():
+            if self.use_amp:
+                with autocast():
+                    outputs = model(**inputs)
+            else:
+                outputs = model(**inputs)
+            if has_labels:
+                if self.label_smoother is not None:
+                    loss = self.label_smoother(outputs, inputs["labels"]).mean().detach()
+                else:
+                    loss = (outputs["loss"] if isinstance(outputs, dict) else outputs[0]).mean().detach()
+            else:
+                loss = None
+
+        if self.args.prediction_loss_only:
+            return (loss, None, None)
+
+        labels = inputs["labels"]
+        if labels.shape[-1] < gen_kwargs["max_length"]:
+            labels = self._pad_tensors_to_max_len(labels, gen_kwargs["max_length"])
+
+        return (loss, generated_tokens, labels)
+    
+    
+    
+    
+
--- a/examples/examples_seq2seq/trainers/trainer.py
+++ b/examples/examples_seq2seq/trainers/trainer.py
@ -0,0 +1,274 @@
+from typing import Dict, List, Optional
+import numpy as np 
+import time
+import torch
+import collections
+from packaging import version
+from torch.utils.data.dataset import Dataset
+
+from transformers import Trainer
+from transformers import logging
+from transformers.trainer_utils import (
+    speed_metrics,
+    EvalLoopOutput,
+    denumpify_detensorize
+)
+from transformers.file_utils import is_torch_tpu_available
+from transformers.trainer_pt_utils import (
+    find_batch_size,
+    nested_numpify,
+    nested_truncate,
+    nested_concat,
+    IterableDatasetShard
+)
+from .trainer_utils import EvalPrediction
+
+
+from torch.utils.data.dataloader import DataLoader
+from torch.utils.data.dataset import IterableDataset
+from transformers.deepspeed import deepspeed_init
+
+
+if version.parse(torch.__version__) >= version.parse("1.6"):
+    from torch.cuda.amp import autocast
+
+if is_torch_tpu_available():
+    import torch_xla.core.xla_model as xm
+    import torch_xla.debug.metrics as met
+    import torch_xla.distributed.parallel_loader as pl
+
+logger = logging.get_logger(__name__)
+
+class BaseTrainer(Trainer):
+    def __init__(self, evaluation_metrics=[], data_info=None, *args, **kwargs):
+        """When doing evaluation, it computes average of list of metrics 
+        given in evaluation_metrics and adds it to the dictionary of results.
+        Trainer class then use this average metric to save the best model."""
+        super().__init__(*args, **kwargs)
+        self.evaluation_metrics = evaluation_metrics 
+        self.data_info = data_info
+
+    def get_data_info(self, metric_key_prefix):
+        """Returns the data information required to make the predictions/labels
+        suitable for the evaluation."""
+        if self.data_info is not None:
+            return self.data_info[metric_key_prefix]
+        return None     
+
+    def evaluate(
+        self,
+        eval_dataset: Optional[Dataset] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+    ) -> Dict[str, float]:
+        """
+        Run evaluation and returns metrics.
+        The calling script will be responsible for providing a method to compute metrics, as they are task-dependent
+        (pass it to the init :obj:`compute_metrics` argument).
+        You can also subclass and override this method to inject custom behavior.
+        Args:
+            eval_dataset (:obj:`Dataset`, `optional`):
+                Pass a dataset if you wish to override :obj:`self.eval_dataset`. If it is an :obj:`datasets.Dataset`,
+                columns not accepted by the ``model.forward()`` method are automatically removed. It must implement the
+                :obj:`__len__` method.
+            ignore_keys (:obj:`Lst[str]`, `optional`):
+                A list of keys in the output of your model (if it is a dictionary) that should be ignored when
+                gathering predictions.
+            metric_key_prefix (:obj:`str`, `optional`, defaults to :obj:`"eval"`):
+                An optional prefix to be used as the metrics key prefix. For example the metrics "bleu" will be named
+                "eval_bleu" if the prefix is "eval" (default)
+        Returns:
+            A dictionary containing the evaluation loss and the potential metrics computed from the predictions. The
+            dictionary also contains the epoch number which comes from the training state.
+        """
+        # memory metrics - must set up as early as possible
+        self._memory_tracker.start()
+        eval_dataloader = self.get_eval_dataloader(eval_dataset)
+        start_time = time.time()
+        eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
+        output = eval_loop(
+            eval_dataloader,
+            description="Evaluation",
+            # No point gathering the predictions if there are no metrics, otherwise we defer to
+            # self.args.prediction_loss_only
+            prediction_loss_only=True if self.compute_metrics is None else None,
+            ignore_keys=ignore_keys,
+            metric_key_prefix=metric_key_prefix,
+        )
+        output.metrics.update(speed_metrics(metric_key_prefix, start_time, output.num_samples))
+        if len(self.evaluation_metrics) != 0:
+           selected_metrics = [output.metrics[metric_key_prefix+"_"+k] for k in self.evaluation_metrics if metric_key_prefix+"_"+k in output.metrics]
+           assert len(selected_metrics) >= 1, "at least one metric should be selected to compute the average_metrics."
+           output.metrics.update({metric_key_prefix+'_average_metrics': np.mean(selected_metrics)})         
+    
+        self.log(output.metrics)
+
+        if self.args.tpu_metrics_debug or self.args.debug:
+            # tpu-comment: Logging debug metrics for PyTorch/XLA (compile, execute times, ops, etc.)
+            xm.master_print(met.metrics_report())
+
+        self.control = self.callback_handler.on_evaluate(self.args, self.state, self.control, output.metrics)
+        self._memory_tracker.stop_and_update_metrics(output.metrics)
+        return output.metrics
+    
+    def evaluation_loop(
+        self,
+        dataloader: DataLoader,
+        description: str,
+        prediction_loss_only: Optional[bool] = None,
+        ignore_keys: Optional[List[str]] = None,
+        metric_key_prefix: str = "eval",
+    ) -> EvalLoopOutput:
+        """
+        Prediction/evaluation loop, shared by :obj:`Trainer.evaluate()` and :obj:`Trainer.predict()`.
+
+        Works both with or without labels.
+        """
+        prediction_loss_only = (
+            prediction_loss_only if prediction_loss_only is not None else self.args.prediction_loss_only
+        )
+
+        # if eval is called w/o train init deepspeed here
+        if self.args.deepspeed and not self.deepspeed:
+
+            # XXX: eval doesn't have `resume_from_checkpoint` arg but we should be able to do eval
+            # from the checkpoint eventually
+            deepspeed_engine, _, _ = deepspeed_init(self, num_training_steps=0, resume_from_checkpoint=None)
+            self.model = deepspeed_engine.module
+            self.model_wrapped = deepspeed_engine
+            self.deepspeed = deepspeed_engine
+            # XXX: we don't need optim/sched for inference, but this needs to be sorted out, since
+            # for example the Z3-optimizer is a must for zero3 to work even for inference - what we
+            # don't need is the deepspeed basic optimizer which is self.optimizer.optimizer
+            deepspeed_engine.optimizer.optimizer = None
+            deepspeed_engine.lr_scheduler = None
+
+        model = self._wrap_model(self.model, training=False)
+  
+        # if full fp16 is wanted on eval and this ``evaluation`` or ``predict`` isn't called while
+        # ``train`` is running, halve it first and then put on device
+        if not self.is_in_train and self.args.fp16_full_eval:
+            model = model.half().to(self.args.device)
+
+        batch_size = dataloader.batch_size
+
+        logger.info(f"***** Running {description} *****")
+        if isinstance(dataloader.dataset, collections.abc.Sized):
+            logger.info(f"  Num examples = {self.num_examples(dataloader)}")
+        else:
+            logger.info("  Num examples: Unknown")
+        logger.info(f"  Batch size = {batch_size}")
+
+        model.eval()
+
+        self.callback_handler.eval_dataloader = dataloader
+        # Do this before wrapping.
+        eval_dataset = dataloader.dataset
+
+        if is_torch_tpu_available():
+            dataloader = pl.ParallelLoader(dataloader, [self.args.device]).per_device_loader(self.args.device)
+
+        if self.args.past_index >= 0:
+            self._past = None
+
+        # Initialize containers
+        # losses/preds/labels on GPU/TPU (accumulated for eval_accumulation_steps)
+        losses_host = None
+        preds_host = None
+        labels_host = None
+        # losses/preds/labels on CPU (final containers)
+        all_losses = None
+        all_preds = None
+        all_labels = None
+        # Will be useful when we have an iterable dataset so don't know its length.
+
+        observed_num_examples = 0
+        # Main evaluation loop
+        for step, inputs in enumerate(dataloader):
+            # Update the observed num examples
+            observed_batch_size = find_batch_size(inputs)
+            if observed_batch_size is not None:
+                observed_num_examples += observed_batch_size
+
+            # Prediction step
+            loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
+            # Update containers on host
+            if loss is not None:
+                losses = self._nested_gather(loss.repeat(batch_size))
+                losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0)
+            if logits is not None:
+                logits = self._pad_across_processes(logits)
+                logits = self._nested_gather(logits)
+                preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
+            if labels is not None:
+                labels = self._pad_across_processes(labels)
+                labels = self._nested_gather(labels)
+                labels_host = labels if labels_host is None else nested_concat(labels_host, labels, padding_index=-100)
+            self.control = self.callback_handler.on_prediction_step(self.args, self.state, self.control)
+
+            # Gather all tensors and put them back on the CPU if we have done enough accumulation steps.
+            if self.args.eval_accumulation_steps is not None and (step + 1) % self.args.eval_accumulation_steps == 0:
+                if losses_host is not None:
+                    losses = nested_numpify(losses_host)
+                    all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0)
+                if preds_host is not None:
+                    logits = nested_numpify(preds_host)
+                    all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+                if labels_host is not None:
+                    labels = nested_numpify(labels_host)
+                    all_labels = (
+                        labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100)
+                    )
+
+                # Set back to None to begin a new accumulation
+                losses_host, preds_host, labels_host = None, None, None
+
+        if self.args.past_index and hasattr(self, "_past"):
+            # Clean the state at the end of the evaluation loop
+            delattr(self, "_past")
+
+        # Gather all remaining tensors and put them back on the CPU
+        if losses_host is not None:
+            losses = nested_numpify(losses_host)
+            all_losses = losses if all_losses is None else np.concatenate((all_losses, losses), axis=0)
+        if preds_host is not None:
+            logits = nested_numpify(preds_host)
+            all_preds = logits if all_preds is None else nested_concat(all_preds, logits, padding_index=-100)
+        if labels_host is not None:
+            labels = nested_numpify(labels_host)
+            all_labels = labels if all_labels is None else nested_concat(all_labels, labels, padding_index=-100)
+
+        # Number of samples
+        if not isinstance(eval_dataset, IterableDataset):
+            num_samples = len(eval_dataset)
+        elif isinstance(eval_dataset, IterableDatasetShard):
+            num_samples = eval_dataset.num_examples
+        else:
+            num_samples = observed_num_examples
+
+        # Number of losses has been rounded to a multiple of batch_size and in a distributed training, the number of
+        # samplers has been rounded to a multiple of batch_size, so we truncate.
+        if all_losses is not None:
+            all_losses = all_losses[:num_samples]
+        if all_preds is not None:
+            all_preds = nested_truncate(all_preds, num_samples)
+        if all_labels is not None:
+            all_labels = nested_truncate(all_labels, num_samples)
+        # Metrics!
+        if self.compute_metrics is not None and all_preds is not None and all_labels is not None:
+            metrics = self.compute_metrics(EvalPrediction(predictions=all_preds, label_ids=all_labels, 
+            data_info=self.get_data_info(metric_key_prefix)))
+        else:
+            metrics = {}
+
+        # To be JSON-serializable, we need to remove numpy types or zero-d tensors
+        metrics = denumpify_detensorize(metrics)
+
+        if all_losses is not None:
+            metrics[f"{metric_key_prefix}_loss"] = all_losses.mean().item()
+
+        # Prefix all keys with metric_key_prefix + '_'
+        for key in list(metrics.keys()):
+            if not key.startswith(f"{metric_key_prefix}_"):
+                metrics[f"{metric_key_prefix}_{key}"] = metrics.pop(key)
+        return EvalLoopOutput(predictions=all_preds, label_ids=all_labels, metrics=metrics, num_samples=num_samples)
--- a/examples/examples_seq2seq/trainers/trainer_args.py
+++ b/examples/examples_seq2seq/trainers/trainer_args.py
@ -0,0 +1,140 @@
+from dataclasses import dataclass, field
+from typing import Optional, List
+from transformers import Seq2SeqTrainingArguments 
+# run_seq2seq parameters.
+
+@dataclass
+class TrainingArguments(Seq2SeqTrainingArguments):
+    print_num_parameters: Optional[bool] = field(default=False, metadata={"help": "If set, print the parameters of "
+                                                                                 "the model."})
+    do_test: Optional[bool] = field(default=False, metadata={"help": "If set, evaluates the test performance."})
+    split_validation_test: Optional[bool] = field(default=False,
+                                                  metadata={"help": "If set, for the datasets which do not"
+                                                                    "have the test set, we use validation set as their"
+                                                                    "test set and make a validation set from either"
+                                                                    "splitting the validation set into half (for smaller"
+                                                                    "than 10K samples datasets), or by using 1K examples"
+                                                                    "from training set as validation set (for larger"
+                                                                    " datasets)."})
+    compute_time: Optional[bool] = field(default=False, metadata={"help": "If set measures the time."})
+    compute_memory: Optional[bool] = field(default=False, metadata={"help": "if set, measures the memory"})
+    # prefix_length: Optional[int] = field(default=100, metadata={"help": "Defines the length for prefix tuning."})
+
+
+
+
+@dataclass
+class DataTrainingArguments:
+    """
+    Arguments pertaining to what data we are going to input our model for training and eval.
+    """
+    task_name: Optional[str] = field(
+        default=None, metadata={"help": "The name of the dataset to use (via the datasets library)."}
+    )
+    dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the dataset to use (via the datasets library)."}
+    )
+    eval_dataset_name: Optional[str] = field(
+        default=None, metadata={"help": "The name of the evaluation dataset to use (via the datasets library)."}
+    )
+    eval_dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the evaluation dataset to use (via the datasets library)."}
+    )
+    test_dataset_name: Optional[str] = field(
+        default=None, metadata={"help": "The name of the test dataset to use (via the datasets library)."}
+    )
+    test_dataset_config_name: Optional[str] = field(
+        default=None, metadata={"help": "The configuration name of the test dataset to use (via the datasets library)."}
+    )
+    overwrite_cache: bool = field(
+        default=False, metadata={"help": "Overwrite the cached training and evaluation sets"}
+    )
+    preprocessing_num_workers: Optional[int] = field(
+        default=None,
+        metadata={"help": "The number of processes to use for the preprocessing."},
+    )
+    max_source_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": "The maximum total input sequence length after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    max_target_length: Optional[int] = field(
+        default=128,
+        metadata={
+            "help": "The maximum total sequence length for target text after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded."
+        },
+    )
+    val_max_target_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "The maximum total sequence length for validation target text after tokenization. Sequences longer "
+            "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
+            "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
+            "during ``evaluate`` and ``predict``."
+        },
+    )
+    test_max_target_length: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "The maximum total sequence length for test target text after tokenization. Sequences longer "
+                    "than this will be truncated, sequences shorter will be padded. Will default to `max_target_length`."
+                    "This argument is also used to override the ``max_length`` param of ``model.generate``, which is used "
+                    "during ``evaluate`` and ``predict``."
+        },
+    )
+    pad_to_max_length: bool = field(
+        default=False,
+        metadata={
+            "help": "Whether to pad all samples to model maximum sentence length. "
+            "If False, will pad the samples dynamically when batching to the maximum length in the batch. More "
+            "efficient on GPU but very bad for TPU."
+        },
+    )
+    max_train_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of training examples to this "
+            "value if set."
+        },
+    )
+    max_val_samples: Optional[int] = field(
+        default=None,
+        metadata={
+            "help": "For debugging purposes or quicker training, truncate the number of validation examples to this "
+            "value if set."
+        },
+    )
+    max_test_samples: Optional[int] = field(
+        default=None,
+        metadata={"help": "For debugging purposes or quicker training, truncate the number of test examples to this "
+            "value if set."}
+    )
+    num_beams: Optional[int] = field(default=None, metadata={"help": "Number of beams to use for evaluation."})
+    ignore_pad_token_for_loss: bool = field(
+        default=True,
+        metadata={
+            "help": "Whether to ignore the tokens corresponding to padded labels in the loss computation or not."
+        },
+    )
+    task_adapters: Optional[List[str]] = field(
+        default=None,
+        metadata={"help": "Defines a dictionary from task adapters to the tasks."}
+    )
+    task_embeddings: Optional[List[str]] = field(
+        default=None,
+        metadata={"help": "Defines a dictionary from tasks to the tasks embeddings."}
+    )
+    data_seed: Optional[int] = field(default=42, metadata={"help": "seed used to shuffle the data."})
+    
+    model_parallel: Optional[bool] = field(default=False, metadata={"help": "whether apply model parallelization"})
+
+    def __post_init__(self):
+        if self.task_name is None:
+            raise ValueError("Need either a dataset name or a training/validation file.")
+        if self.val_max_target_length is None:
+            self.val_max_target_length = self.max_target_length
+        if self.test_max_target_length is None:
+            self.test_max_target_length = self.max_target_length
--- a/examples/examples_seq2seq/trainers/trainer_utils.py
+++ b/examples/examples_seq2seq/trainers/trainer_utils.py
@ -0,0 +1,75 @@
+import numpy as np 
+from typing import Union, NamedTuple, Tuple, Dict, Any   
+import os 
+import regex as re
+import logging
+from dataclasses import fields
+import torch.nn as nn
+import json
+
+
+
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+class EvalPrediction(NamedTuple):
+    """
+    Evaluation output (always contains labels), to be used to compute metrics.
+    Parameters:
+        predictions (:obj:`np.ndarray`): Predictions of the model.
+        label_ids (:obj:`np.ndarray`): Targets to be matched.
+        data_info: (:obj:`Dict[str, Any]`): Extra dataset information, one requires
+        to performs the evaluation. The data_info is a dictionary with keys from
+        train, eval, test to specify the data_info for each split of the dataset.
+    """
+    predictions: Union[np.ndarray, Tuple[np.ndarray]]
+    label_ids: np.ndarray
+    data_info: Dict[str, Any]
+
+
+
+
+
+def create_dir(output_dir):
+    """
+    Checks whether to the output_dir already exists and creates it if not.
+    Args:
+      output_dir: path to the output_dir
+    """
+    if not os.path.exists(output_dir):
+        os.makedirs(output_dir)
+
+
+def get_last_checkpoint(output_dir):
+    if os.path.exists(os.path.join(output_dir, 'pytorch_model.bin')):
+        return output_dir
+    return None
+
+
+def pad_punctuation(text):
+   """Re-implementation of _pad_punctuation in t5. This function adds spaces
+   around punctuation. While this pads punctuation as expected, it has the 
+   unexpected effected of padding certain unicode characters with accents, with
+   spaces as well. For instance: "François" becomes "Fran ç ois"""
+   # Pad everything except for: underscores (_), whitespace (\s),
+   # numbers (\p{N}), letters (\p{L}) and accent characters (\p{M}).
+   text = re.sub(r'([^_\s\p{N}\p{L}\p{M}])', r' \1 ', text)
+   # Collapse consecutive whitespace into one space.
+   text = re.sub(r'\s+', ' ', text)
+   return text
+
+def save_json(filepath, dictionary):
+   with open(filepath, "w") as outfile:
+      json.dump(dictionary, outfile)
+
+
+def read_json(filepath):
+   f = open(filepath,)
+   return json.load(f)
+
+
+def save_training_config(config_file, output_dir):
+   json_data = read_json(config_file)
+   save_json(os.path.join(output_dir, "training_config.json"), json_data)
+
--- a/examples/examples_seq2seq/utils/init.py
+++ b/examples/examples_seq2seq/utils/init.py
--- a/examples/examples_seq2seq/utils/utils.py
+++ b/examples/examples_seq2seq/utils/utils.py
@ -0,0 +1,15 @@
+import os 
+import regex as re
+import logging
+from dataclasses import fields
+import torch.nn as nn
+import json
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.INFO)
+
+
+
+
+
+
--- a/examples/examples_text-classification/README.md
+++ b/examples/examples_text-classification/README.md
@ -0,0 +1,58 @@
+# Text classification with OpenDelta
+This repository contains the examples that uses OpenDelta to do text-classification in a traditional classification mode, i.e., with a classification head on top of the language model. Almost all of the training pipeline codes remain the same, except for some minimum changes to insert delta models onto the backbone model. 
+
+
+## Generating the json configuration file
+
+```
+python config_gen.py --job $job_name
+
+```
+The available job configuration (e.g., `--job lora_roberta-base`) can be seen from `config_gen.py`. You can also
+create your only configuration.
+
+
+## Run the code
+
+```
+python run_glue.py configs/$job_name/$dataset.json
+```
+
+
+## Possible Errors
+
+1. 
+```
+ValueError: You must login to the Hugging Face hub on this computer by typing `transformers-cli login` and entering your credentials to use `use_auth_token=Tr
+ue`. Alternatively, you can pass your own token as the `use_auth_token` argument.
+```
+- Solution 1: Please register an account on [HuggingFace](https://huggingface.co/) 
+Then run transformers-cli login on your command line to enter the username and password.
+
+- Solution 2: Disable push_to_hub by modifying in the config.json : "push_to_hub": False
+
+2. 
+```
+OSError: Looks like you do not have git-lfs installed, please install. You can install from https://git-lfs.github.com/. Then run `git lfs install` (you only have to do this once).
+```
+
+- Solution 1:
+```
+wget -P ~ https://github.com/git-lfs/git-lfs/releases/download/v3.0.2/git-lfs-linux-amd64-v3.0.2.tar.gz
+cd ~
+tar -xvzf git-lfs-linux-amd64-v3.0.2.tar.gz
+export PATH=~:$PATH # a temperary fix. To permantly add, modify your bash
+git-lfs install
+```
+
+- Solution 2: Disable push_to_hub by modifying in the config.json : "push_to_hub": False
+
+3. dataset connection error
+
+Solution 1: open a python console, running the error command again, may not be useful
+
+Solution 2: download the dataset by yourself on a internect connected machine, saved to disk and transfer to your server, at last load_from_disk.
+
+
+## Link to the original training scripts
+This example repo is based on the [huggingface text-classification example](https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification). Thanks to the authors of the original repo. 
--- a/examples/examples_text-classification/configs/config_gen.py
+++ b/examples/examples_text-classification/configs/config_gen.py
@ -0,0 +1,342 @@
+import collections 
+import copy
+
+AllConfigs = {}
+
+BaseConfigs = {}
+BaseConfigs['roberta-base'] = {
+                ("job_name", "task_name", "eval_dataset_name", "test_dataset_name", "num_train_epochs", 
+                "max_source_length",
+                "per_device_train_batch_size", "per_device_eval_batch_size", "warmup_steps","save_steps", "eval_steps", "metric_for_best_model"): zip(
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record",
+                    "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte", "mnli", "qqp", "stsb"],
+                    [ 20,  20,  40,  20,   3,   3,  20,  20,  20,   3,   3,  20,   3,   3,  20],
+                    [256, 256, 256, 256, 256, 512, 256, 128, 128, 128, 128, 128, 128, 128, 128],
+                    [ 32,  32,  32,  32,  32,  16,  32] + [32] * 8,
+                    [ 32,  32,  32,  32,  32,  16,  32] + [32] * 8,
+                    [0] *7 +[0] *8,
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    ["eval_accuracy"] *15,
+                ),
+                "do_train": True,
+                "do_eval": True,
+                "do_test": True,
+                
+                "model_name_or_path": "roberta-base",
+                "tokenizer_name": "roberta-base",
+                "save_total_limit": 1,
+                # For glue datasets.
+                # "split_validation_test": True,
+                "seed": 42,
+                "dataset_config_name": ["en"],
+                "eval_dataset_config_name": ["en"],
+                "test_dataset_config_name": ["en"],
+                # other configurations.
+                "predict_with_generate": True,
+                # To evaluate during training.
+                "load_best_model_at_end": True,
+                # "metric_for_best_model": "average_metrics",
+                "greater_is_better": True,
+                "evaluation_strategy": "steps",
+                "overwrite_output_dir": True,
+                "push_to_hub": True,
+                "save_strategy": "steps"
+            }
+
+
+BaseConfigs['deberta-base'] = {
+                ("job_name", "task_name", "eval_dataset_name", "test_dataset_name", "num_train_epochs", 
+                "max_source_length",
+                "per_device_train_batch_size", "per_device_eval_batch_size", "warmup_steps","save_steps", "eval_steps", "metric_for_best_model"): zip(
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record",
+                    "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte", "mnli", "qqp", "stsb"],
+                    [ 20,  20,  40,  20,   3,   3,  20,  20,  20,   3,   3,  20,   3,   3,  20],
+                    [256, 256, 256, 256, 256, 512, 256, 128, 128, 128, 128, 128, 128, 128, 128],
+                    [ 32,  32,  32,  32,  32,  16,  32] + [32] * 8,
+                    [ 32,  32,  32,  32,  32,  16,  32] + [32] * 8,
+                    [0] *7 +[0] *8,
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    ["eval_accuracy"] *15,
+                ),
+                "do_train": True,
+                "do_eval": True,
+                "do_test": True,
+                
+                "model_name_or_path": "microsoft/deberta-v3-base",
+                "tokenizer_name": "microsoft/deberta-v3-base",
+                "save_total_limit": 1,
+                # For glue datasets.
+                # "split_validation_test": True,
+                "seed": 42,
+                "dataset_config_name": ["en"],
+                "eval_dataset_config_name": ["en"],
+                "test_dataset_config_name": ["en"],
+                # other configurations.
+                "predict_with_generate": True,
+                # To evaluate during training.
+                "load_best_model_at_end": True,
+                # "metric_for_best_model": "average_metrics",
+                "greater_is_better": True,
+                "evaluation_strategy": "steps",
+                "overwrite_output_dir": True,
+                "push_to_hub": True,
+                "save_strategy": "steps"
+            }
+
+BaseConfigs['deberta-v2-xlarge'] = {
+                ("job_name", "task_name", "eval_dataset_name", "test_dataset_name", "num_train_epochs", 
+                "max_source_length",
+                "per_device_train_batch_size", "per_device_eval_batch_size", "warmup_steps","save_steps", "eval_steps", "metric_for_best_model", "gradient_accumulation_steps"): zip(
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record",
+                    "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte",  "mnli", "qqp", "stsb"], 
+                    ["superglue-boolq", "superglue-cb", "superglue-copa", "superglue-wic", "superglue-multirc", "superglue-record", "superglue-wsc.fixed", "mrpc", "cola", "sst2", "qnli", "rte", "mnli", "qqp", "stsb"],
+                    [ 20,  20,  40,  20,   3,   3,  20,  20,  20,   3,   3,  20,   3,   3,  20],
+                    [256, 256, 256, 256, 256, 512, 256, 128, 128, 128, 128, 128, 128, 128, 128],
+                    [ 16,  16,  16,  16,  16,  8,  16] + [16] * 8,
+                    [ 16,  16,  16,  16,  16,  8,  16] + [16] * 8,
+                    [0] *7 +[0] *8,
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    [200, 100, 50, 100, 200, 200, 100, 200, 100, 200, 200, 100, 200, 200, 100],
+                    ["eval_accuracy"] *15,
+                    [4] *15,
+                ),
+                "do_train": True,
+                "do_eval": True,
+                "do_test": True,
+                
+                "model_name_or_path": "microsoft/deberta-v2-xlarge",
+                "tokenizer_name": "microsoft/deberta-v2-xlarge",
+                "save_total_limit": 1,
+                # For glue datasets.
+                # "split_validation_test": True,
+                "seed": 42,
+                "dataset_config_name": ["en"],
+                "eval_dataset_config_name": ["en"],
+                "test_dataset_config_name": ["en"],
+                # other configurations.
+                "predict_with_generate": True,
+                # To evaluate during training.
+                "load_best_model_at_end": True,
+                # "metric_for_best_model": "average_metrics",
+                "greater_is_better": True,
+                "evaluation_strategy": "steps",
+                "overwrite_output_dir": True,
+                "push_to_hub": True,
+                "save_strategy": "steps"
+            }
+
+
+AllConfigs['bitfit_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['bitfit_roberta-base'].update({
+                "delta_type": "bitfit",      
+                "learning_rate": 3e-4,         
+                "output_dir": "outputs/bitfit/roberta-base/",
+                "unfrozen_modules": [
+                    "classifier",
+                    "deltas"
+                ],
+            })
+
+AllConfigs['adapter_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['adapter_roberta-base'].update({
+                                "delta_type": "adapter",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm",
+                                    "classifier",
+                                ],
+                                "bottleneck_dim":24,
+                                "output_dir": "outputs/adapter/roberta-base/",
+                            })
+
+AllConfigs['lora_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['lora_roberta-base'].update({
+                                "delta_type": "lora",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm",
+                                    "classifier",
+                                ],
+                                "lora_r": 8,
+                                "output_dir": "outputs/lora/roberta-base/",
+                            })
+
+AllConfigs['compacter_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['compacter_roberta-base'].update({
+                                "delta_type": "compacter",
+                                "learning_rate": 3e-3,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm",
+                                    "classifier",
+                                ],
+                                "output_dir": "outputs/compacter/roberta-base/",
+                                "non_linearity": "gelu_new",
+
+                                #Compacter.
+                                "hypercomplex_division": 4, 
+                                "hypercomplex_adapters": True,
+                                "hypercomplex_nonlinearity": "glorot-uniform",
+                                # gradient clip and clamp 
+                                "gradient_clip": False,
+                                "phm_clamp": False,
+                                "normalize_phm_weight": False, 
+                                "learn_phm": True,
+                                # shared one side 
+                                "factorized_phm": True, 
+                                "shared_phm_rule": False,
+                                "factorized_phm_rule": False,
+                                "phm_c_init": "normal",
+                                "phm_init_range": 0.0001,
+                                "use_bias_down_sampler": True,
+                                "use_bias_up_sampler": True,
+                            })
+
+AllConfigs['compacter++_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['compacter++_roberta-base'].update({
+                                "delta_type": "compacter",
+                                "learning_rate": 3e-3,
+                                "do_train": True,
+                                "do_eval": True,
+                                "do_test": True,
+                                "modified_modules": [
+                                    "DenseReluDense"
+                                ],
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm",
+                                    "classifier",
+                                ],
+                                "output_dir": "outputs/compacter++/roberta-base/",
+                                "non_linearity": "gelu_new",
+
+                                #Compacter.
+                                "hypercomplex_division": 4, 
+                                "hypercomplex_adapters": True,
+                                "hypercomplex_nonlinearity": "glorot-uniform",
+                                # gradient clip and clamp 
+                                "gradient_clip": False,
+                                "phm_clamp": False,
+                                "normalize_phm_weight": False, 
+                                "learn_phm": True,
+                                # shared one side 
+                                "factorized_phm": True, 
+                                "shared_phm_rule": False,
+                                "factorized_phm_rule": False,
+                                "phm_c_init": "normal",
+                                "phm_init_range": 0.0001,
+                                "use_bias_down_sampler": True,
+                                "use_bias_up_sampler": True,
+                            })
+
+
+AllConfigs['low_rank_adapter_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['low_rank_adapter_roberta-base'].update({
+                                "delta_type": "low_rank_adapter",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "layer_norm",
+                                    "final_layer_norm",
+                                    "classifier",
+                                ],
+                                "output_dir": "outputs/low_rank_adapter/roberta-base/",
+                                "non_linearity": "gelu_new",
+                                "low_rank_w_init": "glorot-uniform", 
+                                "low_rank_rank": 1,
+                            })
+
+
+AllConfigs['soft_prompt_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['soft_prompt_roberta-base'].update({
+                                "delta_type": "soft_prompt",
+                                "learning_rate": 3e-2,
+                                "soft_token_num":100,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "classifier",
+                                ],
+                                "output_dir": "outputs/soft_prompt/roberta-base/",
+                            })
+
+AllConfigs['prefix_roberta-base'] = copy.deepcopy(BaseConfigs['roberta-base'])
+AllConfigs['prefix_roberta-base'].update({
+                                "delta_type": "prefix",
+                                "learning_rate": 3e-4,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "classifier",
+                                ],
+                                "output_dir": "outputs/prefix/roberta-base/",
+                            })
+
+AllConfigs['soft_prompt_deberta-v2-xlarge'] = copy.deepcopy(BaseConfigs['deberta-v2-xlarge'])
+AllConfigs['soft_prompt_deberta-v2-xlarge'].update({
+                                "delta_type": "soft_prompt",
+                                "learning_rate": 3e-2,
+                                "soft_token_num":100,
+                                "unfrozen_modules": [
+                                    "deltas",
+                                    "classifier",
+                                ],
+                                "output_dir": "outputs/soft_prompt/deberta-v2-xlarge/",
+                            })
+
+
+if __name__ == "__main__":
+    import argparse
+    import json
+    import os
+    parser = argparse.ArgumentParser("Parser to generate configuration")
+    parser.add_argument("--job", type=str)
+    args = parser.parse_args()
+
+    config = AllConfigs[args.job]
+
+    Cartesian_product = []
+    for key in config:
+        if isinstance(key, tuple):
+            Cartesian_product.append(key)
+    all_config_jsons = {}
+    for key_tuple in Cartesian_product:
+        for zipped in config[key_tuple]:
+            job_name = zipped[0]
+            all_config_jsons[job_name] = {}
+            for key_name, zipped_elem in zip(key_tuple, zipped):
+                if key_name != 'job_name':
+                    all_config_jsons[job_name][key_name] = zipped_elem
+    for key in config:
+        if not isinstance(key, tuple):
+            for job_name in all_config_jsons:
+                if key == "output_dir":
+                    all_config_jsons[job_name][key] = config[key] + job_name
+                else:
+                    all_config_jsons[job_name][key] = config[key]
+
+
+    if not os.path.exists(f"./{args.job}/"):
+        os.mkdir(f"./{args.job}/")
+
+    for job_name in all_config_jsons:
+        with open(f"./{args.job}/{job_name}.json", 'w') as fout:
+            json.dump(all_config_jsons[job_name], fout, indent=4,sort_keys=True)
+        
+    
+
+    
--- a/Show More
+++ b/Show More