Skip to content
Snippets Groups Projects
user avatar
Christian authored
31fcd637
History

RECORD - public

This is the code repository to our work Learning with an Open Horizon in Ever-Changing Dialogue Circumstances.

This work proposes the usage of lifetime return and meta-learning of hyperparameters for enhanced continual reinforcement learning training. We optimized the state-of-the-art architecture for continual RL of dialogue policies called DDPT (see https://aclanthology.org/2022.coling-1.21/).

As base algorithms, we use PPO and CLEAR. While PPO is an on-policy algorithm, CLEAR is an off-policy algorithm specifically built for continual reinforcement learning. Moreover, the dialogue policies can be trained with different user simulator setups: single user simulator (rule-based or transformer-based), and multiple simulators together.

Installation

The code builds upon ConvLab-3. To install ConvLab-3, please follow the instructions in the repository:

https://github.com/ConvLab/ConvLab-3

In addition, for utilizing meta-learning and evaluation, you need to install the higher library and rliable using

pip install higher
pip install -U rliable

Training

The code for training models can be found in the folders ppo_DPT and vtrace_DPT. We explain the usage using vtrace_DPT. It works analogously for ppo_DPT.

Train with rule-based simulator

The rule-based simulator has different configurations with outputting only little actions or more actions in a turn. We train with them using the following two configurations

python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl.json
python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_shy.json

Train with transformer-based simulator TUS

python convlab/policy/vtrace_DPT/train_ocl_meta.py --seed=0 --path=convlab/policy/vtrace_DPT/semantic_level_config_ocl_tus.json

Train with all three simulators

We can leverage all simulators during learning with the following execution

python convlab/policy/vtrace_DPT/train_ocl_meta_users.py --seed=0 

We can run the training with various seeds to obtain different results. The results are stored in the folder experiments and moved to finished_experiments once they are done.

Leveraging Lifetime Return and Meta Learning

We can specify whether we want to use meta-learning, use episodic return, lifetime return, or both in the config-file convlab/policy/vtrace_DPT/config.json.

  • lifetime_weight: number between 0 and 1; 0 means no lifetime return, 1 means using lifetime return
  • only_lifetime: true or false; true means only lifetime return is used, false means both lifetime return and episodic return are used
  • meta: true or false; true means meta-learning is used, false means no meta-learning is used

Specifying the Timeline

We provide timelines used for the paper in convlab/policy/ocl_utils/timelines. You specify the following:

  • timeline: a dictionary, where the keys are given by domains. The values determine after how many dialogues the domain should be introduced
  • num_domain_probs: for every integer n, the probability of using n domains in a user goal
  • domain_probs: for every domain, the probability of using the domain in a user goal
  • new_domain_probs: probability that the newly introduce domain should be part of the user goal
  • num_dialogues_stationary: number of dialogues before the user demand changes
  • std_deviation: specifies the variation of user demand changes

During training you specify the timeline_path to use in the config, e.g. in semantic_level_config_ocl.json

Evaluation

Let us assume we have run two experiments, one with meta-learning and one baseline. Each experiment has been run with 5 different seeds. We create folders meta and baseline for the two experiments, each folder containing the different seed folders. We assume the folders meta and baseline lie in the folder meta-experiments.

meta-experiments

  • meta
    • seed_0
    • seed_1
    • seed_2
    • seed_3
    • seed_4
  • baseline
    • seed_0
    • seed_1
    • seed_2
    • seed_3
    • seed_4

We can evaluate the experiments using the following command

python convlab/policy/ocl_utils/plot_ocl.py meta baseline --dir_path meta-experiments

More generally, you pass a list of experiment names and the folder they are saved in. The script will then create different plots as in the paper and save them in the folder meta-experiments.