Paul Lobo

Fall 2019 Reflection

During the summer I decided to focus on something completely new. Since Reinforcement Learning is something really interesting and quite difficult, I took it up to challenge myself. I started by going through various state of the art algorithms that gave me a decent base to start making some progress towards writing a thesis in the subject.

The summer got over and along came Fall and I now realized that directly going through these algorithms was not enough as I wasn’t ready to modify any of them. Strengthening the basics by reading the “Introduction to Reinforcement Learning” by Sutton and Barto along with our lab sessions on the topics helped me get through this phase. Dr. Parks suggestions to replicate the World Model in Pytorch and later Tensorflow 2 was also really helpful as it got me comfortable with writing large custom models, debugging them and playing around with multiple and processors and gpus to improve performance. Before this I don’t think I had ever been able to create any sort of working Recurrent Neural Network let alone ones which use Mixture Density Networks. I Also got to tinker with libraries like horovod and openmpi which are extremely impressive as they greatly reduce the complexity for someone trying to make use of all the processing power available.

My goal for the semester was to replace the last layer in the World Model paper. The idea was to remove CMA-ES and use Proximal Policy Optimization instead for the Controller Model. The reason for this is the CMA-ES is computationally very inefficient. It requires many many CPU cycles and burns through many combinations of parameters as it is a trial and error process. PPO is a state of the art model free algorithm and with additional data of the predicted next state as its input it should ideally be able to outperform the simple CMA-ES. The results however were not satisfactory, it did not outperform it which proved quite disappointing.

Although I didn’t receive the performance that I expected, I already have a couple of ideas on why it could have happened. All in all the semester was quite a success. I go into my final semester with quite some more exploring to do to achieve my goal.

References

Replicating the World Model in Pytorch

My goal for this article is to understand the implementation of the World Model Paper by David Ha, Jürgen Schmidhuber and create an implementation in Pytorch.

References

Paper: https://arxiv.org/abs/1803.10122

Implementation: https://github.com/hardmaru/WorldModelsExperiments

Setup

Clone github Repository

git clone https://github.com/hdilab/world_model_experiments.git

Change directory

cd world_model_experiments/pytorch_implementation

Create a Conda Environement.

conda create –name pytorch python=3.7 pip

During my experimentation I found that when using multi processing in Pytorch there is a huge speed benefit when building pytorch from source.

Follow this guide to install Pytorch from source.

https://github.com/pytorch/pytorch#from-source

Install Other Dependencies

pip install cma

pip install gym

pip install gym[Box2D]

pip install tensorflow-gpu

pip install jupyter

pip install matplotlib

pip install opencv-python

Training

Call process.bash

sh process.bash

The script process.bash does the following tasks sequentially:

Disable GPU for extraction as we will make use of all the processors on the machine to ssample from the environment in parallel.
Run generate_data.py to generate the data. This script will make use of all the processors on the machine to sample from the Car Racing environment and generate 10000 episodes of 1000 timesteps each. It will do so by generating random parameters for each processor. The code is setup so that we can later convert it into an iterative process.
Enable the GPUs to speed up VAE and RNN training.
Run train_vae.py to train the VAE to learn to encode each observation into a latent vector of size 32.
Call series.py to encode all the initially sampled data into the latent representation.
Call train_rnn. This will use the state and action as the input and the next state will be its target output. This will train it to predict the future state.
Disable the GPUs again for the Evolutionary Strategy.
Call train_v_m_c_pepg.py to use an evolutionary strategy to take the current state and hidden state from the rnn and optimize the parameters to maximize the score.

Results

After 200 iteratoins we recieved a score of 820 ± 153 over 100 random trials. We could probably improve this by training the RNN some more. I also found that since we used a random policy to generate the training data we could use an iterative process to generate much better training examples to train the MDN RNN on.

Citation

If you find this project useful in an academic setting, please cite:

@incollection{ha2018worldmodels,
  title = {Recurrent World Models Facilitate Policy Evolution},
  author = {Ha, David and Schmidhuber, J{\"u}rgen},
  booktitle = {Advances in Neural Information Processing Systems 31},
  pages = {2451--2463},
  year = {2018},
  publisher = {Curran Associates, Inc.},
  url = {https://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution},
  note = "\url{https://worldmodels.github.io}",
}

Spring 19 Review

Before starting my Masters I worked in the IT industry for more than 4 years. Working in software, I had a steady income and did not have to step outside my comfort zone. But, I always knew that I had the potential and the desire to do so much more. With that knowledge, I decided to pursue my Masters with the objective to inquire into Machine Learning.

Once I started attending lectures and working on assignments, I found myself back in my comfort zone with no scope of working towards my goal. Fortunately, I was given the opportunity to work under Dr. Park at the Hdilab where I assumed I would be able to do research. During my time there, Dr. Park recommended me to a colleague of his – Dr. Choi. He was very impressed with what I was able to accomplish and he offered to hire me in the spring of 2019 to work on data preprocessing tasks that would aid his research. Since I received such positive feedback from both professors, I was happy to continue the work they assigned me. I was very preoccupied with these tasks that I did not seem to have any time to work on my goal which was to do more research on Machine Learning. Finding that balance between work and research is key for someone pursuing a career in Academia.

Towards the end of the semester, I was given the opportunity to work on ‘The Obstacle Tower Challenge’ developed by Unity. The objective of this challenge was to benchmark various Reinforcement Learning algorithms. As I was unsure of what I should’ve been working towards, I jumped at the opportunity to work on it. When I began, I found that getting past the first round was rather easy – it was more to do with setting up an environment rather than actually developing new algorithms. This stage of the competition had me working late nights to ensure we made it to the next round. But in all this time, I did not get a chance to study the basic concepts in Reinforcement Learning.

My summer review with Dr. Park did not go as planned. Until now I was of the opinion that I was doing pretty well. It is only now that I have admitted to myself that I have been completely sidetracked. His suggestions have really been frank, eye-opening and gratefully needed. The need to focus on creating reproducible output is the only way that my progress can be evaluated. I start the summer of 2019 with the goal of starting a thesis in Reinforcement Learning.