Since the start of this century, or more likely in the last decade, we gathered massive amounts of data and that is the firm ground for the resurrection of artificial intelligence. This fact also led to the creation of a large number of products, intelligent agents, etc. which are now empowering different devices and covering several business use-cases. Digging a little deeper, at least from a technical perspective, one can draw a straightforward conclusion that most AI products have a base in supervised learning. On the other end of the spectrum is one other type of learning that reemerged, and that is reinforcement learning. But what is reinforcement learning? In today's order of things, when artificial intelligence is on the rise, it is a favorable field for new research. It represents a computer-based way of learning from an environment with a focus on achieving a goal. With this in mind, application of this type of learning is  in tasks of optimal management, optimization, and monitoring. If you listen to Andrew Ng’s talk on the current state of AI, he starts the talk with the statement  that reinforcement learning is the least used type of learning, which is true considering particular use-cases but, more important, even more, data. Well, now you may ask yourself, how can I gather more data, especially if my algorithm has to learn from an environment? The answer lays in simulation environments, which is why the reinforcement learning algorithms learn through video games or robotic simulations.

Now, back in 2013, the DeepMind research team released a paper at the NIPS conference about a reinforcement learning algorithm that can play the Atari game (Breakout). This paper was massive, they used deep learning techniques (which only helped the hype!) to approximate a function that has to be learned, and the agent was able to achieve better scores than humans. After that, the RL field went through a lot, from simple 2D environments and 80s arcade games eventually environments evolved to more complex and task demanding necessities. That's why we have environments like Obstacle Tower today. Obstacle Tower is a 3D environment, and to be able to tackle some of its obstacles (obviously), the problem of navigation or learning-to-control arises. In the next couple of paragraphs, I am going to propose one way on how to solve this problem in reinforcement learning fashion.

Picture 1: Obstacle Tower environment

Before we start, as you can see, I didn't go into details of reinforcement learning, and I won't. Before going further, I recommend at least getting familiar with basic concepts, such as learning loop, agent, environment, rewards, policies, state, and value functions. Blogs can be a quick starter, but for in-detail knowledge, David Silver or John Schulman classes are a must. They are among the top contributors in this field today.

Obstacle Tower environment

The Obstacle Tower is a three-dimensional environment created by Unity Technologies. It is a procedurally generated environment consisting of a hundred floors (the first version had only 24) and created to tackle challenges in areas like computer vision, planning, and generalization. From the previous sentence, you can already conclude that this environment was built with research in mind, and that was Unity's premise, after all.

Picture 2: Different themes across the floors

Along with the environment, they released a paper describing their efforts and performance they were able to pull off. But what drew me to choose precisely this environment? Well, I knew it was sophisticated and visually stunning, which meant that it was a lot harder than any of the previous environments. Besides, Unity Technologies published a challenge symbolically called the "Obstacle Tower challenge” in collaboration with Google Cloud Platform. Although I didn’t enter the competition, it was good knowing that lots of other people are solving similar problems.

Solving the whole tower with modest hardware (12-core i7, 32G RAM, and 1050Ti) wasn't going to cut it, so I restricted myself to a couple of floors where only generalization and environment-control problems are present. With higher floors, agent planning comes into play and another set of problems.

Picture 3: Agent planning is required to solve obstacles

As mentioned before, the environment is visually stunning, with real-time lighting and shadowing effects along with different color themes every couple of floors, and the agent movement was viewed through a third-person camera. Application representing the agent is written as a python module (like most production-ready AI tools nowadays), and the Obstacle Tower environment is also provided as a python module, which made development easier for a brief time. The most important aspects of one environment are its observations, actions, and rewards. The observation space consists of two types of information, and one is rendered pixel representation of one in-game frame with size 168x168 pixels and all three color channels. Next to this user receives a vector of additional in-game information such as time left, the current number of keys the agent is holding, a number of the floor that the agent is currently on. Action space is multi-discrete, containing a set of smaller discrete action spaces whose union corresponds to one in-game action. The agent can move in all four directions, use jump and turn camera clockwise, or counterclockwise. The last and most important aspect of the environment is given rewards when a specific action is applied. Obstacle Tower is a sparse environment, meaning that the agent doesn't get a reward in every timestep. Explicit rewards (1 point) are given only for a door opening or floor finishing. If we look at it from a human perspective, it is more natural, as humans don't get rewarded for every action in the real world. Still, from an implementation perspective, it makes the learning process much more difficult.

Defining a model

Having in mind that the problem is in the area of computer vision, it was clear that model architecture has to be capable of handling pictures as the input data. There are lots of other environments within the computer vision area, like old Atari emulators. Model complexity doesn't take much of a part in RL tasks, meaning that this architecture didn't have to be too complicated. Reinforcement learning problems rely on more subtle tricks, which I am going to explain later. Deep neural networks were used as policy approximators in pretty much every RL task, and this problem right here was not any exception. The model used by DeepMind's team in the already mentioned paper was an excellent start point as it imposed the convolution neural network part of the model. The only thing that's left is the ability to somehow model sequence on agents' movements through the environment, meaning a dependency between consecutive frames. The newer approach for this is to use a recurrent neural network as opposed to frame stacking. And voila, the model was set. The selected reinforcement learning algorithm that was used for training an agent was proximal policy optimization in the actor-critic framework. At that time, state-of-the-art, off-policy, and easy to implement. All the things one scientist would like. Also, it was used in the Obstacle Tower paper, so it represented a baseline.

I started the implementation with the actor-critic framework, where our actor is policy gradient PPO algorithm in a parallel fashion. Earlier, I mentioned that RL algorithms need a lot of data. Executing many environments in parallel and gathering observations and experiences in one place helped to solve this particular problem. Price, of course, is higher hardware requirements. Anyway, this was the latest trend. I achieve this through the implementation of a module for environment control. It represents an interface that allows sending commands for executing the next time step, restarting environment, and closing environment. Since python is "a one CPU core at the time" language because of the GIL, this was implemented through the execution of multiple processes. Each process executes one instance of the environment. More processes equal more environments equal to more data. This speeds up the training process, but it requires more CPU cores, more RAM, and more GPU memory.

Picture 4: Communication between processes

The model was defined through PyTorch, a library for machine learning. Base architecture had a three-layer CNN with 32, 64, and 64 filters with a kernel size of 8x8, 4x4, and 3x3 for each layer, respectively. All activation functions were LeakyReLU, and there was a batch normalization layer between CNN layers. Extracted features were passed through an RNN network with GRU cells (512 memory size). Two fully connected layers at the end represented the actor (with softmax output) as a policy estimator and critic as a value function estimator.

Picture 5: Base model architecture

All relevant information such as state, rewards, actions, and predicted values were being saved in tensors. You can look at it as an experience memory structure — a familiar term in the RL world. With the training loop set, the process could start. While the training phase agent managed to finish the starting floor, the inference phase showed different results. The agent never managed to find a good policy to get out of the room, non-stop circling. Changing training time, or event timesteps per episode (from initial 128 to 64 or 32) didn't help. It looked like that sparsity of the problem was too much for my agent to get over. I didn't want to push it too much with too much training time since I consider that brute force learning and not a generalization. Introducing time as a penalization tool to reward shaping didn't help either. It only made things worse. Time is a double-edge sword when trying to fit it as part of a reward. An agent just stood in one place the whole time and waited till time ran up. This was the fastest way to stop penalization.

Existence is a pain! Interesting phenomenon. I knew that for the agent to successfully transition from one room to another, it had to have rewards at each step. With a lack of additional information from the environment, this was an impossible mission . Still, it turned out that this particular problem was already under research, and there were a couple of solutions. I've opted for an agent's inner rewards system. So, enter the Intrinsic Curiosity module.

Intrinsic curiosity module

The intrinsic module was a paper introduced by the OpenAI research group and Berkeley UC after the conclusion that external rewards that are crafted along with the environmental engineering process are not enough and that they don't scale well. They performed a large scale study on lots of free environments, where they tested a model that yields intrinsic reward. This reward was able to make the agent explore the unseen parts of the environment without the usage of any external rewards. This was a missing piece for my agent to be able to navigate through the rooms of Obstacle Tower. Did it pay off? Is it applicable to this particular problem?  Let's explain, but first model architecture.

Picture 6: ICM model architecture

The idea is straightforward, extract features of the environment state and the state in the next timestep. Applying selected action and feature representation of the current state, predict the one in the next timestep. Applying mean squared error between the real feature vector F(st+1) and predicted one F’(st+1) yields an intrinsic reward since the reward is a value of error term, meaning that the higher error will drive the agent to those unseen parts of the environment.

A combination of intrinsic and external rewards together improved agent performance a lot. During training, the agent was able to go up until the 5th floor, but the policy variance was also high, sometimes the agent would finish on the 2nd floor, sometimes on 3rd, etc. This was achieved after around 14 hours of training. This was, of course, a limiting factor since I made the problem episodic and violently reset the environment after 128 timesteps, but this problem was continuous, and with a setup like that, the agent would go even further I am sure of it.

Picture 7: Training process of PPO algorithm

Implementation of this solution was not straightforward either since not everything worked out-of-the-box. During implementation, there were some engineering tricks proposed by various authors (not in paper but in code implementation!) on how to make the RL training process more stable, and much focus was put in those efforts. Things like observation normalization, running standard deviation reward scaling, advantage normalization, and custom network initialization made this whole process a lot harder since this was something that had to be learned along the way, which is a lot harder, but that is the research process. In the end, we got an agent that can solve about three floors of the Obstacle Tower environment, after 14 hours of the training process by already mentioned hardware. The results are pretty modest, but there is much room for improvement. The agent showed that generalization is not a problem during learning. Still, lack of additional data which would remove unnecessary exploration (in a case when it's not needed) could make the performance even better. The baseline that I've mentioned earlier for the PPO algorithm was five floors after 20 million timesteps on 50 parallel environments. After this, you get what's the difficulty level here.

Conclusion: An agent is running in an Obstacle Tower environment

What can we say in the end, an agent is running in an Obstacle Tower environment after all. The Obstacle Tower environment has proven to be very challenging to master and is poised to push the boundaries of reinforcement learning research. Challenges from the visual, control, and planning aspects, as well as the rewards provided by the environment, require knowledge of the matter and leave room for further research. This environment certainly reinforces the claim that over time, more complex environments are created that are pushed to the stage in this area. Reinforcement learning proved to be a tough part of machine learning, requiring a lot of expertise and math knowledge. Since it's a pure research field, (almost) every bit of information (beyond basics) is placed in research papers, but engineering tricks that made algorithms work, train faster, etc. are not there, which makes them hard to reproduce. Ignoring all those challenges, this field is impressive, and it is fascinating to see that after a long process, a "smart" agent emerges, and it's able to play the game above human-level or even control and optimize some process.

In the end, Unity Technologies announced winners of the Obstacle Tower challenge. You can see solutions that winners and runner ups came up with. There are exciting new techniques and tricks such as behavior cloning, but what is nice to see is that some solutions did try ICM and can pass the 10th floor, etc. which means that the described procedure proves to be a contender.

A link towards my Github account, and this project is here, and, well, that conclusion brings this reinforcement learning story to an end, and we'll be seeing each other again in some other machine learning adventure! 'Till then, farewell!