Project: YourGym - Reinforcement Learning Experience

During summer 2020 I was working on a university research project under the name CIIRC Gym/CROW or YourGym (collaborative robotic workspace) lead by Mgr. Michal Vavrečka, Ph.D. CIIRC is the Czech Institute of Informatics, Robotics and Cybernetics which falls under CTU (Czech Technical University).

The project strives to create an easily modifiable simulation environment for people/students to get familiar with reinforcement learning and for further research for students/researchers and the creators of this project.

What CIIRC Gym is doing, it’s trying to make use of existing physics engines (mainly PyBullet with MuJoCo-py branch), algorithm implementations and package them in a way that a person using it will have a smoother start and easier time with setting up objectives, environments for the learning task, training, and evaluation.

This article aims to give an overview of my work I’ve been doing, trying to bird-view introduce you into the reinforcement learning area (if you are not familiar with RL). Through analogies and basic concepts I will explain what RL is, what should one expect to do in this area, some interesting ideas I stumbled upon while learning. I will share my views on the topic and its possible future, I will share my thoughts on what I enjoy about this field and what things I don’t. My opinions are fresh, surely limited, and will change with experience. This article represents a frozen frame of mind I had at the time of writing and should be taken less into account with time as I don’t plan to update this article.

Some people ask me how I managed to get to work on this project and there is a simple answer. I searched and asked. Although I wasn’t able to find this project in the section of offerings for students the CIIRC had at the time. I knew what area I want to research and learn. I started finding the last-year summer-job offers for students (the Czech version was deleted but the English was still up) and one project caught my eye and it went something like this “Reinforcement learning agents in Unity game engine”. Reinforcement learning was an area I wanted to explore for a while and I found this to be a great opportunity. I had some experience with Unity and neural networks and so I sent a quick email to the leader of the project in November 2019. He responded that the project is no longer available but that they are working on something similar now. We met and that was it.

I was offered to join as soon as we met but as a worried freshman at university, I was doing everything to bear my main responsibilities.

I haven’t taken a serious course on reinforcement learning (yet) and haven’t learned any definitions but I will try to give you my current point of view. Reinforcement learning is a sub-field of machine learning (learning from data), whose main goal is to create a system capable of learning behavior in an environment though interaction with that environment. I’m not sure if this definition is sufficient but it includes most of the things I’d encountered.

In reinforcement learning, you as a programmer are responsible for:

  1. Implemeting and running training algorithm
  2. Implementing appropriate reward function
  3. Providing all the data it learns from
    • Real-world data – real-time images, information about the position of the robot’s parts (arms, head, hand, etc.), and other sensory data.
    • Data from simulation – Simulation tries to resemble the real-world or another different world (game). Here, you also need real-time information about the environment and the robot
  4. Evaluating the results (does the model learn, why not, which one is the best, what can be improved…)
  5. Putting the trained model into practice (deploying to the real world/its environment)

To summarize, in reinforcement learning you put your Agent/Robot/Model in a certain environment real-world/simulation, let the robot decide on an action, gather information (video/images/positions/last actions/the last reward) from some of this information you calculate reward at a particular time (after some action/series of actions), you pass this reward to the training algorithm along with all the necessary environment information and the training algorithm changes the behavior of a robot to try to maximize the reward.

Reward and reward function

You can imagine reward as a helper to the learning algorithm, nothing more. If you want to learn something and distinguish progress from regress, you want to be able to tell if you are doing a good job or a bad job. At this time, this is the job of a human programmer.

An example of a reward function can be this: Let’s say you have a robot/agent whose job is to eat and you want it to eat just certain foods but not the other. You provide the robot necessary information to distinguish different foods (f.e. camera images or explicitly saying which food is what).

From the beginning the robot won’t know any better, it will eat randomly. Why randomly? Well, a robot makes internal decisions, and for it to learn it has to be able to modify this decision making (at the beginning decisions are random). For each food, we programmed a certain reward, if the robot eats red-colored food (tomatoes, cherries, red apples) it will get a reward of value let’s say +1 (it’s the programmer's job to experiment with different values and ways to calculate it), for yellow food you would give +2, for every other food you would give it -2. You start the training process, the robot starts eating randomly but after the training algorithm starts slowly tweaking the decision process of the robot, the robot starts to eat only tomatoes, red apples, cherries but it loves to eat corn, pineapples even more (if it had to choose between red-colored food and yellow colored food it would choose yellow). The reason for this is that our learning algorithm tries to maximize the reward. So to maximize the reward it has to change the robot’s behavior in a way it chooses actions with the highest rewards. How does it know which behavior gets more reward? Well, we feed this information to the robot during training, which actions did it take last time (did it eat any food? If it did, did it eat food with a positive reward or negative one, if it was a negative reward then we don’t want to repeat that action else if it was a positive reward we want to ‘reinforce’ that behavior). From this information, the training algorithm tried to map actions to rewards and find a correlation. After some time of training, the robot would rarely or never touch any food with a negative reward. ‘Rarely’ because sometimes/often the training algorithms have programmed exploration in them. Exploration is something like experimenting with food that has possibly negative rewards (often it’s implemented as to robot to take random action) in the “robot’s eyes“. Why did I say in the “robot’s eyes”? As we mentioned, the robot has its decision process and this decision process is randomly initialized at the beginning. The decision process is also something like a reward predictor (at the beginning these predictions a false and random). So from the start, the robot doesn’t know which food is the one with positive rewards and which with the negative ones. After experimenting and training, this perception can be tweaked to the one we desire based on our reward function, which is: eat red food (+1), eat yellow food even more (+2), and don’t eat food with any other color (-2).

Robot brain (decision process) and the training algorithm:

The robot brain or the so-called model or agent represents a function which can take the state of the environment (description of the environment such as my numerical position in the environment, position, angles and speed of my hands, position and speed of my goal, etc.) and calculates a decision.

As I said, initially this ‘decision process’ is random. For this, we need a training algorithm. We can think of the training algorithm as a trainer who trains a dog to do tricks. We/trainer have something in mind we want our robot to learn and the robot/model/dog has its decision process. During the training phase, we make our model/dog want to achieve positive rewards and avoid the negative ones. So whenever during training the model/dog does something which is close or matches the actions we want the robot to learn we give it a positive reward/treat. Optionally, when the robot does something we don’t want it to learn we punish it (I do not condone animal cruelty).

Internally the model is more often than not some implementation of single/multiple neural networks that map the state of the environment (info we have) to the output (actions we take). Or in other words, the neural network acts as a reward predictor, and the action we take at the end, when we process the input with the neural network, is the one with the highest value (perceived to be the highest reward). There are some algorithms like Q learning, which do not require one but there are also reasons for which these are not used and the neural networks are used instead.


This can be a weird term in the beginning, it was for me. I will start with examples. Starcraft/counter-strike/GTA can be environments, surroundings of a robot placed on your desk at home can be an environment, drone or plane and its surroundings can be an environment, car on a road, train on rails, humanoid robot walking/reaching/catching/speaking … can be an environment. Why did I use ‘can’? Well, that’s because there have to be some met prerequisites.

  • The environment has to be influenceable though our actions (drone after powering up the motors moves through the air, GTA character moves through space and affects it, the car moves on the road, etc.)
  • We have to be able to set goals, appropriate rewards, and get information from the environment.
  • We also should have the possibility to reset the environment when we irreversibly change our environment (break the item we wanted to catch, crash, lose the game), or even when we achieve our desired goal (caught the item, got to our destination, won the game). We call these time segments between resets -> episodes. And the episode is a series of small-time steps, where often the reward is given/absent each time-step and information is perceived each time step.

Examples of goals

  • CAR: move along the road but don’t crash
  • GAME: win the game

Examples of rewards

  • CAR: after each meter traveled you get a positive reward and after crossing the road, crashing into oncoming traffic you get a negative reward
  • GAME: after killing an enemy you get a positive reward, after destroying their base you get a positive reward, after losing you get a negative reward

Examples of input information

  • CAR: my speed, my position, my acceleration, my destination, angle of my steering wheel, braking information, information about other cars, images of the road and objects in the scene, …
  • GAME: my health, my position, my speed, my rotation, information about my teammates, info about the game, information about my enemies, their position, their health, etc.

Examples of output information

  • CAR: throttle, angle of my steering wheel, braking information, …
  • GAME: my rotation, my game specific actions, etc.

An example of a simple environment can be this pole balancing simulation from OpenAI gym.

This environment changes through our action (in this case we are in an unstable position so it changes regardless if we make actions) such as move left with the cart, move right with the cart or stay where you are. We can read the information from the game (angle of the pole, our speed, the rotational speed of the pole). We can make goals: balance the pole as long as possible, balance the pole as short as possible, swing the pole around, balance the pole down as fast as possible, reach some rotational speed with the pole, and many more. And we can calculate appropriate rewards from the environment depending on our goal:

  • GOAL: Balance the pole as long as possible
  • REWARD FUNCTION: For each milli-second of balancing the pole you get a small positive reward, when the pole crosses 30° with the vertical axis, the environment will be reset (start the next episode) and you get a bigger negative reward.
  • GOAL: Swing the pole at a constant speed of X unit/sec.
  • REWARD FUNCTION: For each unit/sec we get closer to our speed X we get a positive reward and when around the speed of X u/s we get another positive reward, when you cross the max speed limit you get a bigger negative reward and the environment will be reset (start the next episode). To improve the consistency of the speed we can introduce a small negative reward for each speed change. There are many ways to achieve this and you validate your ideas through experimentation/ thought process.

Results evaluation

One way to evaluate results is to look at our accumulated rewards (sum of all rewards in one episode) during episodes: in the beginning, each of our episodes will have negative accumulated rewards but over time (if our training works) our accumulated rewards will get bigger.

Additionally, we can also create our own evaluation process: we can introduce a success rate over N last episodes (how many N past episodes were successful 50%,70%…).

Then you take the best ones and observe them in real-time, watch their flaws, decide on your next strategy, etc. You should also look at others and try to figure out why aren’t they training, where’s the problem.

Putting the model into practice

If you are training in the simulation or a game. Putting that model into practice can be just as same as to get information from the environment and use the trained model for actions (without training).

If you are using a model from simulation and you want to deploy it in the real world, which I, unfortunately, haven’t yet done, you will have to provide all the necessary inputs you used during training to the robot. So if I trained my balancing pole in simulation and the inputs were the position of the cart, speed/acceleration of the cart, speed, and acceleration of the pole, I will have to provide this data to the robot and that’s where the main problems and errors lie. If we want to deploy our simulation model into the real world we can add noise to our model inputs/outputs in simulation to account for the accuracy we will have in the real world. Because the real world is much more inaccurate than the simulation -> we have to make simulation less accurate i.e. more similar to the real world. There are certainly many other ways to improve the model’s behavior in the real world. For inputs, we will have to use sensory data, accelerometers, possibly high-speed cameras (if necessary)...

Video of a sim2real example:

Why reinforcement learning?

Reinforcement learning helps us to find solutions to complicated problems that are nearly impossible/just impossible to traditionally code by a human. One of these solutions can be to program an agent that can beat a team of world champions in a game, other solutions might include using robots to do complicated tasks such as solving a rubics cube.

OpenAI Dota 2 bots beating world champions:

Solving the rubics cube:

A big advantage in reinforcement learning is the possibility to use the simulation for gathering data and training. This simulation/multiple simulations can run much faster than in real-time and the robot/agent can get hundreds/thousands of human years to perfect a particular task. Today’s improvements in compute power are a driving force for reinforcement learning. Lack of boundaries allows an agent to achieve superhuman results.

What is possible today?

As you may have already seen reinforcement learning is very successful in games, also pretty successful in simulation and developing in the sim2real domain.

RL in simulation examples:

RL agent dogfighting in simulation with a human pilot:

Sim2real example:

Use case of reinforcement learning can be motor control for the stability of a robot because how complicated and interconnected this space can get (control tens of muscles/motors to do one/multiple things at the same time). Learning movement in complex and unstructured environments. With simulation and a lot of training, a robot can also discover much better strategies for a particular task such as dogfighting with a jet or strategies to win a game.

Example of a reinforcement learning agent exploiting game flaws with its unique strategy to gain highest score.

That’s why designing the correct reward function is very important and often challenging.

I consider reinforcement learning a very powerful tool for control with help from simulation the data can be created for relatively cheap and fast.

The big downside is that reinforcement learning still requires a lot of time to train. There is a so-called curse of dimensionality at play which says that with an increasing amount of dimensions/difficulty (inputs/outputs) the training time and data required increases exponentially.

One way to help the training process is to give it hints. f.e. if we want to train the robot to pick something and put it somewhere else we can create a dataset of training data (we doing the task correctly instead of the robot) in virtual reality where we will control the hand. We can divide a complicated task into many smaller ones to make the training easier from the beginning (see curriculum learning). We can not only divide tasks but we can create multiple specialized models. One model might be responsible for reaching, others to grab correctly and others to balance the robot that stands on his feet while doing all of this.

Let’s say you have unlimited resources and you want to train some task asap. You would start with one computer and you would conclude training time to be 50 hours f.e. where is the bottleneck? The main bottleneck I was able to observe was the CPU (if you don’t use big convolutional neural networks at least). The training environment I’ve been using run just on one core. After some searching I discovered the possibility to run multiple environments at the same time but how does that work? There is only one model how could there be multiple training environments. One way to do this is that it’s based upon boss-worker architecture. Few cores just gather data from the simulation and send it to the main computational unit/core. These can be expanded to multiple computers but there is still a bottleneck in the main processing unit which tries to train. I haven’t explored it more in-depth and I may be wrong.

It’s also often challenging to design an appropriate reward function and the more you are specifying what the robot should do the less original/better the solution is going to be and the more human-caused errors/limitations you will introduce.

Also, I haven’t been observing any adaptations of reinforcement learning in real life except few robots like a digit, ANYmal robot, and a few other examples. Although, I would still consider this field to be heavily dominated by research not applications although I’m always happy to find some.

I have high hopes for reinforcement learning. I imagine humanoid robots that move flawlessly undistinguishable from humans as a result of vast training in crazy environments and their ability to recover is superhuman. I see drones that are so fast and obliterate the best drone pilots in the world. Best car racers, pilots, heck the best sportsmen in any discipline. But the goals we can create within the environments are limited by our simulation. If we could recreate/mimic the human mind, we could have RL agents trying to have the best conversation, figure out the best strategies to convey information, best strategies to maximize particular emotion/feeling, etc. but unfortunately this technology has a big potential to harm as well, but so has the whole field of machine learning, software, and technology in general.

Fast forward to summer 2020. During the first week, I was getting familiar with the workflow of the MuJoCo-py physics engine (I wasn’t working on the main PyBullet physics engine branch) and the structure of the software they’ve been building.

During this time I also started researching RL (Q/DQ/DDQ learning and ideas behind modern algorithms, reward shaping…). My main responsibilities were to create training scripts for MuJoCo environments as well as to create these environments with MuJoCo-py syntax. The environments were each in a different setting with different objects (robotic workspace, kitchen, our working space).

So, in the beginning, I was experimenting with MuJoCo physics, adding objects, moving them with Python, trying out different settings, and generally unconstrainedly exploring.

After this, I continued to morph the existing training PyBullet script to work with MuJoCo. Imported KUKA LBR IIWA robot into the scene. And for a while tried to figure out what settings should I use to approach the learning of the reaching task (robot has to touch an object in 3D space). I started with all the 7 axis control.

But figured out that I should start with a few axes as possible to be able to learn quicker and tune the training process.

The reward function for this task was implemented with a simple Euclidean distance between 2 points in 3D space. If you are getting closer to the point with your ‘hand’/end effector you get a positive reward and if you are getting farther you get a negative reward. Reward functions can get complicated and you should change one to suit your needs.

Started with 1 axis training but noticed that the trained model, although fast, had some problems exactly reaching its goal. One of the reasons was that the problem got harder with each new axis (the curse of dimensionality – the amount of data needed grows exponentially with dimensions) and also that for the model to figure out the task it has to learn the exact mapping of joint positions to 1/2/3D space.

After this partial success, I moved on to the next strategy. I was amazed at how the OpenAI team were able to train their fetcher robot.

I studied their code and figured out that they’d used movement change not position mapping, meaning, that If you have a robot and you are trying to teach him to reach for stuff you don’t give him data such as "I want you to reach position (x, y, z) so from inputs (position of each axis joint and end effector/part we want to touch our goal with/, of course, the end goal ) you have to exactly remember to use this combination of (a1, a2, a3, a4, a5, a6, a7) numbers (which are basically position for each joint)".

Besides controlling the directions I also added control for the angle of the end effector, so the robot doesn’t always have a stiffened end effector (always pointing in the one direction as it is with the fetcher).

Instead, what you want to do is this: "I want you to reach position (x, y, z), your position of the end effector is (a, b, c) tell me in what direction do you want to move your hand". This type of training simplifies the problem as it gets rid of the problem to learn inverse kinematics (inverse kinematics encompasses calculations that we need to program the movement of the robot joints if we want to move its end-effector to the place we want) from scratch, and in the end, it’s all you need. The inverse kinematics is handled by the physical constraints of the robot (if you pull its hand the arm will move accordingly to minimize resistance).

As with the fetcher, I also added more ‘friction’ to the joints so the robot doesn’t fall under its weight.

After these findings, tweaks, and additions robot was able to learn the reaching task quickly and reliably. I think that this task can be easily modified to suit different needs such as "I want you to reach this object but I want you to not touch these objects etc.".

After this successful familiarization and training the basic reach task, I moved on to the next one which was the throw task.

Throw task means that you want the robot to throw away some object as far as possible. So the reward function looks like this (if the object is getting farther from you, you receive a positive reward, in the end, you will receive an additional reward for the total distance between you and the thrown object). I figured that the robot should already have this object ‘in hand’ as the task is not to pick and throw.

Essentially what you would do is to spawn a robot with a throwable object in his hand, give him an appropriate reward function, run the task with the implemented algorithm for some time and then watch the results and test the trained models.

Training time (the time it took for the model to learn how to throw the object) was pretty short as it’s pretty basic. I moved on to the next task.

The next task was the push task and oh dear was it challenging. I enjoyed that I had a lot of autonomy in this project and didn’t receive much advice on how one should think about reward functions and I didn’t receive instructions on how to implement a certain task, this gave me a lot of room to experiment.

In a push task, you are trying to push an object from place A to place B. I started with a KUKA robot over the table upside down while the object and its goal spawned randomly in a circle.

There were two reward functions in this task. The first reward function was dependent on the distance between the end effector of the robot and the puck it had to push and the next was calculated from distance change object-its goal.

I was experimenting.

I tried to research reward shaping and discovered curriculum learning in machine learning (trying to learn a problem step by step with increasing difficulty as humans do f.e. how we learn math, start with numbers, then simple signs +,- then /,*, etc.).

So I tried something similar. I tried to divide the task into 2 parts. First, you want to get behind the red puck (while not touching the red puck – I created a ‘negative’ area around the object we don’t want to touch – if the robot touches it, it’s punished by negative reward) and then you want to push the puck in the right direction. You wouldn’t be able to progress to part 2 unless you successfully achieved part 1. For this, I kept track of the success rate for part 1. If it's over ~90% you can dynamically turn off the first reward (getting behind a puck) and turn on the second reward (pushing the target to its place). Additionally, if the success rate of the second part is over ~95% you can try to give the model some freedom to do anything and the reward it receives will solely dependent if it managed to get the object to its place.

I thought I cracked the code, the first part was learning pretty well but it was harder for the model to connect the dots- ‘I’m behind the puck so I should now try to push the puck’ and ‘now the invisible negative-reward area around the puck doesn’t exist’. I think that this might have been solved by adding one of the switch inputs that it would tell the model ‘start part 2’ but I’m not sure, definitely dynamically switching 2 trained models would be great however that wasn’t the goal I had. I had to move on, the project is still in development and it wasn’t my place to research and experiment with the training phase.

After talking to the project leader I was informed that my kind of a push task was far too complicated to be learned easily with simple training. I was recommended to create a much-simplified version of the push task similar to the reach task. It was pretty straight forward but even this simplified version had some issues.

The main issue was the distance-based reward. The model was slow to learn how to approach its goal as it received positive reward even if its trajectory diverged (negative reward started showing just after it passed the goal). I didn’t think twice and implemented a distance-based reward with a straight line connecting the origin and the desired goal (positive reward for getting closer to the goal and negative reward for getting farther away from the line). This approach worked as expected and I moved on.

I then continued making environments for pick task (grab an object and lift it to a certain height), pick and place task (grab an object, lift it and move it to a certain place) with random initializations of objects and goals as well as making sure the robot can physically grab the object.

But those couldn’t be trained as the training on these tasks requires some tricks and it is a whole new area to work on and at the time wasn’t part of my main responsibility. I continued adding one more robot named Franka Emika Panda into all of these environments and trained the models for reach, throw, and push tasks.

This learning/working experience in reinforcement learning was very insightful and valuable. I plan to continue working on this project during my schoolyear but the time I will able to work will be much more limited.

I learned that the reinforcement learning field is still very restrictive and there is still too much work to be done for it to reach its full potential whatever that means. RL will most likely be a powerful tool to take care of problematic areas of development in robotics and the possibilities to be exposed to a vast amount of data will be valuable to take care of the complexities of the real-world.

I hope you enjoyed reading my two cents and got some insight into reinforcement learning.

If you want to train your own agents I recommend OpenAI gym and channel Sebastian Schuchmann.