An Introduction to the Basics of Reinforcement Learning

Reinforcement learning (RL) is pretty simple in theory – “take actions, get rewards, increase likelihood of high reward actions”. However, we can quickly runs into subtle problems that don’t show up in standard supervised learning. The aim of this post is to give a gentle, concrete introduction to what RL actually is, why we might want to use it instead of (or alongside) supervised learning, and some of the headaches (figure 1) that come with it: sparse rewards, credit assignment, and reward shaping.

Figure 1: I’d like to help take you from confusion/headache 🙁 (left) to having a least some clarity 🙂 (right) with regard to what reinforcement learning is and where its useful

Rather than starting with Atari or robot arms, we’ll work through a small toy environment: a paddle catching falling balls. It’s simple enough to understand visually, but rich enough to show how different reward designs can lead to completely different behaviours, even when the underlying environment and objective are the same. Along the way, we’ll connect the code to the standard RL formalism (MDPs, returns, policy gradients), so you can see how the equations map onto something you can actually run.

From Supervised Learning to RL: Why Bother?

Most of what we call machine learning today is supervised learning:

You have inputs, x (images, text, market states…)
You have labels, y (cat/dog, sentiment, future price…)
You train a model to map x to y as accurately as possible on a fixed dataset.

This works well when you have labelled data and your prediction doesn’t change which data you see next. A lot of real problems break these assumptions, such as:

Playing chess or Go
Controlling a robot or a drone
Deciding what to show a user next
Placing trades or bids over time

There is no dataset with “optimal action” labels for every possible situation, and your actions are not just predictions – they influence the next state which changes what happens next.

Supervised learning answers “Given this input, what output matches the label?” Reinforcement learning answers a different question “Given that my actions affect the future, what are the optimal actions in a trajectory of decisions?” Instead of a dataset of labelled examples, an RL agent interacts with an environment, chooses actions, observes the next state and a reward (how good that step was) and then adjusts its behaviour to maximize the total reward it gets over a whole trajectory, not just the next step. The framework lets you work with problems where you don’t always have ground-truth actions to imitate (although this can also be helpful in an RL setting), and not just one-shot predictions. This extra power comes with extra headaches (sparse rewards, exploration, reward hacking…), but it’s what lets us train agents that can act in complex environments.

The RL Framework: Agents, Environments, and Returns

Reinforcement learning is usually framed in terms of a Markov Decision Process (MDP). An MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ where:

$\mathcal{S}$ : set of states
$\mathcal{A}$ : set of actions
$P(s’ \mid s, a)$ : transition dynamics, how the environment moves from state $s$ to $s’$ when you take action $a$
$R(s, a)$ or $R(s, a, s’)$ : reward function, how much immediate feedback you get for taking action $a$ in state $s$
$\gamma \in [0, 1)$ : discount factor, how much you care about future rewards

You can think of this as the formal version of “an agent interacting with an environment”. At each time step $t$ :

The agent observes a state $s_t \in \mathcal{S}$
It chooses an action $a_t \in \mathcal{A}$ according to its policy $\pi(a \mid s)$
The environment samples a next state $s_{t+1} \sim P(\cdot \mid s_t, a_t)$ and a reward $r_t = R(s_t, a_t, s_{t+1})$ is assigned

This produces a trajectory $(s_0, a_0, r_0, s_1, a_1, r_1, \dots),$ and the quality of a trajectory is measured by its return: the discounted sum of rewards: $G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots$

The RL objective is to find a policy $\pi_\theta(a \mid s)$ that maximizes the expected return $\mathbb{E}[G_0]$ . More explicitly, we can write the objective as

$J(\theta) = \mathbb{E}_\pi \left[ \sum_t r_t \right].$

These performance measures are functions of the entire trajectory and the environment’s randomness, they are not smooth, differentiable functions of the policy parameters. Policy gradient methods like REINFORCE get around this by never differentiating through the reward or the environment. Instead, they differentiate through the probability of the actions under the policy. Using the log-derivative trick, we obtain

$\nabla_\theta J(\theta) = \mathbb{E}\pi \left[ \sum_t \nabla\theta \log \pi_\theta(a_t \mid s_t)\, A_t \right],$

where $A_t$ is an advantage estimate (a centred version of the return). In words: the return (an often non-differentiable measure of how good the trajectory was) only appears as a scalar weight on the differentiable quantity $\log \pi_\theta(a_t \mid s_t)$ , and we update the policy to make high-return actions more likely.

Sparse Rewards and Credit Assignment

So far, RL sounds simple. We define rewards and learn a policy that maximizes the long-term return. In practice, the main headache is that the reward signal is often weak, delayed, and noisy.

A lot of RL environments have sparse rewards. Picture a small gridworld where you only get +1 when you reach the goal, and 0 everywhere else. A random policy can wander around for ages and never hit that +1. All it sees is a stream of zeros. From the agent’s point of view, every trajectory looks equally useless so there’s no gradient telling it “this path was slightly better than that one”.

Even when you do eventually see a non-zero reward, you run into credit assignment problems. The return $G_t$ at time t depends on all future rewards, so early actions are judged by what happens much later. This means a bad early move followed by brilliant recovery can give us a high final return, so the algorithm can end up giving positive credit to bad early moves. Conversely, a sequence of good early moves followed by one catastrophic mistake results in a low final reward, so all the earlier good decisions are assigned low rewards. The algorithm doesn’t know which individual step deserved the reward. It just knows that this whole trajectory turned out well or badly.

One common way to cope with sparse rewards is reward shaping, which involves adding small intermediate rewards or penalties to guide the agent, instead of only paying out at the very end. For example, you might give a small negative reward for each extra step taken, or a small positive reward for moving closer to the goal. This can make learning much easier, but it also introduces a new problem – if you shape the reward badly, the agent may learn to optimize the shaped signal while completely ignoring the behaviour you actually care about.

These issues (too few rewards, and rewards applied to whole trajectories rather than individual decisions) are a big part of what makes RL difficult and sensitive to hyperparameters. They also set up the motivation for the toy problem we’ll look at next, where we’ll see the difference in performance between using a sparse vs shaped reward.

A Toy Example: Catching Falling Balls with Sparse vs Shaped Rewards

To make all of this concrete, I built a small RL environment where a paddle at the bottom of the screen has to catch falling balls.

For each episode, up to five balls are spawned one by one from the top at random horizontal positions, and a random “wind” pushes them left or right as they fall. The agent controls a paddle with three actions (left, stay, right). If a ball hits the bottom where the paddle is, it counts as a catch, otherwise it’s a miss. The episode is considered a win if the agent catches at least 4 out of 5 balls.

In MDP terms, this game is defined as $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$ :

The state $s_t$ is a compact vector: paddle position, up to two balls in flight (positions and velocities), current wind, and how many balls have been spawned/caught so far.
The actions $a_t$ are the three moves: left, stay, right.
The transition dynamics $P(s_{t+1} \mid s_t, a_t)$ apply the chosen move, update wind and ball physics, spawn new balls, and end the episode once all 5 balls are processed or it’s impossible to reach 4 catches.
The reward $R$ is where we experiment: either a sparse terminal $\pm 1$ signal (win/lose), or a shaped reward that adds small step-wise bonuses and penalties while keeping the same win condition.
The policy $\pi_\theta(a \mid s)$ is a neural network that outputs a categorical distribution over the three actions, and we train it with REINFORCE to maximise the expected return $J(\theta) = \mathbb{E}_\pi \left[ \sum_t r_t \right]$ in this environment.

During training, the policy samples from the distribution of possible actions (to explore) and during evaluation it acts greedily by taking the most likely action. A succesful game would look something like this:

Random baseline

As a sanity check, we first evaluate a random policy that picks each action with equal probability, but is otherwise identical (same environment, same observation space, same evaluation protocol). Over 100 evaluation games, the random baseline achieves 0% wins with 0.29 / 5 balls caught on average. So, a policy that has no structure at all almost never wins and only catches the occasional ball by accident.

Sparse reward

Next, we train a policy using the sparse reward. In the environment, this means the agent gets zero reward at every step, and only at the very end of the episode receives a single terminal signal: +1 if it caught at least 4 balls, and −1 otherwise. In other words, the raw reward sequence looks like 0, 0, 0, …, ±1. During training we turn this terminal reward into a discounted return and propagate it back through the whole trajectory, so every action gets some credit or blame, but all of that feedback ultimately comes from a single binary win/lose outcome.

In practice, the policy quickly falls into a degenerate behaviour, it tends to drift all the way to one side (for example, the far right…) and sit there. Because the only information is a delayed ±1 at the end, most trajectories look similarly bad, and small random tweaks to the policy don’t reliably improve the return. The gradient signal stays weak and noisy, so the agent never really escapes this “do nothing useful” strategy. When we evaluate this sparse-trained policy greedily over 100 games, it achieves 0% wins and only 0.26 / 5 balls caught on average which is no better than random.

Shaped reward

We then train an identical policy in the shaped version of the game, where the terminal win/lose signal is kept the same but we add small intermediate bonuses for catches, penalties for misses, and a term that gently rewards moving closer to the nearest ball. The underlying objective (“catch at least 4 out of 5 balls”) hasn’t changed, but the agent now gets additional hints along the way, instead of having to rely solely on a single terminal win/lose bit.

With this shaped reward, learning looks completely different. The policy gradually learns to track the balls: it moves under the most threatening one, corrects for wind, and visibly tries to win. In greedy evaluation over 100 games, the shaped-trained policy achieves 66% wins and 3.11 / 5 balls caught on average, a vast improvement over both the random baseline and the sparse-trained agent.

Summary of the learning curves

The training curves pull this together:

The win-rate plot shows the sparse-reward agent flatlining near zero, while the shaped-reward agent climbs well above the random baseline and stabilises at a high win rate.
The catch-fraction plot tells the same story: sparse reward hovers around “random” levels, whereas shaped reward steadily improves until the agent is catching the majority of balls.
The total episode reward plot shows the shaped agent moving from negative returns (early exploration and mistakes) to comfortably positive returns as it learns, while the sparse agent remains stuck around its initial level.

Taken together, the GIFs and curves show that with only a sparse terminal reward, the agent gets stuck in a low-return, “do-nothing useful” strategy. With a shaped reward on the same MDP, it discovers a policy that actually plays the game well.

Wrapping up

The ball-catching game is deliberately tiny, but it captures most of the core ideas of reinforcement learning in one place. We framed the problem as an MDP, defined a policy $\pi_\theta(a \mid s)$ , and optimised its behaviour with a policy gradient method that never differentiates through the environment or the reward itself – only through the log-likelihood of the actions that led to good outcomes.

The key lesson from the toy example is that how you shape the reward matters as much as what you optimise. With a sparse win/lose signal, the agent effectively never learns: it finds a degenerate “go to one side and sit there” strategy and gets stuck. With a shaped reward on the same environment and objective, the agent learns to track balls, respond to wind, and win most games. In both cases we are maximising expected return, the difference is whether the agent has a usable learning signal on the way to that return.

This is why RL can feel both powerful and fragile. It gives us a principled framework for sequential decision-making when we don’t have ground-truth actions, but small choices in reward design, state representation, and exploration can dramatically change what an agent learns. In more complex settings such as robotics, trading, recommendation systems where the stakes are higher, thinking carefully about reward shaping and hyperparameters is crucial.

RL is not just “supervised learning with a fancier loss”. It’s a different way of thinking about problems where actions affect the future, and where we care about trajectories, not single predictions. The toy paddle might be simple, but the questions it surfaces are often the ones that come up in real RL applications.

Author

James Broster

View all posts

Oxford Protein Informatics Group

or "OPIG" to friends