{"id":13750,"date":"2025-12-10T12:26:33","date_gmt":"2025-12-10T12:26:33","guid":{"rendered":"https:\/\/www.blopig.com\/blog\/?p=13750"},"modified":"2025-12-10T15:13:25","modified_gmt":"2025-12-10T15:13:25","slug":"an-introduction-to-the-basics-of-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/www.blopig.com\/blog\/2025\/12\/an-introduction-to-the-basics-of-reinforcement-learning\/","title":{"rendered":"An Introduction to the Basics of Reinforcement Learning"},"content":{"rendered":"\n<p>Reinforcement learning (RL) is pretty simple in theory &#8211; \u201ctake actions, get rewards, increase likelihood of high reward actions\u201d. However, we can quickly runs into subtle problems that don\u2019t show up in standard supervised learning. The aim of this post is to give a gentle, concrete introduction to what RL actually is, why we might want to use it instead of (or alongside) supervised learning, and some of the headaches (figure 1) that come with it: sparse rewards, credit assignment, and reward shaping.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"205\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?resize=625%2C205&#038;ssl=1\" alt=\"\" class=\"wp-image-13765\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?resize=1024%2C336&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?resize=300%2C98&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?resize=768%2C252&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?resize=1536%2C504&amp;ssl=1 1536w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?resize=624%2C205&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?w=1690&amp;ssl=1 1690w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/confusion_to_understanding.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p>Figure 1: I&#8217;d like to help take you from confusion\/headache \ud83d\ude41 (left) to having a least some clarity \ud83d\ude42 (right) with regard to what reinforcement learning is and where its useful<\/p>\n\n\n\n<p>Rather than starting with Atari or robot arms, we\u2019ll work through a small toy environment: a paddle catching falling balls. It\u2019s simple enough to understand visually, but rich enough to show how different reward designs can lead to completely different behaviours, even when the underlying environment and objective are the same. Along the way, we\u2019ll connect the code to the standard RL formalism (MDPs, returns, policy gradients), so you can see how the equations map onto something you can actually run.<\/p>\n\n\n\n<!--more-->\n\n\n\n<h2 class=\"wp-block-heading\"><strong>From Supervised Learning to RL: Why Bother?&nbsp;<\/strong><\/h2>\n\n\n\n<p>Most of what we call machine learning today is supervised learning:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You have inputs, x (images, text, market states\u2026)&nbsp;<\/li>\n\n\n\n<li>You have labels, y (cat\/dog, sentiment, future price\u2026)&nbsp;<\/li>\n\n\n\n<li>You&nbsp;train a model to map&nbsp;x to y as accurately as possible on a fixed dataset.&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>This works well when you have labelled data and your prediction doesn\u2019t&nbsp;change which data you see next.&nbsp;A lot of real problems break&nbsp;these assumptions, such as:&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Playing chess or Go&nbsp;&nbsp;&nbsp;<\/li>\n\n\n\n<li>Controlling a robot or a drone&nbsp;&nbsp;&nbsp;<\/li>\n\n\n\n<li>Deciding what to show a user next&nbsp;&nbsp;&nbsp;<\/li>\n\n\n\n<li>Placing trades or bids over time&nbsp;&nbsp;&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>There is no dataset with \u201coptimal action\u201d labels for every&nbsp;possible situation, and your actions are not just&nbsp;predictions &#8211;&nbsp;they influence the next state which changes what happens next. <\/p>\n\n\n\n<p>Supervised learning answers <em>\u201cGiven this input, what output matches the label?\u201d<\/em> Reinforcement learning answers a different question <em>\u201cGiven that my actions affect the future, what are the optimal actions in a trajectory of decisions?\u201d<\/em> Instead of a dataset of labelled examples, an RL agent interacts with an <strong>environment<\/strong>, chooses <strong>actions<\/strong>, observes the next<strong> state<\/strong> and a <strong>reward<\/strong> (how good that step was) and then adjusts its&nbsp;behaviour&nbsp;to maximize the total reward it gets over a whole trajectory, not just the next step. The framework lets you work with problems where you don\u2019t&nbsp;always have ground-truth actions to imitate (although this can also be helpful in an RL setting), and not just&nbsp;one-shot&nbsp;predictions.&nbsp;This extra power comes with extra headaches (sparse rewards, exploration, reward hacking\u2026), but&nbsp;it\u2019s&nbsp;what lets us train agents that can act in complex environments.&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The&nbsp;RL Framework: Agents,&nbsp;Environments, and Returns<\/strong>&nbsp;<\/h2>\n\n\n\n<p>Reinforcement learning is usually framed in terms of a <strong>Markov Decision Process (MDP)<\/strong>. An MDP is a tuple <math><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi class=\"mathcal\">\ud835\udcae<\/mi><mo separator=\"true\">,<\/mo><mi class=\"mathcal\">\ud835\udc9c<\/mi><mo separator=\"true\">,<\/mo><mi>P<\/mi><mo separator=\"true\">,<\/mo><mi>R<\/mi><mo separator=\"true\">,<\/mo><mi>\u03b3<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\"> (\\mathcal{S}, \\mathcal{A}, P, R, \\gamma)<\/annotation><\/semantics><\/math> where:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><math><semantics><mi class=\"mathcal\">\ud835\udcae<\/mi><annotation encoding=\"application\/x-tex\">\\mathcal{S}<\/annotation><\/semantics><\/math>: set of <strong>states<\/strong><\/li>\n\n\n\n<li><math><semantics><mi class=\"mathcal\">\ud835\udc9c<\/mi><annotation encoding=\"application\/x-tex\">\\mathcal{A}<\/annotation><\/semantics><\/math>: set of <strong>actions<\/strong><\/li>\n\n\n\n<li><math><semantics><mrow><mi>P<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msup><mi>s<\/mi><mo lspace=\"0em\" rspace=\"0em\" class=\"tml-prime\">\u2032<\/mo><\/msup><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><mi>s<\/mi><mo separator=\"true\">,<\/mo><mi>a<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">P(s&#8217; \\mid s, a)<\/annotation><\/semantics><\/math>: <strong>transition dynamics<\/strong>, how the environment moves from state <math><semantics><mi>s<\/mi><annotation encoding=\"application\/x-tex\">s<\/annotation><\/semantics><\/math> to <math><semantics><msup><mi>s<\/mi><mo lspace=\"0em\" rspace=\"0em\" class=\"tml-prime\">\u2032<\/mo><\/msup><annotation encoding=\"application\/x-tex\">s&#8217;<\/annotation><\/semantics><\/math> when you take action <math><semantics><mi>a<\/mi><annotation encoding=\"application\/x-tex\">a<\/annotation><\/semantics><\/math><\/li>\n\n\n\n<li><math><semantics><mrow><mi>R<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>s<\/mi><mo separator=\"true\">,<\/mo><mi>a<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">R(s, a)<\/annotation><\/semantics><\/math> or <math><semantics><mrow><mi>R<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>s<\/mi><mo separator=\"true\">,<\/mo><mi>a<\/mi><mo separator=\"true\">,<\/mo><msup><mi>s<\/mi><mo lspace=\"0em\" rspace=\"0em\" class=\"tml-prime\">\u2032<\/mo><\/msup><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">R(s, a, s&#8217;)<\/annotation><\/semantics><\/math>: <strong>reward function<\/strong>, how much immediate feedback you get for taking action <math><semantics><mi>a<\/mi><annotation encoding=\"application\/x-tex\">a<\/annotation><\/semantics><\/math> in state <math><semantics><mi>s<\/mi><annotation encoding=\"application\/x-tex\">s<\/annotation><\/semantics><\/math><\/li>\n\n\n\n<li><math><semantics><mrow><mi>\u03b3<\/mi><mo>\u2208<\/mo><mo form=\"prefix\" stretchy=\"false\">[<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mn>1<\/mn><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\gamma \\in [0, 1)<\/annotation><\/semantics><\/math>: <strong>discount factor<\/strong>, how much you care about future rewards<\/li>\n<\/ul>\n\n\n\n<p>You can think of this as the formal version of \u201can agent interacting with an environment\u201d. At each time step <math><semantics><mi>t<\/mi><annotation encoding=\"application\/x-tex\">t<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The agent observes a <strong>state<\/strong> <math><semantics><mrow><msub><mi>s<\/mi><mi>t<\/mi><\/msub><mo>\u2208<\/mo><mi class=\"mathcal\">\ud835\udcae<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">s_t \\in \\mathcal{S}<\/annotation><\/semantics><\/math><\/li>\n\n\n\n<li>It chooses an <strong>action<\/strong> <math><semantics><mrow><msub><mi>a<\/mi><mi>t<\/mi><\/msub><mo>\u2208<\/mo><mi class=\"mathcal\">\ud835\udc9c<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">a_t \\in \\mathcal{A}<\/annotation><\/semantics><\/math> according to its <strong>policy<\/strong> <math><semantics><mrow><mi>\u03c0<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>a<\/mi><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><mi>s<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\pi(a \\mid s)<\/annotation><\/semantics><\/math><\/li>\n\n\n\n<li>The environment samples a <strong>next state<\/strong> <math><semantics><mrow><msub><mi>s<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u223c<\/mo><mi>P<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mo form=\"prefix\" stretchy=\"false\">\u22c5<\/mo><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><msub><mi>s<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>a<\/mi><mi>t<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">s_{t+1} \\sim P(\\cdot \\mid s_t, a_t)<\/annotation><\/semantics><\/math> and a <strong>reward<\/strong> <math><semantics><mrow><msub><mi>r<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><mi>R<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>s<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>a<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>s<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">r_t = R(s_t, a_t, s_{t+1})<\/annotation><\/semantics><\/math> is assigned<\/li>\n<\/ul>\n\n\n\n<p>This produces a trajectory <math><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>s<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><msub><mi>a<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><msub><mi>r<\/mi><mn>0<\/mn><\/msub><mo separator=\"true\">,<\/mo><msub><mi>s<\/mi><mn>1<\/mn><\/msub><mo separator=\"true\">,<\/mo><msub><mi>a<\/mi><mn>1<\/mn><\/msub><mo separator=\"true\">,<\/mo><msub><mi>r<\/mi><mn>1<\/mn><\/msub><mo separator=\"true\">,<\/mo><mo>\u2026<\/mo><mspace width=\"0.1667em\"><\/mspace><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(s_0, a_0, r_0, s_1, a_1, r_1, \\dots),<\/annotation><\/semantics><\/math> and the quality of a trajectory is measured by its <strong>return<\/strong>: the discounted sum of rewards: <math><semantics><mrow><msub><mi>G<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>r<\/mi><mi>t<\/mi><\/msub><mo>+<\/mo><mi>\u03b3<\/mi><msub><mi>r<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><msup><mi>\u03b3<\/mi><mn>2<\/mn><\/msup><msub><mi>r<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mn>2<\/mn><\/mrow><\/msub><mo>+<\/mo><mo>\u2026<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">G_t = r_t + \\gamma r_{t+1} + \\gamma^2 r_{t+2} + \\dots<\/annotation><\/semantics><\/math><\/p>\n\n\n\n<p>The RL objective is to find a policy <math><semantics><mrow><msub><mi>\u03c0<\/mi><mi>\u03b8<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>a<\/mi><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><mi>s<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\pi_\\theta(a \\mid s)<\/annotation><\/semantics><\/math> that maximizes the expected return <math><semantics><mrow><mi>\ud835\udd3c<\/mi><mo form=\"prefix\" stretchy=\"false\">[<\/mo><msub><mi>G<\/mi><mn>0<\/mn><\/msub><mo form=\"postfix\" stretchy=\"false\">]<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\mathbb{E}[G_0]<\/annotation><\/semantics><\/math>. More explicitly, we can write the objective as<\/p>\n\n\n\n<p><math><semantics><mrow><mi>J<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mi>\ud835\udd3c<\/mi><mi>\u03c0<\/mi><\/msub><mrow><mo fence=\"true\" form=\"prefix\">[<\/mo><msub><mo>\u2211<\/mo><mi>t<\/mi><\/msub><msub><mi>r<\/mi><mi>t<\/mi><\/msub><mo fence=\"true\" form=\"postfix\">]<\/mo><\/mrow><mi>.<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">J(\\theta) = \\mathbb{E}_\\pi \\left[ \\sum_t r_t \\right].<\/annotation><\/semantics><\/math><\/p>\n\n\n\n<p>These performance measures are functions of the <strong>entire trajectory<\/strong> and the environment\u2019s randomness, they are not smooth, differentiable functions of the policy parameters. Policy gradient methods like REINFORCE get around this by never differentiating through the reward or the environment. Instead, they differentiate through the <strong>probability of the actions<\/strong> under the policy. Using the log-derivative trick, we obtain<\/p>\n\n\n\n<p><math><semantics><mrow><msub><mo>\u2207<\/mo><mi>\u03b8<\/mi><\/msub><mi>J<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mi>\ud835\udd3c<\/mi><mi>\u03c0<\/mi><mrow><mo fence=\"true\" form=\"prefix\">[<\/mo><msub><mo>\u2211<\/mo><mi>t<\/mi><\/msub><mo>\u2207<\/mo><mi>\u03b8<\/mi><mrow><mspace width=\"0.1667em\"><\/mspace><mi>log<\/mi><mo>\u2061<\/mo><mspace width=\"0.1667em\"><\/mspace><\/mrow><msub><mi>\u03c0<\/mi><mi>\u03b8<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>a<\/mi><mi>t<\/mi><\/msub><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><msub><mi>s<\/mi><mi>t<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mspace width=\"0.1667em\"><\/mspace><msub><mi>A<\/mi><mi>t<\/mi><\/msub><mo fence=\"true\" form=\"postfix\">]<\/mo><\/mrow><mo separator=\"true\">,<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\nabla_\\theta J(\\theta) = \\mathbb{E}\\pi \\left[ \\sum_t \\nabla\\theta \\log \\pi_\\theta(a_t \\mid s_t)\\, A_t \\right],<\/annotation><\/semantics><\/math><\/p>\n\n\n\n<p>where <math><semantics><msub><mi>A<\/mi><mi>t<\/mi><\/msub><annotation encoding=\"application\/x-tex\">A_t<\/annotation><\/semantics><\/math> is an advantage estimate (a centred version of the return). In words: the return (an often non-differentiable measure of how good the trajectory was) only appears as a scalar weight on the <strong>differentiable<\/strong> quantity <math><semantics><mrow><mrow><mi>log<\/mi><mo>\u2061<\/mo><mspace width=\"0.1667em\"><\/mspace><\/mrow><msub><mi>\u03c0<\/mi><mi>\u03b8<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>a<\/mi><mi>t<\/mi><\/msub><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><msub><mi>s<\/mi><mi>t<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\log \\pi_\\theta(a_t \\mid s_t)<\/annotation><\/semantics><\/math>, and we update the policy to make high-return actions more likely.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sparse Rewards and Credit Assignment<\/strong><\/h2>\n\n\n\n<p>So far, RL sounds simple. We define rewards and learn a policy that maximizes&nbsp;the&nbsp;long-term return. In practice,&nbsp;the&nbsp;main headache is that&nbsp;the&nbsp;reward signal is often weak, delayed, and noisy.&nbsp;<\/p>\n\n\n\n<p>A lot of RL environments have sparse rewards. Picture a small&nbsp;gridworld&nbsp;where you only get +1 when you reach&nbsp;the&nbsp;goal, and 0 everywhere else. A random policy can wander around for ages and never hit that +1. All it sees is a stream of zeros. From&nbsp;the&nbsp;agent\u2019s point of view, every trajectory looks equally useless so there\u2019s&nbsp;no gradient telling it&nbsp;\u201cthis&nbsp;path was slightly better than that one\u201d.&nbsp;<\/p>\n\n\n\n<p>Even when you do eventually see a non-zero reward, you run into credit assignment problems.&nbsp;The&nbsp;return&nbsp;<math><semantics><msub><mi>G<\/mi><mi>t<\/mi><\/msub><annotation encoding=\"application\/x-tex\">G_t<\/annotation><\/semantics><\/math>&nbsp;at time&nbsp;t&nbsp;depends on all future rewards, so early actions are judged by what happens much later. This means a bad early move followed by brilliant recovery can give us a high&nbsp;final return, so&nbsp;the&nbsp;algorithm can end up giving positive credit to bad early moves.&nbsp;Conversely, a sequence of good early moves followed by one catastrophic mistake results in a low final reward, so all&nbsp;the&nbsp;earlier good decisions are assigned low rewards.&nbsp;The&nbsp;algorithm&nbsp;doesn\u2019t&nbsp;know which individual step&nbsp;deserved&nbsp;the&nbsp;reward. It just knows that this whole trajectory turned out well or badly.&nbsp;<\/p>\n\n\n\n<p>One common way to cope with sparse rewards is reward shaping, which involves adding small intermediate rewards or penalties to guide&nbsp;the&nbsp;agent, instead of only paying out at&nbsp;the&nbsp;very end. For example, you might give a small negative reward for each extra step taken, or a small positive reward for moving closer to&nbsp;the&nbsp;goal. This can make learning much easier, but it also introduces a new problem &#8211; if you shape&nbsp;the&nbsp;reward badly,&nbsp;the&nbsp;agent may learn to&nbsp;optimize&nbsp;the&nbsp;shaped signal while completely ignoring&nbsp;the&nbsp;behaviour&nbsp;you&nbsp;actually care&nbsp;about.&nbsp;<\/p>\n\n\n\n<p>These issues (too few rewards, and rewards applied to whole trajectories rather than individual decisions) are a big part of what makes RL difficult and sensitive to hyperparameters.&nbsp;They also set up&nbsp;the&nbsp;motivation for&nbsp;the&nbsp;toy problem&nbsp;we\u2019ll&nbsp;look at next, where&nbsp;we\u2019ll&nbsp;see the difference in performance between using a sparse vs shaped reward. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>A Toy Example: Catching Falling Balls with Sparse vs Shaped Rewards&nbsp;<\/strong><\/h2>\n\n\n\n<p>To make&nbsp;all of&nbsp;this concrete, I built a small RL environment where a paddle at&nbsp;the&nbsp;bottom of&nbsp;the&nbsp;screen&nbsp;has to&nbsp;catch falling balls.&nbsp;<\/p>\n\n\n\n<p>For each episode, up to five balls are spawned one by one from the top at random horizontal positions, and a random \u201cwind\u201d pushes them left or right as they fall. The agent controls a paddle with three actions (left, stay, right). If a ball hits the bottom where the paddle is, it counts as a catch, otherwise it\u2019s a miss. The episode is considered a win if the agent catches at least 4 out of 5 balls.<\/p>\n\n\n\n<p>In MDP terms, this game is defined as <math><semantics><mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi class=\"mathcal\">\ud835\udcae<\/mi><mo separator=\"true\">,<\/mo><mi class=\"mathcal\">\ud835\udc9c<\/mi><mo separator=\"true\">,<\/mo><mi>P<\/mi><mo separator=\"true\">,<\/mo><mi>R<\/mi><mo separator=\"true\">,<\/mo><mi>\u03b3<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">(\\mathcal{S}, \\mathcal{A}, P, R, \\gamma)<\/annotation><\/semantics><\/math>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>state<\/strong> <math><semantics><msub><mi>s<\/mi><mi>t<\/mi><\/msub><annotation encoding=\"application\/x-tex\">s_t<\/annotation><\/semantics><\/math> is a compact vector: paddle position, up to two balls in flight (positions and velocities), current wind, and how many balls have been spawned\/caught so far.<\/li>\n\n\n\n<li>The <strong>actions<\/strong> <math><semantics><msub><mi>a<\/mi><mi>t<\/mi><\/msub><annotation encoding=\"application\/x-tex\">a_t<\/annotation><\/semantics><\/math> are the three moves: left, stay, right.<\/li>\n\n\n\n<li>The <strong>transition dynamics<\/strong> <math><semantics><mrow><mi>P<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>s<\/mi><mrow><mi>t<\/mi><mo>+<\/mo><mn>1<\/mn><\/mrow><\/msub><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><msub><mi>s<\/mi><mi>t<\/mi><\/msub><mo separator=\"true\">,<\/mo><msub><mi>a<\/mi><mi>t<\/mi><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">P(s_{t+1} \\mid s_t, a_t)<\/annotation><\/semantics><\/math> apply the chosen move, update wind and ball physics, spawn new balls, and end the episode once all 5 balls are processed or it\u2019s impossible to reach 4 catches.<\/li>\n\n\n\n<li>The <strong>reward<\/strong> <math><semantics><mi>R<\/mi><annotation encoding=\"application\/x-tex\">R<\/annotation><\/semantics><\/math> is where we experiment: either a sparse terminal <math><semantics><mrow><mo>\u00b1<\/mo><mn>1<\/mn><\/mrow><annotation encoding=\"application\/x-tex\">\\pm 1<\/annotation><\/semantics><\/math> signal (win\/lose), or a shaped reward that adds small step-wise bonuses and penalties while keeping the same win condition.<\/li>\n\n\n\n<li>The <strong>policy<\/strong> <math><semantics><mrow><msub><mi>\u03c0<\/mi><mi>\u03b8<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>a<\/mi><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><mi>s<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\pi_\\theta(a \\mid s)<\/annotation><\/semantics><\/math> is a neural network that outputs a categorical distribution over the three actions, and we train it with REINFORCE to maximise the expected return <math><semantics><mrow><mi>J<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><msub><mi>\ud835\udd3c<\/mi><mi>\u03c0<\/mi><\/msub><mrow><mo fence=\"true\" form=\"prefix\">[<\/mo><msub><mo>\u2211<\/mo><mi>t<\/mi><\/msub><msub><mi>r<\/mi><mi>t<\/mi><\/msub><mo fence=\"true\" form=\"postfix\">]<\/mo><\/mrow><\/mrow><annotation encoding=\"application\/x-tex\">J(\\theta) = \\mathbb{E}_\\pi \\left[ \\sum_t r_t \\right]<\/annotation><\/semantics><\/math> in this environment.<\/li>\n<\/ul>\n\n\n\n<p>During training, the policy <strong>samples<\/strong>&nbsp;from the distribution of possible actions (to explore) and during evaluation it acts&nbsp;<strong>greedily<\/strong>&nbsp;by taking&nbsp;the&nbsp;most likely action.&nbsp;A succesful game would look something like this:<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1410\" style=\"aspect-ratio: 1990 \/ 1410;\" width=\"1990\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/shaped_policy_trimmed.mov\"><\/video><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Random baseline<\/h3>\n\n\n\n<p>As a sanity check, we first evaluate a random policy that picks each action with equal probability, but is otherwise identical (same environment, same observation space, same evaluation protocol). Over 100 evaluation games, the random baseline achieves <strong>0% wins<\/strong> with <strong>0.29 \/ 5<\/strong> balls caught on average. So, a policy that has no structure at all almost never wins and only catches the occasional ball by accident.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1410\" style=\"aspect-ratio: 1990 \/ 1410;\" width=\"1990\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/random_policy_trimmed.mov\"><\/video><\/figure>\n\n\n\n<p><strong>Sparse reward<\/strong><\/p>\n\n\n\n<p>Next, we train a policy using the sparse reward. In the environment, this means the agent gets <strong>zero reward at every step<\/strong>, and only at the very end of the episode receives a single terminal signal: +1 if it caught at least 4 balls, and \u22121 otherwise. In other words, the raw reward sequence looks like <code>0, 0, 0, \u2026, \u00b11<\/code>. During training we turn this terminal reward into a <strong>discounted return<\/strong> and propagate it back through the whole trajectory, so every action gets some credit or blame, but all of that feedback ultimately comes from a single binary win\/lose outcome.<\/p>\n\n\n\n<p>In practice, the policy quickly falls into a degenerate behaviour, it tends to drift all the way to one side (for example, the far right\u2026) and sit there. Because the only information is a delayed \u00b11 at the end, most trajectories look similarly bad, and small random tweaks to the policy don\u2019t reliably improve the return. The gradient signal stays weak and noisy, so the agent never really escapes this \u201cdo nothing useful\u201d strategy. When we evaluate this sparse-trained policy greedily over 100 games, it achieves <strong>0% wins <\/strong>and only<strong> 0.26 \/ 5 balls caught<\/strong> on average which is no better than random.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1410\" style=\"aspect-ratio: 1990 \/ 1410;\" width=\"1990\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/sparse_reward_trimmed.mov\"><\/video><\/figure>\n\n\n\n<p><strong>Shaped reward<\/strong><\/p>\n\n\n\n<p>We then train an identical policy in the shaped version of the game, where the terminal win\/lose signal is kept the same but we add small intermediate bonuses for catches, penalties for misses, and a term that gently rewards moving closer to the nearest ball. The underlying objective (\u201ccatch at least 4 out of 5 balls\u201d) hasn\u2019t changed, but the agent now gets additional hints along the way, instead of having to rely solely on a single terminal win\/lose bit.<\/p>\n\n\n\n<p>With this shaped reward, learning looks completely different. The policy gradually learns to track the balls: it moves under the most threatening one, corrects for wind, and visibly tries to win. In greedy evaluation over 100 games, the shaped-trained policy achieves <strong>66% wins<\/strong> and <strong>3.11 \/ 5<\/strong> balls caught on average, a vast improvement over both the random baseline and the sparse-trained agent.<\/p>\n\n\n\n<figure class=\"wp-block-video\"><video height=\"1410\" style=\"aspect-ratio: 1990 \/ 1410;\" width=\"1990\" controls src=\"https:\/\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/shaped_policy_trimmed-1.mov\"><\/video><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Summary of the learning curves<\/h3>\n\n\n\n<p>The training curves pull this together:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The <strong>win-rate plot<\/strong> shows the sparse-reward agent flatlining near zero, while the shaped-reward agent climbs well above the random baseline and stabilises at a high win rate.<\/li>\n\n\n\n<li>The <strong>catch-fraction plot<\/strong> tells the same story: sparse reward hovers around \u201crandom\u201d levels, whereas shaped reward steadily improves until the agent is catching the majority of balls.<\/li>\n\n\n\n<li>The <strong>total episode reward plot<\/strong> shows the shaped agent moving from negative returns (early exploration and mistakes) to comfortably positive returns as it learns, while the sparse agent remains stuck around its initial level.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"625\" height=\"164\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?resize=625%2C164&#038;ssl=1\" alt=\"\" class=\"wp-image-13756\" srcset=\"https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?resize=1024%2C269&amp;ssl=1 1024w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?resize=300%2C79&amp;ssl=1 300w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?resize=768%2C202&amp;ssl=1 768w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?resize=624%2C164&amp;ssl=1 624w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?w=1483&amp;ssl=1 1483w, https:\/\/i0.wp.com\/www.blopig.com\/blog\/wp-content\/uploads\/2025\/12\/PastedGraphic-1.png?w=1250&amp;ssl=1 1250w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><\/a><\/figure>\n\n\n\n<p>Taken together, the GIFs and curves show that with only a sparse terminal reward, the agent gets stuck in a low-return, \u201cdo-nothing useful\u201d strategy. With a shaped reward on the same MDP, it discovers a policy that actually plays the game well.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Wrapping up<\/h3>\n\n\n\n<p>The ball-catching game is deliberately tiny, but it captures most of the core ideas of reinforcement learning in one place. We framed the problem as an MDP, defined a policy <math><semantics><mrow><msub><mi>\u03c0<\/mi><mi>\u03b8<\/mi><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>a<\/mi><mo lspace=\"0.22em\" rspace=\"0.22em\" stretchy=\"false\">|<\/mo><mi>s<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">\\pi_\\theta(a \\mid s)<\/annotation><\/semantics><\/math>, and optimised its behaviour with a policy gradient method that never differentiates through the environment or the reward itself &#8211; only through the log-likelihood of the actions that led to good outcomes.<\/p>\n\n\n\n<p>The key lesson from the toy example is that how you shape the reward matters as much as what you optimise. With a sparse win\/lose signal, the agent effectively never learns: it finds a degenerate \u201cgo to one side and sit there\u201d strategy and gets stuck. With a shaped reward on the <em>same<\/em> environment and objective, the agent learns to track balls, respond to wind, and win most games. In both cases we are maximising expected return, the difference is whether the agent has a usable learning signal on the way to that return.<\/p>\n\n\n\n<p>This is why RL can feel both powerful and fragile. It gives us a principled framework for sequential decision-making when we don\u2019t have ground-truth actions, but small choices in reward design, state representation, and exploration can dramatically change what an agent learns. In more complex settings such as robotics, trading, recommendation systems where the stakes are higher, thinking carefully about reward shaping and hyperparameters is crucial.<\/p>\n\n\n\n<p>RL is not just \u201csupervised learning with a fancier loss\u201d. It\u2019s a different way of thinking about problems where actions affect the future, and where we care about trajectories, not single predictions. The toy paddle might be simple, but the questions it surfaces are often the ones that come up in real RL applications.<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Reinforcement learning (RL) is pretty simple in theory &#8211; \u201ctake actions, get rewards, increase likelihood of high reward actions\u201d. However, we can quickly runs into subtle problems that don\u2019t show up in standard supervised learning. The aim of this post is to give a gentle, concrete introduction to what RL actually is, why we might [&hellip;]<\/p>\n","protected":false},"author":126,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","wikipediapreview_detectlinks":true,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"ngg_post_thumbnail":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[633,632,189],"tags":[890,902],"ppma_author":[790],"class_list":["post-13750","post","type-post","status-publish","format-standard","hentry","category-ai","category-deep-learning","category-machine-learning","tag-reinforcement-learning","tag-rl"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"authors":[{"term_id":790,"user_id":126,"is_guest":0,"slug":"jamesb","display_name":"James Broster","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/df390c53770be6a0afc152da99e17025226f7300979c8b5e54021ddeb87971e4?s=96&d=mm&r=g","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":""}],"_links":{"self":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13750","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/users\/126"}],"replies":[{"embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/comments?post=13750"}],"version-history":[{"count":5,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13750\/revisions"}],"predecessor-version":[{"id":13821,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/posts\/13750\/revisions\/13821"}],"wp:attachment":[{"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/media?parent=13750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/categories?post=13750"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/tags?post=13750"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.blopig.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=13750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}