{"id":606531,"date":"2025-07-08T20:59:20","date_gmt":"2025-07-09T01:59:20","guid":{"rendered":"https:\/\/towardsdatascience.com\/?p=606531"},"modified":"2025-07-08T20:59:32","modified_gmt":"2025-07-09T01:59:32","slug":"how-to-finetune-small-language-models-to-think-with-reinforcement-learning","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/","title":{"rendered":"How to Fine-Tune Small Language Models to Think with Reinforcement Learning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><mdspan datatext=\"el1752026182070\" class=\"mdspan-comment\"><strong>Reasoning models<\/strong> are currently<\/mdspan> in fashion. DeepSeek-R1, Gemini-2.5-Pro, OpenAI&#8217;s O-series models, Anthropic&#8217;s Claude, Magistral, and Qwen3 \u2014 there is a new one every month. When you ask these models a question, they go into a <em>chain of thought<\/em> before generating an answer.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/results_gif3-2.gif\" alt=\"\" class=\"wp-image-607822\"\/><figcaption class=\"wp-element-caption\">A simple demonstration of what reasoning looks like. When asked a question, the Language Model (LM) generates a chain of thought first, followed by the answer. (Illustration by the Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I recently asked myself the question, &#8220;Hmm&#8230; I wonder if I should write a Reinforcement Learning loop from scratch that teaches this &#8216;thinking&#8217; behaviour to <em>really<\/em> small models \u2014 <em>like only 135 million<\/em> <em>parameters<\/em>&#8220;. It should be easy, right?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Well, it wasn&#8217;t.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>Small models simply do not have the world knowledge that large models do. This makes &lt; 1B parameter model lack the &#8220;common sense&#8221; to easily reason through complex logical tasks. Therefore, you cannot just rely on compute to train them to reason. <br><br>You need additional tricks up your sleeve.<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">In this article, I won&#8217;t just cover tricks though. I will cover the major ideas behind training reasoning behaviours into language models, share some simple code snippets, and some practical tips to fine-tune Small Language Models (SLMs) with RL.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>This article is divided into 5 sections:<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Intro to RLVR (Reinforcement Learning with Verifiable Rewards) and why it is uber cool<\/li>\n\n\n\n<li class=\"wp-block-list-item\">A visual overview of the GRPO algorithm and the clipped surrogate PPO loss.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">A code walkthrough!<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Supervised fine-tuning and practical tips to train reasoning models<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Results!<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Unless otherwise mentioned, all images used in this article are illustrations produced by the author.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>At the end of this article, I will link to the 50-minute companion YouTube video of this article. If you have any queries, that video likely has the answers\/clarification you need. You can also reach out to me on X (<a href=\"https:\/\/x.com\/neural_avb\">@neural_avb<\/a>).<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Reinforcement Learning with Verifiable Rewards (RLVR)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before diving into specific challenges with Small models, let&#8217;s first introduce some terms.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Group Relative Policy Optimization, or GRPO, is a (rather new) Reinforcement Learning (RL) technique that researchers are using to fine-tune Large Language Models (LLMs) on logical and analytical tasks. Since its inception, a new term has been <span style=\"margin: 0px; padding: 0px;\">circulating in the LLM research space:&nbsp;<\/span><strong>RLVR<\/strong>,&nbsp;or <strong>R<\/strong>einforcement <strong>L<\/strong>earning with <strong>V<\/strong>erifiable <strong>R<\/strong>ewards.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To understand what makes RLVR unique, it&#8217;s helpful to contrast it with the most common application of RL in language models: RLHF (<strong>R<\/strong>einforcement <strong>L<\/strong>earning with <strong>H<\/strong>uman <strong>F<\/strong>eedback). In RLHF, an RL module is trained to maximize scores from a separate reward model, which acts as a proxy for human preferences. This reward model is trained on a dataset where humans have ranked or rated different model responses. <\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>In other words, RLHF is trained so LLMs can output responses that are more aligned with human preferences. It tries to make models follow instructions more closely.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>RLVR tries to solve a different problem. RLVR teaches a model to be verifiably correct, often by learning to generate it&#8217;s own chain of thought.<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Where RLHF had a <em>subjective<\/em> reward model, RLVR uses an <em>objective<\/em> verifier. The core idea is to provide rewards based on whether an answer is demonstrably correct, not on a prediction of what a human might prefer.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/RLVR2-1024x682.png\" alt=\"\" class=\"wp-image-607780\"\/><figcaption class=\"wp-element-caption\">An illustration of how RLVR works (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is exactly why this system is called &#8216;RL with <em>verifiable rewards<\/em>&#8216;. Not every question&#8217;s answer can be verified easily. Especially open-ended questions like <em>&#8220;What iPhone should I buy?<\/em>&#8221; or &#8220;<em>Where should I go to college?&#8221;<\/em>. Some use cases, however, do fit easily in the &#8220;verifiable rewards&#8221; paradigm, like math, logical tasks, and code-writing, to name a few. In the <code>reasoning-gym<\/code> section below, we will look into how exactly these tasks can be simulated and how the rewards can be generated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>But before that, you might ask: well where does &#8220;reasoning&#8221; fit into all of this?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We will train the LLM to generate arbitrarily long chain of thought reasoning texts before generating the final answer. We instruct the model to wrap its thinking process in <code>&lt;think&gt;<\/code> tags and its final conclusion in <code>&lt;answer&gt;<\/code> tags. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The full language model response will look something like this:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-markup\">&lt;think&gt;\nUser has asked me to count the number of r&#039;s in strawberry.\nLet&#039;s do a cumulative count.\ns=0, t=0, r=1, a=0, w=0, b=0, e=0, r=2, r=3, y=4\n\nIt seems there are 3 r&#039;s in strawberry. \nI notice that there is an r in straw and 2 r&#039;s in berry.\nSince 1+2=3 I am more confident there are 3 r&#039;s\n&lt;\/think&gt;\n&lt;answer&gt;\n3\n&lt;\/answer&gt;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This structure allows us to easily extract just the final answer and check if it&#8217;s correct. The verifier is a single source of truth, and can be a simple piece of code that (literally) counts alphabets.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">def count_alphabets(word, letter):\n    return sum([1 for l in word if l == letter])\n\nreward = 1 if (lm_answer == count_alphabets(&quot;strawberry&quot;, &quot;r&quot;) else -1<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We will keep a record of the model&#8217;s experiences \u2014 its responses and the corresponding rewards received from the verifier. The RL algorithm will then train to promote behaviours that increase the likelihood of correct final answers.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>By consistently rewarding correct answers and good formatting, we would increase the likelihood of reasoning tokens that lead to correct answers. <\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Get this: we don&#8217;t <em>need<\/em> to directly evaluate the intermediate reasoning tokens. By simply rewarding the final answer, we will indirectly elicit reasoning steps into the LLM&#8217;s chain of thought that lead to correct answers!<\/strong><\/p>\n<\/blockquote>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/DR1-Zero-1-1024x451.png\" alt=\"\" class=\"wp-image-607778\"\/><figcaption class=\"wp-element-caption\">Source: Some exercepts from the <a href=\"https:\/\/arxiv.org\/pdf\/2501.12948\">DeepSeek-R1<\/a> paper (License: Free) <\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">2. GRPO (Group Relative Policy Optimization)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">I am going to skip the usual <em>Reinforcement Learning 101<\/em> intro here, I expect most of you who read this far to understand the basics of RL. There is an agent who observes states from the environment and takes an action \u2014 the environment rewards the agent depending on how good the action was \u2014 the agent stores these experiences and trains to take better actions in the future that lead to higher rewards. <em>RL 101<\/em> class dismissed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>But how do we transfer the RL paradigm to language?<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s talk about our algorithm of choice \u2014 <strong>G<\/strong>roup <strong>R<\/strong>elative <strong>P<\/strong>olicy <strong>O<\/strong>ptimization to understand how. GRPO works in two iteratively self-repeating phases \u2014 an experience collection phase where the Language Model (LM) accumulates experiences in the environment with its current weights. And a training phase where it uses the collected memories to update its weights to improve. After training, it once again goes into an experience collection step with the updated weights.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Experience Collection<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s dissect each step in the experience collection phase now.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Step 1: <\/strong>The environment is a black box that generates questions about logical or math tasks. We will discuss this in an upcoming section with the <code>reasoning-gym<\/code> library.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Step 2: <\/strong>We tokenize the input questions into a sequence of integer tokens.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/GRPO1-1-1024x576.png\" alt=\"\" class=\"wp-image-607783\"\/><figcaption class=\"wp-element-caption\">Sample questions, tokenized them, forward pass through LM, and generate multiple responses for each question! (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Step 3: <\/strong>The &#8220;agent&#8221; or the &#8220;policy&#8221; is the current SLM we are training. It observes the environment&#8217;s tokenized questions and generates responses. The LLM response gets converted into text and returned to the environment. The environment rewards each response.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/GRPO2-1-1024x576.png\" alt=\"\" class=\"wp-image-607784\"\/><figcaption class=\"wp-element-caption\">The Environment acts as the verifier and assigns a reward to the agent. (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Step 4: <\/strong>From the rewards, we calculate the <strong>advantage<\/strong> of each response. In GRPO, the advantage is the relative goodness of each response in the group. Importantly, advantages are calculated per group, i.e. we do not standardize rewards <em>across<\/em> different questions.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/GRPO3-1-1024x576.png\" alt=\"\" class=\"wp-image-607785\"\/><figcaption class=\"wp-element-caption\">Advantages define how favourable a specific response is relative to other responses to the same question <br>(Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><strong>Step 5: <\/strong>The original question, the log probabilities for each LM-generated token, and the advantages are all accumulated inside a memory buffer.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Steps 1-5 are repeated till the buffer size reaches the desired threshold.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/GRPO4-1024x576.png\" alt=\"\" class=\"wp-image-607768\"\/><figcaption class=\"wp-element-caption\">Saving experiences in the buffer! (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Training Phase<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">After the end of the experience collection phase, our goal is to enter the training phase. Here, we will learn from the reward patterns the LLM observed and use RL to improve its weights. Here is how that works:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Randomly sample a minibatch of memories. Remember, each memory already contained its group-relative-advantage (Step 5 from the experience collection phase). Randomly sampling question-answer pairs improves the robustness of the training as the gradients are calculated as an average of a diverse set of experiences, preventing over-fitting on any single question.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">For each minibatch, we want to maximize this term following the standard PPO (Proximal Policy Optimization) formulation. <strong>The major difference with GRPO is that we do not need an additional reward model or a value network to calculate advantages. Instead, GRPO samples multiple responses to the same question to calculate the relative advantage of each response.<\/strong> The memory footprint is significantly reduced since we won&#8217;t need to train those additional models!<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Repeat the above steps.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/2_steps-1-1024x576.jpg\" alt=\"\" class=\"wp-image-607782\"\/><figcaption class=\"wp-element-caption\">GRPO operates in 2 repeating phases \u2014 collect experiences, train on experiences, repeat. (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">What the PPO Loss means<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let me explain the PPO Loss in an intuitive step-by-step fashion. The PPO Loss looks like this. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/PPO_Loss_Full-1024x576.png\" alt=\"\" class=\"wp-image-607788\"\/><figcaption class=\"wp-element-caption\"><a href=\"https:\/\/arxiv.org\/abs\/1707.06347\">The PPO Loss Function<\/a>. Let me break it down for you. (Illustration by the Author)<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Here, <code>pi_old<\/code> is the old-policy neural network that we used during the data collection phase.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><code>\u03c0<\/code> is the current policy neural network we are training. Since the weights of <code>\u03c0<\/code> change after each gradient update, <code>\u03c0<\/code> and <code>\u03c0_old<\/code> do not remain the same during the training phase \u2014 hence the distinction.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><code>G<\/code> is the number of generated responses for a single question. <code>|o_i|<\/code> is the length of the i-th response in the group. Therefore, those summation and normalization operation computes a mean over all the tokens over all responses. What does it compute the mean of? Well it is <code>\u03c0\/\u03c0_old * A_{it}<\/code>. What does that mean?<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/PPO_Loss_ADV-1024x576.png\" alt=\"\" class=\"wp-image-607787\"\/><figcaption class=\"wp-element-caption\">The simplest way to assign an advantage to each token is by copying the advantage of the entire response (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\"><code>A_it<\/code> is the advantage of the t-th token in the i-th response. Remember when we calculated the advantage of each response in Step 5 during experience collection? The easiest way to assign an advantage to each token is by simply duplicating the same advantage to each token \u2014 this means we are saying that every token is equally responsible for generating the correct answer.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Lastly, what is <code>\u03c0(o_it | q, o_i &lt; t)<\/code>? It means what is the probability of the <code>t-th<\/code> token in the <code>i-th<\/code> response? Meaning, how <em>likely<\/em> was that token when it was generated?<\/li>\n\n\n\n<li class=\"wp-block-list-item\">The importance sampling ratio reweights the advantages between the current updating policy and the old exploration policy.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">The clipping term ensures that the updates to the network do not become too large and the weights do not move too far away from the old policy. This adds more stability to the training process by keeping the model updates close to &#8220;a trust region&#8221; from the data-collection policy.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Eqn2-1024x576.png\" alt=\"\" class=\"wp-image-607764\"\/><figcaption class=\"wp-element-caption\">The PPO objective broken down into individual components. (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>When we are maximizing the PPO objective, we are effectively asking the LLM to <em>increase<\/em> the log-probability of the tokens that led to a high advantage, while <em>decreasing<\/em> the log-probability of tokens that had a low advantage.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In other words: make tokens that generate good advantages more likely and tokens that generate low advantages less likely.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Understanding the PPO Loss with an example<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s forget about the clipping term and the <code>\u03c0_old <\/code>for now, and let&#8217;s just see what maximizing <code>\ud835\udf0b(\ud835\udc5c_i) * A_i<\/code> means. To remind you, this part of the equation simply means, &#8220;the product of the probability of the i-th token (o_i) and the advantage of the i-th token (A_i)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s say for a question, the LLM generated these two sequences: &#8220;A B C&#8221; and &#8220;D E F&#8221;, and it got an advantage of +1 for the former and -1 for the latter*. Let&#8217;s say we have the log probabilities for each of the 3 tokens as shown below. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">* <em>actually since group-relative advantages always have a standard deviation of 1, the correct advantages should be +0.707 and -0.707.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Notice what happens when you multiply the advantages <code>A_it<\/code> by the current logprobs <code>pi<\/code>. Now really think about what it means to maximize the mean of that product matrix.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Slide2-1024x576.jpeg\" alt=\"\" class=\"wp-image-607770\"\/><figcaption class=\"wp-element-caption\">A toy example to show what it means to maximize the product of the probability of a token with it&#8217;s advantage (Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Remember we can only change the probabilities coming out of the LLM. The advantages come from the environment and are therefore treated as constants. Increasing this expected score would therefore mean increasing the probability of tokens with a positive advantage, and decreasing the value of the negative advantage example. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/Slide6-1024x576.jpeg\" alt=\"\" class=\"wp-image-607774\"\/><figcaption class=\"wp-element-caption\">To increase the mean of the product tensor, we must increase each value in the tensor, so we must increase the probs of positive advantage-tokens, and decrease the probs of negative-advantage tokens. <br>(Illustrated by the Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Below, you will find an example of how log-probs change after a few rounds of training. Notice how the blue line is moving closer to zero when the advantage is high? This indicates that the log-probabilities increased (or the probabilities increased) after going through RL Training. Compare that to the plot on the right, which shows a different response with a low advantage. The blue line is moving away from 0, becoming less probable for selection in later rounds.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/comparison-log-probs-1024x464.png\" alt=\"\" class=\"wp-image-607799\"\/><figcaption class=\"wp-element-caption\">A comparison of how RL fine-tuning affects log-probs of tokens after training (Illustration by the Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In the next section, let&#8217;s take a look at the <code>reasoning-gym<\/code> library and understand how we could sample tasks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Implementation<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">So, to do RL, we first need tasks. A common way to do this is by using an existing dataset of math problems, like the GSM-8K dataset. In this article, let&#8217;s look at a different case \u2014 generating tasks procedurally with a Python library called <a href=\"https:\/\/github.com\/open-thought\/reasoning-gym\">reasoning-gym<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For my experiments, I used two tasks: <strong>syllogism<\/strong> and <strong>propositional logic<\/strong>. <code>reasoning-gym<\/code> contains a host of different repositories of varying difficulty.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A <strong>syllogism task<\/strong> is a type of logical puzzle designed to test deductive reasoning. Basically, we will provide the LLM with two premises and ask if the conclusion is correct or not. The <strong>propositional logic<\/strong> task is a symbolic reasoning task where the LLM is provided tasks with symbols and asked to generate the conclusion. Unlike syllogism, this is not a YES\/NO classification response \u2014 they have to generate the correct conclusion directly. This makes this task considerably harder.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/results_gif2.gif\" alt=\"\" class=\"wp-image-607823\"\/><figcaption class=\"wp-element-caption\">Example of the Syllogism Task (Footage of my RL-trained model)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Before we begin coding, I guess it is customary to specify what I mean by &#8220;small&#8221; models.<\/em> <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The jury is still out on what qualifies as a &#8220;small&#8221; model (some say &lt;14B, some say &lt;7B), but for my YouTube video, I picked even smaller models: SmolLM-135M-Instruct, SmolLM-360M-Instruct, and Qwen3-0.6B. These are ~135M, ~360M, and ~600M models, respectively.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s see how to set up the basic training loop. First, we can use Huggingface&#8217;s <code>transformers<\/code> library to load in a model we want to train, let&#8217;s say the little 135M param model <code>SmolLM-135M-Instruct<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To generate some propositional logic tasks, for example, you just call this <code>reasoning_gym.create_dataset function<\/code> as shown below.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">import re\nfrom reasoning_gym import create_dataset, get_score_answer_fn\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nmodel_name = &quot;HuggingfaceTB\/SmolLM-135M-Instruct&quot;\n\n# load model from huggingface\nlm = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# This sets all models as trainable\nfor param in lm.parameters():\n    param.requires_grad = True\n# In my experiments, I used a LORA adapter (more on this later)\n\n# specify name of the env \nenvironment_name = &quot;propositional_logic&quot;\n\n# In practice, you should wrap this with a torch dataloader \n# to sample a minibatch of questions\ndataset = create_dataset(\n    environment_name, seed=42, size=DATA_SIZE\n)\n\nfor d in dataset:\n    question = d[&quot;question&quot;] # Accessing the question\n     \n    # We will use this later to verify if answer is correct\n    validation_object = d[&quot;metadata&quot;][&quot;source_dataset&quot;]\n    score_fn = get_score_answer_fn(validation_object)\n\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To generate reasoning data, we want the LM to generate thinking, followed by the response. Below is the system prompt we will be using.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">system_prompt = &quot;&quot;&quot;A conversation between User and Assistant. The user asks a question, and the Assistant solves it.\nThe assistant first thinks about the reasoning process in the mind and then provides the user\nwith the answer. The reasoning process and answer are enclosed within &lt;think&gt; &lt;\/think&gt; and\n&lt;answer&gt; &lt;\/answer&gt; tags, respectively, i.e., &lt;think&gt; reasoning process here &lt;\/think&gt;\n&lt;answer&gt; answer here &lt;\/answer&gt;.\n\nDo not generate new code. Do not write python code.\n\nYou may also be given examples by the user telling you the expected response format.\nFollow the format of the examples, but solve the specific problem asked by the user, not the examples.\n\nVery important - Remember again, your output format should be:\n&lt;think&gt; reasoning process here &lt;\/think&gt;\n&lt;answer&gt; answer here &lt;\/answer&gt;\n\nYour response will be scored by extracting the substring between the &lt;answer&gt;...&lt;\/answer&gt; tags.\nIt is critical to follow the above format.\nfeature_extraction_utilsling to follow the response format will result in a penalty.\n&quot;&quot;&quot;<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To generate answers, we first tokenize the system prompt and the question as shown below.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Create messages structure\nmessages = [\n    {&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: system_prompt},\n    {&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: question}, # Obtained from reasoning-gym\n]\n\n# Create tokenized representation\ninputs = tokenizer.apply_chat_template(\n    messages,\n    tokenize=True,\n    return_tensors=&quot;pt&quot;,\n    add_generation_prompt=True\n)<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Then we pass it through the LM \u2014 generate multiple responses using the <code>num_return_sequences<\/code> parameter, and detokenize it back to get a string response. No gradients are calculated during this stage.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">generated_response = lm.generate(\n    input_ids=inputs[&quot;input_ids&quot;],\n    attention_mask=inputs[&quot;attention_mask&quot;],\n    max_new_tokens=max_new_tokens, # The max number of tokens to generate\n    do_sample=True,                # Probabilistic sampling\n    top_p=0.95,                    # Nucleus sampling\n    num_return_sequences=G,        # Number of sequences per question\n    temperature=1,                 # Increase randomness\n    eos_token_id=eos_token_id,\n    pad_token_id=eos_token_id,\n)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We also write the <code>extract_answer<\/code> function, which uses regular expressions to extract answers between the answer tags.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\" datatext=\"\"><code class=\"language-python\">def extract_answer(response):\n    answer = re.search(r&quot;&lt;answer&gt;(.*?)&lt;\/answer&gt;&quot;, response, re.DOTALL)\n    if answer is not None:\n        return answer.group(1).strip()\n    else:\n        return &quot;&quot;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, we use the score function we got previously to generate a reward depending on whether the LM&#8217;s response was correct. To calculate rewards, we add a format reward and a correction reward. The correction reward comes from the environment, and the format reward is awarded if the model correctly generates the <code>&lt;think&gt; ... &lt;\/think&gt;<\/code> and <code>&lt;answer&gt; ... &lt;\/answer&gt;<\/code> tags.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The advantages are calculated by standardizing across each group.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\"># Response is an array of string of length [B*G]\n# B is the number of questions, G is the number of responses per question\n\ncorrectness_reward = score_fn(response, validation_object)\nformat_reward = calculate_format_reward(response)\n\n# Total reward is a weighted sum of correctness and formatting rewards\nreward = correctness_reward * 0.85 + format_reward * 0.15 \n\n# Convert rewards from [B*G, 1] -&gt; [B, G]\nrewards = rewards.reshape(B, G) \n\n# Calculate advantages\nadvantages = (rewards - np.mean(rewards, axis=1, keepdims=True)) \/ (\n    np.std(rewards, axis=1, keepdims=True) + 1e-8\n)\nadvantages = advantages.reshape(-1, 1)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Store the (old) log probs, advantages, responses, and response masks in a memory buffer.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\" datatext=\"\"><code class=\"language-python\"># A function that returns the log prob of each selected token\nlog_probs = calculate_log_probs(lm, generated_response)\n\nbuffer.extend([{\n    &quot;full_response&quot;: generated_response[i],\n    &quot;response_mask&quot;: response_mask[i], # A binary mask to denote which tokens in generated response are AI generated, 0 for system prompt and questions\n    &quot;old_log_probs&quot;: log_probs[i],\n    &quot;advantages&quot;: advantages[i]\n} for i in range(len(generated_response))])<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">After multiple experience collection step, once the buffer is full, we initiate our training loop. Here, we sample minibatches from our experience, calculate the log probs, compute loss, and backdrop. <\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\" datatext=\"\"><code class=\"language-python\"># full_response, response_mask, old_log_probs, advantages &lt;--- Buffer\n\n# Recompute the new log_probs. Notice no torch.no_grad(), so gradients WILL BE USED here.\nlogits = llm(input_ids=full_response).logits\n\n# Extract log probs from the logits\n# Does log_softmax over the vocabulary and extracts the log-prob of each selected token\nlog_probs = calculate_log_probs(\n     logits,\n     full_responses\n)\n\n# Calculate the clipped surrogate loss\nreasoning_loss = calculate_ppo_loss(\n     log_probs,       # Trainable\n     old_log_probs,   # Obtained from exploration, not trainable\n     advantages,      # Obtained from environment, not trainable\n     response_mask    # Obtained from exploration, not trainable\n) \n\n# Optimizaiton steps\naccelerator.backward(reasoning_loss)\noptimizer.step()\noptimizer.zero_grad()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You can use additional entropy losses here, or minimize KLD with your reference model as suggested in the original Deepseek-R1 paper, but future papers have concluded that these leash the training process and not a requirement.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Warming up with Supervised Fine-tuning<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Technically, we can try to run a big RL training right now and hope that the small models can pull through and conquer our tasks. However, the probability of that is incredibly low.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">There is one big problem \u2014 our small models are not appropriately trained to generate formatted outputs or perform well on these tasks. Off the box, their responses do have <em>some<\/em> logical flow to them, thanks to the pretraining or instruction tuning from their original developers, but they are not good enough for our target task.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/slm_fails-1024x576.png\" alt=\"\" class=\"wp-image-607800\"\/><figcaption class=\"wp-element-caption\">Comparing the outputs of a small model with a Large LM (Illustration by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Think about it \u2014 RL trains by collecting experiences and updating the policy to maximize the good experiences. But if most of the experiences are completely bad and the model receives 0 rewards, it has no way to optimize, because it gets no signal to improve at all. So the recommended approach is to first teach the model the behavior you want to train using supervised fine-tuning. Here is a simple script:<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">client = openai.AsyncClient()\nENVIRONMENT = &quot;propositional_logic&quot;\nmodel = &quot;gpt-4.1-mini&quot;\nsemaphore = asyncio.Semaphore(50)\nnum_datapoints = 200\nsystem_prompt = (\n    system_prompt\n    + &quot;&quot;&quot;You will also be provided the real answer. Your thinking should eventually result in producing the real answer.&quot;&quot;&quot;\n)\n\ndataloader = create_dataset(name=ENVIRONMENT, size=num_datapoints)\n\n@backoff.on_exception(backoff.expo, openai.RateLimitError)\nasync def generate_response(item):\n    async with semaphore:\n        messages = [\n            {&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: system_prompt},\n            {\n                &quot;role&quot;: &quot;user&quot;,\n                &quot;content&quot;: f&quot;&quot;&quot;\n    Question: {item[&#039;question&#039;]}\n    Metadata: {item[&#039;metadata&#039;]}\n    Answer: {item[&#039;answer&#039;]}\n                    &quot;&quot;&quot;,\n            },\n        ]\n        response = await client.chat.completions.create(messages=messages, model=model)\n        return {\n            &quot;question&quot;: item[&quot;question&quot;],\n            &quot;metadata&quot;: item[&quot;metadata&quot;],\n            &quot;answer&quot;: item[&quot;answer&quot;],\n            &quot;response&quot;: response.choices[0].message.content,\n        }\n\nasync def main():\n    responses = await asyncio.gather(*[generate_response(item) for item in dataloader])\n    fname = f&quot;responses_{ENVIRONMENT}_{model}.json&quot;\n    json.dump(responses, open(fname, &quot;w&quot;), indent=4)\n    print(f&quot;Saved responses to {fname}&quot;)\n\nif __name__ == &quot;__main__&quot;:\n    asyncio.run(main())<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">To generate the fine-tuning dataset, I first generated the thinking and answer tags with a small LLM-like GPT-4.1-mini. Doing this is incredibly simple \u2014 we sample 200 or so examples for each task, call the OpenAI API to generate a response, and save it on disk.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">During SFT, we load the base model we want to train, attach a trainable LORA adapter ,and do parameter-efficient fine-tuning. Here are the LORA configurations I used.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-python\">lora:\n  r: 32\n  lora_alpha: 64\n  lora_dropout: 0\n  target_modules: [&quot;q_proj&quot;, &quot;v_proj&quot;, &quot;k_proj&quot;, &quot;o_proj&quot;, \n                   &quot;up_proj&quot;, &quot;down_proj&quot;, &quot;gate_proj&quot;] <\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">LORA allows the training process to be more memory efficient and also reduces the risk of corrupting the original model. You can find the details of parameter-efficient supervised fine-tuning in my YouTube video right here.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Finetune LLMs to teach them ANYTHING with Huggingface and Pytorch | Step-by-step tutorial\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/bZcKYiwtw1I?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I trained a LORA adapter on 200 examples of syllogism data with the smallest language model I could find \u2014 the HuggingfaceTB\/SmolLM-135M-Instruct, and it got us an accuracy of 46%. Roughly, this means that we generate a correct answer 46% of the time. More importantly, we often get the formatting right, so our regex can safely extract answers from the responses more often than not.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Some more optimizations for SLMs and practical considerations<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"wp-block-list-item\">Not all reasoning tasks can be solved by all models. <strong>An easy way to verify if a task is too hard or too easy for the model is to just check the base accuracy of the model on your task<\/strong>. If it is, let&#8217;s say below 10-20%, the task is likely very hard and you need additional supervised warmup fine-tuning.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>SFT, even on small datasets, can generally show massive accuracy gains on small models.<\/strong> If you can acquire a good dataset, you may not even need to do Reinforcement Learning in many scenarios. SLMs are immensely tunable.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Papers like <a href=\"https:\/\/arxiv.org\/abs\/2503.14476\">DAPO<\/a> and <a href=\"https:\/\/arxiv.org\/abs\/2503.20783\">Critical Perspectives on R1<\/a> have claimed that the original loss normalization from <a href=\"https:\/\/arxiv.org\/pdf\/2402.03300\">DeepSeek<\/a> has a <strong>length bias<\/strong>. They have proposed other normalization methods that are worth looking at. For my project, the regular DeepSeek loss just worked.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">DAPO also mentions <strong>removing the KLD term<\/strong> in the original R1 paper. Originally, the goal of this loss was to ensure that the updating policy is never too far away from the base policy, but DAPO suggests not using this because the behaviour of the policy can drastically change during reasoning, making this KLD term an unnecessary regularisation term that will restrict the model&#8217;s intelligence.<\/li>\n\n\n\n<li class=\"wp-block-list-item\"><strong>Generating diverse responses IS KEY <\/strong>to making RL possible. If you only generated correct responses, or if you only generated incorrect responses, the advantage will be 0, and this will give the RL algorithm no training signal at all. We can generate diverse responses by increasing the <code>temperature<\/code>, <code>top_p<\/code>, and <code>num_return_sequences<\/code> parameters in the <code>generate()<\/code>.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">You can also generate <strong>diverse rewards<\/strong>, by adding more terms into the reward function. For example, a length reward that penalizes overly long reasoning.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">The following parameters increase the <strong>stability of training at the cost of more computation<\/strong>: increasing num generations per rollout, increasing the size of the buffer and lowering the learning rate.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">Use <strong>gradient accumulation<\/strong> (or even gradient checkpointing) if you have limited resources to train these models.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">There is some fine print I skipped in this article related to <strong>padding<\/strong>. When saving experiences into buffer, it&#8217;s best practice to remove the pad tokens altogether \u2014 and recreate them when loading a minibatch during training.<\/li>\n\n\n\n<li class=\"wp-block-list-item\">It is best to leave whitespace around &lt;think&gt; and &lt;answer&gt; (and their closing tags). This results in <strong>consistent tokenization<\/strong> and makes training slightly easier for the SLMs.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">4. Results<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Here is my YouTube video that explains everything in this blog post more pictorially and provides a hands-on tutorial on how to code such a thing.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-16-9 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"I trained my own Small Reasoning LMs using GRPO and Reinforcement Learning!\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube.com\/embed\/yGkJj_4bjpE?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">On the supervised-fine-tuned SmolLM-135M on the syllogism task, we got a bump to 60%! You can see the reward curve here \u2014 the healthy standard deviation of the rewards shows that we were indeed getting diverse responses throughout, which is a healthy thing if we want to train with RL.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/image-53-1024x576.png\" alt=\"\" class=\"wp-image-607813\"\/><figcaption class=\"wp-element-caption\">Rewards curve of the Syllogism task on SmolLM-135M after SFT (Illustration by Author)<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Here is a set of hyperparameters that worked well for me.<\/p>\n\n\n\n<pre class=\"wp-block-prismatic-blocks\"><code class=\"language-yaml\">config:\n  name: &quot;path\/to\/sft_model&quot;\n  max_new_tokens: 300 # reasoning + answer token budget\n  exploration_batchsize: 8  # number of questions per batch during rollout\n  G: 6  # num responses per group\n  temperature: 0.7\n  batch_size: 16  # minibatch size during training\n  gradient_accumulation_steps: 12\n  learning_rate: 0.000001  # Advisable to keep this low, like 1e-6 or 1e-7\n  top_p: 0.95\n  buffer_size: 500\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I also repeated this experiment with larger models \u2014 the SmolLM-360M-Instruct and the Qwen3-0.6B model. In the latter, I was able to get accuracies up to 81% which is awesome! We got a 20% additive bump on average in the syllogism task!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the propositional logic task, which in my opinion is a harder reasoning task, I also saw similar gains across all small models! I am sure that with more instruction tuning and RL fine-tuning, possibly on multiple tasks at once, we can raise the intelligence of these models a lot higher. Training on a single task can generate quick results which is what I wanted for this Youtube video, but it can also act as a bottleneck for the model&#8217;s overall intelligence.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s end this article with a GIF of the small models outputting reasoning data and solving tasks. Enjoy, and stay magnificent!<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/contributor.insightmediagroup.io\/wp-content\/uploads\/2025\/07\/results_gif1-1.gif\" alt=\"\" class=\"wp-image-607819\"\/><figcaption class=\"wp-element-caption\">SmolLM-135M after training on Propositional Logic Tasks (Source: Author)<\/figcaption><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Author&#8217;s YouTube channel<\/strong>: <a href=\"https:\/\/www.youtube.com\/@avb_fj\">https:\/\/www.youtube.com\/@avb_fj<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Author&#8217;s Patreon<\/strong>: <a href=\"https:\/\/www.patreon.com\/c\/NeuralBreakdownwithAVB\">www.patreon.com\/NeuralBreakdownwithAVB<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Author&#8217;s Twitter (X) account<\/strong>: <a href=\"https:\/\/x.com\/neural_avb\">https:\/\/x.com\/neural_avb<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Deepseek Math: https:\/\/arxiv.org\/pdf\/2402.03300 <br>DeepSeek R1: https:\/\/arxiv.org\/abs\/2501.12948 <br>DAPO: https:\/\/arxiv.org\/abs\/2503.14476 <br>Critical Perspectives on R1: https:\/\/arxiv.org\/abs\/2503.20783<br>Reasoning Gym Library: <a href=\"https:\/\/github.com\/open-thought\/reasoning-gym\">github.com\/open-thought\/reasoning-gym<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A good place to read about Reasoning: <a href=\"https:\/\/github.com\/willccbb\/verifiers\">https:\/\/github.com\/willccbb\/verifiers<\/a><br><br>A great place to study code: <a href=\"https:\/\/github.com\/huggingface\/trl\/blob\/main\/trl\/trainer\/grpo_trainer.py\">https:\/\/github.com\/huggingface\/trl\/blob\/main\/trl\/trainer\/grpo_trainer.py<\/a><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch<\/p>\n","protected":false},"author":18,"featured_media":606532,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch","footnotes":""},"categories":[21],"tags":[468,445,32705,758,746],"sponsor":[],"coauthors":[30327],"class_list":["post-606531","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models","tag-deep-dives","tag-deep-learning","tag-huggingface","tag-pytorch","tag-reinforcement-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Fine-Tune Small Language Models to Think with Reinforcement Learning | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Fine-Tune Small Language Models to Think with Reinforcement Learning | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2025-07-09T01:59:20+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-07-09T01:59:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Avishek Biswas\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Avishek Biswas\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"19 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"How to Fine-Tune Small Language Models to Think with Reinforcement Learning\",\"datePublished\":\"2025-07-09T01:59:20+00:00\",\"dateModified\":\"2025-07-09T01:59:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\"},\"wordCount\":3766,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png\",\"keywords\":[\"Deep Dives\",\"Deep Learning\",\"Huggingface\",\"Pytorch\",\"Reinforcement Learning\"],\"articleSection\":[\"Large Language Models\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\",\"url\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\",\"name\":\"How to Fine-Tune Small Language Models to Think with Reinforcement Learning | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png\",\"datePublished\":\"2025-07-09T01:59:20+00:00\",\"dateModified\":\"2025-07-09T01:59:32+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png\",\"width\":1920,\"height\":1080,\"caption\":\"GRPO Algorithm\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to Fine-Tune Small Language Models to Think with Reinforcement Learning\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Fine-Tune Small Language Models to Think with Reinforcement Learning | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/","og_locale":"en_US","og_type":"article","og_title":"How to Fine-Tune Small Language Models to Think with Reinforcement Learning | Towards Data Science","og_description":"A visual tour and from-scratch guide to train GRPO reasoning models in PyTorch","og_url":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/","og_site_name":"Towards Data Science","article_published_time":"2025-07-09T01:59:20+00:00","article_modified_time":"2025-07-09T01:59:32+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png","type":"image\/png"}],"author":"Avishek Biswas","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Avishek Biswas","Est. reading time":"19 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"How to Fine-Tune Small Language Models to Think with Reinforcement Learning","datePublished":"2025-07-09T01:59:20+00:00","dateModified":"2025-07-09T01:59:32+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/"},"wordCount":3766,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png","keywords":["Deep Dives","Deep Learning","Huggingface","Pytorch","Reinforcement Learning"],"articleSection":["Large Language Models"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/","url":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/","name":"How to Fine-Tune Small Language Models to Think with Reinforcement Learning | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png","datePublished":"2025-07-09T01:59:20+00:00","dateModified":"2025-07-09T01:59:32+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/07\/GRPO4.png","width":1920,"height":1080,"caption":"GRPO Algorithm"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/how-to-finetune-small-language-models-to-think-with-reinforcement-learning\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"How to Fine-Tune Small Language Models to Think with Reinforcement Learning"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"TDS Contributor Portal","distributor_original_site_url":"https:\/\/contributor.insightmediagroup.io","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606531","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=606531"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/606531\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/606532"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=606531"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=606531"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=606531"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=606531"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=606531"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}