{"id":13465,"date":"2024-05-31T06:47:27","date_gmt":"2024-05-31T06:47:27","guid":{"rendered":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/"},"modified":"2025-01-13T18:38:53","modified_gmt":"2025-01-13T18:38:53","slug":"deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/","title":{"rendered":"Deep Dive into Anthropic&#8217;s Sparse Autoencoders by Hand \u270d\ufe0f"},"content":{"rendered":"<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d0b6ae\" data-has-transparency=\"true\" style=\"--dominant-color: #d0b6ae;\" loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"720\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1cfu1t2WCqje4wF51OOOxww.png\" alt=\"Image by author (Zephyra, the protector of Lumaria by my 4-year old)\" class=\"wp-image-355987 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1cfu1t2WCqje4wF51OOOxww.png 1280w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1cfu1t2WCqje4wF51OOOxww-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1cfu1t2WCqje4wF51OOOxww-1024x576.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1cfu1t2WCqje4wF51OOOxww-768x432.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><figcaption class=\"wp-element-caption\">Image by author (Zephyra, the protector of Lumaria by my 4-year old)<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\"><em>&quot;In the mystical lands of Lumaria, where ancient magic filled the air, lived Zephyra, the Ethereal Griffin. With the body of a lion and the wings of an eagle, Zephyra was the revered protector of the Codex of Truths, an ancient script holding the universe&#8217;s secrets.<\/em><\/p>\n<p class=\"wp-block-paragraph\"><em>Nestled in a sacred cave, the Codex was safeguarded by Zephyra&#8217;s viridescent eyes, which could see through deception to unveil pure truths. One day, a dark sorcerer descended on the lands of Lumaria and sought to shroud the world in ignorance by concealing the Codex. The villagers called upon Zephyra, who soared through the skies, as a beacon of hope. With a majestic sweep of the wings, Zephyra created a protective barrier of light around the grove, repelling the sorcerer and exposing the truths.<\/em><\/p>\n<p class=\"wp-block-paragraph\"><em>After a long duel, it was concluded that the dark sorcerer was no match to Zephyra&#8217;s light. Through her courage and vigilance, the true light kept shining over Lumaria. And as time went by, Lumaria was guided to prosperity under Zephyra&#8217;s protection and its path stayed illuminated by the truths Zephyra safeguarded. And this is how Zephyra&#8217;s legend lived on!&quot;<\/em><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h2 class=\"wp-block-heading\">Anthropic&#8217;s journey &#8216;towards extracting interpretable features&#8217;<\/h2>\n<p class=\"wp-block-paragraph\">Following the story of Zephyra, Anthropic AI delved into the expedition of extracting meaningful features in a model. The idea behind this investigation lies in understanding how different components in a neural network interact with one another and what role each component plays.<\/p>\n<p class=\"wp-block-paragraph\">According to the paper <strong>&quot;<a href=\"https:\/\/transformer-circuits.pub\/2023\/monosemantic-features\/index.html\">Towards Monosemanticity: Decomposing Language Models With Dictionary Learning<\/a>&quot;<\/strong> a Sparse Autoencoder is able to successfully extract meaningful features from a model. In other words, Sparse Autoencoders help break down the problem of &#8216;polysemanticity&#8217; &#8211; neural activations that correspond to several meanings\/interpretations at once by focusing on sparsely activating features that hold a single interpretation &#8211; in other words, are more one-directional.<\/p>\n<p class=\"wp-block-paragraph\">To understand how all of it is done, we have these beautiful handiworks on <a href=\"https:\/\/lnkd.in\/g2rM9iV2\">Autoencoders<\/a> and<a href=\"https:\/\/www.linkedin.com\/posts\/tom-yeh_claude-autoencoder-aibyhand-activity-7199774212759183362-msKU\/?\"> Sparse Autoencoders <\/a>by Prof. <a href=\"https:\/\/www.linkedin.com\/in\/tom-yeh\/\">Tom Yeh<\/a> that explain the behind-the-scenes workings of these phenomenal mechanisms.<\/p>\n<p class=\"wp-block-paragraph\">(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission. )<\/p>\n<p class=\"wp-block-paragraph\">To begin, let us first let us first explore what an Autoencoder is and how it works.<\/p>\n<h2 class=\"wp-block-heading\">What is an Autoencoder?<\/h2>\n<p class=\"wp-block-paragraph\">Imagine a writer has his desk strewn with different papers &#8211; some are his notes for the story he is writing, some are copies of final drafts, some are again illustrations for his action-packed story. Now amidst this chaos, it is hard to find the important parts &#8211; more so when the writer is in a hurry and the publisher is on the phone demanding a book in two days. Thankfully, the writer has a very efficient assistant &#8211; this assistant makes sure the cluttered desk is cleaned regularly, grouping similar items, organizing and putting things into their right place. And as and when needed, the assistant would retrieve the correct items for the writer, helping him meet the deadlines set by his publisher.<\/p>\n<p class=\"wp-block-paragraph\">Well, the name of this assistant is Autoencoder. It mainly has two functions &#8211; encoding and decoding. Encoding refers to condensing input data and extracting the essential features (organization). Decoding is the process of reconstructing original data from encoded representation while aiming to minimize information loss (retrieval).<\/p>\n<p class=\"wp-block-paragraph\">Now let&#8217;s look at how this assistant works.<\/p>\n<h2 class=\"wp-block-heading\">How does an Autoencoder Work?<\/h2>\n<p class=\"wp-block-paragraph\">Given : Four training examples <strong>X1, X2, X3, X4.<\/strong><\/p>\n<h3 class=\"wp-block-heading\">[1] Auto<\/h3>\n<p class=\"wp-block-paragraph\">The first step is to copy the training examples to targets <strong>Y&#8217;<\/strong>. The Autoencoder&#8217;s work is to reconstruct these training examples. Since the targets are the training examples themselves, the word <em><strong>&#8216;Auto&#8217;<\/strong><\/em> is used which is Greek for <em><strong>&#8216;self&#8217;<\/strong><\/em>.<\/p>\n<h3 class=\"wp-block-heading\">[2] Encoder : Layer 1 +ReLU<\/h3>\n<p class=\"wp-block-paragraph\">As we have seen in all our previous models, a simple weight and bias matrix coupled with ReLU is powerful and is able to do wonders. Thus, by using the first Encoding layer we reduce the size of the original feature set from 4&#215;4 to 3&#215;4.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ede9e5\" data-has-transparency=\"false\" style=\"--dominant-color: #ede9e5;\" loading=\"lazy\" decoding=\"async\" width=\"944\" height=\"550\" class=\"wp-image-399493 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1INIu2VmAyBnQHRLUc_pY-g.gif\" alt=\"\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">A quick recap:<\/p>\n<p class=\"wp-block-paragraph\"><strong>Linear transformation<\/strong> : The input embedding vector is multiplied by the weight matrix W and then added with the bias vector <strong>b<\/strong>,<\/p>\n<p class=\"wp-block-paragraph\">z = <strong>W<\/strong>x+<strong>b<\/strong>, where <strong>W<\/strong> is the weight matrix, x is our word embedding and <strong>b<\/strong> is the bias vector.<\/p>\n<p class=\"wp-block-paragraph\"><strong>ReLU activation function<\/strong> : Next, we apply the ReLU to this intermediate z.<\/p>\n<p class=\"wp-block-paragraph\">ReLU returns the element-wise maximum of the input and zero. Mathematically, <strong>h<\/strong> = max{0,z}.<\/p>\n<\/blockquote>\n<h3 class=\"wp-block-heading\">[3] Encoder : Layer 2 + ReLU<\/h3>\n<p class=\"wp-block-paragraph\">The output of the previous layer is processed by the second Encoder layer which reduces the input size further to 2&#215;3. This is where the extraction of relevant features occurs. This layer is also called the &#8216;bottleneck&#8217; since the outputs in this layer have much lower features than the input features.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"eee6e2\" data-has-transparency=\"false\" style=\"--dominant-color: #eee6e2;\" loading=\"lazy\" decoding=\"async\" width=\"952\" height=\"550\" class=\"wp-image-399494 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/10UBKNLacq0ZOXF-f9Tzvzg.gif\" alt=\"\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[4] Decoder : Layer 1 + ReLU<\/h3>\n<p class=\"wp-block-paragraph\">Once the encoding process is complete, the next step is to decode the relevant features to build &#8216;back&#8217; the final output. To do so, we multiply the features from the last step with corresponding weights and biases and apply the ReLU layer. The result is a 3&#215;4 matrix.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ebe5df\" data-has-transparency=\"false\" style=\"--dominant-color: #ebe5df;\" loading=\"lazy\" decoding=\"async\" width=\"978\" height=\"904\" class=\"wp-image-399495 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1yCWisBAtVJ35IZB164Vvew.gif\" alt=\"\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[5] Decoder : Layer 2 + ReLU<\/h3>\n<p class=\"wp-block-paragraph\">A second Decoder layer (weight, biases + ReLU) applies on the previous output to give the final result which is the reconstructed 4&#215;4 matrix. We do so to get back to original dimension in order to compare the results with our original target.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8e3df\" data-has-transparency=\"false\" style=\"--dominant-color: #e8e3df;\" loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"904\" class=\"wp-image-399496 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1jUXnoKZk1kQP3MDUA9SLtA.gif\" alt=\"\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[6] Loss Gradients &amp; BackPropagation<\/h3>\n<p class=\"wp-block-paragraph\">Once the output from the decoder layer is obtained, we calculate the gradients of the Mean Square Error (MSE) between the <strong>outputs (Y)<\/strong> and the <strong>targets (Y&#8217;)<\/strong>. To do so, we find *<em>2<\/em>(Y-Y&#8217;)** , which gives us the final gradients that activate the backpropagation process and updates the weights and biases accordingly.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e8e0de\" data-has-transparency=\"false\" style=\"--dominant-color: #e8e0de;\" loading=\"lazy\" decoding=\"async\" width=\"998\" height=\"1300\" class=\"wp-image-399498 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1R_qDdXzetVZZJ8oKaEVeig.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">Now that we understand how the Autoencoder works, it&#8217;s time to explore how its <strong>sparse variation<\/strong> is able to achieve interpretability for large language models (LLMs).<\/p>\n<h2 class=\"wp-block-heading\">Sparse Autoencoder &#8211; How does it work?<\/h2>\n<p class=\"wp-block-paragraph\">To start with, suppose we are given:<\/p>\n<ul class=\"wp-block-list\">\n<li>The output of a transformer after the feed-forward layer has processed it, i.e. let us assume we have the model activations for five tokens (X). They are good but they do not shed light on how the model arrives at its decision or makes the predictions.<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e7e4e0\" data-has-transparency=\"true\" style=\"--dominant-color: #e7e4e0;\" loading=\"lazy\" decoding=\"async\" width=\"804\" height=\"828\" class=\"wp-image-399500 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1oBQzrK_vU9FJRTaHkQOd7A.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1oBQzrK_vU9FJRTaHkQOd7A.png 804w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1oBQzrK_vU9FJRTaHkQOd7A-291x300.png 291w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1oBQzrK_vU9FJRTaHkQOd7A-768x791.png 768w\" sizes=\"auto, (max-width: 804px) 100vw, 804px\" \/><\/figure>\n<p class=\"wp-block-paragraph\">The prime question here is:<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Is it possible to map each activation (3D) to a higher-dimension space (6D) that will help with the understanding?<\/p><\/blockquote>\n<h3 class=\"wp-block-heading\">[1] Encoder : Linear Layer<\/h3>\n<p class=\"wp-block-paragraph\">The first step in the Encoder layer is to multiply the input <strong>X<\/strong> with encoder weights and add biases (as done in the first step of an Autoencoder).<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e2d9d9\" data-has-transparency=\"false\" style=\"--dominant-color: #e2d9d9;\" loading=\"lazy\" decoding=\"async\" width=\"772\" height=\"436\" class=\"wp-image-399502 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1xhgfTmOD7ZowFVtSBHPn8Q.gif\" alt=\"\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[2] Encoder : ReLU<\/h3>\n<p class=\"wp-block-paragraph\">The next sub-step is to apply the ReLU activation function to add non-linearity and suppress negative activations. This suppression leads to many features being set to 0 which enables the concept of sparsity &#8211; outputting sparse and interpretable features <em><strong>f.<\/strong><\/em><\/p>\n<p class=\"wp-block-paragraph\">Interpretability happens when we have only one or two positive features. If we examine <em><strong>f6<\/strong><\/em>, we can see <strong>X2<\/strong> and <strong>X3<\/strong> are positive, and may say that both have &#8216;Mountain&#8217; in common.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e4d3d0\" data-has-transparency=\"false\" style=\"--dominant-color: #e4d3d0;\" loading=\"lazy\" decoding=\"async\" width=\"772\" height=\"444\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1L5PxylZCTjdNULt4gjt7oQ.gif\" alt=\"\" class=\"wp-image-356006 not-transparent\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[3] Decoder : Reconstruction<\/h3>\n<p class=\"wp-block-paragraph\">Once we are done with the encoder, we proceed to the decoder step. We multiply <em><strong>f<\/strong><\/em> with decoder weights and add biases. This outputs <strong>X&#8217;<\/strong>, which is the reconstruction of <strong>X<\/strong> from interpretable features.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ded5d4\" data-has-transparency=\"false\" style=\"--dominant-color: #ded5d4;\" loading=\"lazy\" decoding=\"async\" width=\"760\" height=\"816\" class=\"wp-image-399503 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/13sZJcXiZSQVr41YSTA33RA.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">As done in an Autoencoder, we want <strong>X&#8217;<\/strong> to be as close to <strong>X<\/strong> as possible. To ensure that, further training is essential.<\/p>\n<h3 class=\"wp-block-heading\">[4] Decoder : Weights<\/h3>\n<p class=\"wp-block-paragraph\">As an intermediary step, we compute the L2 norm for each of the weights in this step. We keep them aside to be used later.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e5d8d8\" data-has-transparency=\"false\" style=\"--dominant-color: #e5d8d8;\" loading=\"lazy\" decoding=\"async\" width=\"412\" height=\"400\" class=\"wp-image-399504 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1k3zIB0kEP1sewwORw08FOQ.gif\" alt=\"\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>L2-norm<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Also known as Euclidean norm, L2-norm calculates the magnitude of a vector using the formula: ||x||\u2082 = \u221a(\u03a3\u1d62 x\u1d62\u00b2).<\/p>\n<p class=\"wp-block-paragraph\">In other words, it sums the squares of each component and then takes the square root over the result. This norm provides a straightforward way to quantify the length or distance of a vector in Euclidean space.<\/p>\n<\/blockquote>\n<h2 class=\"wp-block-heading\">Training<\/h2>\n<p class=\"wp-block-paragraph\">As mentioned earlier, a Sparse Autoencoder instils extensive training to get the reconstructed <strong>X&#8217;<\/strong> closer to <strong>X<\/strong>. To illustrate that, we proceed to the next steps below:<\/p>\n<h3 class=\"wp-block-heading\">[5] Sparsity : L1 Loss<\/h3>\n<p class=\"wp-block-paragraph\">The goal here is to obtain as many values close to zero \/ zero as possible. We do so by invoking <strong>L1 sparsity<\/strong> to penalize the absolute values of the weights &#8211; the core idea being that we want to make the sum as small as possible.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e4e1e0\" data-has-transparency=\"false\" style=\"--dominant-color: #e4e1e0;\" loading=\"lazy\" decoding=\"async\" width=\"540\" height=\"630\" class=\"wp-image-399505 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1RYebksXA-6kfOCWkZxRiA.gif\" alt=\"\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>L1-loss<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The L1-loss is calculated as the sum of the absolute values of the weights: L1 = \u03bb\u03a3|w|, where \u03bb is a regularization parameter.<\/p>\n<p class=\"wp-block-paragraph\">This encourages many weights to become zero, simplifying the model and thus enhancing <strong>interpretability<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">In other words, L1 helps build the focus on the most relevant features while also preventing overfitting, improving model generalization, and reducing computational complexity.<\/p>\n<\/blockquote>\n<h3 class=\"wp-block-heading\">[6] Sparsity : Gradient<\/h3>\n<p class=\"wp-block-paragraph\">The next step is to calculate <strong>L1<\/strong>&#8216;s gradients which -1 for positive values. Thus, for all values of <em><strong>f &gt;0<\/strong><\/em> , the result will be set to -1.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dcd1d0\" data-has-transparency=\"false\" style=\"--dominant-color: #dcd1d0;\" loading=\"lazy\" decoding=\"async\" width=\"892\" height=\"416\" class=\"wp-image-399506 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/11WXXLP5p7zYyBK2T22CbcA.gif\" alt=\"\" \/><\/figure>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>How does L1 penalty push weights towards zero?<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The gradient of the L1 penalty pushes weights towards zero through a process that applies a constant force, regardless of the weight&#8217;s current value. Here&#8217;s how it works (all images in this sub-section are by author):<\/p>\n<p class=\"wp-block-paragraph\">The L1 penalty is expressed as:<\/p>\n<\/blockquote>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ececec\" data-has-transparency=\"true\" style=\"--dominant-color: #ececec;\" loading=\"lazy\" decoding=\"async\" width=\"146\" height=\"74\" class=\"wp-image-399508 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/166I49jyCjN_cx6Au7lbJMQ.png\" alt=\"\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The gradient of this penalty with respect to a weight <em><strong>w<\/strong><\/em> is:<\/p><\/blockquote>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f0f0f0\" data-has-transparency=\"false\" style=\"--dominant-color: #f0f0f0;\" loading=\"lazy\" decoding=\"async\" width=\"410\" height=\"92\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1wkHXsIBVbbIp0sUgknjQVg.png\" alt=\"\" class=\"wp-image-356019 not-transparent\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1wkHXsIBVbbIp0sUgknjQVg.png 410w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1wkHXsIBVbbIp0sUgknjQVg-300x67.png 300w\" sizes=\"auto, (max-width: 410px) 100vw, 410px\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>where <em><strong>sign(w)<\/strong><\/em> is:<\/p><\/blockquote>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f4f4f4\" data-has-transparency=\"true\" style=\"--dominant-color: #f4f4f4;\" loading=\"lazy\" decoding=\"async\" width=\"230\" height=\"222\" class=\"wp-image-399509 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1rLboil1lraGSYN0xLl0TUQ.png\" alt=\"\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>During gradient descent, the update rule for weights is:<\/p><\/blockquote>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f1f1f1\" data-has-transparency=\"false\" style=\"--dominant-color: #f1f1f1;\" loading=\"lazy\" decoding=\"async\" width=\"584\" height=\"166\" class=\"wp-image-399510 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1onjEcsGwhQkOc5YPUCucog.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1onjEcsGwhQkOc5YPUCucog.png 584w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1onjEcsGwhQkOc5YPUCucog-300x85.png 300w\" sizes=\"auto, (max-width: 584px) 100vw, 584px\" \/><\/figure>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">where \ud835\udfb0 is the learning rate.<\/p>\n<p class=\"wp-block-paragraph\">The <strong>constant subtraction (or addition)<\/strong> of <strong>\u03bb<\/strong> from the weight value (depending on its sign) decreases the absolute value of the weight. If the weight is small enough, this process can drive it to exactly zero.<\/p>\n<\/blockquote>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n<h3 class=\"wp-block-heading\">[7] Sparsity : Zero<\/h3>\n<p class=\"wp-block-paragraph\">For all other values that are already zero, we keep them unchanged since they have already been zeroed out.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"dad0d0\" data-has-transparency=\"false\" style=\"--dominant-color: #dad0d0;\" loading=\"lazy\" decoding=\"async\" width=\"846\" height=\"426\" class=\"wp-image-399511 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1gtDUWgJ11gs1bh77CEt-Qw.gif\" alt=\"\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[8] Sparsity : Weight<\/h3>\n<p class=\"wp-block-paragraph\">We multiple each row of the gradient matrix obtained in Step 6 by the corresponding decoder weights obtained in Step 4. This step is crucial as it prevents the model from learning large weights which would add incorrect information while reconstructing the results.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"ded9d8\" data-has-transparency=\"false\" style=\"--dominant-color: #ded9d8;\" loading=\"lazy\" decoding=\"async\" width=\"552\" height=\"646\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/1kM4XIHlPsa7su69XV11H7Q.gif\" alt=\"\" class=\"wp-image-356026 not-transparent\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[9] Reconstruction : MSE Loss<\/h3>\n<p class=\"wp-block-paragraph\">We use the Mean Square Error or the <strong>L2<\/strong> loss function to calculate the difference between <strong>X&#8217;<\/strong> and <strong>X<\/strong>. The goal as seen previously is to minimize the error to the lowest value.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d9dadf\" data-has-transparency=\"false\" style=\"--dominant-color: #d9dadf;\" loading=\"lazy\" decoding=\"async\" width=\"996\" height=\"306\" class=\"wp-image-399512 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/13bOB5l-c-cXhrtX89Fk0AA.gif\" alt=\"\" \/><\/figure>\n<h3 class=\"wp-block-heading\">[10] Reconstruction : Gradient<\/h3>\n<p class=\"wp-block-paragraph\">The gradient of <strong>L2<\/strong> loss is *<em>2<\/em>(X&#8217;-X)**.<\/p>\n<p class=\"wp-block-paragraph\">And hence as seen for the original Autoencoders, we run backpropagation to update the weights and the biases. The catch here is finding a good balance between sparsity and reconstruction.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e7e4e3\" data-has-transparency=\"false\" style=\"--dominant-color: #e7e4e3;\" loading=\"lazy\" decoding=\"async\" width=\"502\" height=\"440\" class=\"wp-image-399513 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/13MDwExTyz4ImSJX2GMzHkA.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">And with this, we come to the end of this very clever and intuitive way of learning how a model understands an idea and the direction it takes to generate a response.<\/p>\n<h3 class=\"wp-block-heading\">To summarize:<\/h3>\n<ol class=\"wp-block-list\">\n<li>An <strong>Autoencoder<\/strong> overall consists of two parts : <strong>Encoder<\/strong> and <strong>Decoder<\/strong>. The <strong>Encoder<\/strong> uses weights and biases coupled with the ReLU activation function to compress the initial input features into a lower dimension, trying to capture only the relevant parts. The <strong>Decoder<\/strong> on the other hand takes the output of the Encoder and works to reconstruct the input features back to their original state. Since the targets in an Autoencoder are the initial features themselves, hence the use of the word &#8216;auto&#8217;. The aim, as is for standard neural networks, is to achieve the lowest error (difference) between the target and the input features &#8211; and it is achieved by propagating the gradient of the error through the network while updating the weights and biases.<\/li>\n<li>A <strong>Sparse Autoencoder<\/strong> consists of all the components as a standard Autoencoder along with a few more additions. The key here is the different approach in the training step. Since the aim here is to retrieve the interpretable features, we want to zero out those values which hold relatively less meaning. Once the encoder uses ReLU to suppress the negative values, we go a step further and use L1-Loss on the result to encourage sparsity by penalizing the absolute values of the weights. This is achieved by adding a penalty term to the loss function, which is the sum of the absolute values of the weights: \u03bb\u03a3|w|. The weights that remain non-zero are those that are crucial for the model&#8217;s performance.<\/li>\n<\/ol>\n<h2 class=\"wp-block-heading\">Extracting Interpretable features using Sparsity<\/h2>\n<p class=\"wp-block-paragraph\">As humans, our brains activate only a small subset of neurons in response to specific stimuli. Likewise, Sparse Autoencoders learn a sparse representation of the input by leveraging sparsity constraints like <strong>L1<\/strong> regularization. By doing so, a Sparse Autoencoder is able to extract interpretable features from complex data thus enhancing the simplicity and interpretability of the learned features. This selective activation mirroring biological neural processes helps focus on the most relevant aspects of the input data making the models more robust and efficient.<\/p>\n<p class=\"wp-block-paragraph\">With Anthropic&#8217;s endeavor to understand interpretability in AI models, their initiative highlights the need for transparent and understandable AI systems, especially as they become more integrated into critical decision-making processes. By focusing on creating models that are both powerful and interpretable, Anthropic contributes to the development of AI that can be trusted and effectively utilized in real-world applications.<\/p>\n<p class=\"wp-block-paragraph\">In conclusion, <strong>Sparse Autoencoders<\/strong> are vital for extracting interpretable features, enhancing model robustness, and ensuring efficiency. The ongoing work on understanding these powerful models and how they make inferences underscore the growing importance of interpretability in AI, paving the way for more transparent AI systems. It remains to see how these concepts evolve and driving us towards a future that entails a safe integration of AI in our lives!<\/p>\n<p class=\"wp-block-paragraph\"><em>P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.<\/em><\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/drive.google.com\/file\/d\/1xiAjdlWCAzhj-I-YOb7wSMeroUOQzdlE\/view?usp=sharing\">Blank Template for hand-exercise<\/a><\/p>\n<p class=\"wp-block-paragraph\">Now go have fun and help Zephyr keep the Codex of Truth safe!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"d7bea8\" data-has-transparency=\"true\" style=\"--dominant-color: #d7bea8;\" loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"720\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png\" alt=\"\" class=\"wp-image-13466 has-transparency\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png 1280w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg-1024x576.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg-768x432.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><\/figure>\n<p class=\"wp-block-paragraph\"><em>Once again special thanks to <a href=\"https:\/\/www.linkedin.com\/in\/tom-yeh\/\">Prof. Tom Yeh<\/a> for supporting this work!<\/em><\/p>\n<h3 class=\"wp-block-heading\">References:<\/h3>\n<p class=\"wp-block-paragraph\">[1] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al. Oct 2023 <a href=\"https:\/\/transformer-circuits.pub\/2023\/monosemantic-features\/index.html\">https:\/\/transformer-circuits.pub\/2023\/monosemantic-features\/index.html<\/a><\/p>\n<p class=\"wp-block-paragraph\">[2] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al. May 2024 <a href=\"https:\/\/transformer-circuits.pub\/2024\/scaling-monosemanticity\/\">https:\/\/transformer-circuits.pub\/2024\/scaling-monosemanticity\/<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Explore the concepts behind the interpretability quest for LLMs<\/p>\n","protected":false},"author":18,"featured_media":13466,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Explore the concepts behind the interpretability quest for LLMs","footnotes":""},"categories":[21,22],"tags":[1838,453,450,446,585],"sponsor":[],"coauthors":[31633],"class_list":["post-13465","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models","category-machine-learning","tag-autoencoder","tag-editors-pick","tag-large-language-models","tag-machine-learning","tag-transformers"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deep Dive into Anthropic&#039;s Sparse Autoencoders by Hand \u270d\ufe0f | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-\ufe0f-eebe0ef59709\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Dive into Anthropic&#039;s Sparse Autoencoders by Hand \u270d\ufe0f | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Explore the concepts behind the interpretability quest for LLMs\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-\ufe0f-eebe0ef59709\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2024-05-31T06:47:27+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-13T18:38:53+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Srijanie Dey, PhD\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Srijanie Dey, PhD\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Deep Dive into Anthropic&#8217;s Sparse Autoencoders by Hand \u270d\ufe0f\",\"datePublished\":\"2024-05-31T06:47:27+00:00\",\"dateModified\":\"2025-01-13T18:38:53+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/\"},\"wordCount\":2341,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png\",\"keywords\":[\"Autoencoder\",\"Editors Pick\",\"Large Language Models\",\"Machine Learning\",\"Transformers\"],\"articleSection\":[\"Large Language Models\",\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/\",\"url\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/\",\"name\":\"Deep Dive into Anthropic's Sparse Autoencoders by Hand \u270d\ufe0f | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png\",\"datePublished\":\"2024-05-31T06:47:27+00:00\",\"dateModified\":\"2025-01-13T18:38:53+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png\",\"width\":1280,\"height\":720},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deep Dive into Anthropic&#8217;s Sparse Autoencoders by Hand \u270d\ufe0f\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Deep Dive into Anthropic's Sparse Autoencoders by Hand \u270d\ufe0f | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-\ufe0f-eebe0ef59709\/","og_locale":"en_US","og_type":"article","og_title":"Deep Dive into Anthropic's Sparse Autoencoders by Hand \u270d\ufe0f | Towards Data Science","og_description":"Explore the concepts behind the interpretability quest for LLMs","og_url":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-\ufe0f-eebe0ef59709\/","og_site_name":"Towards Data Science","article_published_time":"2024-05-31T06:47:27+00:00","article_modified_time":"2025-01-13T18:38:53+00:00","og_image":[{"width":1280,"height":720,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png","type":"image\/png"}],"author":"Srijanie Dey, PhD","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Srijanie Dey, PhD","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Deep Dive into Anthropic&#8217;s Sparse Autoencoders by Hand \u270d\ufe0f","datePublished":"2024-05-31T06:47:27+00:00","dateModified":"2025-01-13T18:38:53+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/"},"wordCount":2341,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png","keywords":["Autoencoder","Editors Pick","Large Language Models","Machine Learning","Transformers"],"articleSection":["Large Language Models","Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/","url":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/","name":"Deep Dive into Anthropic's Sparse Autoencoders by Hand \u270d\ufe0f | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png","datePublished":"2024-05-31T06:47:27+00:00","dateModified":"2025-01-13T18:38:53+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/05\/12DMsbjmpne5weITU2ACcEg.png","width":1280,"height":720},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-anthropics-sparse-autoencoders-by-hand-%ef%b8%8f-eebe0ef59709\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Deep Dive into Anthropic&#8217;s Sparse Autoencoders by Hand \u270d\ufe0f"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/13465","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=13465"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/13465\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/13466"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=13465"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=13465"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=13465"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=13465"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=13465"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}