{"id":5152,"date":"2024-04-12T05:18:36","date_gmt":"2024-04-12T05:18:36","guid":{"rendered":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/"},"modified":"2025-01-08T19:13:51","modified_gmt":"2025-01-08T19:13:51","slug":"deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813","status":"publish","type":"post","link":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/","title":{"rendered":"Deep Dive into Transformers by Hand \u270d\ufe0e"},"content":{"rendered":"<p class=\"wp-block-paragraph\">There has been a new development in our neighborhood.<\/p>\n<p class=\"wp-block-paragraph\">A &#8216;Robo-Truck,&#8217; as my son likes to call it, has made its new home on our street.<\/p>\n<p class=\"wp-block-paragraph\">It is a Tesla Cyber Truck and I have tried to explain that name to my son many times but he insists on calling it Robo-Truck. Now every time I look at Robo-Truck and hear that name, it reminds me of the movie Transformers where robots could transform to and from cars.<\/p>\n<p class=\"wp-block-paragraph\">And isn&#8217;t it strange that Transformers as we know them today could very well be on their way to powering these Robo-Trucks? It&#8217;s almost a full circle moment. But where am I going with all these?<\/p>\n<p class=\"wp-block-paragraph\">Well, I am heading to the destination &#8211; Transformers. Not the robot car ones but the neural network ones. And you are invited!<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"b0ca67\" data-has-transparency=\"true\" style=\"--dominant-color: #b0ca67;\" loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"720\" class=\"wp-image-318062 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1-4bAZ8RgZIH6MA114yAmqg.png\" alt=\"Image by author (Our Transformer - &#039;Robtimus Prime&#039;. Colors as mandated by my artist son.)\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1-4bAZ8RgZIH6MA114yAmqg.png 1280w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1-4bAZ8RgZIH6MA114yAmqg-300x169.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1-4bAZ8RgZIH6MA114yAmqg-1024x576.png 1024w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1-4bAZ8RgZIH6MA114yAmqg-768x432.png 768w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><figcaption class=\"wp-element-caption\">Image by author (Our Transformer &#8211; &#8216;Robtimus Prime&#8217;. Colors as mandated by my artist son.)<\/figcaption><\/figure>\n<h3 class=\"wp-block-heading\">What are Transformers?<\/h3>\n<p class=\"wp-block-paragraph\">Transformers are essentially neural networks. Neural networks that specialize in learning context from the data.<\/p>\n<p class=\"wp-block-paragraph\">But what makes them special is the presence of mechanisms that eliminate the need for <strong>labeled datasets<\/strong> and <strong>convolution or recurrence<\/strong> in the network.<\/p>\n<h3 class=\"wp-block-heading\">What are these special mechanisms?<\/h3>\n<p class=\"wp-block-paragraph\">There are many. But the two mechanisms that are truly the force behind the transformers are attention weighting and feed-forward networks (FFN).<\/p>\n<h3 class=\"wp-block-heading\">What is attention-weighting?<\/h3>\n<p class=\"wp-block-paragraph\">Attention-weighting is a technique by which the model learns which part of the incoming sequence needs to be focused on. Think of it as the &#8216;Eye of Sauron&#8217; scanning everything at all times and throwing light on the parts that are relevant.<\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Fun-fact: Apparently, the researchers had almost named the Transformer model &#8216;Attention-Net&#8217;, given Attention is such a crucial part of it.<\/p><\/blockquote>\n<h3 class=\"wp-block-heading\"><strong>What is FFN?<\/strong><\/h3>\n<p class=\"wp-block-paragraph\">In the context of transformers, FFN is essentially a regular multilayer perceptron acting on a batch of independent data vectors. Combined with attention, it produces the correct &#8216;position-dimension&#8217; combination.<\/p>\n<h2 class=\"wp-block-heading\"><strong>How do Attention and FFN work?<\/strong><\/h2>\n<p class=\"wp-block-paragraph\">So, without further ado, let&#8217;s dive into how <strong>attention-weighting<\/strong> and <strong>FFN<\/strong> make transformers so powerful.<\/p>\n<p class=\"wp-block-paragraph\">This discussion is based on Prof. Tom Yeh&#8217;s wonderful AI by Hand Series on <a href=\"https:\/\/lnkd.in\/g39jcD7j\">Transformers<\/a> . (All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission.)<\/p>\n<p class=\"wp-block-paragraph\">So here we go:<\/p>\n<p class=\"wp-block-paragraph\">The key ideas here : <strong>attention weighting and feed-forward network (FFN)<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">Keeping those in mind, suppose we are given:<\/p>\n<ul class=\"wp-block-list\">\n<li>5 input features from a previous block (A 3&#215;5 matrix here, where X1, X2, X3, X4 and X5 are the features and each of the three rows denote their characteristics respectively.)<\/li>\n<\/ul>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e6dfc9\" data-has-transparency=\"true\" style=\"--dominant-color: #e6dfc9;\" loading=\"lazy\" decoding=\"async\" width=\"372\" height=\"288\" class=\"wp-image-318064 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1RPnrl4AuoQYjweFTYhEHBA.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1RPnrl4AuoQYjweFTYhEHBA.png 372w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1RPnrl4AuoQYjweFTYhEHBA-300x232.png 300w\" sizes=\"auto, (max-width: 372px) 100vw, 372px\" \/><\/figure>\n<p class=\"wp-block-paragraph\">[1] <strong>Obtain attention weight matrix A<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The first step in the process is to obtain the <strong>attention weight matrix A<\/strong>. This is the part where the self-attention mechanism comes to play. What it is trying to do is find the most relevant parts in this input sequence.<\/p>\n<p class=\"wp-block-paragraph\">We do it by feeding the input features into the query-key (QK) module. For simplicity, the details of the QK module are not included here.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f1f1d9\" data-has-transparency=\"false\" style=\"--dominant-color: #f1f1d9;\" loading=\"lazy\" decoding=\"async\" width=\"1240\" height=\"692\" class=\"wp-image-318065 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1DYNNNiaZac_ZNGFVUn4aag.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">[2] <strong>Attention Weighting<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Once we have the <strong>attention weight matrix A (5&#215;5)<\/strong>, we multiply the input features (3&#215;5) with it to obtain the <strong>attention-weighted features Z<\/strong>.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"efeed5\" data-has-transparency=\"false\" style=\"--dominant-color: #efeed5;\" loading=\"lazy\" decoding=\"async\" width=\"1508\" height=\"576\" class=\"wp-image-318066 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/11_VmXxp6iPkwVEdhFwExkg.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">The important part here is that the features here are combined <strong>based on their positions<\/strong> P1, P2 and P3 i.e. <strong>horizontally<\/strong>.<\/p>\n<p class=\"wp-block-paragraph\">To break it down further, consider this calculation performed row-wise:<\/p>\n<p class=\"wp-block-paragraph\">P1 X A1 = Z1 \u2192 Position [1,1] = 11<\/p>\n<p class=\"wp-block-paragraph\">P1 X A2 = Z2 \u2192 Position [1,2] = 6<\/p>\n<p class=\"wp-block-paragraph\">P1 X A3 = Z3 \u2192 Position [1,3] = 7<\/p>\n<p class=\"wp-block-paragraph\">P1 X A4 = Z4 \u2192 Position [1,4] = 7<\/p>\n<p class=\"wp-block-paragraph\">P1 X A5 = Z5 \u2192 Positon [1,5] = 5<\/p>\n<p class=\"wp-block-paragraph\">.<\/p>\n<p class=\"wp-block-paragraph\">.<\/p>\n<p class=\"wp-block-paragraph\">.<\/p>\n<p class=\"wp-block-paragraph\">P2 X A4 = Z4 \u2192 Position [2,4] = 3<\/p>\n<p class=\"wp-block-paragraph\">P3 X A5 = Z5 \u2192Position [3,5] = 1<\/p>\n<p class=\"wp-block-paragraph\">As an example:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f8f8ec\" data-has-transparency=\"true\" style=\"--dominant-color: #f8f8ec;\" loading=\"lazy\" decoding=\"async\" width=\"946\" height=\"448\" class=\"wp-image-318067 has-transparency\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1X6fqG-iOlHNv5JOF-TWxeA.png\" alt=\"\" srcset=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1X6fqG-iOlHNv5JOF-TWxeA.png 946w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1X6fqG-iOlHNv5JOF-TWxeA-300x142.png 300w, https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1X6fqG-iOlHNv5JOF-TWxeA-768x364.png 768w\" sizes=\"auto, (max-width: 946px) 100vw, 946px\" \/><\/figure>\n<p class=\"wp-block-paragraph\">It seems a little tedious in the beginning but follow the multiplication row-wise and the result should be pretty straight-forward.<\/p>\n<p class=\"wp-block-paragraph\">Cool thing is the way our attention-weight matrix <strong>A<\/strong> is arranged, the new features <strong>Z<\/strong> turn out to be the combinations of <strong>X<\/strong> as below **** :<\/p>\n<p class=\"wp-block-paragraph\">Z1 = X1 + X2<\/p>\n<p class=\"wp-block-paragraph\">Z2 = X2 + X3<\/p>\n<p class=\"wp-block-paragraph\">Z3 = X3 + X4<\/p>\n<p class=\"wp-block-paragraph\">Z4 = X4 + X5<\/p>\n<p class=\"wp-block-paragraph\">Z5 = X5 + X1<\/p>\n<p class=\"wp-block-paragraph\">(Hint : Look at the positions of 0s and 1s in matrix <strong>A<\/strong>).<\/p>\n<p class=\"wp-block-paragraph\">[3] <strong>FFN : First Layer<\/strong><\/p>\n<p class=\"wp-block-paragraph\">The next step is to feed the attention-weighted features into the feed-forward neural network.<\/p>\n<p class=\"wp-block-paragraph\">However, the difference here lies in <strong>combining the values across dimensions<\/strong> as opposed to positions in the previous step. It is done as below:<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"edebeb\" data-has-transparency=\"false\" style=\"--dominant-color: #edebeb;\" loading=\"lazy\" decoding=\"async\" width=\"1434\" height=\"426\" class=\"wp-image-318068 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1TMJqa8DPZ3LcnWtdccBKQQ.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">What this does is that it looks at the data from the other direction.<\/p>\n<p class=\"wp-block-paragraph\"><strong>&#8211; In the attention step, we combined our input on the basis of the original features to obtain new features.<\/strong><\/p>\n<p class=\"wp-block-paragraph\"><strong>&#8211; In this FFN step, we consider their characteristics i.e. combine features vertically to obtain our new matrix.<\/strong><\/p>\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Eg: P1(1,1) * Z1(1,1)<\/p>\n<ul class=\"wp-block-list\">\n<li>P2(1,2) * Z1 (2,1)<\/li>\n<li>P3 (1,3) * Z1(3,1) + b(1) = 11, where b is bias.<\/li>\n<\/ul>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">Once again element-wise row operations to the rescue. Notice that here the number of dimensions of the new matrix is increased to 4 here.<\/p>\n<p class=\"wp-block-paragraph\">[4] <strong>ReLU<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Our favorite step : ReLU, where the negative values obtained in the previous matrix are returned as zero and the positive value remain unchanged.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"f6f6f6\" data-has-transparency=\"false\" style=\"--dominant-color: #f6f6f6;\" loading=\"lazy\" decoding=\"async\" width=\"1190\" height=\"360\" class=\"wp-image-318070 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1FmroND2LsW91TrYXNh2UGQ.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">[5] <strong>FFN : Second Layer<\/strong><\/p>\n<p class=\"wp-block-paragraph\">Finally we pass it through the second layer where the dimensionality of the resultant matrix is reduced from 4 back to 3.<\/p>\n<figure class=\"wp-block-image size-large\"><img data-dominant-color=\"e9e7e9\" data-has-transparency=\"false\" style=\"--dominant-color: #e9e7e9;\" loading=\"lazy\" decoding=\"async\" width=\"1378\" height=\"368\" class=\"wp-image-318071 not-transparent\" src=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1z0CE0MMXVIuuu0qPYybrjA.gif\" alt=\"\" \/><\/figure>\n<p class=\"wp-block-paragraph\">The output here is ready to be fed to the next block (see its similarity to the original matrix) and the entire process is repeated from the beginning.<\/p>\n<p class=\"wp-block-paragraph\"><strong>The two key things to remember here are:<\/strong><\/p>\n<ol class=\"wp-block-list\">\n<li><strong>The attention layer combines across positions (horizontally).<\/strong><\/li>\n<li><strong>The feed-forward layer combines across dimensions (vertically).<\/strong><\/li>\n<\/ol>\n<p class=\"wp-block-paragraph\">And this is the secret sauce behind the power of the transformers &#8211; the ability to analyze data from different directions.<\/p>\n<p class=\"wp-block-paragraph\">To summarize the ideas above, here are the key points:<\/p>\n<ol class=\"wp-block-list\">\n<li>The transformer architecture can be perceived as a combination of the attention layer and the feed-forward layer.<\/li>\n<li>The <strong>attention layer combines the features<\/strong> to produce a new feature. E.g. think of combining two robots Robo-Truck and Optimus Prime to get a new robot Robtimus Prime.<\/li>\n<li>The <strong>feed-forward (FFN) layer combines the parts or the characteristics<\/strong> of the a feature to produce new parts\/characteristics. E.g. wheels of Robo-Truck and Ion-laser of Optimus Prime could produce a wheeled-laser.<\/li>\n<\/ol>\n<h2 class=\"wp-block-heading\">The ever powerful Transformers<\/h2>\n<p class=\"wp-block-paragraph\">Neural networks have existed for quite some time now. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) had been reigning supreme but things took quite an eventful turn once Transformers were introduced in the year 2017. And since then, the field of AI has grown at an exponential rate &#8211; with new models, new benchmarks, new learnings coming in every single day. And only time will tell if this phenomenal idea will one day lead the way for something even bigger &#8211; a real &#8216;Transformer&#8217;.<\/p>\n<p class=\"wp-block-paragraph\">But for now it would not be wrong to say that an idea can really <em>transform<\/em> how we live!<\/p>\n\n<p class=\"wp-block-paragraph\">P.S. If you would like to work through this exercise on your own, here is the blank template for your use.<\/p>\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/drive.google.com\/file\/d\/1F08laMdmwQ2vxYIqewOghS1eknaprgxe\/view?usp=drive_link\">Blank Template for hand-exercise<\/a><\/p>\n<p class=\"wp-block-paragraph\">Now go have some fun and create your own <strong>Robtimus Prime<\/strong>!<\/p>","protected":false},"excerpt":{"rendered":"<p>Explore the details behind the power of transformers<\/p>\n","protected":false},"author":18,"featured_media":5153,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"is_member_only":false,"sub_heading":"Explore the details behind the power of transformers","footnotes":""},"categories":[21,22],"tags":[463,478,450,446,585],"sponsor":[],"coauthors":[31633],"class_list":["post-5152","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-large-language-models","category-machine-learning","tag-ai","tag-getting-started","tag-large-language-models","tag-machine-learning","tag-transformers"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v25.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deep Dive into Transformers by Hand \u270d\ufe0e | Towards Data Science<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-\ufe0e-68b8be4bd813\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Dive into Transformers by Hand \u270d\ufe0e | Towards Data Science\" \/>\n<meta property=\"og:description\" content=\"Explore the details behind the power of transformers\" \/>\n<meta property=\"og:url\" content=\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-\ufe0e-68b8be4bd813\/\" \/>\n<meta property=\"og:site_name\" content=\"Towards Data Science\" \/>\n<meta property=\"article:published_time\" content=\"2024-04-12T05:18:36+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-01-08T19:13:51+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1280\" \/>\n\t<meta property=\"og:image:height\" content=\"720\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Srijanie Dey, PhD\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:site\" content=\"@TDataScience\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Srijanie Dey, PhD\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/\"},\"author\":{\"name\":\"TDS Editors\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\"},\"headline\":\"Deep Dive into Transformers by Hand \u270d\ufe0e\",\"datePublished\":\"2024-04-12T05:18:36+00:00\",\"dateModified\":\"2025-01-08T19:13:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/\"},\"wordCount\":1159,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png\",\"keywords\":[\"AI\",\"Getting Started\",\"Large Language Models\",\"Machine Learning\",\"Transformers\"],\"articleSection\":[\"Large Language Models\",\"Machine Learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/\",\"url\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/\",\"name\":\"Deep Dive into Transformers by Hand \u270d\ufe0e | Towards Data Science\",\"isPartOf\":{\"@id\":\"https:\/\/towardsdatascience.com\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png\",\"datePublished\":\"2024-04-12T05:18:36+00:00\",\"dateModified\":\"2025-01-08T19:13:51+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png\",\"width\":1280,\"height\":720,\"caption\":\"Image by author\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/towardsdatascience.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deep Dive into Transformers by Hand \u270d\ufe0e\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/towardsdatascience.com\/#website\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"name\":\"Towards Data Science\",\"description\":\"Publish AI, ML &amp; data-science insights to a global community of data professionals.\",\"publisher\":{\"@id\":\"https:\/\/towardsdatascience.com\/#organization\"},\"alternateName\":\"TDS\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/towardsdatascience.com\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/towardsdatascience.com\/#organization\",\"name\":\"Towards Data Science\",\"alternateName\":\"TDS\",\"url\":\"https:\/\/towardsdatascience.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"contentUrl\":\"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg\",\"width\":696,\"height\":696,\"caption\":\"Towards Data Science\"},\"image\":{\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/TDataScience\",\"https:\/\/www.youtube.com\/c\/TowardsDataScience\",\"https:\/\/www.linkedin.com\/company\/towards-data-science\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee\",\"name\":\"TDS Editors\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g\",\"caption\":\"TDS Editors\"},\"description\":\"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds\",\"url\":\"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Deep Dive into Transformers by Hand \u270d\ufe0e | Towards Data Science","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-\ufe0e-68b8be4bd813\/","og_locale":"en_US","og_type":"article","og_title":"Deep Dive into Transformers by Hand \u270d\ufe0e | Towards Data Science","og_description":"Explore the details behind the power of transformers","og_url":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-\ufe0e-68b8be4bd813\/","og_site_name":"Towards Data Science","article_published_time":"2024-04-12T05:18:36+00:00","article_modified_time":"2025-01-08T19:13:51+00:00","og_image":[{"width":1280,"height":720,"url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png","type":"image\/png"}],"author":"Srijanie Dey, PhD","twitter_card":"summary_large_image","twitter_creator":"@TDataScience","twitter_site":"@TDataScience","twitter_misc":{"Written by":"Srijanie Dey, PhD","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#article","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/"},"author":{"name":"TDS Editors","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee"},"headline":"Deep Dive into Transformers by Hand \u270d\ufe0e","datePublished":"2024-04-12T05:18:36+00:00","dateModified":"2025-01-08T19:13:51+00:00","mainEntityOfPage":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/"},"wordCount":1159,"commentCount":0,"publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"image":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png","keywords":["AI","Getting Started","Large Language Models","Machine Learning","Transformers"],"articleSection":["Large Language Models","Machine Learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/","url":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/","name":"Deep Dive into Transformers by Hand \u270d\ufe0e | Towards Data Science","isPartOf":{"@id":"https:\/\/towardsdatascience.com\/#website"},"primaryImageOfPage":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage"},"image":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage"},"thumbnailUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png","datePublished":"2024-04-12T05:18:36+00:00","dateModified":"2025-01-08T19:13:51+00:00","breadcrumb":{"@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#primaryimage","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2024\/04\/1pckQhR-RyKrnS28F98TIBQ.png","width":1280,"height":720,"caption":"Image by author"},{"@type":"BreadcrumbList","@id":"https:\/\/towardsdatascience.com\/deep-dive-into-transformers-by-hand-%ef%b8%8e-68b8be4bd813\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/towardsdatascience.com\/"},{"@type":"ListItem","position":2,"name":"Deep Dive into Transformers by Hand \u270d\ufe0e"}]},{"@type":"WebSite","@id":"https:\/\/towardsdatascience.com\/#website","url":"https:\/\/towardsdatascience.com\/","name":"Towards Data Science","description":"Publish AI, ML &amp; data-science insights to a global community of data professionals.","publisher":{"@id":"https:\/\/towardsdatascience.com\/#organization"},"alternateName":"TDS","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/towardsdatascience.com\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/towardsdatascience.com\/#organization","name":"Towards Data Science","alternateName":"TDS","url":"https:\/\/towardsdatascience.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/","url":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","contentUrl":"https:\/\/towardsdatascience.com\/wp-content\/uploads\/2025\/02\/tds-logo.jpg","width":696,"height":696,"caption":"Towards Data Science"},"image":{"@id":"https:\/\/towardsdatascience.com\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/TDataScience","https:\/\/www.youtube.com\/c\/TowardsDataScience","https:\/\/www.linkedin.com\/company\/towards-data-science\/"]},{"@type":"Person","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/f9925d336b6fe962b03ad8281d90b8ee","name":"TDS Editors","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/towardsdatascience.com\/#\/schema\/person\/image\/23494c9101089ad44ae88ce9d2f56aac","url":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/?s=96&d=mm&r=g","caption":"TDS Editors"},"description":"Building a vibrant data science and machine learning community. Share your insights and projects with our global audience: bit.ly\/write-for-tds","url":"https:\/\/towardsdatascience.com\/author\/towardsdatascience\/"}]}},"distributor_meta":false,"distributor_terms":false,"distributor_media":false,"distributor_original_site_name":"Towards Data Science","distributor_original_site_url":"https:\/\/towardsdatascience.com","push-errors":false,"_links":{"self":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/5152","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/comments?post=5152"}],"version-history":[{"count":0,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/posts\/5152\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media\/5153"}],"wp:attachment":[{"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/media?parent=5152"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/categories?post=5152"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/tags?post=5152"},{"taxonomy":"sponsor","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/sponsor?post=5152"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/towardsdatascience.com\/wp-json\/wp\/v2\/coauthors?post=5152"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}