State Space Model (SSM) - Revision history

BPeat at 15:21, 28 May 2025

2025-05-28T15:21:54Z

BPeat: /* State Space vs Transformer Models */

2024-05-01T00:57:28Z

‎State Space vs Transformer Models

BPeat: /* State Space vs transformer Models */

2024-05-01T00:56:17Z

‎State Space vs transformer Models

BPeat: /* Selective State Space vs transformer Models */

2024-05-01T00:29:33Z

‎Selective State Space vs transformer Models

BPeat: /* Selective State Space vs transformer Models */

2024-05-01T00:28:56Z

‎Selective State Space vs transformer Models

BPeat: /* Selective State Space vs transformer Models */

2024-05-01T00:20:18Z

‎Selective State Space vs transformer Models

BPeat at 00:07, 1 May 2024

2024-05-01T00:07:18Z

BPeat at 21:44, 30 April 2024

2024-04-30T21:44:42Z

BPeat at 00:58, 29 April 2024

2024-04-29T00:58:44Z

BPeat at 00:48, 29 April 2024

2024-04-29T00:48:27Z

@@ Line 2: / Line 2: @@
 |title=PRIMO.ai
 |titlemode=append
-|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
+|keywords=ChatGPT, artificial, intelligence, machine, learning,  NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
 <!-- Google tag (gtag.js) -->
@@ Line 24: / Line 24: @@
 * [[Mixture-of-Experts (MoE)]] ... [[Mistral]]
 * [https://arxiv.org/abs/2312.00752 Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Albert Gu, Tri Dao]
-* [[Large Language Model (LLM)]] ... [[Large Language Model (LLM)#Multimodal|Multimodal]] ... [[Foundation Models (FM)]] ... [[Generative Pre-trained Transformer (GPT)|Generative Pre-trained]] ... [[Transformer]] ... [[GPT-4]] ... [[GPT-5]] ... [[Attention]] ... [[Generative Adversarial Network (GAN)|GAN]] ... [[Bidirectional Encoder Representations from Transformers (BERT)|BERT]]
+* [[Large Language Model (LLM)]] ... [[Large Language Model (LLM)#Multimodal|Multimodal]] ... [[Foundation Models (FM)]] ... [[Generative Pre-trained Transformer (GPT)|Generative Pre-trained]] ... [[Transformer]]  ... [[Attention]] ... [[Generative Adversarial Network (GAN)|GAN]] ... [[Bidirectional Encoder Representations from Transformers (BERT)|BERT]]
 * [[Natural Language Processing (NLP)]] ... [[Natural Language Generation (NLG)|Generation (NLG)]] ... [[Natural Language Classification (NLC)|Classification (NLC)]] ... [[Natural Language Processing (NLP)#Natural Language Understanding (NLU)|Understanding (NLU)]] ... [[Language Translation|Translation]] ... [[Summarization]] ... [[Sentiment Analysis|Sentiment]] ... [[Natural Language Tools & Services|Tools]]
 * [https://en.wikipedia.org/wiki/State-space_representation State-space representation | Wikipedia]
@@ Line 76: / Line 76: @@
 = State Space vs Transformer Models =
 We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the <b>selective state space mechanism</b>. Is that it does have different strengths and weaknesses compared to the [[Attention]] mechanism. Both in terms of how much memory it consumes where [[Attention]] mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about a torrent of data and a 1,000 samples per second, if that were to be naively translated to a 1,000 tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with frontier grade [[Transformer|Transformers]]. It was only with [[GPT-4]] a year ago that the public first got to see a quality 8,000 token [[Transformer]]. And before that it was like just a couple months where we had just seen the 4,000 before that as of like 18 months ago, 2,000 tokens was what you could really get from like the [[OpenAI]] API. So just the sheer volume of data may not limit itself super well to the [[Transformer]] but also another other When they break down these micro tasks. And look at what the [[Transformer]] can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. [[Mamba]] versus [[Transformer]] comparison paper, it's more about the <b>state space mechanism</b> and the [[Attention]] mechanism. Those are really the two things that are more dueling it out than the higher level [[architectures]]. And they're not even dueling it out because they actually work best together - spoiler. But in the super noisy environment where what actually matters is quite rare in what you're signaling, then the [[Transformer]] sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Put it may be that the [[Gradient Descent Optimization & Challenges |gradient]] is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the <b>selective state space model</b>, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my anthropomorphizing policy but it has an ability to recognize when the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the [[Transformer]] is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the <b>selective state space</b> mechanism. ... to zero in and do the [[Gradient Descent Optimization & Challenges |gradient]] on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100% just confident in that theory but it is consistent with all the evidence that I know of so far. [https://www.cognitiverevolution.ai/ Nick Labenz - The Cognitive Revolution]
 <youtube>dRxolamy-NA</youtube>

@@ Line 76: / Line 76: @@
 = State Space vs Transformer Models =
 We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the <b>selective state space mechanism</b>. Is that it does have different strengths and weaknesses compared to the [[Attention]] mechanism. Both in terms of how much memory it consumes where [[Attention]] mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about a torrent of data and a 1,000 samples per second, if that were to be naively translated to a 1,000 tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with frontier grade [[Transformer|Transformers]]. It was only with [[GPT-4]] a year ago that the public first got to see a quality 8,000 token [[Transformer]]. And before that it was like just a couple months where we had just seen the 4,000 before that as of like 18 months ago, 2,000 tokens was what you could really get from like the [[OpenAI]] API. So just the sheer volume of data may not limit itself super well to the [[Transformer]] but also another other When they break down these micro tasks. And look at what the [[Transformer]] can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. [[Mamba]] versus [[Transformer]] comparison paper, it's more about the <b>state space mechanism</b> and the [[Attention]] mechanism. Those are really the two things that are more dueling it out than the higher level [[architectures]]. And they're not even dueling it out because they actually work best together - spoiler. But in the super noisy environment where what actually matters is quite rare in what you're signaling, then the [[Transformer]] sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Put it may be that the [[Gradient Descent Optimization & Challenges |gradient]] is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the <b>selective state space model</b>, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my anthropomorphizing policy but it has an ability to recognize when the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the [[Transformer]] is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the <b>selective state space</b> mechanism. ... to zero in and do the [[Gradient Descent Optimization & Challenges |gradient]] on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100% just confident in that theory but it is consistent with all the evidence that I know of so far. [https://www.cognitiverevolution.ai/ Nick Labenz - The Cognitive Revolution]
 <youtube>dRxolamy-NA</youtube>

@@ Line 75: / Line 75: @@
-= State Space vs transformer Models =
+= State Space vs Transformer Models =
 We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the <b>selective state space mechanism</b>. Is that it does have different strengths and weaknesses compared to the [[Attention]] mechanism. Both in terms of how much memory it consumes where [[Attention]] mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about a torrent of data and a 1,000 samples per second, if that were to be naively translated to a 1,000 tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with frontier grade [[Transformer|Transformers]]. It was only with [[GPT-4]] a year ago that the public first got to see a quality 8,000 token [[Transformer]]. And before that it was like just a couple months where we had just seen the 4,000 before that as of like 18 months ago, 2,000 tokens was what you could really get from like the [[OpenAI]] API. So just the sheer volume of data may not limit itself super well to the [[Transformer]] but also another other When they break down these micro tasks. And look at what the [[Transformer]] can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. [[Mamba]] versus [[Transformer]] comparison paper, it's more about the <b>state space mechanism</b> and the [[Attention]] mechanism. Those are really the two things that are more dueling it out than the higher level [[architectures]]. And they're not even dueling it out because they actually work best together - spoiler. But in the super noisy environment where what actually matters is quite rare in what you're signaling, then the [[Transformer]] sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Put it may be that the [[Gradient Descent Optimization & Challenges |gradient]] is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the <b>selective state space model</b>, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my anthropomorphizing policy but it has an ability to recognize when the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the [[Transformer]] is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the <b>selective state space</b> mechanism. ... to zero in and do the [[Gradient Descent Optimization & Challenges |gradient]] on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100% just confident in that theory but it is consistent with all the evidence that I know of so far. [https://www.cognitiverevolution.ai/ Nick Labenz - The Cognitive Revolution]
 <youtube>dRxolamy-NA</youtube>

@@ Line 75: / Line 75: @@
-= Selective State Space vs transformer Models =
+= State Space vs transformer Models =
 We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the <b>selective state space mechanism</b>. Is that it does have different strengths and weaknesses compared to the [[Attention]] mechanism. Both in terms of how much memory it consumes where [[Attention]] mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about a torrent of data and a 1,000 samples per second, if that were to be naively translated to a 1,000 tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with frontier grade [[Transformer|Transformers]]. It was only with [[GPT-4]] a year ago that the public first got to see a quality 8,000 token [[Transformer]]. And before that it was like just a couple months where we had just seen the 4,000 before that as of like 18 months ago, 2,000 tokens was what you could really get from like the [[OpenAI]] API. So just the sheer volume of data may not limit itself super well to the [[Transformer]] but also another other When they break down these micro tasks. And look at what the [[Transformer]] can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. [[Mamba]] versus [[Transformer]] comparison paper, it's more about the <b>state space mechanism</b> and the [[Attention]] mechanism. Those are really the two things that are more dueling it out than the higher level [[architectures]]. And they're not even dueling it out because they actually work best together - spoiler. But in the super noisy environment where what actually matters is quite rare in what you're signaling, then the [[Transformer]] sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Put it may be that the [[Gradient Descent Optimization & Challenges |gradient]] is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the <b>selective state space model</b>, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my anthropomorphizing policy but it has an ability to recognize when the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the [[Transformer]] is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the <b>selective state space</b> mechanism. ... to zero in and do the [[Gradient Descent Optimization & Challenges |gradient]] on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100% just confident in that theory but it is consistent with all the evidence that I know of so far. [https://www.cognitiverevolution.ai/ Nick Labenz - The Cognitive Revolution]
 <youtube>dRxolamy-NA</youtube>

@@ Line 76: / Line 76: @@
 = Selective State Space vs transformer Models =
-We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the selective state space mechanism. Is that it does have different strengths and weaknesses compared to the [[Attention]] mechanism. Both in terms of how much memory it consumes where [[Attention]] mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about a torrent of data and a 1,000 samples per second, if that were to be naively translated to a 1,000 tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with frontier grade [[Transformer|Transformers]]. It was only with [[GPT-4]] a year ago that the public first got to see a quality 8,000 token [[Transformer]]. And before that it was like just a couple months where we had just seen the 4,000 before that as of like 18 months ago, 2,000 tokens was what you could really get from like the [[OpenAI]] API. So just the sheer volume of data may not limit itself super well to the [[Transformer]] but also another other When they break down these micro tasks. And look at what the [[Transformer]] can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. [[Mamba]] versus [[Transformer]] comparison paper, it's more about the state space mechanism and the [[Attention]] mechanism. Those are really the two things that are more dueling it out than the higher level architectures. And they're not even doing it out because they actually work best together spoiler but in the super noisy environment where what actually matters is quite rare in what you're signaling, then the Transformer sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Range. Put it may be that the gradient is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the selective state space model, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my noanthropomorphizing policy but it has an ability to recognize When the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the Transformer is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the selective State space. Mechanism. Oh, I see to kind of zero in and do the gradient on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100 just confident in that theory but it is consistent with all the evidence that I know of so far. [https://www.cognitiverevolution.ai/ Nick Labenz - The Cognitive Revolution]
+We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the <b>selective state space mechanism</b>. Is that it does have different strengths and weaknesses compared to the [[Attention]] mechanism. Both in terms of how much memory it consumes where [[Attention]] mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about a torrent of data and a 1,000 samples per second, if that were to be naively translated to a 1,000 tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with frontier grade [[Transformer|Transformers]]. It was only with [[GPT-4]] a year ago that the public first got to see a quality 8,000 token [[Transformer]]. And before that it was like just a couple months where we had just seen the 4,000 before that as of like 18 months ago, 2,000 tokens was what you could really get from like the [[OpenAI]] API. So just the sheer volume of data may not limit itself super well to the [[Transformer]] but also another other When they break down these micro tasks. And look at what the [[Transformer]] can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. [[Mamba]] versus [[Transformer]] comparison paper, it's more about the <b>state space mechanism</b> and the [[Attention]] mechanism. Those are really the two things that are more dueling it out than the higher level [[architectures]]. And they're not even dueling it out because they actually work best together - spoiler. But in the super noisy environment where what actually matters is quite rare in what you're signaling, then the [[Transformer]] sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Put it may be that the [[Gradient Descent Optimization & Challenges |gradient]] is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the <b>selective state space model</b>, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my anthropomorphizing policy but it has an ability to recognize when the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the [[Transformer]] is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the <b>selective state space</b> mechanism. ... to zero in and do the [[Gradient Descent Optimization & Challenges |gradient]] on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100% just confident in that theory but it is consistent with all the evidence that I know of so far. [https://www.cognitiverevolution.ai/ Nick Labenz - The Cognitive Revolution]
 <youtube>dRxolamy-NA</youtube>

← Older revision		Revision as of 00:07, 1 May 2024
Line 77:		Line 77:
	= Selective State Space vs transformer Models =		= Selective State Space vs transformer Models =
	We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the selective state space mechanism. Is that it does have different strengths and weaknesses compared to the attention mechanism. Both in terms of how much memory it consumes where attention mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about like a torrent of data and a thousand samples per second, if that were to be naively translated to a thousand tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with Frontier grade Transformers. It was only with gpt4 a year ago that the public first. Got to see a quality 8, 000 token Transformer. And before that it was like just a couple months where we had just seen the four thousand before that as of like 18 months ago, 2 000, Calcons was what you could really get from like the open AI API. So just the sheer volume of data may not limit itself super well to the Transformer but also another other When they break down these micro tasks. And look at what the Transformer can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. Mamba versus Transformer comparison paper, it's more about the state space mechanism and the attention mechanism. Those are really the two things that are more dueling it out than the higher level architectures. And they're not even doing it out because they actually work best together spoiler but in the super noisy environment where what actually matters is quite rare in what you're signaling, then the Transformer sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Range. Put it may be that the gradient is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the selective state space model, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my noanthropomorphizing policy but it has an ability to recognize When the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the Transformer is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the selective State space. Mechanism. Oh, I see to kind of zero in and do the gradient on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100 just confident in that theory but it is consistent with all the evidence that I know of so far.		We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the selective state space mechanism. Is that it does have different strengths and weaknesses compared to the attention mechanism. Both in terms of how much memory it consumes where attention mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about like a torrent of data and a thousand samples per second, if that were to be naively translated to a thousand tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with Frontier grade Transformers. It was only with gpt4 a year ago that the public first. Got to see a quality 8, 000 token Transformer. And before that it was like just a couple months where we had just seen the four thousand before that as of like 18 months ago, 2 000, Calcons was what you could really get from like the open AI API. So just the sheer volume of data may not limit itself super well to the Transformer but also another other When they break down these micro tasks. And look at what the Transformer can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. Mamba versus Transformer comparison paper, it's more about the state space mechanism and the attention mechanism. Those are really the two things that are more dueling it out than the higher level architectures. And they're not even doing it out because they actually work best together spoiler but in the super noisy environment where what actually matters is quite rare in what you're signaling, then the Transformer sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Range. Put it may be that the gradient is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the selective state space model, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my noanthropomorphizing policy but it has an ability to recognize When the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the Transformer is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the selective State space. Mechanism. Oh, I see to kind of zero in and do the gradient on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100 just confident in that theory but it is consistent with all the evidence that I know of so far.
		+
		+	<youtube>dRxolamy-NA</youtube>

← Older revision		Revision as of 21:44, 30 April 2024
Line 73:		Line 73:
	<youtube>GqwhkbrWDOI</youtube>		<youtube>GqwhkbrWDOI</youtube>
	<youtube>dG6MSsdojLg</youtube>		<youtube>dG6MSsdojLg</youtube>
		+
		+
		+	= Selective State Space vs transformer Models =
		+	We just did an episode on the first 90 days of [[Mamba]] literature. And one of the things that is really interesting about this new mechanism, the selective state space mechanism. Is that it does have different strengths and weaknesses compared to the attention mechanism. Both in terms of how much memory it consumes where attention mechanism is quadratic but the length of the input and that might be, by the way, one of the reasons like just as you talk about like a torrent of data and a thousand samples per second, if that were to be naively translated to a thousand tokens per second. Then very quickly you're getting to a level of tokens that we have only very recently reached with Frontier grade Transformers. It was only with gpt4 a year ago that the public first. Got to see a quality 8, 000 token Transformer. And before that it was like just a couple months where we had just seen the four thousand before that as of like 18 months ago, 2 000, Calcons was what you could really get from like the open AI API. So just the sheer volume of data may not limit itself super well to the Transformer but also another other When they break down these micro tasks. And look at what the Transformer can do and can't do one of the things that really struggles on is the hyper noisy environment there. There was a interesting result in this one. Mamba versus Transformer comparison paper, it's more about the state space mechanism and the attention mechanism. Those are really the two things that are more dueling it out than the higher level architectures. And they're not even doing it out because they actually work best together spoiler but in the super noisy environment where what actually matters is quite rare in what you're signaling, then the Transformer sometimes has a hard time converging and the intuition I've developed for that is because it's changing all the weights at the same time across like the entire range. Range. Put it may be that the gradient is often dominated by noise and has a hard time converging on the signal. Whereas When I don't want to make everything about the selective state space model, so I do have an obsession about this as folks know. It is updating per token and so it seems like it has a more natural mechanism when the actual signal hits to say. Oh, and this is where I start to violate my noanthropomorphizing policy but it has an ability to recognize When the signal hits and update in a more focused way on that one thing that really was supposed to matter, whereas the Transformer is updating everything all across, it's considering everything at once. And so, it seems like the signal can get lost in all that noise, the recurrent nature of the selective State space. Mechanism. Oh, I see to kind of zero in and do the gradient on the signal when you have the signal and then of course there's still a lot of noise but that maybe can get separated from the signal because of this bit by bit level processing and updating. I'm not 100 just confident in that theory but it is consistent with all the evidence that I know of so far.

@@ Line 21: / Line 21: @@
 * [[State Space Model (SSM)]] ... [[Mamba]] ... [[Sequence to Sequence (Seq2Seq)]] ... [[Recurrent Neural Network (RNN)]] ... [[(Deep) Convolutional Neural Network (DCNN/CNN)|Convolutional Neural Network (CNN)]]
-* [[Memory]]
+* [[Memory]] ... [[Memory Networks]] ... [[Hierarchical Temporal Memory (HTM)]] ... [[Lifelong Learning]]
 * [[Mixture-of-Experts (MoE)]] ... [[Mistral]]
 * [https://arxiv.org/abs/2312.00752 Mamba: Linear-Time Sequence Modeling with Selective State Spaces | Albert Gu, Tri Dao]