Difference between revisions of "Dimensional Reduction"

From
Jump to: navigation, search
m
m
 
(27 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
|title=PRIMO.ai
 
|title=PRIMO.ai
 
|titlemode=append
 
|titlemode=append
|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, Microsoft, Azure, Amazon, AWS  
+
|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
+
 
 +
<!-- Google tag (gtag.js) -->
 +
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script>
 +
<script>
 +
  window.dataLayer = window.dataLayer || [];
 +
  function gtag(){dataLayer.push(arguments);}
 +
  gtag('js', new Date());
 +
 
 +
  gtag('config', 'G-4GCWLBVJ7T');
 +
</script>
 
}}
 
}}
[http://www.youtube.com/results?search_query=Dimensional+Reduction+Algorithm Youtube search...]
+
[https://www.youtube.com/results?search_query=~Dimensional+~Reduction+AI YouTube]
[http://www.google.com/search?q=Dimensional+Reduction+Algorithm+Dimension+machine+learning+ML ...Google search]
+
[https://www.quora.com/search?q=~Dimensional%20Reduction%20AI ... Quora]
 +
[https://www.google.com/search?q=~Dimensional+~Reduction+AI ...Google search]
 +
[https://news.google.com/search?q=~Dimensional+~Reduction+AI ...Google News]
 +
[https://www.bing.com/news/search?q=~Dimensional+~Reduction+AI&qft=interval%3d%228%22 ...Bing News]
  
* [[Manifold Hypothesis]]
+
* [[Embedding]] ... [[Fine-tuning]] ... [[Retrieval-Augmented Generation (RAG)|RAG]] ... [[Agents#AI-Powered Search|Search]] ... [[Clustering]] ... [[Recommendation]] ... [[Anomaly Detection]] ... [[Classification]] ... [[Dimensional Reduction]].  [[...find outliers]]
 +
* [[Math for Intelligence]] ... [[Finding Paul Revere]] ... [[Social Network Analysis (SNA)]] ... [[Dot Product]] ... [[Kernel Trick]]
 +
* [[Hyperdimensional Computing (HDC)]]  
 
* [[Pooling / Sub-sampling: Max, Mean]]
 
* [[Pooling / Sub-sampling: Max, Mean]]
* [[Kernel Trick]]
+
* [[Backpropagation]] ... [[Feed Forward Neural Network (FF or FFNN)|FFNN]] ... [[Forward-Forward]] ... [[Activation Functions]] ...[[Softmax]] ... [[Loss]] ... [[Boosting]] ... [[Gradient Descent Optimization & Challenges|Gradient Descent]] ... [[Algorithm Administration#Hyperparameter|Hyperparameter]] ... [[Manifold Hypothesis]] ... [[Principal Component Analysis (PCA)|PCA]]
* [[Softmax]]
+
* [https://files.knime.com/sites/default/files/inline-images/knime_seventechniquesdatadimreduction.pdf Seven Techniques for Dimensionality Reduction | KNIME]
* [http://files.knime.com/sites/default/files/inline-images/knime_seventechniquesdatadimreduction.pdf Seven Techniques for Dimensionality Reduction | KNIME]
+
* [https://github.com/JonTupitza/Data-Science-Process/blob/master/06-Dimensionality-Reduction.ipynb Dimensionality Reduction Techniques Jupyter Notebook] | [https://github.com/jontupitza Jon Tupitza]
* [http://github.com/JonTupitza/Data-Science-Process/blob/master/06-Dimensionality-Reduction.ipynb Dimensionality Reduction Techniques Jupyter Notebook] | [http://github.com/jontupitza Jon Tupitza]
 
 
* [[(Deep) Convolutional Neural Network (DCNN/CNN)]]
 
* [[(Deep) Convolutional Neural Network (DCNN/CNN)]]
* [http://en.wikipedia.org/wiki/Factor_analysis Factor analysis]
+
* [https://en.wikipedia.org/wiki/Factor_analysis Factor analysis]
* [http://en.wikipedia.org/wiki/Feature_extraction Feature extraction]
+
* [https://en.wikipedia.org/wiki/Feature_extraction Feature extraction]
* [http://en.wikipedia.org/wiki/Feature_selection Feature selection]
+
* [https://en.wikipedia.org/wiki/Feature_selection Feature selection]
* [http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Locally-linear_embedding Nonlinear dimensionality reduction | Wikipedia]
+
* [https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Locally-linear_embedding Nonlinear dimensionality reduction | Wikipedia]
* [[Local Linear Embedding (LLE) | Embedding functions]]
+
 
  
 
To identify the most important [[Feature Exploration/Learning | Features]] to address:
 
To identify the most important [[Feature Exploration/Learning | Features]] to address:
Line 29: Line 42:
 
* Algorithms:
 
* Algorithms:
 
** [[Principal Component Analysis (PCA)]] is an unsupervised linear transformation technique helps us identify patterns in data based of the correlation between the features. PCA aims to find the directions of the maximum variance in high dimensional data and project it onto a lower dimensional feature space.
 
** [[Principal Component Analysis (PCA)]] is an unsupervised linear transformation technique helps us identify patterns in data based of the correlation between the features. PCA aims to find the directions of the maximum variance in high dimensional data and project it onto a lower dimensional feature space.
** [http://en.wikipedia.org/wiki/Independent_component_analysis Independent Component Analysis (ICA)]
+
** [https://en.wikipedia.org/wiki/Independent_component_analysis Independent Component Analysis (ICA)]
** [http://en.wikipedia.org/wiki/Canonical_correlation Canonical Correlation Analysis (CCA)]
+
** [https://en.wikipedia.org/wiki/Canonical_correlation Canonical Correlation Analysis (CCA)]
** [http://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear Discriminant Analysis (LDA)] is a supervised linear transformation technique is to find the feature subspace that optimizes class separability.  
+
** [https://en.wikipedia.org/wiki/Linear_discriminant_analysis Linear Discriminant Analysis (LDA)] is a supervised linear transformation technique is to find the feature subspace that optimizes class separability.  
** [http://en.wikipedia.org/wiki/Multidimensional_scaling Multidimensional Scaling (MDS)]
+
** [https://en.wikipedia.org/wiki/Multidimensional_scaling Multidimensional Scaling (MDS)]
** [http://en.wikipedia.org/wiki/Non-negative_matrix_factorization Non-Negative Matrix Factorization (NMF)]
+
** [https://en.wikipedia.org/wiki/Non-negative_matrix_factorization Non-Negative Matrix Factorization (NMF)]
** [http://en.wikipedia.org/wiki/Partial_least_squares_regression Partial Least Squares Regression (PLSR)]
+
** [https://en.wikipedia.org/wiki/Partial_least_squares_regression Partial Least Squares Regression (PLSR)]
** [http://en.wikipedia.org/wiki/Principal_component_regression Principal Component Regression (PCR)]
+
** [https://en.wikipedia.org/wiki/Principal_component_regression Principal Component Regression (PCR)]
** [http://en.wikipedia.org/wiki/Projection_pursuit Projection Pursuit]
+
** [https://en.wikipedia.org/wiki/Projection_pursuit Projection Pursuit]
** [http://en.wikipedia.org/wiki/Sammon_mapping Sammon Mapping/Projection]
+
** [https://en.wikipedia.org/wiki/Sammon_mapping Sammon Mapping/Projection]
 
** [[Local Linear Embedding (LLE)]] creates an embedding of the dataset and tries to preserve the relationships between neighborhoods in the dataset. LLE can be thought of as a series of local PCAs that are globally compared to find the best non-linear embedding.
 
** [[Local Linear Embedding (LLE)]] creates an embedding of the dataset and tries to preserve the relationships between neighborhoods in the dataset. LLE can be thought of as a series of local PCAs that are globally compared to find the best non-linear embedding.
 
** [[Isomap]] Embedding is a non-linear dimensionality reduction technique that creates an embedding of the dataset and tries to preserve the relationships in the dataset. Isomap looks for a lower-dimensional embedding which maintains distances between all points.
 
** [[Isomap]] Embedding is a non-linear dimensionality reduction technique that creates an embedding of the dataset and tries to preserve the relationships in the dataset. Isomap looks for a lower-dimensional embedding which maintains distances between all points.
Line 43: Line 56:
 
** Singular Value Decomposition (SVD) is a linear dimensionality reduction technique.
 
** Singular Value Decomposition (SVD) is a linear dimensionality reduction technique.
  
Some datasets may contain many variables that may cause very hard to handle. Especially nowadays data collecting in systems occur at very detailed level due to the existence of more than enough resources. In such cases, the data sets may contain thousands of variables and most of them can be unnecessary as well. In this case, it is almost impossible to identify the variables which have the most impact on our prediction. Dimensional Reduction Algorithms are used in this kind of situations. It utilizes other algorithms like Random Forest, Decision Tree to identify the most important variables. [http://towardsdatascience.com/10-machine-learning-algorithms-you-need-to-know-77fb0055fe0 10 Machine Learning Algorithms You need to Know | Sidath Asir @ Medium]
+
 
 +
 
 +
<hr><center>
 +
<b>Dimensional Reduction</b> techniques for reducing the number of input variables in training data - captures the “essence” of the data
 +
</center>
 +
<hr>
 +
 
 +
 
 +
Some datasets may contain many variables that may cause very hard to handle. Especially nowadays data collecting in systems occur at very detailed level due to the existence of more than enough resources. In such cases, the data sets may contain thousands of variables and most of them can be unnecessary as well. In this case, it is almost impossible to identify the variables which have the most impact on our prediction. Dimensional Reduction Algorithms are used in this kind of situations. It utilizes other algorithms like Random Forest, Decision Tree to identify the most important variables. [https://towardsdatascience.com/10-machine-learning-algorithms-you-need-to-know-77fb0055fe0 10 Machine Learning Algorithms You need to Know | Sidath Asir @ Medium]
  
  
Line 57: Line 78:
  
 
= <span id="Projection"></span>Projection =
 
= <span id="Projection"></span>Projection =
[http://www.youtube.com/results?search_query=Dimensional+Reduction+Projection+Algorithm Youtube search...]
+
[https://www.youtube.com/results?search_query=Dimensional+Reduction+Projection+Algorithm Youtube search...]
[http://www.google.com/search?q=Dimensional+Reduction+Projection+Algorithm+Dimension+machine+learning+ML ...Google search]
+
[https://www.google.com/search?q=Dimensional+Reduction+Projection+Algorithm+Dimension+machine+learning+ML ...Google search]
  
 
* [[Autoencoder (AE) / Encoder-Decoder]]
 
* [[Autoencoder (AE) / Encoder-Decoder]]
Line 64: Line 85:
 
* [[Privacy]]
 
* [[Privacy]]
 
* [[Manifold Hypothesis]]
 
* [[Manifold Hypothesis]]
** [http://arxiv.org/pdf/1802.03426.pdf Uniform Manifold Approximation and Projection (UMAP) | L. McInnes, J. Healy, and J. Melville] ... a dimension reduction technique that can be used for visualisation similarly to [[T-Distributed Stochastic Neighbor Embedding (t-SNE) | t-SNE]], but also for general non-linear dimension reduction.  
+
** [https://arxiv.org/pdf/1802.03426.pdf Uniform Manifold Approximation and Projection (UMAP) | L. McInnes, J. Healy, and J. Melville] ... a dimension reduction technique that can be used for visualisation similarly to [[T-Distributed Stochastic Neighbor Embedding (t-SNE) | t-SNE]], but also for general non-linear dimension reduction.  
*** [http://github.com/lmcinnes/umap UMAP]...[[Python]] version
+
*** [https://github.com/lmcinnes/umap UMAP]...[[Python]] version
*** [http://github.com/pair-code/umap-js UMAP-JS] ...[[Javascript]] version
+
*** [https://github.com/pair-code/umap-js UMAP-JS] ...[[JavaScript]] version
 +
* [https://www.sciencedirect.com/science/article/pii/S2215016120303137 Uncovering High-dimensional Structures of Projections from Dimensionality Reduction Methods | Michael Thrun & Alfred Ultsch - ScienceDirect]
  
 
<youtube>6BPl81wGGP8</youtube>
 
<youtube>6BPl81wGGP8</youtube>
 +
 +
= <span id="Product Quantization (PQ)"></span>Product Quantization (PQ) =
 +
* [[Inverted File Indexes (IVF)]]
 +
* [[Approximate Nearest Neighbor (ANN)]]
 +
* [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database|Database; Vector & Relational]] ... [[Graph]] ... [[LlamaIndex]]
 +
 +
Product quantization (PQ) is a technique used for vector compression and is very effective in compressing high-dimensional vectors for nearest neighbor search. The idea behind PQ is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. This technique is used in [[Approximate Nearest Neighbor (ANN)|Approximate Nearest Neighbor Search (ANNs)]] and is a vital part of many vector quantization techniques.
 +
 +
Here are some key points about product quantization:
 +
 +
* PQ divides and splits vectors into segments and quantizes each segment of the vectors separately.
 +
* Each vector in the database is converted to a short code, known as a PQ code, which is a representation that is extremely [[memory]]-efficient for the approximate nearest neighbor search.
 +
* PQ methods decompose the embedding manifold into a Cartesian product of M disjoint partitions and quantize each partition into K clusters.
 +
* PQ is highly scalable and can be used for large-scale searches.
 +
* PQ is used in many vector search libraries, including Faiss, which contains different index types, including one with product quantization (IVF-PQ).
 +
 +
PQ is a very effective way to compress vectors because it can achieve high compression rates while still preserving the semantic relationships between the vectors. This is because PQ preserves the pairwise distances between the vectors, which is what is important for many NLP and ML tasks. PQ is often used in conjunction with other vector indexing techniques, such as [[Inverted File Indexes (IVF)]] and [[K-Nearest Neighbors (KNN)]]. This is because PQ can be used to compress the vectors, which makes them smaller and faster to search, while the other techniques can be used to further improve the search performance. Some of the disadvantages of using product quantization for vector indexing are it can be sensitive to the choice of codebook size and the number of subvectors and it can be difficult to tune the hyperparameters for optimal performance. Here are some of the advantages of using product quantization for vector indexing:
 +
* It is very effective at compressing vectors while preserving their semantic relationships.
 +
* It is fast and scalable, making it suitable for indexing large datasets of vectors.
 +
* It is often used in conjunction with other vector indexing techniques to further improve the search performance.
 +
 +
<youtube>t9mRf2S5vDI</youtube>
 +
<youtube>PNVJvZEkuXo</youtube>

Latest revision as of 23:54, 1 March 2024

YouTube ... Quora ...Google search ...Google News ...Bing News


To identify the most important Features to address:

  • reduce the amount of computing resources required
  • 2D & 3D intuition often fails in higher dimensions
  • distances tend to become relatively the 'same' as the number of dimensions increases



Dimensional Reduction techniques for reducing the number of input variables in training data - captures the “essence” of the data



Some datasets may contain many variables that may cause very hard to handle. Especially nowadays data collecting in systems occur at very detailed level due to the existence of more than enough resources. In such cases, the data sets may contain thousands of variables and most of them can be unnecessary as well. In this case, it is almost impossible to identify the variables which have the most impact on our prediction. Dimensional Reduction Algorithms are used in this kind of situations. It utilizes other algorithms like Random Forest, Decision Tree to identify the most important variables. 10 Machine Learning Algorithms You need to Know | Sidath Asir @ Medium



Projection

Youtube search... ...Google search

Product Quantization (PQ)

Product quantization (PQ) is a technique used for vector compression and is very effective in compressing high-dimensional vectors for nearest neighbor search. The idea behind PQ is to decompose the space into a Cartesian product of low-dimensional subspaces and to quantize each subspace separately. This technique is used in Approximate Nearest Neighbor Search (ANNs) and is a vital part of many vector quantization techniques.

Here are some key points about product quantization:

  • PQ divides and splits vectors into segments and quantizes each segment of the vectors separately.
  • Each vector in the database is converted to a short code, known as a PQ code, which is a representation that is extremely memory-efficient for the approximate nearest neighbor search.
  • PQ methods decompose the embedding manifold into a Cartesian product of M disjoint partitions and quantize each partition into K clusters.
  • PQ is highly scalable and can be used for large-scale searches.
  • PQ is used in many vector search libraries, including Faiss, which contains different index types, including one with product quantization (IVF-PQ).

PQ is a very effective way to compress vectors because it can achieve high compression rates while still preserving the semantic relationships between the vectors. This is because PQ preserves the pairwise distances between the vectors, which is what is important for many NLP and ML tasks. PQ is often used in conjunction with other vector indexing techniques, such as Inverted File Indexes (IVF) and K-Nearest Neighbors (KNN). This is because PQ can be used to compress the vectors, which makes them smaller and faster to search, while the other techniques can be used to further improve the search performance. Some of the disadvantages of using product quantization for vector indexing are it can be sensitive to the choice of codebook size and the number of subvectors and it can be difficult to tune the hyperparameters for optimal performance. Here are some of the advantages of using product quantization for vector indexing:

  • It is very effective at compressing vectors while preserving their semantic relationships.
  • It is fast and scalable, making it suitable for indexing large datasets of vectors.
  • It is often used in conjunction with other vector indexing techniques to further improve the search performance.