Difference between revisions of "Clustering"

From
Jump to: navigation, search
m
m
 
(26 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
|title=PRIMO.ai
 
|title=PRIMO.ai
 
|titlemode=append
 
|titlemode=append
|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, Microsoft, Azure, Amazon, AWS  
+
|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
+
 
 +
<!-- Google tag (gtag.js) -->
 +
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script>
 +
<script>
 +
  window.dataLayer = window.dataLayer || [];
 +
  function gtag(){dataLayer.push(arguments);}
 +
  gtag('js', new Date());
 +
 
 +
  gtag('config', 'G-4GCWLBVJ7T');
 +
</script>
 
}}
 
}}
 
[https://www.youtube.com/results?search_query=~Clustering+AI YouTube]
 
[https://www.youtube.com/results?search_query=~Clustering+AI YouTube]
Line 11: Line 20:
 
[https://www.bing.com/news/search?q=~Clustering+AI&qft=interval%3d%228%22 ...Bing News]
 
[https://www.bing.com/news/search?q=~Clustering+AI&qft=interval%3d%228%22 ...Bing News]
  
* [[...cluster]] - [[AI Solver]]
+
* [[Embedding]] ... [[Fine-tuning]] ... [[Retrieval-Augmented Generation (RAG)|RAG]] ... [[Agents#AI-Powered Search|Search]] ... [[Clustering]] ... [[Recommendation]] ... [[Anomaly Detection]] ... [[Classification]] ... [[Dimensional Reduction]]. [[...find outliers]]
* [[Embedding]][[Agents#AI-Powered Search|Search]] ... [[Clustering]] ... [[Recommendation]] ... [[Anomaly Detection]] ... [[Classification]] ... [[Dimensional Reduction]] ... [[...find outliers]]
 
 
** [[Singular Value Decomposition (SVD)]]
 
** [[Singular Value Decomposition (SVD)]]
 
** [[Principal Component Analysis (PCA)]]
 
** [[Principal Component Analysis (PCA)]]
Line 25: Line 33:
 
** [[Variational Autoencoder (VAE)]]
 
** [[Variational Autoencoder (VAE)]]
 
** [[Biclustering]]
 
** [[Biclustering]]
 +
** [[OPTICS: Ordering Points To Identify the Clustering Structure]]
 
** [https://en.wikipedia.org/wiki/Multidimensional_scaling Multidimensional Scaling (MDS)]
 
** [https://en.wikipedia.org/wiki/Multidimensional_scaling Multidimensional Scaling (MDS)]
 
** Hierarchical; to include clustering  
 
** Hierarchical; to include clustering  
Line 30: Line 39:
 
*** [[Hierarchical Clustering;  Agglomerative (HAC) & Divisive (HDC)]]
 
*** [[Hierarchical Clustering;  Agglomerative (HAC) & Divisive (HDC)]]
 
*** [[Hierarchical Temporal Memory (HTM)]]
 
*** [[Hierarchical Temporal Memory (HTM)]]
 +
* [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database|Database; Vector & Relational]] ... [[Graph]] ... [[LlamaIndex]]
  
Similarity Measures for Clusters:
+
= Similarity Measures for Clusters =
 
* Compare the numbers of identical and unique item pairs appearing in cluster sets
 
* Compare the numbers of identical and unique item pairs appearing in cluster sets
 
* Achieved by counting the number of item pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c) set.
 
* Achieved by counting the number of item pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c) set.
Line 39: Line 49:
 
[https://girke.bioinformatics.ucr.edu/GEN242/mydoc_Rclustering_3.html#example-2 Clustering Algorithms | Data Analysis in Genome Biology]
 
[https://girke.bioinformatics.ucr.edu/GEN242/mydoc_Rclustering_3.html#example-2 Clustering Algorithms | Data Analysis in Genome Biology]
  
<youtube>CtKeHnfK5uA</youtube>
 
<youtube>Muf8EonV7Q0</youtube>
 
<youtube>Q7iZ-HhW50o</youtube>
 
<youtube>ZueoXMgCd1c</youtube>
 
  
= OPTICS: ordering points to identify the clustering structure =
+
= Unsupervised Learning =
* [https://www.dbs.ifi.lmu.de/Publikationen/Papers/OPTICS.pdf OPTICS: ordering points to identify the clustering structure (PDF)| M. Ankerst, M. Breunig, H. Kriegel, J. Sander - Institute for Computer Science, University of Munich]
+
The main types of clustering in unsupervised machine learning include K-means, hierarchical clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Gaussian Mixtures Model (GMM).
 +
 
 +
== News Headlines With Text Clustering ==
 +
One way to use unsupervised learning for text clustering of news headlines is by using a model that employs unsupervised learning to automatically extract [[latent]] information from news articles with pre-determined topics. This model can use techniques such as Doc2vec to generate word vectors for each article. Afterward, a clustering algorithm such as spectral clustering can be applied to group the data based on similarity. This approach alleviates the need for humans to label news items manually. Another approach is to fine-tune pre-trained models unsupervised for text clustering, which simultaneously learns text representations and cluster assignments using a clustering oriented loss.
 +
 
 +
 
 +
<youtube>23qfPq0m7XA</youtube>
 +
 
 +
= Feature Extraction =
 +
Feature extraction is an efficient approach for alleviating the issue of dimensionality in high-dimensional data. Unsupervised feature extraction projects high-dimensional data into a low-dimensional subspace while preserving similarity. It generates low-dimensional features without considering any explicit semantic labels. This can be done using unsupervised learning methods such as transformations (e.g., PCA/ICA/NMF), embeddings (e.g., T-distributed stochastic neighbor embedding), cluster-based methods (e.g., k-means), and kernel-based methods (e.g., kernel PCA).
 +
 
  
Cluster analysis is a primary method for database mining. It is either used as a stand-alone tool to get insight into the distribution of a data set, e.g. to focus further analysis and data processing, or asa preprocessing step for other algorithms operating on the detected clusters. Almost all of the well-known clustering algorithms require input parameters which are hard to determine but have a significant influence on the clustering result. Furthermore, for many real-datasets there does not even exist a global parameter setting for which the result of the clustering algorithm describes the intrinsic clustering structure accurately. We introduce a new algorithm for the pur-pose of cluster analysis which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database representing its density-based clustering structure. This cluster-ordering contains information which is equivalent to the density-based clusterings corresponding to a broad range of parameter settings. It is a versatile basis for both automatic and interactive cluster analysis. We show how to automatically and efficientlyextract not only ‘traditional’ clustering information (e.g. representa-tive points, arbitrary shaped clusters), but also the intrinsic cluster-ing structure. For medium sized data sets, the cluster-ordering canbe represented graphically and for very large data sets, we introducean appropriate visualization technique. Both are suitable for inter-active exploration of the intrinsic clustering structure offering additional insights into the distribution and correlation of the data.
+
<youtube>Muf8EonV7Q0</youtube>

Latest revision as of 13:13, 16 September 2023

YouTube ... Quora ...Google search ...Google News ...Bing News

Similarity Measures for Clusters

  • Compare the numbers of identical and unique item pairs appearing in cluster sets
  • Achieved by counting the number of item pairs found in both clustering sets (a) as well as the pairs appearing only in the first (b) or the second (c) set.
  • With this a similarity coefficient, such as the Jaccard index, can be computed. The latter is defined as the size of the intersect divided by the size of the union of two sample sets: a/(a+b+c).
  • In case of partitioning results, the Jaccard Index measures how frequently pairs of items are joined together in two clustering data sets and how often pairs are observed only in one set.
  • Related coefficient are the Rand Index and the Adjusted Rand Index. These indices also consider the number of pairs (d) that are not joined together in any of the clusters in both sets

Clustering Algorithms | Data Analysis in Genome Biology


Unsupervised Learning

The main types of clustering in unsupervised machine learning include K-means, hierarchical clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Gaussian Mixtures Model (GMM).

News Headlines With Text Clustering

One way to use unsupervised learning for text clustering of news headlines is by using a model that employs unsupervised learning to automatically extract latent information from news articles with pre-determined topics. This model can use techniques such as Doc2vec to generate word vectors for each article. Afterward, a clustering algorithm such as spectral clustering can be applied to group the data based on similarity. This approach alleviates the need for humans to label news items manually. Another approach is to fine-tune pre-trained models unsupervised for text clustering, which simultaneously learns text representations and cluster assignments using a clustering oriented loss.


Feature Extraction

Feature extraction is an efficient approach for alleviating the issue of dimensionality in high-dimensional data. Unsupervised feature extraction projects high-dimensional data into a low-dimensional subspace while preserving similarity. It generates low-dimensional features without considering any explicit semantic labels. This can be done using unsupervised learning methods such as transformations (e.g., PCA/ICA/NMF), embeddings (e.g., T-distributed stochastic neighbor embedding), cluster-based methods (e.g., k-means), and kernel-based methods (e.g., kernel PCA).