Difference between revisions of "Database"
m |
m (→Vector) |
||
| Line 56: | Line 56: | ||
= Vector = | = Vector = | ||
| + | |||
| + | * [https://learn.microsoft.com/en-us/semantic-kernel/memories/vector-db Microsoft] | ||
| + | * [https://labelbox.com/blog/how-vector-similarity-search-works/ How vector similarity search works | Labelbox] | ||
| + | * [https://www.elastic.co/what-is/vector-search What is vector search | Elastic.co] | ||
| + | * [https://aws.amazon.com/what-is/vector-databases/ What are vector databases | Amazon] | ||
| + | |||
| + | A vector database is a specialized type of database designed to store and retrieve vector embeddings for fast retrieval and similarity search. Vector embeddings are high-dimensional representations of data items, such as images, documents, or user profiles, that capture their semantic or [[context]]ual meaning. | ||
| + | |||
| + | Here's an overview of how a vector database indexes and stores vector embeddings for fast retrieval and similarity search: | ||
| + | |||
| + | * <b>Vector Embeddings Generation</b>: Data items are first converted into vectors using feature extraction or embedding techniques. For example, images can be represented as vectors using convolutional neural networks (CNNs), and text documents can be represented as vectors using word embeddings or sentence embeddings[3]. | ||
| + | |||
| + | * <b>Indexing</b>: The vector embeddings are indexed using specialized data structures and algorithms to enable efficient search and retrieval. Common indexing techniques include k-nearest neighbor (k-NN) indexes, hierarchical navigable small world (HNSW) indexes, and inverted file indexes (IVF) [6]. These indexes organize the vectors in a way that allows for fast nearest-neighbor search based on similarity measures. | ||
| + | |||
| + | * <b>Similarity Search</b>: To perform similarity search, a query vector is used to find the most similar vectors in the vector database. The query vector represents the desired information or criteria. The similarity between vectors is measured using distance metrics such as cosine similarity or Euclidean distance. The vector database uses algorithms like approximate nearest neighbor (ANN) search to optimize the search process through techniques like hashing, quantization, or graph-based search. | ||
| + | |||
| + | * <b>Retrieval</b>: The vector database retrieves the vectors that are most similar to the query vector based on the similarity measure. These retrieved vectors can be used for various purposes, such as recommendation systems, content-based search, or clustering. | ||
| + | |||
| + | The main advantages of a vector database are its ability to handle high-dimensional vector embeddings, perform fast and accurate similarity search, and provide flexibility and scalability for working with complex data. Vector databases are particularly useful in fields such as natural language processing (NLP), computer vision, and other AI applications where vector embeddings are commonly used. | ||
| + | |||
| + | Overall, a vector database combines specialized algorithms, indexing techniques, and similarity measures to efficiently store and retrieve vector embeddings for fast retrieval and similarity search. It enables applications to work effectively with vector data and extract insights from high-dimensional representations of data items. | ||
| + | |||
<youtube>dN0lsF2cvm4</youtube> | <youtube>dN0lsF2cvm4</youtube> | ||
| Line 62: | Line 84: | ||
=== <span id="Pinecone"></span>Pinecone === | === <span id="Pinecone"></span>Pinecone === | ||
* [https://www.pinecone.io Pinecone] | * [https://www.pinecone.io Pinecone] | ||
| + | * [https://www.pinecone.io/learn/vector-database/ Learn vector database | Pinecone] | ||
* [https://www.pinecone.io/learn/langchain-intro/ Pinecone LangChain Intro] | * [https://www.pinecone.io/learn/langchain-intro/ Pinecone LangChain Intro] | ||
* [https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/00-langchain-intro.ipynb Demo using Colab] | * [https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/00-langchain-intro.ipynb Demo using Colab] | ||
| Line 83: | Line 106: | ||
<youtube>15TDwVSpwKc</youtube> | <youtube>15TDwVSpwKc</youtube> | ||
<youtube>tp0bQNDtLPc</youtube> | <youtube>tp0bQNDtLPc</youtube> | ||
| + | |||
| + | === Weaviate === | ||
| + | * [https://weaviate.io/blog/what-is-a-vector-database What is a vector database | Weaviate] | ||
=== Marqo === | === Marqo === | ||
Revision as of 09:11, 17 August 2023
YouTube ... Quora ...Google search ...Google News ...Bing News
- Excel ... Documents ... Database; Vector & Relational ... Graph ... LlamaIndex
- Data Science ... Governance ... Preprocessing ... Exploration ... Interoperability ... Master Data Management (MDM) ... Bias and Variances ... Benchmarks ... Datasets
- Data Quality ...validity, accuracy, cleaning, completeness, consistency, encoding, padding, augmentation, labeling, auto-tagging, normalization, standardization, and imbalanced data
- Artificial Intelligence (AI) ... Machine Learning (ML) ... Deep Learning ... Neural Network ... Reinforcement ... Learning Techniques
- Risk, Compliance and Regulation ... Ethics ... Privacy ... Law ... AI Governance ... AI Verification and Validation
- Analytics ... Visualization ... Graphical Tools ... Diagrams & Business Analysis ... Requirements ... Loop ... Bayes ... Network Pattern
- Development ... Notebooks ... AI Pair Programming ... Codeless, Generators, Drag n' Drop ... AIOps/MLOps ... AIaaS/MLaaS
Databases are fundamental to training all sorts of Machine Learning (ML) and artificial intelligence (AI) models. They provide a consistent and reliable way to store data, but their value stems from their data management functionalities. Machine Learning (ML) and other AI techniques provide the means for enhancing these functionalities towards increased scalability and intelligence in managing very large datasets. AI databases are a fast-emerging database approach dedicated to creating better machine-learning and deep-learning models and then train them faster and more efficiently. AI databases integrate artificial intelligence technologies to provide value-added services. Databases play a crucial role in supporting AI/Machine Learning (ML) by providing the means to store, manage, and analyze large datasets, which is essential for training accurate and effective AI models. They also provide the means for enhancing data management functionalities towards increased scalability and intelligence in managing very large datasets.
Contents
Vector vs Relational vs Graph
Databases designed to handle specific types of data and queries:
- Vector Database:
- Vector databases are specialized for handling high-dimensional vectors and performing similarity searches.
- A vector database, also known as a similarity search or high-dimensional database, is optimized for storing and retrieving high-dimensional vectors that represent complex data.
- It is particularly suited for tasks that involve similarity searches, such as recommendation systems, image and audio recognition, and natural language processing tasks.
- Vector databases use specialized indexing and search algorithms to efficiently perform similarity searches in high-dimensional spaces.
- Examples: Pinecone, Weaviate, Milvus, Faiss, Annoy. Marqo,
- Relational Database:
- Relational databases are good at managing structured tabular data.
- A relational database stores data in structured tables with predefined schemas, where each table has columns and rows.
- It uses the Structured Query Language (SQL) to manage and query the data.
- Relational databases are suitable for structured and tabular data and are widely used in business applications, content management systems, and data warehousing.
- Examples: MySQL, PostgreSQL, Oracle Database.
- Graph Database:
- Graph databases are designed for data with intricate relationships.
- A graph database is designed to store and manage data in the form of graph structures, consisting of nodes and edges.
- Nodes represent entities, and edges represent relationships between these entities.
- Graph databases are excellent at handling data with complex relationships and are used for applications such as social networks, recommendation systems, fraud detection, and knowledge graphs.
- Examples: Neo4j, Amazon Neptune, JanusGraph.
Vector
- Microsoft
- How vector similarity search works | Labelbox
- What is vector search | Elastic.co
- What are vector databases | Amazon
A vector database is a specialized type of database designed to store and retrieve vector embeddings for fast retrieval and similarity search. Vector embeddings are high-dimensional representations of data items, such as images, documents, or user profiles, that capture their semantic or contextual meaning.
Here's an overview of how a vector database indexes and stores vector embeddings for fast retrieval and similarity search:
- Vector Embeddings Generation: Data items are first converted into vectors using feature extraction or embedding techniques. For example, images can be represented as vectors using convolutional neural networks (CNNs), and text documents can be represented as vectors using word embeddings or sentence embeddings[3].
- Indexing: The vector embeddings are indexed using specialized data structures and algorithms to enable efficient search and retrieval. Common indexing techniques include k-nearest neighbor (k-NN) indexes, hierarchical navigable small world (HNSW) indexes, and inverted file indexes (IVF) [6]. These indexes organize the vectors in a way that allows for fast nearest-neighbor search based on similarity measures.
- Similarity Search: To perform similarity search, a query vector is used to find the most similar vectors in the vector database. The query vector represents the desired information or criteria. The similarity between vectors is measured using distance metrics such as cosine similarity or Euclidean distance. The vector database uses algorithms like approximate nearest neighbor (ANN) search to optimize the search process through techniques like hashing, quantization, or graph-based search.
- Retrieval: The vector database retrieves the vectors that are most similar to the query vector based on the similarity measure. These retrieved vectors can be used for various purposes, such as recommendation systems, content-based search, or clustering.
The main advantages of a vector database are its ability to handle high-dimensional vector embeddings, perform fast and accurate similarity search, and provide flexibility and scalability for working with complex data. Vector databases are particularly useful in fields such as natural language processing (NLP), computer vision, and other AI applications where vector embeddings are commonly used.
Overall, a vector database combines specialized algorithms, indexing techniques, and similarity measures to efficiently store and retrieve vector embeddings for fast retrieval and similarity search. It enables applications to work effectively with vector data and extract insights from high-dimensional representations of data items.
Offerings
Pinecone
- Pinecone
- Learn vector database | Pinecone
- Pinecone LangChain Intro
- Demo using Colab
- Introduction
- LangChain
- Embedding ... Fine-tuning ... Search ... Clustering ... Recommendation ... Anomaly Detection ... Classification ... Dimensional Reduction ... ...find outliers
- OpenAI Integration ...utilize OpenAI for generating language embeddings, which can then be stored in Pinecone and used for Semantic Search
- Find the code here
Key concepts:
- Vector search - Unlike traditional search methods that revolve around keywords, it is done by indexing and searching through ML-generated representations of data — vector embeddings — to find items most similar to the query.
- Vector embeddings - Vector embeddings, or “vectors,” are sets of floating-point numbers that represent objects. They are generated by embedding models trained to capture the semantic similarity of objects in a given set.
Weaviate
Marqo
- Meet Marqo, an open source vector search engine for AI applications | Paul Sawers - TechCrunch ... Vector generation, storage, and retrieval through a single API
Marqo is an end-to-end vector search engine and database that enables users to store and query unstructured data such as text, images, and code using vectors; providing a single API for vector generation, storage, and retrieval. Marqo is designed to support multimodal vector search, allowing users to search for similar vectors across different types of data.
Marqo's vector search engine works by generating vectors for unstructured data such as text, images, and code, and then storing these vectors in a database. The vectors are generated using deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) that are trained on large amounts of data. When a user queries the database, Marqo's search engine generates a vector for the query and compares it to the vectors in the database using similarity metrics such as cosine similarity. The search engine returns the most similar vectors to the query, allowing users to find similar data across different types of data. Marqo's search engine is designed to support multimodal vector search, which means that users can search for similar vectors across different types of data such as text, images, and code. This is achieved by generating vectors for each type of data and storing them in the same database.
Marqo's vector search engine uses continuous learning technology to improve the relevance of search results based on user engagement. This allows the search engine to adapt to changing user needs and preferences. Marqo's vector search engine continuously improves by learning from user engagement. Its technology is a new form of vector search that improves based on user engagement; the search engine can learn from user behavior such as clicks, "add to cart" actions, and other engagement metrics to improve the relevance of search results. Marqo's continuous learning technology is designed to automatically improve the search engine based on user engagement, allowing it to adapt to changing user needs and preferences
Relational
In-database Machine Learning
In-database machine learning refers to the ability to build and train Machine Learning (ML) models directly within a database, using the data that already resides there. This approach eliminates the need to move data out of the database and into a separate analytics engine, which can save time and reduce costs providing a simpler, faster, and more efficient way to build and train Machine Learning (ML) models by leveraging the data that already resides within your database.
Some of the benefits of in-database machine learning include:
- Simplicity: Since you're starting with tools and data you're already familiar with, it's easier for you and your employees to get started with Machine Learning (ML).
- Speed: With algorithms in the database that ensure minimized data movement, you can build and train models faster, which saves time and costs.
- Ease of deployment: Models built in the database are easier to deploy and operationalize, allowing you to see results faster.
There are several databases that support in-database machine learning:
- Amazon Redshift: is a managed, petabyte-scale data warehouse service designed to make it simple and cost-effective to analyze all of your data using your existing business intelligence tools. Amazon Redshift ML is designed to make it easy for SQL users to create, train, and deploy Machine Learning (ML) models using SQL commands.
- BlazingSQL: is a GPU-accelerated SQL engine built on top of the RAPIDS ecosystem; it exists as an open-source project and a paid service. RAPIDS is a suite of open source software libraries and APIs, incubated by Nvidia, that uses CUDA and is based on the Apache Arrow columnar memory format.
- Brytlyt: is a GPU database and analytics platform that provides real-time insights on large and streaming datasets. It uses patent-pending IP and the power of GPUs to deliver results up to 1,000x faster than legacy systems.
- Google Cloud BigQuery: is a fully managed, cloud-native data warehouse that enables super-fast SQL queries using the processing power of Google's infrastructure.
- IBM Db2 Warehouse: is a software-defined data warehouse for private and virtual clouds that support Docker container technology. It provides scalable, elastic, and flexible deployment options for analytics workloads.
- Kinetica: is an active analytics platform that combines historical and streaming data analysis, location intelligence, and Machine Learning (ML)-powered Predictive Analytics.
- Microsoft SQL Server: is a relational database management system developed by Microsoft. It supports in-database machine learning through its Machine Learning (ML) Services component, which allows you to run R and Python scripts within the database.
- Oracle Database: is a multi-model database management system produced and marketed by Oracle Corporation. It supports in-database machine learning through its Oracle Machine Learning (ML) component, which allows you to build and deploy Machine Learning (ML) models within the database.
Supporting AI
These databases provide various features and capabilities that can be leveraged to support AI implementations. They offer scalability, availability, and improved accuracy of predictions and actions, making them suitable for handling massive amounts of data and providing high availability for AI applications. Databases like MySQL, Apache Cassandra, PostgreSQL, and Couchbase can support AI implementations in various ways. Here are some examples:
- MySQL: HeatWave is a fully managed service for the MySQL database from Oracle and has built-in support for machine learning (HeatWave ML). HeatWave ML fully automates the process to train a model, generate inferences, and invoke explanations, all without extracting data or model out of the database. The user can use familiar SQL interfaces to invoke all the machine learning capabilities.
- Apache Cassandra: is a powerful and scalable distributed database solution that has emerged as a go-to choice for many AI applications including Uber, Netflix, and Priceline. It provides a foundation for two of the most important data management categories — features and events — for real-time AI, enabling the delivery of highly accurate insights based on the right data at the right time.
- PostgreSQL: Flexible Server and Azure Cosmos DB for PostgreSQL have now introduced support for the pgvector extension. With the pgvector extension, customers can now store embeddings in PostgreSQL databases which are vectors created by generative AI models that represent the semantic meaning of textual data allowing efficient similarity searches.
- Couchbase: is a document-focused engagement database that is also open-source and distributed. While I couldn't find any specific information about Couchbase's AI support, it does offer enterprise-grade support services to help users understand or troubleshoot Couchbase products.
Database Support for AI Algorithms
Databases support AI algorithms by providing a consistent and reliable way to store and manage data, which is essential for training accurate and effective AI models. Lately, database companies have been adding artificial intelligence routines into databases so the users can explore the power of these smarter, more sophisticated algorithms on their own data stored in the database. The AI algorithms are also finding a home below the surface, where the AI routines help optimize internal tasks like re-indexing or query planning. These new features are often billed as adding automation because they relieve the user of housekeeping work. Developers are encouraged to let them do their work and forget about them. There’s much more interest, though, in AI routines that are open to users. These machine learning algorithms can classify data and make smarter decisions that evolve and adapt over time. They can unlock new use cases and enhance the flexibility of existing algorithms. In summary, databases support AI algorithms by providing a consistent and reliable way to store and manage data, which is essential for training accurate and effective AI models. They also provide the means for enhancing data management functionalities towards increased scalability and intelligence in managing very large datasets. I hope this discussion helps you understand the role of databases in supporting AI algorithms and their role in the development of AI applications.
Examples
There are several database startups that are highlighting their direct support of machine learning and other AI routines. Here are some examples:
- SingleStore: offers fast analytics for tracking incoming telemetry in real-time. This data can also be scored according to various AI models as it is ingested.
- MindsDB: adds machine learning routines to standard databases like MongoDB, MariaDB, PostgreSQL, or Microsoft SQL. It extends SQL to include features for learning from the data already in the database to make predictions and classify objects.
- BlazingSQL: is a GPU-accelerated SQL engine built on the RAPIDS ecosystem. It allows you to ETL raw data directly into GPU memory as a GPU DataFrame, and then execute relational algebra on that data, returning results directly to a GPU DataFrame.
- Brytlyt: is a GPU database and analytics platform that provides real-time insights on large and streaming datasets. It uses patent-pending IP and the power of GPUs to deliver results up to 1,000x faster than legacy systems.