Difference between revisions of "Data Preprocessing"

From
Jump to: navigation, search
m
 
(43 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
|title=PRIMO.ai
 
|title=PRIMO.ai
 
|titlemode=append
 
|titlemode=append
|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, Microsoft, Azure, Amazon, AWS  
+
|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
+
 
 +
<!-- Google tag (gtag.js) -->
 +
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script>
 +
<script>
 +
  window.dataLayer = window.dataLayer || [];
 +
  function gtag(){dataLayer.push(arguments);}
 +
  gtag('js', new Date());
 +
 
 +
  gtag('config', 'G-4GCWLBVJ7T');
 +
</script>
 
}}
 
}}
[http://www.youtube.com/results?search_query=Data+Preprocessing+machine+learning+ML YouTube search...]
+
[https://www.youtube.com/results?search_query=ai+Data+Preprocessing YouTube]
[http://www.google.com/search?q=Data+Preprocessing+machine+learning+ML ...Google search]
+
[https://www.quora.com/search?q=ai%20Data%20Preprocessing ... Quora]
 +
[https://www.google.com/search?q=ai+Data+Preprocessing ...Google search]
 +
[https://news.google.com/search?q=ai+Data+Preprocessing ...Google News]
 +
[https://www.bing.com/news/search?q=ai+Data+Preprocessing&qft=interval%3d%228%22 ...Bing News]
  
* [http://www.kaggle.com/rtatman/data-cleaning-challenge-json-txt-and-xls/ Data Cleaning Challenge: .json, .txt and .xls | Rachael Tatman]
+
* [[Data Science]] ... [[Data Governance|Governance]] ... [[Data Preprocessing|Preprocessing]] ... [[Feature Exploration/Learning|Exploration]] ... [[Data Interoperability|Interoperability]] ... [[Algorithm Administration#Master Data Management (MDM)|Master Data Management (MDM)]] ... [[Bias and Variances]] ... [[Benchmarks]] ... [[Datasets]]
* [[Data Cleaning]]
+
* [[Data Quality]] ...[[AI Verification and Validation|validity]], [[Evaluation - Measures#Accuracy|accuracy]], [[Data Quality#Data Cleaning|cleaning]], [[Data Quality#Data Completeness|completeness]], [[Data Quality#Data Consistency|consistency]], [[Data Quality#Data Encoding|encoding]], [[Data Quality#Zero Padding|padding]], [[Data Quality#Data Augmentation, Data Labeling, and Auto-Tagging|augmentation, labeling, auto-tagging]], [[Data Quality#Batch Norm(alization) & Standardization| normalization, standardization]], and [[Data Quality#Imbalanced Data|imbalanced data]]
* [http://scikit-learn.org/stable/modules/preprocessing.html Sklearn.preprocessing]
+
* [[Risk, Compliance and Regulation]] ... [[Ethics]] ... [[Privacy]] ... [[Law]] ... [[AI Governance]] ... [[AI Verification and Validation]]
* The Passenger Screening Kaggle challenge [http://www.kaggle.com/c/passenger-screening-algorithm-challenge/discussion/45805 1st place solution] was won in part due to data preparation/generation.
+
* [[Natural Language Processing (NLP)#Managed Vocabularies |Managed Vocabularies]]
* [http://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6 Data Pre Processing Techniques You Should Know | Maneesha Rajaratne - Towards Data Science]
+
* [[Excel]] ... [[LangChain#Documents|Documents]] ... [[Database|Database; Vector & Relational]] ... [[Graph]] ... [[LlamaIndex]]
* [http://medium.com/datadriveninvestor/machine-learning-ml-data-preprocessing-5b346766fc48 Machine Learning(ML) — Data Preprocessing | Raji Adam Bifola]
+
* [[Analytics]] ... [[Visualization]] ... [[Graphical Tools for Modeling AI Components|Graphical Tools]] ... [[Diagrams for Business Analysis|Diagrams]] & [[Generative AI for Business Analysis|Business Analysis]] ... [[Requirements Management|Requirements]] ... [[Loop]] ... [[Bayes]] ... [[Network Pattern]]
* [http://sci2s.ugr.es/most-influential-preprocessing Most Influential Data Preprocessing Algorithms | S. García, J. Luengo, F. Herrera]
+
* [[Development]] ... [[Notebooks]] ... [[Development#AI Pair Programming Tools|AI Pair Programming]] ... [[Codeless Options, Code Generators, Drag n' Drop|Codeless]] ... [[Hugging Face]] ... [[Algorithm Administration#AIOps/MLOps|AIOps/MLOps]] ... [[Platforms: AI/Machine Learning as a Service (AIaaS/MLaaS)|AIaaS/MLaaS]]
* [http://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html How to fix an Unbalanced Dataset | Will Badr - Amazon Web Services]
+
* [[Algorithm Administration#Hyperparameter|Hyperparameter]]s
* [[Datasets]]
+
* [[Strategy & Tactics]] ... [[Project Management]] ... [[Best Practices]] ... [[Checklists]] ... [[Project Check-in]] ... [[Evaluation]] ... [[Evaluation - Measures|Measures]]
* [[Data Encoding]]
+
* [[AI Solver]] ... [[Algorithms]] ... [[Algorithm Administration|Administration]] ... [[Model Search]] ... [[Discriminative vs. Generative]] ... [[Train, Validate, and Test]]
* [[Batch Norm(alization) & Standardization]]
+
* [[Python]] ... [[Generative AI with Python|GenAI w/ Python]] ... [[JavaScript]] ... [[Generative AI with JavaScript|GenAI w/ JavaScript]] ... [[TensorFlow]] ... [[PyTorch]]
* [[Feature Exploration/Learning]]
+
* [https://scale.com/ Scale] ... data collection, curation, labeling, and annotation
* [[Hyperparameters]]
+
* [https://scikit-learn.org/stable/modules/preprocessing.html Sklearn.preprocessing]
* [[Data Augmentation]]
+
* The Passenger Screening Kaggle challenge [https://www.kaggle.com/c/passenger-screening-algorithm-challenge/discussion/45805 1st place solution] was won in part due to data preparation/generation.
* [[Visualization]]
+
* [https://towardsdatascience.com/data-pre-processing-techniques-you-should-know-8954662716d6 Data Pre Processing Techniques You Should Know | Maneesha Rajaratne - Towards Data Science]
* [[Python]]
+
* [https://medium.com/datadriveninvestor/machine-learning-ml-data-preprocessing-5b346766fc48 Machine Learning(ML) — Data Preprocessing | Raji Adam Bifola]
* [[Master Data Management  (MDM) / Feature Store / Data Lineage / Data Catalog]]
+
* [https://sci2s.ugr.es/most-influential-preprocessing Most Influential Data Preprocessing Algorithms | S. García, J. Luengo, F. Herrera]
 +
* [https://www.kdnuggets.com/2019/05/fix-unbalanced-dataset.html How to fix an Unbalanced Dataset | Will Badr -] [[Amazon | Amazon Web Services]]
 +
* [https://docs.aws.amazon.com/machine-learning/latest/dg/creating-and-using-datasources.html Creating and Using Datasources |] [[Amazon | Amazon Web Services]]
 +
* [https://github.com/jontupitza Jon Tupitza Famous Jupyter Notebooks:]
 +
** [https://github.com/JonTupitza/Data-Science-Process/blob/master/01-Data-Preparation.ipynb Data Preparation 01]  
 +
** [https://github.com/JonTupitza/Data-Science-On-Ramp/blob/master/03-Data-Preparation.ipynb Data Preparation 02]  
 +
* [https://covidtracking.com/software/ The COVID Tracking Project - software used]
  
  
http://www.researchgate.net/profile/Martin_Beibel/publication/49849827/figure/fig1/AS:601681616183296@1520463484026/Overview-of-the-data-preprocessing-pipeline-The-data-preprocessing-consists-of-1_W640.jpg
+
https://www.researchgate.net/profile/Martin_Beibel/publication/49849827/figure/fig1/AS:601681616183296@1520463484026/Overview-of-the-data-preprocessing-pipeline-The-data-preprocessing-consists-of-1_W640.jpg
[http://www.researchgate.net/publication/49849827_Comparison_of_Multivariate_Data_Analysis_Strategies_for_High-Content_Screening/figures?lo=1 Article]
+
[https://www.researchgate.net/publication/49849827_Comparison_of_Multivariate_Data_Analysis_Strategies_for_High-Content_Screening/figures?lo=1 Article]
  
<youtube>0xVqLJe9_CY</youtube>
 
 
<youtube>cw2LvVkmtkQ</youtube>
 
<youtube>cw2LvVkmtkQ</youtube>
 
<youtube>TK-2189UcKk</youtube>
 
<youtube>TK-2189UcKk</youtube>
Line 38: Line 55:
 
<youtube>UuktvBOKEcE</youtube>
 
<youtube>UuktvBOKEcE</youtube>
 
<youtube>WeXXtBNtwxk</youtube>
 
<youtube>WeXXtBNtwxk</youtube>
 +
<youtube>0xVqLJe9_CY</youtube>
  
 
== Splitting Data - training and testing sets ==
 
== Splitting Data - training and testing sets ==
Line 44: Line 62:
 
<youtube>Lh1dxgxk7dw</youtube>
 
<youtube>Lh1dxgxk7dw</youtube>
  
== Time-Series Data ==
+
== [[Time]]-Series Data ==
* [http://primo.ai/index.php?title=PRIMO.ai&action=edit&section=19 Time-based Algorithms]
+
* [[Backtesting]]
* [http://blog.netsil.com/a-comparison-of-time-series-databases-and-netsils-use-of-druid-db805d471206 A Comparison of Time Series Databases and Netsil’s Use of Druid | Netsil]
+
* [https://primo.ai/index.php?title=PRIMO.ai&action=edit&section=19 Time-based Algorithms]
* [http://azure.microsoft.com/en-us/blog/microsoft-announces-the-general-availability-of-azure-time-series-insights/ Microsoft announces the general availability of Azure Time Series Insights | Ryan Waite - Microsoft]
+
* [https://blog.netsil.com/a-comparison-of-time-series-databases-and-netsils-use-of-druid-db805d471206 A Comparison of Time Series Databases and Netsil’s Use of Druid | Netsil]
* [http://www.outlyer.com/blog/top10-open-source-time-series-databases/ Top 10 Time Series Databases | Outlyer]
+
* [https://azure.microsoft.com/en-us/blog/microsoft-announces-the-general-availability-of-azure-time-series-insights/ Microsoft announces the general availability of Azure Time Series Insights | Ryan Waite - Microsoft]
 +
* [https://www.outlyer.com/blog/top10-open-source-time-series-databases/ Top 10 Time Series Databases | Outlyer]
  
 
<youtube>HYvAPjukKic</youtube>
 
<youtube>HYvAPjukKic</youtube>
Line 57: Line 76:
 
<youtube>2SUBRE6wGiA</youtube>
 
<youtube>2SUBRE6wGiA</youtube>
  
http://azurecomcdn.azureedge.net/mediahandler/acomblog/media/Default/blog/578a09a1-f144-4a62-98cb-e6e3ed774817.png
+
https://azurecomcdn.azureedge.net/mediahandler/acomblog/media/Default/blog/578a09a1-f144-4a62-98cb-e6e3ed774817.png
 +
 
 +
== Categorical Variables ==
 +
* [https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 All about Categorical Variable Encoding | Baijayanta Roy - Towards Data Science]
 +
 
 +
Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are.  Instead, they need to be recoded into a series of variables which can then be entered into the regression model.  There are a variety of coding systems that can be used when recoding categorical variables. [https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/#:~:text=Categorical%20variables%20require%20special%20attention,equation%20just%20as%20they%20are.&text=For%20example%2C%20you%20may%20want,(or%20any%20given%20level). Coding Systems for Categorical Variables In Regression Analysis | UCLA institute for Digital Research & Education Statistical Consulting]
 +
 
 +
<youtube>YCdKs61ClV4</youtube>
 +
<youtube>7TKnHHMBTok</youtube>
 +
 
  
 
== SQL Database Optimization ==
 
== SQL Database Optimization ==
  
 +
<youtube>dUrLYznFbpQ</youtube>
 
<youtube>Rw3ewEXOKC8</youtube>
 
<youtube>Rw3ewEXOKC8</youtube>
<youtube>dUrLYznFbpQ</youtube>
 

Latest revision as of 21:30, 26 April 2024

YouTube ... Quora ...Google search ...Google News ...Bing News


Overview-of-the-data-preprocessing-pipeline-The-data-preprocessing-consists-of-1_W640.jpg Article

Splitting Data - training and testing sets

Time-Series Data

578a09a1-f144-4a62-98cb-e6e3ed774817.png

Categorical Variables

Categorical variables require special attention in regression analysis because, unlike dichotomous or continuous variables, they cannot by entered into the regression equation just as they are. Instead, they need to be recoded into a series of variables which can then be entered into the regression model. There are a variety of coding systems that can be used when recoding categorical variables. Coding Systems for Categorical Variables In Regression Analysis | UCLA institute for Digital Research & Education Statistical Consulting


SQL Database Optimization