Difference between revisions of "Watch me Build a Cybersecurity Startup"

From
Jump to: navigation, search
m
 
(13 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
|title=PRIMO.ai
 
|title=PRIMO.ai
 
|titlemode=append
 
|titlemode=append
|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, Microsoft, Azure, Amazon, AWS  
+
|keywords=ChatGPT, artificial, intelligence, machine, learning, GPT-4, GPT-5, NLP, NLG, NLC, NLU, models, data, singularity, moonshot, Sentience, AGI, Emergence, Moonshot, Explainable, TensorFlow, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Hugging Face, OpenAI, Tensorflow, OpenAI, Google, Nvidia, Microsoft, Azure, Amazon, AWS, Meta, LLM, metaverse, assistants, agents, digital twin, IoT, Transhumanism, Immersive Reality, Generative AI, Conversational AI, Perplexity, Bing, You, Bard, Ernie, prompt Engineering LangChain, Video/Image, Vision, End-to-End Speech, Synthesize Speech, Speech Recognition, Stanford, MIT |description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
+
 
 +
<!-- Google tag (gtag.js) -->
 +
<script async src="https://www.googletagmanager.com/gtag/js?id=G-4GCWLBVJ7T"></script>
 +
<script>
 +
  window.dataLayer = window.dataLayer || [];
 +
  function gtag(){dataLayer.push(arguments);}
 +
  gtag('js', new Date());
 +
 
 +
  gtag('config', 'G-4GCWLBVJ7T');
 +
</script>
 
}}
 
}}
 
[http://www.youtube.com/results?search_query=Bot+Framework+Language+Understanding+Intelligent+Cognitive+Services YouTube search...]
 
[http://www.youtube.com/results?search_query=Bot+Framework+Language+Understanding+Intelligent+Cognitive+Services YouTube search...]
 
[http://www.google.com/search?q=Bot+Framework+deep+machine+learning+ML ...Google search]
 
[http://www.google.com/search?q=Bot+Framework+deep+machine+learning+ML ...Google search]
  
* [[How do I leverage AI?]]
+
* [[Risk, Compliance and Regulation]] ... [[Ethics]] ... [[Privacy]] ... [[Law]] ... [[AI Governance]] ... [[AI Verification and Validation]]
** [[Watch me Build a Marketing Startup]] | Siraj Raval
+
* [[Cybersecurity]] ... [[Open-Source Intelligence - OSINT |OSINT]] ... [[Cybersecurity Frameworks, Architectures & Roadmaps | Frameworks]] ... [[Cybersecurity References|References]] ... [[Offense - Adversarial Threats/Attacks| Offense]] ... [[National Institute of Standards and Technology (NIST)|NIST]] ... [[U.S. Department of Homeland Security (DHS)| DHS]] ... [[Screening; Passenger, Luggage, & Cargo|Screening]] ... [[Law Enforcement]] ... [[Government Services|Government]] ... [[Defense]] ... [[Joint Capabilities Integration and Development System (JCIDS)#Cybersecurity & Acquisition Lifecycle Integration| Lifecycle Integration]] ... [[Cybersecurity Companies/Products|Products]] ... [[Cybersecurity: Evaluating & Selling|Evaluating]]
** [[Watch me Build a Healthcare Startup]] | Siraj Raval
+
* [[How do I leverage Artificial Intelligence (AI)?]]
** [[Watch me Build a Finance Startup]] | Siraj Raval  
+
** [[Watch me Build a Marketing Startup]] | [[Creatives#Siraj Raval|Siraj Raval]]
** [[Watch me Build a Retail Startup]] | Siraj Raval  
+
** [[Watch me Build a Healthcare Startup]] | [[Creatives#Siraj Raval|Siraj Raval]]
** [[Watch me Build a Trading Bot]] | Siraj Raval
+
** [[Watch me Build a Finance Startup]] | [[Creatives#Siraj Raval|Siraj Raval]]
 +
** [[Watch me Build a Retail Startup]] | [[Creatives#Siraj Raval|Siraj Raval]]
  
  
Line 22: Line 32:
 
# Customer's Transaction Credit Card Data @ checkout page
 
# Customer's Transaction Credit Card Data @ checkout page
  
<B><i>PROPOSED:</i></B> For [http://eugdpr.org/ GDPR] privacy compliance, Customer data is encrypted and then machine learning is performed on clients's website and only 'encrypted weights file' is sent to [[Amazon |AWS]] Backend Pipeline... for salable real-time [[federated]] learning solution.
+
<B><i>PROPOSED:</i></B> For [[Privacy#General Data Protection Regulations (GDPR)|GDPR]] [[privacy]] compliance, Customer data is encrypted and then machine learning is performed on clients's website and only 'encrypted weights file' is sent to [[Amazon |AWS]] Backend Pipeline... for salable real-time [[Decentralized: Federated & Distributed|federated]] learning solution.
  
  
Line 28: Line 38:
  
 
<B>Client: Snip-it on client's website to fire a [[Lambda]] function on the server ---</B>
 
<B>Client: Snip-it on client's website to fire a [[Lambda]] function on the server ---</B>
* [[Javascript]]
+
* [[JavaScript]]
 
* [[OpenMined]]
 
* [[OpenMined]]
** [http://github.com/OpenMined/syft.js syft.js] a client-side microlibrary for running PySyft operations in [[Javascript]]
+
** [http://github.com/OpenMined/syft.js syft.js] a client-side microlibrary for running PySyft operations in [[JavaScript]]
 
----
 
----
  
Line 40: Line 50:
 
<B>[[Amazon | Amazon AWS]] Backend: Pipeline ---</B>
 
<B>[[Amazon | Amazon AWS]] Backend: Pipeline ---</B>
 
* [http://statsbot.co/blog/building-open-source-google-analytics-from-scratch/ Building Open Source Google Analytics from Scratch | Pavel Tiunov - cube.js]
 
* [http://statsbot.co/blog/building-open-source-google-analytics-from-scratch/ Building Open Source Google Analytics from Scratch | Pavel Tiunov - cube.js]
* [[Javascript]]
+
* [[JavaScript]]
 
** [http://docs.npmjs.com/misc/registry NPM]  JavaScript package registry  
 
** [http://docs.npmjs.com/misc/registry NPM]  JavaScript package registry  
 
*** [http://engineering.fb.com/web/yarn-a-new-package-manager-for-javascript/ Yarn] package manager for JavaScript
 
*** [http://engineering.fb.com/web/yarn-a-new-package-manager-for-javascript/ Yarn] package manager for JavaScript
** [[Javascript#Node.js|Node.js]] - JavaScript runtime built on Chrome's V8 JavaScript engine
+
** [[JavaScript#Node.js|Node.js]] - JavaScript runtime built on Chrome's V8 JavaScript engine
 
*** [http://aws.amazon.com/sdk-for-node-js/ AWS SDK for JavaScript in Node.js] providing JavaScript objects for AWS services including Amazon S3
 
*** [http://aws.amazon.com/sdk-for-node-js/ AWS SDK for JavaScript in Node.js] providing JavaScript objects for AWS services including Amazon S3
 
* Data formats:
 
* Data formats:
Line 52: Line 62:
  
 
* [[Amazon | AWS]] tools:
 
* [[Amazon | AWS]] tools:
** [http://aws.amazon.com/serverless/ Serverless] run applications and services without thinking about servers
+
** [http://aws.amazon.com/serverless/ Run applications and services without thinking about servers] [[Serverless]]
 
** [http://aws.amazon.com/api-gateway/ Amazon API Gateway] - create, publish, maintain, monitor, and secure APIs at any scale;  create REST and WebSocket APIs that act as a “front door” for applications to access data, business logic, or functionality from backend services
 
** [http://aws.amazon.com/api-gateway/ Amazon API Gateway] - create, publish, maintain, monitor, and secure APIs at any scale;  create REST and WebSocket APIs that act as a “front door” for applications to access data, business logic, or functionality from backend services
 
** [[Lambda]] - run code without managing servers
 
** [[Lambda]] - run code without managing servers
Line 78: Line 88:
 
<B><i>PROPOSED:</i></B> Federated Learning in a cloud
 
<B><i>PROPOSED:</i></B> Federated Learning in a cloud
 
* [[OpenMined]]
 
* [[OpenMined]]
** [http://github.com/OpenMined/PySyft PySyft] a A library for encrypted, privacy preserving deep learning  
+
** [http://github.com/OpenMined/PySyft PySyft] a A library for encrypted, [[privacy]] preserving deep learning  
  
 
* <i>TO DO:</i> Switch [http://reactjs.org/ React] backend to a [[Python#Flask |Flask]] backend
 
* <i>TO DO:</i> Switch [http://reactjs.org/ React] backend to a [[Python#Flask |Flask]] backend
Line 105: Line 115:
  
 
= [http://aws-web-analytics-dashboard.s3-website-us-east-1.amazonaws.com/ DharmaSecurity] =
 
= [http://aws-web-analytics-dashboard.s3-website-us-east-1.amazonaws.com/ DharmaSecurity] =
I've built a [http://aws-web-analytics-dashboard.s3-website-us-east-1.amazonaws.com/ Demo] app called [http://aws-web-analytics-dashboard.s3-website-us-east-1.amazonaws.com/ DharmaSecurity], a [[Cybersecurity]]  fraud detection tool for businesses. The way it works is that once signed up, a business will paste a code snippet into their website, and then they'll get access to a dashboard that tells them how many fraudulent accounts they have. Our app will use machine learning to automatically remove suspected fraud accounts, and flag likely ones for review. To build this, I use a suite of [[Amazon |AWS]] tools, [[Python]], [[Javascript]], a [[Logistic Regression (LR)]] model, a credit card fraud dataset, and a library called [[OpenMined]] to enable [[federated]] learning and [http://en.wikipedia.org/wiki/Secure_multi-party_computation secure multi-party computation]. I've packed a lot into this video, animations, code, music, screencasts, skits, etc.  Enjoy! [https://github.com/llSourcell/Build-a-Cybersecurity-Startup Code for "a Cybersecurity Startup" | Siraj Raval - GitHub]
+
I've built a [http://aws-web-analytics-dashboard.s3-website-us-east-1.amazonaws.com/ Demo] app called [http://aws-web-analytics-dashboard.s3-website-us-east-1.amazonaws.com/ DharmaSecurity], a [[Cybersecurity]]  fraud detection tool for businesses. The way it works is that once signed up, a business will paste a code snippet into their website, and then they'll get access to a dashboard that tells them how many fraudulent accounts they have. Our app will use machine learning to automatically remove suspected fraud accounts, and flag likely ones for review. To build this, I use a suite of [[Amazon |AWS]] tools, [[Python]], [[JavaScript]], a [[Logistic Regression (LR)]] model, a credit card fraud dataset, and a library called [[OpenMined]] to enable [[Decentralized: Federated & Distributed|federated]] learning and [http://en.wikipedia.org/wiki/Secure_multi-party_computation secure multi-party computation]. I've packed a lot into this video, animations, code, music, screencasts, skits, etc.  Enjoy!   [http://github.com/llSourcell/Build-a-Cybersecurity-Startup Code for "a Cybersecurity Startup" |] [[Creatives#Siraj Raval|Siraj Raval]] - GitHub
  
<youtube>BXw8vQXxvqc</youtube>
+
<youtube>BXw8vQXxvqc</youtube>  
  
 
=== Credit Card Fraud Detection | Nick Walker ===
 
=== Credit Card Fraud Detection | Nick Walker ===
* [http://github.com/nickwalker037/Credit-Card-Fraud-Detection Credit Card Fraud Detection | Nick Walker - GitHub] Using [[Imbalanced Data#Under-sampling|Under-sampling]] techniques and [[Logistic Regression (LR)]] in order to predict credit card fraud.
+
* [http://github.com/nickwalker037/Credit-Card-Fraud-Detection Credit Card Fraud Detection | Nick Walker - GitHub] Using [[Data Quality#Under-sampling|Under-sampling]] techniques and [[Logistic Regression (LR)]] in order to predict credit card fraud.
  
Using [[Imbalanced Data#Under-sampling|Under-sampling]] techniques and [[Logistic Regression (LR)]] in order to predict credit card fraud
+
Using [[Data Quality#Under-sampling|Under-sampling]] techniques and [[Logistic Regression (LR)]] in order to predict credit card fraud
  
 
This is the Kernel submission for the Kaggle competition "Credit Card Fraud Detection". The dataset contains 28 [[Principal Component Analysis (PCA)]] transformed features of transactions made by credit cards in September 2013 by European cardholders. The dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 total transactions (0.172% of total).
 
This is the Kernel submission for the Kaggle competition "Credit Card Fraud Detection". The dataset contains 28 [[Principal Component Analysis (PCA)]] transformed features of transactions made by credit cards in September 2013 by European cardholders. The dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 total transactions (0.172% of total).
  
Because of the highly unbalanced nature of the dataset, I used a [[Evaluation Measures - Classification Performance#Confusion Matrix|Confusion Matrix]] to calculate the Precision and Recall of my results. I also used the [[Imbalanced Data#Under-sampling|Under-sampling]] technique in order to take a smaller amount of the normal transactions that occurred and train a logistic regressor based on this. I trained and applied the logistic regressor on all of the data, on only the undersampled data, and then I used the logistic regressor trained on the undersampled data and applied it to all of the data. My recall scores for each were as follows:
+
Because of the highly unbalanced nature of the dataset, I used a [[Evaluation - Measures#Confusion Matrix|Confusion Matrix]] to calculate the Precision and Recall of my results. I also used the [[Data Quality#Under-sampling|Under-sampling]] technique in order to take a smaller amount of the normal transactions that occurred and train a logistic regressor based on this. I trained and applied the logistic regressor on all of the data, on only the [[Data Quality#Under-sampling|under-sampled]] data, and then I used the logistic regressor trained on the [[Data Quality#Under-sampling|under-sampled]] data and applied it to all of the data. My recall scores for each were as follows:
  
 
* The logistic regressor trained on and applied to all of the data: 0.52  
 
* The logistic regressor trained on and applied to all of the data: 0.52  
* The logistic regressor trained on and applied to only the undersampled data: 0.91  
+
* The logistic regressor trained on and applied to only the [[Data Quality#Under-sampling|under-sampled]] data: 0.91  
* The logistic regressor trained on the undersampled data and applied to all of the data: 0.92
+
* The logistic regressor trained on the [[Data Quality#Under-sampling|under-sampled]] data and applied to all of the data: 0.92
  
As you can see from my results above, the logistic regressor trained on the undersampled data and applied to all of the data had the best results, with a 92% recall rate. A fairly good start for applying the [[Imbalanced Data#Under-sampling|Under-sampling]] technique on only a logistic regressor.
+
As you can see from my results above, the logistic regressor trained on the undersampled data and applied to all of the data had the best results, with a 92% recall rate. A fairly good start for applying the [[Data Quality#Under-sampling|under-sampling]] technique on only a logistic regressor.
  
 
==== About the Dataset | Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson & Gianluca Bontempi ====
 
==== About the Dataset | Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson & Gianluca Bontempi ====
Line 136: Line 146:
 
* Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
 
* Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.
  
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). [[Evaluation Measures - Classification Performance#Confusion Matrix|Confusion Matrix]] accuracy is not meaningful for unbalanced classification.
+
Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). [[Evaluation - Measures#Confusion Matrix|Confusion Matrix]] accuracy is not meaningful for unbalanced classification.
  
 
The dataset has been collected and analysed during a research collaboration of [http://mlg.ulb.ac.be Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles)] on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML
 
The dataset has been collected and analysed during a research collaboration of [http://mlg.ulb.ac.be Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles)] on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML
  
Cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In [http://neuro.informatik.uni-ulm.de/CIDM2015 Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015]
+
Cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with [[Data Quality#Under-sampling|Under-sampling]] for Unbalanced Classification. In [http://neuro.informatik.uni-ulm.de/CIDM2015 Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015]

Latest revision as of 22:03, 5 December 2023

YouTube search... ...Google search


Architecture

PREVIOUS: From client's website to machine learning on AWS (backend):

  1. Customer's Data
  2. Customer's Transaction Credit Card Data @ checkout page

PROPOSED: For GDPR privacy compliance, Customer data is encrypted and then machine learning is performed on clients's website and only 'encrypted weights file' is sent to AWS Backend Pipeline... for salable real-time federated learning solution.


Transaction Processing

Client: Snip-it on client's website to fire a Lambda function on the server ---


<script async src='https://www.our-aws-app.com/analytics.js'></script>



Amazon AWS Backend: Pipeline ---


Machine Learning Service for Cybersecurity

PREVIOUS: Client-Server Model

PROPOSED: Federated Learning in a cloud


Analytics Dashboard

Client: Dashboard ---

Amazon AWS Backend ---

  • AWS tools:
    • Glue a fully managed extract, transform, and load (ETL) service to prepare and load data for analytics
    • Crawlers to populate the AWS Glue Data Catalog with tables
    • Athena interactive query service to analyze data in Amazon S3 using standard SQL


DharmaSecurity

I've built a Demo app called DharmaSecurity, a Cybersecurity fraud detection tool for businesses. The way it works is that once signed up, a business will paste a code snippet into their website, and then they'll get access to a dashboard that tells them how many fraudulent accounts they have. Our app will use machine learning to automatically remove suspected fraud accounts, and flag likely ones for review. To build this, I use a suite of AWS tools, Python, JavaScript, a Logistic Regression (LR) model, a credit card fraud dataset, and a library called OpenMined to enable federated learning and secure multi-party computation. I've packed a lot into this video, animations, code, music, screencasts, skits, etc. Enjoy! Code for "a Cybersecurity Startup" | Siraj Raval - GitHub

Credit Card Fraud Detection | Nick Walker

Using Under-sampling techniques and Logistic Regression (LR) in order to predict credit card fraud

This is the Kernel submission for the Kaggle competition "Credit Card Fraud Detection". The dataset contains 28 Principal Component Analysis (PCA) transformed features of transactions made by credit cards in September 2013 by European cardholders. The dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 total transactions (0.172% of total).

Because of the highly unbalanced nature of the dataset, I used a Confusion Matrix to calculate the Precision and Recall of my results. I also used the Under-sampling technique in order to take a smaller amount of the normal transactions that occurred and train a logistic regressor based on this. I trained and applied the logistic regressor on all of the data, on only the under-sampled data, and then I used the logistic regressor trained on the under-sampled data and applied it to all of the data. My recall scores for each were as follows:

  • The logistic regressor trained on and applied to all of the data: 0.52
  • The logistic regressor trained on and applied to only the under-sampled data: 0.91
  • The logistic regressor trained on the under-sampled data and applied to all of the data: 0.92

As you can see from my results above, the logistic regressor trained on the undersampled data and applied to all of the data had the best results, with a 92% recall rate. A fairly good start for applying the under-sampling technique on only a logistic regressor.

About the Dataset | Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson & Gianluca Bontempi

The datasets contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a Principal Component Analysis (PCA) transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with Principal Component Analysis (PCA), the only features which have not been transformed with PCA are 'Time' and 'Amount'.

  • Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.
  • The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning.
  • Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion Matrix accuracy is not meaningful for unbalanced classification.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML

Cite: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Under-sampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015