Difference between revisions of "Bioinformatics"

From
Jump to: navigation, search
m
 
(43 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
 
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
 
}}
 
}}
[http://www.youtube.com/results?search_query=genomics+gene+dna+artificial+intelligence+deep+learning Youtube search...]
+
[https://www.youtube.com/results?search_query=Bioinformatics+Bioinformatics+genetics+CRISPR+gene+dna+artificial+intelligence+deep+machine+learning Youtube search...]
[http://www.google.com/search?q=genomics+gene+dna+deep+machine+learning+ML ...Google search]
+
[https://www.google.com/search?q=Bioinformatics+Bioinformatics+genetics+CRISPR+gene+dna+artificial+intelligence+deep+machine+learning ...Google search]
  
 
* [[Case Studies]]
 
* [[Case Studies]]
* [[CRISPR]]
+
** [[Paleontology]]
* [[Paleontology]]
+
** [[Healthcare]]
* [[Healthcare]]
+
*** [[Pharmaceuticals]]
* [http://www.nature.com/articles/d41586-018-07225-z Machine learning spots natural selection at work in human genome | Amy Maxmen]
+
*** [[Drug Discovery]]
* [http://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html DeepVariant: Highly Accurate Genomes With Deep Neural Networks | Mark DePristo and Ryan Poplin, Google Brain Team]
+
**** [[Protein Folding & Discovery]] ...[[Protein Folding & Discovery#Google DeepMind AlphaFold|Google DeepMind AlphaFold]]
 +
** [[Life Sciences]]
 +
* [[COVID-19]]
 +
* [[Bio-inspired Computing]]
 +
* [[COVID-19]]
 +
* [https://www.the-odin.com/diyhumancrispr/ DIY Human CRISPR Guide | The Odin]
 +
* [https://www.nature.com/articles/d41586-018-07225-z Machine learning spots natural selection at work in human genome | Amy Maxmen]
 +
* [https://github.com/google/deepvariant DeepVariant | Google] ...an analysis pipeline that uses a [[Neural Network#Deep Neural Network (DNN)|Deep Neural Network (DNN)]] to call genetic variants from next-generation DNA sequencing data.
 +
** [https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html DeepVariant: Highly Accurate Genomes With Deep Neural Networks | Mark DePristo and Ryan Poplin, Google Brain Team]
 +
 
 +
An interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.
 +
 
 +
<center><b><i>...all biology is computational biology </i></b> - Florian Markowetz </center>
 +
 
 +
 
 +
= Bioinformatics Pipelines =
 +
Bioinformatics includes biological studies that use computer programming as part of their methodology, as well as a specific analysis "pipelines" that are repeatedly used, particularly in the field of genomics. Common uses of bioinformatics include the identification of candidates genes and single nucleotide polymorphisms (SNPs). Often, such identification is made with the aim of better understanding the genetic basis of disease, unique adaptations, desirable properties (esp. in agricultural species), or differences between populations. In a less formal way, bioinformatics also tries to understand the organisational principles within nucleic acid and protein sequences, called proteomics. [https://en.wikipedia.org/wiki/Bioinformatics Wikipedia]
 +
 
 +
https://1.bp.blogspot.com/-VPe_Ao5PK5w/WiGcSnsyQTI/AAAAAAAACOg/HR_kaM05HUoP-556x1a25g-CjB7XD_BxgCLcBGAs/s640/image1.png
 +
 
  
 
<youtube>s6rJLXq1Re0</youtube>
 
<youtube>s6rJLXq1Re0</youtube>
Line 21: Line 40:
 
<youtube>JYt1IqdDAPc</youtube>
 
<youtube>JYt1IqdDAPc</youtube>
 
<youtube>uC3SfnbCXmw</youtube>
 
<youtube>uC3SfnbCXmw</youtube>
 +
<youtube>QJLQBSQJEus</youtube>
 +
 +
 +
== <span id="Using Computer Code to Decipher Genetic Code"></span>Using Computer Code to Decipher Genetic Code: Bioinformatics 101 ==
 +
* [[Drug Discovery#Drug Discovery using Python| Drug Discovery using Python | Chanin Nantasenamat]]
 +
 +
<youtube>p5iZxIT16KQ</youtube>
 +
<youtube>ua08NV58Gew</youtube>
 +
 +
== <span id="ROSALIND Platform"></span>ROSALIND Platform ==
 +
* [https://rosalind.info ROSALIND Platform]
 +
* [[Drug Discovery#Drug Discovery using Python| Drug Discovery using Python | Chanin Nantasenamat]]
 +
 +
Learning bioinformatics usually requires solving computational problems of varying difficulty that are extracted from real challenges of molecular biology. To make learning bioinformatics fun and easy, we have founded Rosalind, a platform for learning bioinformatics through problem solving.
 +
[https://rosalind.info ROSALIND] offers an array of intellectually stimulating problems that grow in biological and computational complexity; each problem is checked automatically, so that the only resource required to learn bioinformatics is an internet connection. [https://rosalind.info ROSALIND] also promises to facilitate improvements in standard bioinformatics education by providing a vital teaching aid and a central homework resource. [https://rosalind.info ROSALIND] is inspired by [https://projecteuler.net/ Project Euler], [https://codingcompetitions.withgoogle.com/codejam Google Code Jam], and the ever growing movement of free online courses. The project's name commemorates [https://en.wikipedia.org/wiki/Rosalind_Franklin Rosalind] Franklin, whose [https://en.wikipedia.org/wiki/X-ray_crystallography X-ray crystallography] with [https://en.wikipedia.org/wiki/Raymond_Gosling Raymond Gosling] facilitated the discovery of the DNA double helix by [https://en.wikipedia.org/wiki/James_Watson Watson] and [https://en.wikipedia.org/wiki/Francis_Crick Crick]. [https://rosalind.info/ ROSALIND]
 +
 +
<youtube>6oHl7hNWn2o</youtube>
 +
 +
== <span id="CRISPR"></span>CRISPR ==
 +
[https://www.youtube.com/results?search_query=CRISPR+gene+dna+artificial+intelligence+deep+machine+learning Youtube search...]
 +
[https://www.google.com/search?q=CRISPR+gene+dna+artificial+intelligence+deep+machine+learning ...Google search]
 +
 +
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a revolutionary gene-editing technology that allows scientists to make precise changes to the DNA of living organisms. A CRISPR kit typically contains the necessary reagents and tools to perform CRISPR gene editing in a laboratory setting.
 +
 +
 +
<youtube>p5G5aMnExpI</youtube>
 +
<youtube>8mVua5bBGps</youtube>
 +
<youtube>5uVvVmZPfzo</youtube>
 +
<youtube>61Uvlt8TUfY</youtube>
 +
 +
=== CRISPR Kit ===
 +
* [https://www.bio-rad.com/en-us/category/crispr-gene-editing-kits?ID=Q0JG5VTU86LJ CRISPR Gene Editing Kits - Bio-Rad]
 +
* [https://www.synthego.com (2) Synthego | Genome Engineering]
 +
* [https://www.sigmaaldrich.com/US/en/product/sigma/hsoligoint Sigma-Aldrich | MilliporeSigma]
 +
* [https://www.creative-biogene.com/crispr-cas9/products/kits-686.html CRISPR/Cas9 Kits - Creative Biogene CRISPR/Cas9 Platform]
 +
* [https://www.sigmaaldrich.com/US/en/product/sigma/CRISPR CRISPR/Cas9 Products and Services | MilliporeSigma]
 +
 +
There are several companies that offer CRISPR kits for educational and research purposes. For example, Bio-Rad offers educational CRISPR gene editing kits that allow students to perform real CRISPR-Cas9 gene editing in the classroom using familiar and safe reagents, techniques, and organisms. Synthego offers an ecosystem of synthetic RNA solutions for CRISPR genome engineering, including engineered cells and CRISPRevolution products. Sigma-Aldrich offers a CRISPR Integration Kit that provides the essential genome editing reagents necessary to integrate a BstNI restriction site to the human KRAS locus.
 +
 +
 +
<youtube>4vmHweDC5SY</youtube>
 +
<youtube>0OTQtlHRPWc</youtube>
 +
<youtube>4u0ZinOaOLg</youtube>
 +
 +
=== CRISPR Explained ===
 +
<youtube>TdBAHexVYzc</youtube>
 +
<youtube>TTHu63Fr_lI</youtube>
 +
<youtube>6tw_JVz_IEc</youtube>
 +
<youtube>h7YWqgXheYQ</youtube>
 +
 +
== PrimateAI-3D ==
 +
* [https://www.illumina.com/ Illumina, Inc.]
 +
* [https://github.com/Illumina/PrimateAI-3D GitHub - Illumina/PrimateAI-3D]
 +
* [https://www.illumina.com/science/genomics-research/articles/primateai-3d.html (1) Improving genetic risk prediction and drug target discovery using ....]
 +
* [https://thenextweb.com/news/ai-trained-on-ape-dna-predicts-genetic-disease-risks-humans AI trained on ape DNA predicts genetic disease risks for humans | Ioanna Lykiardopoulou - TNW] ... Our primate relatives can teach us a lot about our own genes
 +
* [https://finance.yahoo.com/news/illumina-takes-ai-genomics-launch-145421268.html Illumina Takes AI to Genomics: Launch of PrimateAI-3D for Accurate Disease Prediction | Nabaparna Bhattacharya - Yahoo!Finance]
 +
* [https://www.washingtonpost.com/science/2023/06/01/primate-ai-genome-variants (2) New AI tool searches genetic haystacks to find disease-causing variants ....]
 +
* [https://www.science.org/doi/10.1126/science.abo1131 Rare penetrant mutations confer severe risk of common diseases.]
 +
 +
PrimateAI-3D is built on deep-learning language architectures similar to those used in [[ChatGPT]], but designed to model genomic rather than linguistic sequences. The team used natural selection to train its parameters, by presenting it with mutations that are ruled out for disease in our primate relatives. This way, the algorithm learned to recognize benign genetic variants and, by process of elimination, mutations that are likely to cause disease.
 +
 +
Then the scientists applied PrimateAI-3D to identify potentially harmful mutations in humans, using health records and gene variant data of over 400 people who have donated samples to the [https://www.ukbiobank.ac.uk/ UK Biobank project]. They found that the algorithm showed “impressive improvements” in predicting humans’ increased genetic risk for common diseases.
 +
 +
PrimateAI-3D is a deep-learning network developed by Illumina that is trained on 4.5 million common genetic variants from 233 primate species. This state-of-the-art classifier accurately quantifies missense variant pathogenicity in humans, which improves discovery of genes affecting clinical phenotypes. It is used to improve genetic risk prediction and drug target discovery. The algorithm scans about 70 million genetic variants, a selection that is more than 1,000 times as large as ClinVar. The 3D in the name refers to the three-dimensional structure of proteins, a key factor in distinguishing which mutations will wreak havoc.
 +
 +
== What Came First, Cells or Viruses? ==
 +
[https://www.youtube.com/results?search_query=Phylogenomic+genomics+gene+dna+artificial+intelligence+deep+learning Youtube search...]
 +
[https://www.google.com/search?q=Phylogenomic+genomics+gene+dna+deep+machine+learning+ML ...Google search]
 +
 +
* [[Markov Model (Chain, Discrete Time, Continuous Time, Hidden)]]
 +
* [https://www.di.ens.fr/~fbach/courses/fall2013/phyloHMM.pdf Phylogenetic Hidden Markov Models | Adam Siepel and David Haussler]
 +
* [https://evolution.genetics.washington.edu/phylip/software.html Phylogeny Programs | Joe Felsenstein - University of Washington] Here are 392 phylogeny packages and 54 free web servers  ... [https://evolution.genetics.washington.edu/phylip.html PHYLIP (the PHYLogeny Inference Package)] Methods that are available in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters.
 +
 +
<youtube>rX4YEXQVQKI</youtube>
 +
 +
== Virus & Consciousness  ==
 +
* [https://www.livescience.com/61627-ancient-virus-brain.html An Ancient Virus May Be Responsible for Human Consciousness | Rafi Letzter - Live Science]
 +
* [https://www.livescience.com/26505-human-genome-milestones.html Unraveling the Human Genome: 6 Molecular Milestones | Stephanie Pappas - Live Science]
 +
 +
Long ago, a virus bound its genetic code to the genome of four-limbed animals. That snippet of code is still very much alive in humans' brains today, where it does the very viral task of packaging up genetic information and sending it from nerve cells to their neighbors in little capsules that look a whole lot like viruses themselves. And these little packages of information might be critical elements of how nerves communicate and reorganize over time — tasks thought to be necessary for higher-order thinking...
 +
 +
Though it may sound surprising that bits of human genetic code come from viruses, it's actually more common than you might think: A review published in Cell in 2016 found that between 40 and 80 percent of the human genome arrived from some archaic viral invasion.
 +
 +
<youtube>cXGw4sqbJKM</youtube>
 +
<youtube>YlBgd_Mi8-I</youtube>
 +
<youtube>nWuV6PVKv1A</youtube>
 +
<youtube>FmX8au0xGlY</youtube>
 +
 +
== Bioinformatics Project from Scratch  ==
 +
* [https://peerj.com/articles/2322/ Probing the origins of human acetylcholinesterase inhibition via QSAR modeling and molecular docking | S. Simeon​, N. Anuwongcharoen1, W. Shoombuatong, A. Malik1, V. Prachayasittikul, J. E.S. Wikberg, and C. Nantasenamat​]
 +
 +
* [https://www.youtube.com/channel/UCV8e2g4IWQqK71bbzGDEI4Q Data Professor series:]
 +
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>plVLRashaA8</youtube>
 +
<b>Part 1
 +
</b><br>I have shown you how to collect original dataset in biology that you can use in your Data Science Project. Particularly, I have demonstrated how to download and pre-process the biological activity data from the ChEMBL database. The dataset is comprised of compounds (molecules) that have been biologically tested for their activity towards target organism/protein of interest
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>qWVTxfLq2ak</youtube>
 +
<b>Part 2
 +
</b><br>I have shown you how to calculate Lipinski descriptors (molecular descriptors proposed by Christopher Lipinski for predicting their likelihood of being drug-like molecules) and performing Exploratory Data Analysis on these Lipinski descriptors. Particularly, the EDA are based on making simple box plots and scatter plots to discern differences of the active and inactive sets of compounds
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>zD2focOkQ48</youtube>
 +
<b>Part 3
 +
</b><br>I have made some changes to the target protein to be Acetylcholinesterase as it provides a larger dataset to work with. We have already computed the molecular descriptors using the PADEL-Descriptor software and prepare the dataset (X and Y dataframes) that will be used in this video for Model Building
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>wGaGm0sj04M</youtube>
 +
<b>Part 4
 +
</b><br>I have show you how to use the computed molecular descriptors from Part 3 (as the X variables) to build a regression model for predicting the pIC50 values (the Y variable)
 +
|}
 +
|}<!-- B -->
 +
{|<!-- T -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>wGaGm0sj04M</youtube>
 +
<b>Part 5
 +
</b><br>in a multi-part video series on Bioinformatics Project from scratch. In this video, I will show you how to quickly build and compare several regression models (quantitative structure-activity relationship or QSAR) of the Acetylcholinesterase inhibitors using the lazypredict library in Python
 +
|}
 +
|<!-- M -->
 +
| valign="top" |
 +
{| class="wikitable" style="width: 550px;"
 +
||
 +
<youtube>m0sePkuyTKs</youtube>
 +
<b>Part 6
 +
</b><br>I will show you how to deploy the machine learning model as a web app. Essentially, this web app will serve as a Bioinformatics tool that will allow users the ability to predict whether a compound of interest has favorable biological activity against the target protein or not
 +
|}
 +
|}<!-- B -->

Latest revision as of 12:38, 27 July 2023

Youtube search... ...Google search

An interdisciplinary field that develops methods and software tools for understanding biological data, in particular when the data sets are large and complex. As an interdisciplinary field of science, bioinformatics combines biology, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

...all biology is computational biology - Florian Markowetz


Bioinformatics Pipelines

Bioinformatics includes biological studies that use computer programming as part of their methodology, as well as a specific analysis "pipelines" that are repeatedly used, particularly in the field of genomics. Common uses of bioinformatics include the identification of candidates genes and single nucleotide polymorphisms (SNPs). Often, such identification is made with the aim of better understanding the genetic basis of disease, unique adaptations, desirable properties (esp. in agricultural species), or differences between populations. In a less formal way, bioinformatics also tries to understand the organisational principles within nucleic acid and protein sequences, called proteomics. Wikipedia

image1.png



Using Computer Code to Decipher Genetic Code: Bioinformatics 101

ROSALIND Platform

Learning bioinformatics usually requires solving computational problems of varying difficulty that are extracted from real challenges of molecular biology. To make learning bioinformatics fun and easy, we have founded Rosalind, a platform for learning bioinformatics through problem solving. ROSALIND offers an array of intellectually stimulating problems that grow in biological and computational complexity; each problem is checked automatically, so that the only resource required to learn bioinformatics is an internet connection. ROSALIND also promises to facilitate improvements in standard bioinformatics education by providing a vital teaching aid and a central homework resource. ROSALIND is inspired by Project Euler, Google Code Jam, and the ever growing movement of free online courses. The project's name commemorates Rosalind Franklin, whose X-ray crystallography with Raymond Gosling facilitated the discovery of the DNA double helix by Watson and Crick. ROSALIND

CRISPR

Youtube search... ...Google search

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) is a revolutionary gene-editing technology that allows scientists to make precise changes to the DNA of living organisms. A CRISPR kit typically contains the necessary reagents and tools to perform CRISPR gene editing in a laboratory setting.


CRISPR Kit

There are several companies that offer CRISPR kits for educational and research purposes. For example, Bio-Rad offers educational CRISPR gene editing kits that allow students to perform real CRISPR-Cas9 gene editing in the classroom using familiar and safe reagents, techniques, and organisms. Synthego offers an ecosystem of synthetic RNA solutions for CRISPR genome engineering, including engineered cells and CRISPRevolution products. Sigma-Aldrich offers a CRISPR Integration Kit that provides the essential genome editing reagents necessary to integrate a BstNI restriction site to the human KRAS locus.


CRISPR Explained

PrimateAI-3D

PrimateAI-3D is built on deep-learning language architectures similar to those used in ChatGPT, but designed to model genomic rather than linguistic sequences. The team used natural selection to train its parameters, by presenting it with mutations that are ruled out for disease in our primate relatives. This way, the algorithm learned to recognize benign genetic variants and, by process of elimination, mutations that are likely to cause disease.

Then the scientists applied PrimateAI-3D to identify potentially harmful mutations in humans, using health records and gene variant data of over 400 people who have donated samples to the UK Biobank project. They found that the algorithm showed “impressive improvements” in predicting humans’ increased genetic risk for common diseases.

PrimateAI-3D is a deep-learning network developed by Illumina that is trained on 4.5 million common genetic variants from 233 primate species. This state-of-the-art classifier accurately quantifies missense variant pathogenicity in humans, which improves discovery of genes affecting clinical phenotypes. It is used to improve genetic risk prediction and drug target discovery. The algorithm scans about 70 million genetic variants, a selection that is more than 1,000 times as large as ClinVar. The 3D in the name refers to the three-dimensional structure of proteins, a key factor in distinguishing which mutations will wreak havoc.

What Came First, Cells or Viruses?

Youtube search... ...Google search

Virus & Consciousness

Long ago, a virus bound its genetic code to the genome of four-limbed animals. That snippet of code is still very much alive in humans' brains today, where it does the very viral task of packaging up genetic information and sending it from nerve cells to their neighbors in little capsules that look a whole lot like viruses themselves. And these little packages of information might be critical elements of how nerves communicate and reorganize over time — tasks thought to be necessary for higher-order thinking...

Though it may sound surprising that bits of human genetic code come from viruses, it's actually more common than you might think: A review published in Cell in 2016 found that between 40 and 80 percent of the human genome arrived from some archaic viral invasion.

Bioinformatics Project from Scratch

Part 1
I have shown you how to collect original dataset in biology that you can use in your Data Science Project. Particularly, I have demonstrated how to download and pre-process the biological activity data from the ChEMBL database. The dataset is comprised of compounds (molecules) that have been biologically tested for their activity towards target organism/protein of interest

Part 2
I have shown you how to calculate Lipinski descriptors (molecular descriptors proposed by Christopher Lipinski for predicting their likelihood of being drug-like molecules) and performing Exploratory Data Analysis on these Lipinski descriptors. Particularly, the EDA are based on making simple box plots and scatter plots to discern differences of the active and inactive sets of compounds

Part 3
I have made some changes to the target protein to be Acetylcholinesterase as it provides a larger dataset to work with. We have already computed the molecular descriptors using the PADEL-Descriptor software and prepare the dataset (X and Y dataframes) that will be used in this video for Model Building

Part 4
I have show you how to use the computed molecular descriptors from Part 3 (as the X variables) to build a regression model for predicting the pIC50 values (the Y variable)

Part 5
in a multi-part video series on Bioinformatics Project from scratch. In this video, I will show you how to quickly build and compare several regression models (quantitative structure-activity relationship or QSAR) of the Acetylcholinesterase inhibitors using the lazypredict library in Python

Part 6
I will show you how to deploy the machine learning model as a web app. Essentially, this web app will serve as a Bioinformatics tool that will allow users the ability to predict whether a compound of interest has favorable biological activity against the target protein or not