About the Author
(Data Scientist/Machine Learning Scientist)

Summary

Proficient data scientist skilled in Python, utilizing powerful packages such as Pandas, NumPy, Matplotlib, SciPy, Seaborn, SciKit-Learn, Statsmodels, TensorFlow, and more.
Experienced in developing advanced classification and regression models, employing machine learning algorithms such as Logistic Regression, k-Nearest Neighbors (KNN), Decision Tree, Random Forest, Support Vector Regression (SVR), Linear Regression, and Gradient Boosting.
Conducted comprehensive analysis, curation, and management of diverse datasets, ensuring quality control and assurance. Implemented end-to-end pipelines to seamlessly integrate qualified field and genomic data into the Alfalfa Toolbox.
Successfully led and executed 100+ data projects in collaboration with 100+ academic and research institutions across 50+ developing countries, generating and analyzing genotypic data for diverse applications.
Proficient in an extensive array of data science tools, encompassing SQL, SAS, MS SQL Server, MySQL, Jupyter Notebook, Google Colab, RStudio, AzureML, Power BI, Tableau, and Big Data analytics using MS R Client, HDInsight, and Spark.
Additional expertise and keen interest in GBS (Genotyping-By-Sequencing), RNA-seq analysis, molecular genetics, genomics, and bioinformatics, enhancing data analysis capabilities for biological research.
Accomplished author and co-author of 58 peer-reviewed papers, conference papers, and book chapters, focusing on statistical and quantitative genetics as well as molecular genetics.

Education

Professional Certificate - Large Language Models (#LLMs) (Certificate No.: f781980c09e446108b7df85ce81be72d)

Data Science Certificate (Certificate No: b5b3db2d-82dc-4aec-a623-37d86279b0b3): Microsoft Professional Program (MPP) for Data Science, Microsoft Corp, USA. (Individual Course Certificates are displayed in the right column. Many of the MPP alumni graduated from the program in about 18 months after taking and passing the examination for each of the 11 required courses including the Capstone Challenge Project.)

Professional Certificate (Certificate No.: 0731b691e08a473e97dc95a49633df27)

Ph.D.: Plant Science involving Biostatistics, Quantitative Genetics, and Population Genetics, University of Saskatchewan, Saskatoon, Canada.
M.S.: Biostatistics and Quantitative Genetics, Nanjing Agricultural University, Nanjing, China.
B.S.: Agronomy/Crop Science, Hunan Agricultural University, Changsha, China.

Social/meetup groups

SQLSaturday: Indianapolis, IN
IndyPy: Indianapolis, IN
IndyUseR Group: Indianapolis, IN
Indy Azure User Group: Indianapolis, IN
Power BI User Group: Indianapolis, IN
Indy Big Data: Carmel, IN

Peer reviews

Selected previous independent data science projects

Data merging and database management: data cleansing, merging from multiple sources, exploring data relationships, database update and management with a private company (under NDA) in healthcare industry.
Clinic data analysis: accomplished sucessfully a statistical analysis using R for a set of clinic datasets collected from tens of thousands of patients from a private research organization and generated valuable conclusions and insights.
Relational database management project: executed successfully a RDBMS project from a private company using MS SQL server.
Kaggle machine learning project: completed the time series forecasting for the covid-19 pandemic in 2020 that was caused by the pneumonia-causing novel coronavirus (SARS-CoV-2), and created animated data visualization.
MPP machine learning projects: (1) predicting using regression model on customer budget spending, (2) classifying item purchases with an accuracy of 85% - 100% with different algorithms.
Microsoft Professional Capstone - Data Science: successfully accomplished an MPP Capstone Challenge for the machine learning project hosted by DrivenData in predicting the gross rents, nationalwide. Algorithms applied include (1) Linear Regression, (2) Random Forest, (3) AdaBoost, (4) Decision Tree. The R-squares of predictions from different data sets range from 0.8326 ~ 0.9697 and the RMSE from 0.0978 ~ 0.0421, respectively.
MPP Capstone Challenge II: successfully predicted the mortgage rate spreads across 50 states from a data set with 21 independent variables using the following algorithms: (1) Linear Regression, (2) Random Forest and (3) XGBoost.
Kaggle machine learning project: classifying the forest cover types using Python with an accuracy of 88.04% - 100% using the algorithms of (1) Logistic Regression, (2) Support Vector Machine (SVM), (3) Random Forest, (4) XGBClassifier.
Kaggle deep learning project: global detection of wheat spikes/heads using computer vision and image analysis.

His previous roles with diverse data science projects

Database management, data curation and analyses: Dr. He has worked as Data Curator in both Alfalfa ToolBox team (IT/Computing and Bioinformatics) and Alfalfa Breeding team at Noble Research Institute where he managed databases, curated and analyzed data for the multi-million-dollar alfalfa toolbox project (ABT) in the plant science research division to enhance molecular breeding targeting the 8-billion-dollar alfalfa market. Some of his responsibilities pertained to managing the genomic, genetic and phenotypic datasets for the integration into the toolbox web portal to provide user-friendly access to the research community, for instance, to find the homologous/orthologous genes through sequence blasting against the reference genomes of the Cultivated Alfalfa at the Diploid Level (CADL) (Medicago sativa) and M. truncatula, the model legume species. The Genome Browser (JBrowse) was used to build the toolbox that harbored numerous valuable genes from the two species in the fashion that the Legume Information System (LIS) demonstrated. Many of them were identified and validated through gene expression, such as genes that confer tolerance to aluminum (Al) toxicity, drought and heat stress, similar to the Mt Gene Expression Atlas, an in-house platform at Noble Research Institute, or to the portal of phytozome. The ABT also harbored many genetic maps and numerous phenotypic data linked to the USDA-ARS Germplasm Resources Information Network (GRIN) to serve as a one-stop shop, which was tailored for research scientists to select the economically important genes and elite germplasm simutaneously and aimed to provide all resources for molecular breeding with efficiency to increase genetic gain not only for the forage crop but for the beef industry, ultimately.

Managing 100+ data projects: Dr. He has also worked as Senior Scientist and Project Manager for the CGIAR Generation Challenge Programme (GCP), currently the Integrated Breeding Platform (IBP), c/o CIMMYT, which was mainly funded by the Bill & Melinda Gates Foundation (BMGF). While working at the GCP headquarters near Mexico City and collaborating with the global GCP organization from year 1 to year 5, he managed over 100 data projects. These projects supported various research programs in generating and analyzing genomic and genotypic data for molecular marker-assisted breeding of crop species. The goal was to increase genetic gain for more than 100 academic and research institutions across 50+ developing countries on different continents. For instance, Dr. He, together with the key members of the GCP portal, coordinated the development of many sets of SNP markers and advocated their applications to target important genes in the 11 key crop species, such as rice (Oryza spp.) and common bean (Phaseolus vulgaris). All sets of the KASP SNP markers were carefully selected including those from the important gene sequences of interest and developed across the genome of the relevant crop species, which are significant advances in molecular marker development over SSR markers in plant species such as alfalfa and tomato including tagging genes for disease resistance. Many of them are associated with the economically important traits such as yield, quality, disease resistance and stress tolerance etc. He worked closely with the plant scientists to generate genomic, genetic and phenotypic datasets and conduct molecular breeding, some of his efforts are demonstrated through his introduction in the program interview (including the Chinese version). Also, as he has continued introducing the KASP assay through his publication for generating genotypic data, many research programs have adopted the cost-effective KASP assay into their own research laboratories.

The early experience in programming for his Master's degree

While pursuing his graduate studies for a Master's Degree in Statistical and Quantitative Genetics at Nanjing Agricultural University, the author had the opportunity to take the programming courses, i.e. Fortran (formerly FORTRAN from "FORmula TRANslation") and BASIC ("Beginner's All-purpose Symbolic Instruction Code"). Soon, he created genetic algorithm using the programming languages to develop quantitative genetic models for data analysis and prediction of various genetic parameters for the quantitatively inherited traits of soybean for his M.S. thesis project.

By the way, Fortran is a general-purpose programming language which is well suited to numeric computation & scientific computing. It was developed in the early 1950s by the team led by John W. Backus at IBM and later released in 1957. However, it's a compiled language compared with other interpreted languages such as Python and JavaScript. Fortran programming has been used for NVIDIA's CUDA Fortran, which serves as the Fortran interface to the CUDA parallel computing platform, that allows developers to leverage the power of GPUs (Graphics Processing Units) for various computationally intensive tasks such as scientific computing, video editing, machine learning (#ML) and deep learning (#DL) for Artificial Intelligence (#AI). BASIC is a general-purpose and high-level programming language which was designed by John G. Kemeny and Thomas E. Kurtz at Dartmouth College in New Hampshire, USA in 1964.

Computer simulation for his PhD project

One of his earlier experience was the computer simulation of the distributions of the phenotypic data which were influenced by both genetic and environmental factors as part of the research proposal for his PhD project at the University of Saskatchewan. The phenotypic data, due to the genetic models and the modes of gene actions, depended predominantly on the number of genes as well as the environmental factors varied by the locations and years for the resistance to the common bunt disease of wheat using SAS and MINITAB packages. Referring to the genetic approach of Robert C. Elston, the initial intention was to explore the possibility of grouping or clustering the $F_{2}$ -derived progenies or families from the breeding populations to determine the number of genes controlling the disease resistance through the analysis of Mendelian inheritance, using simulated datasets based on the binomial distribution:

$P(x=m)=\binom{n}{m}p^{m}q^{n-m}=\frac{n!}{m!(n-m)!}p^{m}(1-p)^{n-m}$
where p is the probability of common bunt disease for a wheat plant while q is the probability of being healthy for the plant ( $q=1-p$ ), m is the number of diseased plants while n is the total number of wheat plants evaluated for the specific genotype. The expected mean of the probability distribution is $\mu _{x}=np$ with a variance $\sigma _{x}^{2}=np(1-p)$ .

Looking back the simulation with the concept of machine learning (unsupervised learning) in mind, the objective of the simulation was basically to draw inferences for classifying genotypic groups from the datasets simulated with the genetic models under different assumptions through the cluster analyses such as k-means, mixture models and hierarchical clustering.

With his background and interest, he is inspired by the following quote from Dr. Robert C. Elston, a distinguished statistical geneticist and professor at Case Western Reserve University since 1995, who was formerly a professor and head (1979-1995) in the Department of Biometry and Genetics at Louisiana State University Medical Center and a professor (prior to 1979) in the Department of Biostatistics at University of North Carolina:

"Statistical genetics may go out of fashion, but there will always be a need for statisticians who can compute." -- Robert C. Elston (2015)
(Zheng et al. 2015. A Conversation with Robert C. Elston. Statistical Science 30(2) : 258–267. DOI: 10.1214/14-STS497.)

About the Website

This website aims to host information on exploratory data analysis (EDA), data management, visualization, machine learning (ML) and prediction in order to provide possible solutions to real world problems using statistics and computer programming.

Contact the Author

If you have any questions, suggestions or would like to collaborate with data science projects, you may contact the author by email or by sending message and contact information below.

About the Author (Data Scientist/Machine Learning Scientist)

About the Website

Contact the Author

Disclaimer

About the Author
(Data Scientist/Machine Learning Scientist)