Python Programming

Python basics

Python, developed by Guido van Rossum, is not only a free software but also a powerful modern computer programming language. Python is a high-level but easy-to-learn interactive, interpreted, and object-oriented scripting language. Python is highly readable since the English keywords are frequently used instead of punctuations used by other languages, and it has fewer syntactical constructions than other languages.

There are some similarities between Python and Fortran, one of the earliest programming languages, but Python is much more powerful than Fortran. Python uses variables without declaring them and uses indentations as a control structure. Unlike Java, programmers don't have to define classes in Python but are free to do so when necessary.

In addition, Python is also a major tool for web development.

Data analysis

Python has grown and got dedicated open source libraries substantially from the community over the last several years for data analysis and predictive modeling through its evolution from originally a general purpose language. This includes the fundamental packages and libraries for scientific computing, data analysis and statistical applications such as NumPy, IPython, matplotlib, and pandas etc.

Data visualization

What's called good data means the data that one can easily visualize and understand. In data science, generating the most comprehensive and understandable visualizations is highly desirable. Datasets and their sources vary in types including remote data, local data, JSON, CSV etc as well as the data from relational databases. However, the process is not always simple since one has to find the data, process them by reading, cleaning and massaging, and then use the right tool to visualize them.

Basic simple plots could easily be created in Python using matplotlib, however, the more advanced charting for the most appealing visualizations requires, and can only be accomplished with the knowledge and skills of Python in combination with other tools.

Machine learning and artificial intelligence

Python is known for its simplicity and thus considered to be an ideal language for machine learning (ML) and artificial intelligence (AI). The library, Pybrain, is especially rising for machine learning in Python.

Python Applications

The prediction of covid-19 pandemic

The pandemic: the covid-19 pandemic was caused by the pneumonia-causing novel coronavirus (SARS-CoV-2), which has caused lethal respiratory infections in humans in 2020. The coronaviruses may also cause many diseases in animals such as cows, chicken, pigs and birds. The outbreak was believed to be initiated from either a single introduction into humans or very few animal-to-human transmissions (Johns Hopkins Center for Health Security, April 2020).

The viral genome (GenBank Accession: NC_045512): the covid-19 is positive-sense RNA virus and has approximately 25-32 kilobases being the largest genome of RNA viruses with a unique replication capability. Genomic research indicated that two genes, i.e. the S and N genes, are involved in the SARS-CoV-2. High mutation rates of the RNA viruses may result in several slightly different versions of the viral genome whenever the viral genome is replicated, thus creating a new viral population with diverse genomes or, namely, viral quasispecies.

NEXTSTRAIN created a chart below to illustrate the viral genome structural diversity and a phylogeny generated from the genome sequence analysis of samples collected from different countries (nextstrain.org, May 2020).

Diversity of the viral genomes including the open reading frames (ORFs):

A phylogeny from the genome sequence analysis of the (+)ssRNA viruses from many countries:

The genome sequence analysis across samples has demonstrated highest diversity occurring in the structural genes, especially the S protein, ORF3a, and ORF8 (Johns Hopkins Center for Health Security, April 2020).

Another study has reported that selective pressure on the virus may have resulted in mutation based on the analysis of Open Reading Frame 1ab (ORF1ab) of COVID‐2019 and the stabilizing mutation at nsp2 protein has made the COVID‐19 disease being more contagious than SARS (Angeletti et al., J. Med. Virol., Feb. 2020).

The prediction of covid-19 pandemic: the data were obtained and derived from the Johns Hopkins University and the pandemic prediction was made through the use of a python package, i.e. Prophet. The data visualization and the charts of predictions on the cumulative cases as well as the new cases for some of the States (i.e. WA, CA, IN, NY) and Hubei province are illustrated in the right column. Also, animated data visualizations are created for cumulative cases and new cases across the states.

Another example in controlling the covid-19 pandemic in the same chart for WA, CA, IN, NY:

Containing the pandemic: by August 17, 2020, the countries that have more than half million confirmed covid-19 cases include USA, Brazil, India, Russia, South Africa, Peru and Mexico based on the cumulative cases and the new cases (Data from Johns Hopkins University, August 2020). In the USA, by August 23, 2020, the states having more than 150k total confirmed covid-19 cases include California, Florida, Texas, New York, Georgia, Illinois, Arizona, New Jersey and North Carolina as shown in the figures below, including the animated charts for the cumulative cases and new cases.

Testing for COVID-19: there are generally two types of testing regarding the covid-19 infections including (1) diagnostic tests and (2) antibody tests.

Diagnostic testing: there are two types of diagnostic tests to detect the virus including (i) molecular tests or RT-PCR tests and (ii) antigen tests.

Molecular tests: this is the genomic detection-based method to detect genetic material of the virus using the PCR technology, thus it's also called a PCR test, nucleic acid amplification test (NAAT) or, simply, gene test. It is done by collecting fluid from a nasal, throat swab or saliva. Then, a pair of PCR primers designed from the viral genome (ORFs), for instance, Forward Primer (GAC CCC AAA ATC AGC GAA AT) and Reverse Primer (TCT GGT TAC TGC CAG TTG AAT CTG), are used to generate an amplicon from the viral genome which is further detected by the machine using fluorescence marker. This method is able to detect active covid-19 infection.
Antigen tests: this test can detect specific proteins on the surface of the virus and produce rapid results in an hour or less. So, it is faster and less expensive than the molecular/PCR tests. A fluid sample can be collected for testing using a nasal or throat swab. This test can also diagnoses active covid-19 infection.

Antibody testing: this is the test to detect the presence of antibodies that are produced by the immune system to fight infections in response to the virus. It may take several days or weeks to develop antibodies after the infection which could stay in the blood for several weeks or more after recovery. A sample is collected using finger stick or blood draw. This test can only conclude if a person was infected by the virus in the past (FDA, 2020). Therefore, we should not use antibody tests to diagnose an active covid-19 infection.

COVID-19 vaccines: in addition to using masks and social distancing, the following three main types of vaccines can be used to prompt our bodies to work with the immune system in order to best protect people from the virus that causes COVID-19 pandemic, including: (1) mRNA vaccine, (2) protein subunit vaccine, and (3) vector vaccine.

Text mining analytics

To quickly understand the content of a document or webpage, we can use text mining to identify words and make a word cloud to visualize the key words with high frequency. A word cloud can intuitively present the results. It can be made using R or Python. In python, the WordCloud package is used for data visualization for representing text data where the size of the word indicates its frequency with relative importance.

In the meantime, other Python packages including matplotlib, numpy and PIL are used for the text data visualization. The Python WordCloud package can also allow user to adopt other shape to generate the words in that shape.

However, the WordCloud package does not support Chinese characters, thus it is required to add an argument for Chinese font type to support that function using "font_path" parameter.

A word cloud generated through text mining using python.

Create a bubble chart

A bubble plot is similarly a scatterplot of x- and y-axis with additional third variable or dimension equivalent to a z-axis. The additional third numeric variable is indicated using the size of the bubbles. For example, in order to explore the dynamics of GMO soybean production (Herbicide-tolerant) in 2018, the GMO% of all soybeans planted is plotted for the states that produce most soybeans relative to their GDP where the bubble size is presented as the corresponding population, proportionally. Likewise, bubble charts are also created for the GMO% of all corn planted and the GMO% of all upland cotton planted across the major production States in 2018 that includes cultivars with stacked genes, herbicide-tolerance and insect-resistance/Bt.


 					import matplotlib.pyplot as plt
					import numpy
					x, y, s = numpy.loadtxt('/mydata.csv', delimiter=',', unpack=True, usecols=(3, 4, 5), skiprows=1)
					GDP, GMO, pop = x, y, s
					colors = ['#002387', '#043927', '#4C2882', '#48BF91', '#4F42B5', '#800080', '#DA70D6', '#CFB53B', '#D1E231', '#D7837F', '#DF00FF', '#39FF14', '#FE2712', '#FF5800']
					state = ['AR', 'IL', 'IN', 'IA', 'KS', 'MI', 'MN', 'MS', 'MO', 'NE', 'ND', 'OH', 'SD', 'WI']
					plt.title('Bubble Chart for GMO% of All Soybeans Planted Across States (Herbicide-tolerant, 2018; bubble size corresponds to population)', color='#006400')
					plt.xlabel('State GDP (x1000M USD)', color='#006400')
					plt.ylabel('GMO% of All Soybeans Planted', color='#006400')
					plt.scatter(x=GDP, y=GMO, s=pop, alpha=0.6, c=colors)
					for i, txt in enumerate(state):
					    plt.annotate(txt, (x[i], y[i]), ha='center', va='center')
					plt.show()

A bubble chart for GMO% of all upland cotton planted across USA in 2018.

A bubble chart for GMO% of all corn planted across USA in 2018.

A bubble chart for GMO% of all soybeans planted across USA in 2018.

Visualization of gene expression data

To follow the RNA-seq data analysis using R based on the hypothetical example from maize gene expression, a scatter plot matrix is created for the selected sample-treatment combinations using Python, where the diagonal displays the plots of the probability density function through kernel density estimation (KDE).


					import pandas as pd
					data=pd.read_csv("transmydata.csv")
					import numpy as np
					import matplotlib.pyplot as plt
					from pandas.plotting import scatter_matrix
					df = pd.DataFrame(data, columns = ['CAW', 'CBW', 'CAD', 'CBD'])
					scatter_matrix(df, diagonal = 'kde', alpha = 0.6, figsize = (5, 5), c=df['CAD'])
					plt.show()

Genome diagrams

Visualization of syntenic regions across different species/subspecies or different strains of the same species using genome diagrams:


					from Bio import SeqIO
					Seq1_data = SeqIO.read("seq_species1.gb", "gb")
					Seq2_data = SeqIO.read("seq_species2.gb", "gb")
					from Bio.Graphics import GenomeDiagram
					from reportlab.lib.colors import *
					Seq1_colors = [green]*5 + [blue]*7 + [orange]*2 + [grey]*2 + [red] + [lightblue]*11 + [green]*4 \
					         + [grey] + [green]*2 + [grey, green] + [brown]*5 + [blue]*4 + [purple]*5 \
					         + [grey, lightblue] + [navy]*2 + [grey]
					Seq2_colors = [green]*6 + [blue]*8 + [orange]*2 + [grey] + [red] + [lightblue]*21 + [green]*5 \
					         + [grey] + [brown]*4 + [blue]*3 + [purple]*3 + [grey]*5 + [navy]*2
					name = "SyntenyView_linear"
					gd_diagram = GenomeDiagram.Diagram(name) #to create genome diagrams
					max_len = 0
					for record, gene_colors in zip([Seq1_data, Seq2_data], [Seq1_colors, Seq2_colors]):
					    max_len = max(max_len, len(record))
					    gd_track_for_features = gd_diagram.new_track(1, start=0, end=len(record))
					    #can include (name="name of species/subspecies") to label the track.
					    gd_feature_set = gd_track_for_features.new_set()
					    i = 0
					    for feature in record.features:
					        if feature.type != "gene":
					            #Exclude current feature
					            continue
					        gd_feature_set.add_feature(feature, sigil="ARROW",
					                                   color=gene_colors[i], label=True,
					                                   name = str(i+1),
					                                   label_position="start",
					                                   label_size = 9, label_angle=90, label_color=green)
					        i+=1
					gd_diagram.draw(format="linear", pagesize='A2', fragments=1, start=0, end=max_len)
					    #or use format="circular" to create a circular graph.
					gd_diagram.write(name + ".pdf", "PDF")

Python Programming

Python basics

Data analysis

Data visualization

Machine learning and artificial intelligence

The Git Statistics of Python Libraries (2018)

The Popular Frameworks of Python (2017)

Python Applications

The prediction of covid-19 pandemic

Text mining analytics

Create a bubble chart

Visualization of gene expression data

Genome diagrams

Generate a list of random primer sequences with GC content for in silico PCR

3D scatter plot of crop variety trials

Simple 3D plot