SQL Programming

SQL introduction

SQL (Structured Query Language) is a database computer language created for managing data in Relational Database Management Systems (RDBMS). It was originally developed at IBM by Donald D. Chamberlin and Raymond F. Boyce in the early 1970s, initially called SEQUEL, in order to manipulate and retrieve data stored in IBM's original relational database management system.

The SQL language is used to operate databases including database creation, data query, update, schema creation and modification as well as data access control. It is an ANSI (American National Standards Institute) standard language with different versions for Relational Database System. The RDBMS (such as MySQL, MS SQL Server, IBM DB2, Sybase, PostgreSQL, SQLite, Oracle, MS Access) use SQL as their standard database language.

Basic SQL statements


					SELECT c1,c2,...;  --SELECT some fields FROM table1 WHERE (a clause to filter records)...
					FROM table1;
					WHERE condition;
					GROUP by column-name;
					ORDER by column-name;

					SELECT * FROM table1;  --SELECT all fields FROM table1...

					SELECT column_name(s) FROM table1
					INNER JOIN table2 ON table1.column_name = table2.column_name;  --select records with matching values in both tables.
					LEFT JOIN table2 ON table1.column_name = table2.column_name;
					--generate all records from the left table (table1) & the matched records from the right table (table2).
					RIGHT JOIN table2 ON table1.column_name = table2.column_name;
					--generate all records from the right table (table2) & the matched records from the left table (table1).
					FULL OUTER JOIN table2 ON table1.column_name = table2.column_name;
					--generate all records if there is a match in either left (table1) or right (table2) table records.

					SELECT column_name(s) FROM table1 T1, table1 T2  --The table is joined with itself.
					WHERE condition;
					

SQL vs NoSQL

The database technology has two main types of databases including SQL (structured query language) and NoSQL (“non SQL" or “Not only SQL"), which refer to relational databases and non-relational databases, respectively. Unlike SQL, NoSQL databases are document-oriented instead of tables. Thus, the unstructured data (such as articles, photos, videos, and web content etc.) are stored in a single document that is different from SQL databases.

NoSQL databases have an advantage of ease of access by the app developers while relational databases are not easily accessible to the applications developed using the object-oriented programming languages such as Python and PHP (Hypertext Preprocessor).

The most popular databases

  • Microsoft SQL Server: This database management engine can work on both cloud-based servers and local servers. It is stable and fast. However, the cost may be high for many organizations.
  • PostgreSQL: PostgreSQL is a free popular database and is often used for web databases. It is one of the first database management systems and can be used for managing both structured and unstructured data. It can support JSON, is scalable and is able to handle terabytes of data. PostgreSQL is a free popular database and is often used for web databases. It is one of the first database management systems and can be used for managing both structured and unstructured data. It can support JSON, is scalable and is able to handle terabytes of data.
  • MySQL: MySQL is also one of the most popular databases for web-based applications. It is a freeware but there are also different paid editions available for commercial use.
  • Oracle Database: Oracle database management tool is no doubt robust, it can be used for the cloud and be hosted on a single or multiple servers. It is enabled to manage databases with billions of records. However, the cost of Oracle may be prohibitive for small companies.
  • MongoDB: MongoDB is also a free database but has a commercial version. It is designed for managing both structured and unstructured data. MongoDB is easy to use and runs fast, it can support JSON and other NoSQL documents. However, SQL can not used as a query language but tools are available for translating SQL to MongoDB queries.
  • MariaDB: MariaDB database management system has a free and a paid versions. There are a variety of plugins available for it, and it is the fastest growing open source database.

The most popular web services for cloud databases

  • Amazon Web Service (AWS): The services from Amazon include: (1) Amazon RDS: this service provides the database products for Microsoft SQL Server, MySQL, PostgreSQL, Oracle, MariaDB, etc. (2) Amazon SimpleDB: it is a NoSQL data store. (3) Amazon DynamoDB: it is a key-value and document database capable of delivering single-digit millisecond performance at any scale.
  • Microsoft Azure: MS Azure is a cloud platform across 55 regions that build and run web-based applications, smart client applications, and XML web services.
  • Google Cloud Platform: Google Cloud Platform (GCP) is a suite of cloud computing services that Google uses for Search, Gmail, YouTube and many other modular cloud services such as computing, data storage, data analytics and machine learning.
  • Other cloud databases with enterprise-scale database technology include MongoDB Atlas, Oracle Database and IBM DB2.


Basic SQL Applications

Creating and managing databases


					CREATE TABLE geneList(
					    id INT NOT NULL AUTO_INCREMENT,
					    geneName VARCHAR(255),
					    species VARCHAR(255),
					    trait VARCHAR(255),
					    geneSeq VARCHAR(255),
					    description VARCHAR(255),
					    fPrimer VARCHAR(255),
					    rPrimer VARCHAR(255),
					    ampliconSize VARCHAR(255),
					    PRIMARY KEY(id)
					);
					--Insert data
					INSERT INTO genelist(geneName, species, trait, geneSeq, description, fPrimer, rPrimer, ampliconSize) VALUES
					('G1.01gxxxxxx','species1','YIELD','TGGTAACTCG....ACTGATCGGC','DESCRIPTION1','ATAGATTATTTAGAGCCAGC','GCTGGCGTCCACGCATTTAC','118'),
					--more data
					('G8.06gxxxxxx','species1','YIELD','CGGACCAGCG....CCAGTGATCC','DESCRIPTION8','GCTTAGATAGCGAGCGACCG','AGGTGTGCGACCGCAGTGCA','113');
					

The top rows of the gene table (genelist) were exported from the database created using SQL scripts mentioned above:
genelist

This table can be easily accessed/queried using SQL statements such as "SELECT * FROM genelist, WHERE condition(id=...)", etc.:


					SELECT
					    *
					FROM
					    genelist
					WHERE
					    geneName like '%8g%';   -- select all the genes on linkage group 8
					    trait like '%disease%'  -- select all the disease resistance genes
					

Or, it can be updated using the following SQL statements such as:

  • UPDATE genelist, SET column1=..., column2=..., WHERE id=...;
  • DELETE FROM genelist, WHERE id=...;
  • INSERT INTO genelist (column1, column2, ...), VALUES (VALUE1, VALUE2, VALUE3, ...);
  • ALTER TABLE genelist, ADD column_name datatype;


SQL queries from databases

  • USE INNER JOIN of two tables to get sales with employee or customer information.
  • 
    									SELECT s.name, p.payment sales
    									FROM dbo.staff$ s
    									INNER JOIN dbo.payment$ p
    									ON s.staff_id=p.staff_id
    									ORDER BY name;
    								

  • LEFT JOIN two tables: it returns all records from the left table (the 1st) and the matched records from the right table (the 2nd).
  • 
    									SELECT p.customer_id, s.employee_name, p.payment
    									FROM dbo.payment$ p
    									LEFT JOIN dbo.staff$ s
    									ON p.employee_id= s.employee_id
    									order by employee_name;
    								

  • RIGHT JOIN two tables: it returns all records from the right table (the 2nd) and the matched records from the left table (the 1st).
  • 
    									SELECT p.customer_id, s.employee_name, p.payment
    									FROM dbo.payment$ p
    									RIGHT JOIN dbo.staff$ s
    									ON p.employee_id= s.employee_id
    									order by employee_name;
    								

  • FULL OUTER JOIN or FULL JOIN two tables: it returns all records when there is a match in left (the 1st) or right (the 2nd) table.
  • 
    									SELECT p.customer_id, s.employee_name, p.payment
    									FROM dbo.payment$ p
    									FULL OUTER JOIN dbo.staff$ s
    									ON p.employee_id= s.employee_id
    									order by employee_name;
    								

  • SQL Self JOIN: a table is joined with itself. The SQL statement below returns the matched customers from the same city through self-joining based on different customer IDs (“<>” is “not equal to”).
  • 
    									SELECT a.CustomerName AS Name1, b.CustomerName AS Name2, a.City
    									FROM Customers a, Customers b
    									WHERE a.CustomerID <> b.CustomerID
    									AND a.City = b.City
    									ORDER BY a.City;
    								

  • To JOIN Three Tables.
  • 
    									SELECT o.OrderID, c.CustomerName, s.ShipperName
    									FROM ((Orders o
    									INNER JOIN Customers c ON o.CustomerID = c.CustomerID)
    									INNER JOIN Shippers s ON o.ShipperID = s.ShipperID);
    								

  • To select only the DISTINCT values from State column in the Customers table.
  • 
    									SELECT DISTINCT State
    									FROM Customers
    									ORDER BY State ASC;
    								

  • To get the total number of different (distinct) States.
  • 
    									SELECT COUNT(DISTINCT State)
    									FROM Customers;
    								

  • Use WHERE to filter data.
  • 
    									SELECT *
    									FROM Customers
    									-- WHERE city in(‘DC’, ‘New York’);
    									-- WHERE city='DC' OR city=' New York';
    									-- WHERE NOT city='DC' AND NOT city=' New York';
    									-- WHERE city LIKE 'd%';
    									WHERE city NOT LIKE 'd%';
    								

  • Use UPDATE to update records.
  • 
    									UPDATE Customers
    									SET ContactName = 'Tom H.', City= 'Dallas'
    									WHERE CustomerID = 10;
    								

  • Use DELETE to delete records from table.
  • 
    									DELETE FROM Customers
    									WHERE CustomerName='M.G.';
    								

  • To select rows using TOP and ORDER.
  • 
    									SELECT top 3 State
    									FROM Customers
    									-- ORDER BY State asc;
    									-- ORDER BY State, CustomerName;
    									ORDER BY State ASC, CustomerName DESC;
    								

  • To query records using TOP, LIMIT and ROWNUM on different systems.
  • 
    									-- SQL Server:
    									SELECT TOP 10 *
    									--SELECT TOP 50 PERCENT *
    									FROM Customer
    									WHERE City='DC';
    
    									-- MySQL:
    									SELECT *
    									FROM Customers
    									WHERE City='DC'
    									LIMIT 10;
    
    									-- Oracle:
    									SELECT * FROM Customers
    									-- WHERE ROWNUM <= 10;
    									WHERE City='DC' AND ROWNUM <= 10;
    								

  • Use Aggregate Functions for simple statistics: count, min, max, avg, sum, stdev, var.
  • 
    									SELECT staff_id AS staff, customer_id as customer, sum(amount) OVER (ORDER BY staff_id, customer_id) as total
    									FROM sales
    									WHERE staff_id=2 AND amount BETWEEN 200 AND 300
    									ORDER BY customer_id;
    								

  • GROUP BY vs PARTITION BY functions: I. using GROUP BY
  • In GROUP BY clause, we are able to use a column in the SELECT statement with other aggregation functions if the column name is also used in GROUP BY clause, but other columns are not allowed in the SELECT clause if they are not part of GROUP BY clause.
    
    									SELECT City,
    									       AVG(Orderamount) AS Avg,
    									       MIN(OrderAmount) AS Min,
    									       SUM(Orderamount) Total
    									FROM [dbo].[Orders]
    									GROUP BY City;
    								

  • GROUP BY vs PARTITION BY functions: II. using PARTITION BY
  • However, the above issue associated with GROUP BY can be resolved by using a PARTITION BY clause. For instance, a PARTITION BY clause with the OVER clause can be used to specify the column on which aggregation functions are performed.
    Note that the output from the aggregate functions through PARTITION BY is similar to that from a GROUP By clause, but there is a difference in the output of the PARTITION BY (showing ALL RECORDS in the table: displaying duplicate city name for different orders occurred from the same city) and GROUP BY clause output (showing LIMITED NUMBER OF RECORDS: no duplicated city name).
    
    									-- (1) Only City column is in the SELECT statement
    									SELECT City,
    									       AVG(Orderamount) OVER(PARTITION BY City) AS Avg,
    									       MIN(OrderAmount) OVER(PARTITION BY City) AS Min,
    									       SUM(Orderamount) OVER(PARTITION BY City) Total
    									FROM [dbo].[Orders];
    
    									/*  (2) More columns are added to the SELECT statement: In this example, the output generates all rows available in the Orders table together with CustomerName, OrderAmount along with the output of the aggregated functions.  */
    									SELECT City, CustomerName, OrderAmount,
    									       AVG(Orderamount) OVER(PARTITION BY City) AS Avg,
    									       MIN(OrderAmount) OVER(PARTITION BY City) AS Min,
    									       SUM(Orderamount) OVER(PARTITION BY City) Total
    									FROM [dbo].[Orders];