As a data scientist, you deal with a lot of data. For small datasets, maybe you just store this information in a CSV file and load it into Pandas. However, this isn't really a scalable solution and won't do too well if you're constantly updating and inserting new data. You need a database; somewhere to hold your information that allows for easy inserting, updating, and reading. There are two main approaches to storing data in a database: relational databases store data as a row in tables with a pre-defined structure, while non-relational databases store data independently as a document. This post will focus on accessing data from relational databases.
Relational database tables are very similar to Pandas dataframes, each row corresponds with a certain object and columns are used to store different features of that object. For example, a website might have a table in their database that stores its users' information (username, first name, last name, etc.).
id | first_name | last_name | username | password |
---|---|---|---|---|
1 | Jeremy | Jordan | password | password |
2 | John | Snow | winteriscoming | braceyourselves |
3 | Anderson | Paak | yeslawd | breezylovejoy |
Note: Websites should never store your password in plain text.
When we want to get information from a relational database, we do so by querying (asking) for it. Specifically, we use a structured query language (SQL) to speak with the database and access data. A schema defines how information is stored within the table, essentially describing what type of information is stored in each column.
SQL has the ability to search, filter, and combine data that might be stored across multiple tables. In this post, I'll discuss the common SQL commands used for data analysis.
Quick links for reference
Basic SQL Commands
SELECT/FROM
: specifies the columns and table of interestLIMIT
: caps the query resultsORDER BY
: specify the order results are returned inWHERE
: filters the data using a comparison or logical operatorLIKE
: used to match on similar values instead of strictly identical valuesBETWEEN
: only select rows in the specified rangeIN
: specify a list of values to includeAND
: select rows which satisfy two specified conditionsOR
: select rows which satisfy one of the specified conditionsNOT
: select rows which do not satisfy the specified conditionIS NULL
: select rows with null values in specified column
Intermediate SQL Commands
- Aggregation functions (column based)
COUNT(column)
: counts number of rows in a specified columnSUM(column)
: adds all the values in a specified columnMIN/MAX(column)
: returns lowest/highest value in a specified columnAVG(column)
: calculates the average value in a specified column
GROUP BY
: segments records into groupsHAVING
: used to filter groups using a comparison or logical operatorDISTINCT
: returns only unique values, no duplicatesCASE (WHEN/THEN/ELSE/END)
: if/then logic for SQL- Data joins
JOIN (ON)
: combining data from multiple tablesINNER JOIN
: exclude all unmatched recordsLEFT JOIN
: exclude unmatched records found in the joined tableRIGHT JOIN
: exclude unmatched records found in the original tableFULL OUTER JOIN
: return all records, matched and unmatched from each table
UNION (ALL)
: combine results from multiple queries- SQL commands order of operation
- SQL data types (external link)
Basic SQL Commands
There are two commands which are absolutely required for every SQL query, what data you want (SELECT
) and where that data is stored (FROM
). With the SELECT
command, you can specify the columns of a table that you'd like to retrieve while FROM
specifies what table you're interested in retrieving data from. The "*" symbol is shorthand for "all", so if you wanted to grab all of the users' fields from the example table above you'd perform a query like SELECT * FROM users
.
The AS
keyword will create an alias for a column in the data retrieved.
SELECT first_name AS "First Name"
FROM users
You can also combine values from different columns (within the same row).
SELECT unread_count + read_count AS total_count
FROM emails
To limit the size of your query results, you can use the command LIMIT
, followed by the sample size, at the end of your query.
SELECT *
FROM users
LIMIT 100
You can use the command ORDER BY
to specify the column to sort the returned results. The default order is ascending, but you can add the keyword DESC
after specifying the column to return a descending order of results. You can also specify multiple columns to order by.
SELECT *
FROM users
ORDER BY year, month DESC
If you'd like to filter your results, you can specify which results you'd like using the WHERE
command, followed by the column to filter on and a comparison or logical expression. For example, you could query the database for records where the user's age is greater than 21.
SELECT *
FROM users
WHERE age > 21
Logical operators
If you want to filter a collection of similar records, you can specify the similarity using LIKE
(or ILIKE
to ignore case in strings for some flavors of SQL). The %
symbol can be used as a wildcard where it could represent any character or set of characters. To limit the wildcard to only one character, you can use the _
symbol.
SELECT *
FROM users
WHERE name LIKE 'Bill%' -- returns for Bill, Billy, Billie, etc.
To access data from a specific time period, a certain age group, or any other range you might be interested in, you can use the BETWEEN
command to specify a range of results to filter by. Both upper and lower bounds are included in the search results.
SELECT *
FROM users
WHERE age BETWEEN 20 AND 30
For even more specific ranges, you can specify every value to include using the IN
command.
SELECT *
FROM users
WHERE age IN (20, 22, 24, 26)
AND
and OR
are used to combine multiple expressions when specifying a filter on the data, where AND
only returns values which meet all specifications while OR
will return values where at least one of the specifications is met.
NOT
can be used to negate any expression.
IS NULL
will check to see whether or not the data in a specified column is null. Typically it is used in combination with the NOT
keyword to only access records which do not contain null values.
SELECT *
FROM users
WHERE age IS NOT NULL
Intermediate SQL Commands
Aggregation functions
COUNT(column)
is used to return the number of rows containing non-null values in a given column. You can return the total number of rows by either counting on the index column or using *
.
-- returns total number of users
SELECT COUNT(*)
FROM users
-- returns number of users where last_name is not blank
SELECT COUNT(last_name)
FROM users
SUM(column)
provides a summation of all numeric values in a specified column.
MIN(column)
provides the minimum value within a column. For columns containing text, 'A' is considered the lowest value. MAX(column)
provides the maximum value within a column. For columns containing text, 'Z' is considered the highest value.
AVG(column)
calculates the arithmetic mean for numeric columns, ignoring null values.
So far, we've only used the aggregation functions on the entire dataset. However, it is possible to segment the data into groups and aggregate each group independently. Suppose you have a table that has the daily stock price of a company for the past few years, and you'd like to look at the monthly stock price average over this time period. You could accomplish this by grouping the records by year and month (GROUP BY
), and then aggregate the individual records in each group to find the average (AVG
).
SELECT year,
month,
AVG(daily_open) AS monthly_avg
FROM google_stock_price
GROUP BY year, month
ORDER BY month, year
Whereas the previously discussed WHERE
command can be used to filter individual records, HAVING
is used to filter groups. For example, you could query the stock prices database to report the monthly average only for the months which had high price fluctuations. The same set of comparison and logical operators can be used for filtering groups. Aggregator functions (in this case, MAX
and MIN
) operate individually for each group.
SELECT year,
month,
AVG(daily_open) AS monthly_avg
FROM google_stock_price
GROUP BY year, month
HAVING MAX(daily_open) - MIN(daily_open) > 50
ORDER BY month, year
If your table has repeated values across records, you can use DISTINCT
to return a set of unique values, or unique combinations of values in the case of using DISTINCT
for multiple columns. You can also place DISTINCT
inside an aggregator function to perform aggregations only on the unique set of values.
-- example using DISTINCT for multiple columns
SELECT DISTINCT year, month
FROM google_stock_price
-- example using DISTINCT within an aggregator function (suppose there are multiple stock price records per day)
SELECT year,
month,
COUNT(DISTINCT day) AS days_in_month
FROM google_stock_price
SQL is capable of performing "if/then" logic during queries using CASE
. The general structure is to provide instructions on what to do if the case is true (CASE
--> THEN
), what to do if the case is not true (ELSE
--> END
or just END
). For multiple "if" statements, you can specify what to do for each case (WHEN
--> THEN
).
Case statements may be used in both SELECT
and GROUP BY
statements.
SELECT first_name,
last_name,
CASE WHEN age < 18
THEN 'child'
WHEN age >= 18
THEN 'adult'
ELSE NULL
END AS legal_age_classification
FROM users
-- count the number of records in each age bracket
SELECT CASE WHEN age >= 7 AND age < 22 THEN 'gen_z'
WHEN age >= 22 AND age < 38 THEN 'gen_y'
WHEN age >= 38 AND age < 53 THEN 'gen_x'
WHEN age >= 53 THEN 'baby_boomer'
ELSE 'unassigned'
END AS age_group,
COUNT(1) AS count
FROM users
GROUP BY age_group
-- reorient the age bracket count horizontally (like a pivot table)
SELECT COUNT(CASE WHEN age >= 7 AND age < 22 THEN 1 ELSE NULL END) AS gen_z_count,
COUNT(CASE WHEN age >= 22 AND age < 38 THEN 1 ELSE NULL END) AS gen_y_count,
COUNT(CASE WHEN age >= 38 AND age < 53 THEN 1 ELSE NULL END) AS gen_x_count,
COUNT(CASE WHEN age >= 53 THEN 1 ELSE NULL END) AS baby_boomer_count
FROM users
Combining data from different sources
Data joins are used to combine related information stored across multiple tables. A foreign key is used to describe how a record in one table relates to information in another table.
There are many different ways to combine information across two tables, as discussed below. These approaches define how we'll treat records which don't have a connection relating it to records in the other table. To understand how the following joins discussed work, check out this interactive demo.
Note: If both tables have the same column name, you need to create alias for the columns in the results. This is accomplished with SELECT column AS alias
.
In order to combine data from multiple tables, there must be a common field relating records in one table to records in the other. If this is the case, we can combine the records by using JOIN
to specify the table containing related information and ON
to provide the relationship between the two tables. This is basically saying, when a record in the original table matches a record in the join table, combine the two records. You can also provide additional logic in the ON
statement to further control what records are joined. Lastly, you can also specify multiple foreign keys to be matched on; if there are two columns that are common across both tables, joining on both of these keys can sometimes increase accuracy of your query.
SELECT *
FROM users
JOIN emails
ON users.id = emails.user_id
Pro tip: You can also create an alias for your tables by including the alias after the table name.
SELECT *
FROM really_long_name_for_table_of_users users -- here, users is an alias
JOIN emails
ON users.id = emails.user_id AND users.join_date > '20150214'
-- example of additional logic used in the ON statement
By default, SQL performs an inner join. For this approach, only results which are related across the two tables are returned (excluding all other records). You can also explicitly define an inner join with INNER JOIN
instead of JOIN
, although the two commands are synonymous.
All joins returns rows that are matched via the join criteria, however, an inner join excludes all unmatched rows. It is possible, however, to include records even if they weren't matched with any records in the other table - this is known as an outer join. The unmatched records will contain null values for values which would have normally been retrieved from the joined table.
A left join (LEFT JOIN
) returns all unmatched records in the original table (specified via FROM
), in addition to the matched records. The unmatched records will contain null values for the fields that were retrieved from the second table (specified via JOIN
).
A right join (RIGHT JOIN
) does the opposite, returning all unmatched records in the second table (specified via JOIN
), in addition to the matched records. The unmatched records will contain null values for the fields that could not be located in the original table (specified via FROM
).
A full outer join (FULL JOIN
) returns all records, matched and unmatched, from both tables.
You can combine multiple queries, stacking the results of one on top of the other, by creating a UNION
. By default, UNION
will not repeat any identical rows in the results. If you want the full addition of all results across queries, you must specify that with UNION ALL
. In order for two queries to be compatible for a union, they must have the same number of columns and each column should hold the same data type across both queries.
SELECT *
FROM table1
UNION ALL
SELECT *
FROM table2
Order of operations
When building an SQL query, it's important to keep in mind the order that commands are executed. People often say that lexical order and logical order of SQL commands differ; what they mean is that the way we would form a query in plain English (lexical) doesn't always match the order of information (logical) a computer needs to perform the query.
Execution
- FROM
- ON
- WHERE
- GROUP BY
- Aggregation functions (COUNT, SUM, MIN/MAX, AVG)
- HAVING
- SELECT
- DISTINCT
- UNION, INTERSECT, EXCEPT
- ORDER BY
- LIMIT, TOP