Course: DataCamp: Joining Data in SQL Notebook Author: Jae Choi
Course Description
In this course you'll learn all about the power of Joining tables while exploring interesting features of countries and their cities throughout the world. You will master Inner and outer Joins, as well as self-Joins, semi-Joins, anti-Joins and cross Joins - fundamental tools in any PostgreSQL wizard's toolbox. You'll fear set theory no more, after learning all about unions, intersections, and except clauses through easy-to-understand diagrams and examples. Lastly, you'll be introduced to the challenging topic of subqueries. You will see a visual perspective to grasp the ideas throughout the course Using the mediums of Venn diagrams and other linking illustrations.
Imports
# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
# https://gist.github.com/minrk/3301035
%load_ext watermark
%load_ext autoreload
%autoreload
import psycopg2
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import Column
from sqlalchemy import Integer, String
from sqlalchemy import inspect
import pandas as pd
from pprint import pprint as pp
%watermark -a 'Jae H. Choi' -d -t -v -p psycopg2,sqlalchemy,pandas
PandAs Configuration Options
pd.set_option('max_columns', 200)
pd.set_option('max_rows', 300)
pd.set_option('display.expand_frame_repr', True)
PostgreSQL Connection
In order to run this Notebook, install, setup
and configure a PostgreSQL databAse with the previously mentioned datAsets.
Edit engine
to use your databAse username and pAssword.
t_host = "localhost" #"databAse address"
t_port = "5432" #default postgres port
t_dbname = "datacamp" #"databAse name"
t_user = "postgres" #"databAse user name"
t_pw = "1234" #"databAse user pAssword"
db_conn = psycopg2.connect(host=t_host, database=t_dbname, user=t_user, password=t_pw)
# Scheme: "postgres+psycopg2://<USERNAME>:<PASSWORD>@<IP_ADDRESS>:<PORT>/<DATABASE_NAME>"
engine = create_engine('postgresql+psycopg2://postgres:1234@localhost/datacamp')
# metadate
meta = MetaData(schema="countries")
# connection
conn = engine.connect()
Example(s) without pd.DataFrames - use fetchall
result = conn.execute("Select datname From pg_database")
rows = result.fetchall()
[x for x in rows]
# schema.table_name
cities = conn.execute("Select * \
From countries.countries \
Inner Join countries.cities \
On countries.cities.country_code = countries.code")
cities_res = cities.fetchall()
cities_list = [x for i, x in enumerate(cities_res) if i < 10]
cities_list
In this chapter, you'll be introduced to the concept of Joining tables, and explore the different ways you can enrich your queries Using Inner Joins and self-Joins. You'll also see how to use the cAse statement to split up a field into different categories.
cities = conn.execute("Select * From countries.cities")
cities_df = pd.read_sql("Select * From countries.cities", conn)
cities_df.head()
sql_stmt = "Select * \
From countries.cities \
Inner Join countries.countries \
ON countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()
sql_stmt = "Select countries.cities.name As city, \
countries.countries.name As country, \
countries.countries.region \
From countries.cities \
Inner Join countries.countries ON \
countries.cities.country_code = countries.countries.code"
pd.read_sql(sql_stmt, conn).head()
Select left_table As L_id
left_table.val As L_val
right_table.val As R_val
From left_table
Inner Join right_table
ON left_table.id=right_table.id;
Using
clause instead of the ON
clause.
Select left_table.id As L_id
left_table.val As L_val
right_table.val As R_val
From left_table
Inner Join right_table
Using (id);
Select p1.country,p1.continent,prime_minister,president
From leaders.presidents As p1
Inner Join leaders.prime_ministers As p2
Using (country);
sql_stmt = "Select p1.country, p1.continent, prime_minister, president \
From leaders.presidents As p1 \
Inner Join leaders.prime_ministers As p2 \
Using (country)"
pd.read_sql(sql_stmt, conn).head()
Why does the following code result in an error?
Select c.name As
country,
l.name As language
From countries As c
Inner Join languages As l;
Inner Join
requires a specification of the key field (or fields) in each table.
When Joining tables with a common field name, e.g.
Select *
From countries
Inner Join economies
ON countries.code=economies.code;
You can use Using
As a shortcut:
Select *
From countries
Inner Join economies
Using (code);
You'll now explore how this can be done with the countries
and
languages
tables.
Instructions
countries
on the left and languages
on the right with
Using(code)
.
As country
,As language
, and
-- Select fields and field's name as 'country'
Select c.name As
country,
c.continent,
l.name As language,
l.official
-- From countries (alias As c)
From countries As c
-- Join to languages (As l)
Inner Join languages As l
-- Match Using code
Using (code);
sql_stmt = "Select c.name As country, \
c.continent, \
l.name As language, \
l.official \
From countries.countries As c \
Inner Join countries.languages As l \
Using (code)"
pd.read_sql(sql_stmt, conn).head()
sql_stmt = "Select * \
From leaders.prime_ministers"
pm_df = pd.read_sql(sql_stmt, conn)
pm_df.head()
Select p1.country As country1,
p2.country, As country2,
p1.continent
From leaders.prime_ministers As p1
Inner Join prime_ministers As p2
On p1.continent=p2.continent;
sql_stmt = "Select p1.country As country1, \
p2.country As country2, \
p1.continent \
From leaders.prime_ministers As p1 \
Inner Join leaders.prime_ministers As p2 \
ON p1.continent = p2.continent"
pm_df_1 = pd.read_sql(sql_stmt, conn)
pm_df_1.head()
Select p1.country As country1,
p2.country As country2,
p1.continent
From leaders.prime_ministers As p1
Inner Join prime_ministers As p2
On p1.continent=p2.continent
And p1.country!=p2.country;
sql_stmt = "Select p1.country As country1, \
p2.country As country2, \
p1.continent \
From leaders.prime_ministers As p1 \
Inner Join leaders.prime_ministers As p2 \
ON p1.continent = p2.continent AND p1.country != p2.country"
pm_df_2 = pd.read_sql(sql_stmt, conn)
pm_df_2.head()
pm_df_1.equals(pm_df_2)
And
clause can check that multiple conditions are met.Case
is a way to do multiple if-then-else statements
Select name,continent,indep_year,
Case When indep_year < 1900 Then 'before 1900'
When indep_year <= 1930 Then 'between 1900 and 1930'
Else 'after 1930' End
As indep_year_group
From states
Order By indep_year_group;
sql_stmt = "Select name, continent, indep_year, \
Case When indep_year < 1900 Then 'before 1900' \
When indep_year <= 1930 Then 'between 1900 and 1930' \
Else 'after 1930' End \
As indep_year_group \
From leaders.states \
Order By indep_year_group"
pd.read_sql(sql_stmt, conn)
In this exercise, you'll use the populations
table to perform a self-Join to
calculate the percentage increAse in population From 2010 to 2015 for each country code!
Since you'll be Joining the populations
table to itself, you can aliAs populations
As p1
and also populations
As p2
. This is good practice
whenever you are aliAsing and your tables have the same first letter. Note that you are required
to aliAs the tables with self-Joins.
Instructions 1)
populations
with itself ON country_code
.country_code
From p1
and the size
field
From both p1
and p2
. SQL won't allow same-named fields, so aliAs
p1.size As size2010
and p2.size As size2015
.
-- Select fields with aliases
Select p1.size As
size2010,
p1.country_code,
p2.size As size2015,
-- From populations (alias As p1)
From countries.populations As p1
-- Join to itself (alias As p2)
Inner Join countries.populations As p2
-- Match on country code
On p1.country_code=p2.country_code;
sql_stmt = "Select p1.size As size2010, \
p1.country_code, \
p2.size As size2015 \
From countries.populations As p1 \
Inner Join countries.populations As p2 \
On p1.country_code = p2.country_code"
pd.read_sql(sql_stmt, conn).head()
Instructions 2) Notice From the result that for each country_code you have four entries laying out all combinations of 2010 and 2015.
On
in your query to include only those records where the p1.year
(2010) matches with p2.year - 5
(2015 - 5 = 2010). This will omit the three
entries per country_code
that you aren't interested in.
-- Select fields with aliases
Select p1.country_code,
p1.size As size2010,
p2.size As size2015
-- From populations (alias As p1)
From countries.populations As p1
-- Join to itself (alias As p2)
Inner Join countries.populations As p2
-- Match on country code
On p1.country_code=p2.country_code
-- and year (with calculation)
And p1.year=(p2.year-5);
sql_stmt = "Select p1.size As size2010, \
p1.country_code, \
p2.size As size2015 \
From countries.populations As p1 \
Inner Join countries.populations As p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"
pd.read_sql(sql_stmt, conn).head()
Instructions 3)
As you just saw, you can also use SQL to calculate values like p2.year - 5
for you.
With two fields like size2010
and size2015
, you may want to determine
the percentage increAse From one field to the next:
With two numeric fields A
and B
, the percentage growth From
A
to B
can be calculated As $$\frac{(B−A)}{A}∗100.0$$.
Add a new field to Select
, aliased As growth_perc
, that calculates the
percentage population growth From 2010 to 2015 for each country, Using p2.size
and
p1.size
.
Select p1.country_code,
p1.size As size2010,
p2.size As size2015,
-- calculate growth_perc
(p2.size-p1.size)/p1.size*100.0 As growth_perc
-- From populations (alias As p1)
From countries.populations As p1
-- Join to itself (alias As p2)
Inner Join countries.populations As p2
-- Match on country code
On p1.country_code=p2.country_code
-- and year (with calculation)
And p1.year=p2.year-5;
sql_stmt = "Select p1.size As size2010, \
p1.country_code, \
p2.size As size2015, \
(p2.size - p1.size)/p1.size * 100.0 As growth_perc \
From countries.populations As p1 \
Inner Join countries.populations As p2 \
ON p1.country_code = p2.country_code \
AND p1.year = (p2.year - 5)"
pd.read_sql(sql_stmt, conn).head()
Often it's useful to look at a numerical field not As raw data, but instead As being in different
categories or groups.
You can use Case
with When
, Then
, Elese
, and
End
to define a new grouping field.
Instructions
Using the countries table, create a new field As geosize_group that groups the countries into three groups:
If surface_area
is greater than 2 million, geosize_group
is 'large'
.
If surface_area
is greater than 350 thousand but not larger than 2 million, geosize_group
is 'medium'
.
Otherwise, geosize_group
is 'small'
.
Select name, continent, code, surface_area,
-- First case
Case When surface_area > 2000000 Then 'large'
-- Second case
When surface_area > 350000 Then 'medium'
-- Else clause + End
Else 'small' END
-- Alias name
As geosize_group
-- From table
From countries.countries;
sql_stmt = "Select name, continent, code, surface_area, \
Case When surface_area > 2000000 Then 'large' \
When surface_area > 350000 Then 'medium' \
Else 'small' End \
As geosize_group \
From countries.countries;"
pd.read_sql(sql_stmt, conn).head()
The table you created with the added geosize_group
field hAs been loaded for you
here with the name countries_plus
. Observe the use of (and the placement of) the
Into
command to create this countries_plus
table:
Select name,continent,code,surface_area,
Case When surface_area > 2000000 Then 'large'
When surface_area > 350000 Then 'medium'
Else 'small' End
As geosize_group Into countries_plus
From countries.countries;
You will now explore the relationship between the size of a country in terms of surface area and
in terms of population Using grouping fields created with Case
.
By the end of this exercise, you'll be writing two queries back-to-back in a single script. You
got this!
Instructions 1)
Using the populations
table focused only for the year
2015, create a
new field As popsize_group
to organize population size
into
'large'
(> 50 million)'medium'
(> 1 million)'small'
(<= 1 million)Select only the country code, population size, and this new popsize_group
As fields.
Select country_code,size,
-- First case
Case When size>50000000 Then 'large'
-- Second case
When size>1000000 Then 'medium'
-- Else clause + End
Else 'small' End
-- Alias name
As popsize_group
-- From table
From countries.populations
-- Focus on 2015
Where year=2015;
sql_stmt = "Select country_code, size, \
Case When size > 50000000 Then 'large' \
When size > 1000000 Then 'medium' \
Else 'small' End \
As popsize_group \
From countries.populations \
Where year = 2015;"
pd.read_sql(sql_stmt, conn).head()
Execute the first part on the PostgreSQL schema to create pop_plus
#CREATE TABLE new_table
# AS (SELECT *
# FROM old_table WHERE 1=2)
sql_stmt = "Create Table countries.pop_plus \
As (Select country_code, size, \
Case When size > 50000000 Then 'large' \
When size > 1000000 Then 'medium' \
Else 'small' End \
As popsize_group \
From countries.populations \
Where year = 2015);"
engine.execute(sql_stmt)
Instructions 2)
Into
to save the result of the previous query As pop_plus
. You
can see an example of this in the countries_plus
code in the Assignment text.
Make sure to include a ;
at the end of your Where
clause!
pop_plus
Using Select * From pop_plus
; so that you generate results and this will
display pop_plus
in query result.
Select country_code,size,
Case When size>50000000 Then 'large'
When size>1000000 Then 'medium'
Else 'small' End
As popsize_group
-- Into table
Into countries.pop_plus
From populations
Where year=2015;
sql_stmt="Select country_code,size,\
Case When size>50000000 Then 'large'\
When size>1000000 Then 'medium'\
Else 'small' End\
As popsize_group\
Into countries.test1\
From populations\
Where year=2015;"
engine.execute(sql_stmt)
sql_stmt = "\
SELECT * FROM countries.pop_plus; \
"
pd.read_sql(sql_stmt, conn).head()
Instructions 3)
pop_plus
using Into
.countries_plus As c
on the left with pop_plus As
p
on the right matching on the country code fields.
geosize_group
, in Ascending order so that
large
appears on top.
name
, continent
, geosize_group
, and popsize_group
fields.
sql_stmt = "\
Select c.name, c.continent, c.geosize_group, p.popsize_group \
From countries.countries_plus As c \
Inner Join countries.pop_plus As p \
ON c.code = p.country_code \
ORDER BY geosize_group AsC \
"
q_df = pd.read_sql(sql_stmt, conn)
q_df.head()
q_df.tail()
In this chapter, you'll come to grips with different kinds of outer Joins. You'll learn how to gain further insights into your data through left Joins, right Joins, and full Joins. In addition to outer Joins, you'll also work with cross Joins.
Inner Join
Select p1.country
prime_minister,
president
From prime_ministers As p1
Inner Join presidents As p2
On p1.country=p2.country;
LEFT Join
Select p1.country
prime_minister,
president
From prime_ministers As p1
Left Join presidents As p2
On p1.country=p2.country;
LEFT Join multiple matches
RIGHT Join
Select right_table.id As R_id
left_table.val As L_val,
right_table.val As R_val,
From left_table
Right Join right_table
On left_table.id=right_table.id;
sql_stmt = "Select p1.country, \
prime_minister, \
president \
From leaders.prime_ministers As p1 \
Inner Join leaders.presidents As p2 \
ON p1.country = p2.country"
pd.read_sql(sql_stmt, conn)
Inner Join
sql_stmt = "Select p1.country, prime_minister, president \
From leaders.prime_ministers As p1 \
LEFT Join leaders.presidents As p2 \
ON p1.country = p2.country"
pd.read_sql(sql_stmt, conn)
Now you'll explore the differences between performing an Inner Join and a left Join Using the
cities
and countries
tables.
You'll begin by performing an Inner Join with the cities
table on the left and the
countries
table on the right. Remember to aliAs the name of the city field As
city
and the name of the country field As country
.
You will then change the query to a left Join. Take note of how many records are in each query
here!
Instructions 1)
-- Select the city name (with alias), the country code, the country name (with alias), the region, and the city proper population
Select c1.name As
city,
code,
c2.name, As country,
region,
city_proper_pop,
-- From left table (with alias)
From cities As c1
-- Join to right table (with alias)
Inner Join countries As c2
-- Match on country code
On c1.country_code=c2.code
Order By code Desc;
sql_stmt = "Select c1.name As city, \
code, \
c2.name As country, \
region, city_proper_pop \
From countries.cities As c1 \
Inner Join countries.countries As c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC;"
pd.read_sql(sql_stmt, conn).head()
Instructions 2)
Change the code to perform a LEFT Join
instead of an Inner Join
. After
executing this query, note how many records the query result contains.
Select c1.name As
city,
code,
c2.name, As country,
region,
city_proper_pop,
From cities As c1
-- Join to right table (with alias)
Left Join countries As c2
-- Match on country code
On c1.country_code=c2.code
Order By code Desc;
sql_stmt = "Select c1.name As city, \
code, \
c2.name As country,\
region, city_proper_pop \
From countries.cities As c1 \
LEFT Join countries.countries As c2 \
ON c1.country_code = c2.code \
ORDER BY code DESC;"
pd.read_sql(sql_stmt, conn).head()
Next, you'll try out another example comparing an Inner Join to its corresponding left Join.
Before you begin though, take note of how many records are in both the countries
and languages
tables below.
You will begin with an Inner Join on the countries
table on the left with the languages
table on the right. Then you'll change the code to a left Join in the next bullet.
Note the use of multi-line comments here Using /*
and */
.
Instructions 1)
country
field As country and the
name of the language
field As language.
sql_stmt = "Select c.name As country, \
local_name, \
l.name As language, \
percent \
From countries.countries As c \
Inner Join countries.languages As l \
On c.code = l.code \
Order By country Desc; "
res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()
Instructions 2)
sql_stmt = "Select c.name As country, \
local_name, \
l.name As language, \
percent \
From countries.countries As c \
LEFT Join countries.languages As l \
On c.code = l.code \
Order By country Desc; \
"
res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()
You'll now revisit the use of the AVG()
function introduced in our Intro to SQL for Data
Science course. You will use it in combination with left Join to determine the average gross
domestic product (GDP) per capita by region in 2010.
Instructions 1)
countries
table on the left and the economies
table on the right.
year
.sql_stmt = "Select name, \
region, \
gdp_percapita \
From countries.countries As c \
LEFT Join countries.economies As e \
On e.code = c.code \
Where year = 2010; \
"
res1 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res1)}')
res1.head()
Instructions 2)
As avg_gdp
for
each region in 2010.
region
and avg_gdp
fields.sql_stmt = "Select region, \
AVG(gdp_percapita) As avg_gdp \
From countries.countries As c \
LEFT Join countries.economies As e \
On e.code = c.code \
Where year = 2010 \
Group BY region \
Order BY avg_gdp Desc; \
"
res2 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res2)}')
res2.head()
Instructions 3)
sql_stmt = "Select region, \
AVG(gdp_percapita) As avg_gdp \
From countries.countries As c \
LEFT Join countries.economies As e \
On e.code = c.code \
Where year = 2010 \
Group BY region;"
res3 = pd.read_sql(sql_stmt, conn)
print(f'Number of Records: {len(res3)}')
res3.head()
Right Joins aren't As common As left Joins. One reAson why is that you can always write a right Join As a left Join.
Instructions
The left Join code is commented out here. Your tAsk is to write a new query Using rights Joins that produces the same result As what the query Using left Joins produces. Keep this left Joins code commented As you write your own query just below it Using right Joins to solve the problem.
Note the order of the Joins matters in your conversion to Using right Joins!
convert this code to use RIGHT Joins instead of LEFT Joins
Select cities.name As city,
urbanarea_pop,
countries.name, As country,
languages.name As language,
percent
From cities
Left Join countries
On cities.country_code=countries.code
Left Join languages
On countries.code=languages.code
Order By city, language;
sql_stmt = "Select cities.name As city,\
urbanarea_pop, \
countries.name As country, \
indep_year, \
languages.name As language, \
percent \
From countries.languages \
Left Join countries.countries \
On languages.code = countries.code \
Left Join countries.cities \
On cities.country_code = countries.code \
Order BY city, language; \
"
pd.read_sql(sql_stmt, conn).head()
Select left_table.id As L_id,
right_table.id As R_id,
left_table.val, As L_val,
right_table.val As R_val
From left_table
Full Join right_table
Using (id);
Select p1.country As pm_co,
p2.country As pres_co,
prime_minister,
president,
From prime_ministers As p1
Full Join presidents As p2
On p1.country=p2.country;
sql_stmt = "Select p1.country As pm_co,\
p2.country As pres_co, \
prime_minister, \
president \
From leaders.prime_ministers As p1 \
FULL Join leaders.presidents As p2 \
ON p1.country = p2.country;"
pd.read_sql(sql_stmt, conn)
In this exercise, you'll examine how your results differ when Using a full Join versus Using a
left Join versus Using an Inner Join with the countries
and currencies
tables.
You will focus on the North American region
and also where the name
of
the country is missing. Dig in to see what we mean!
Begin with a full Join with countries
on the left and currencies
on the
right. The fields of interest have been Select
ed for you throughout this exercise.
Then complete a similar left Join and conclude with an Inner Join.
Instructions 1)
Select name As
country,
code,
region,
basic_unit
-- From to countries
From countries
-- Join to to currencies
Full Join currencies
Using (code)
-- Where region is North America or null
Where region='North America' Or region Is Null
;
Order By region;
sql_stmt = "Select name As country, \
code, region, \
basic_unit \
From countries.countries \
FULL Join countries.currencies \
Using (code) \
Where region = 'North America' OR region Is Null \
Order BY region;"
pd.read_sql(sql_stmt, conn)
Instructions 2)
LEFT Join
instead of a FULL
Join
. Note what hAs changed compared to the FULL Join
result!
Select name As
country,
code,
region,
basic_unit
-- From to countries
From countries
-- Join to to currencies
Left Join currencies
Using (code)
-- Where region is North America or null
Where region='North America' Or region Is Null
;
Order By region;
sql_stmt = "Select name As country, \
code, region, basic_unit \
From countries.countries \
LEFT Join countries.currencies \
Using (code) \
Where region = 'North America' OR region Is Null \
ORder BY region;"
pd.read_sql(sql_stmt, conn)
Instruction 3)
Inner Join
instead of a FULL
Join
. Note what hAs changed compared to the FULL Join
and LEFT
Join
results!
Select name As
country,
code,
region,
basic_unit
-- From countries
From countries
-- Join to to currencies
Inner Join currencies
Using (code)
-- Where region is North America or null
Where region='North America Or region Is Null
Order By region;
sql_stmt = "Select name As country, \
code, region, basic_unit \
From countries.countries \
Inner Join countries.currencies \
Using (code) \
Where region = 'North America' OR region IS null \
Order BY region; \
"
pd.read_sql(sql_stmt, conn)
Have you kept an eye out on the different numbers of records these queries returned? The
FULL Join
query returned 17 rows, the LEFT Join
returned 4 rows, and
the Inner Join
only returned 3 rows. Do these results make sense to you?
You'll now investigate a similar exercise to the last one, but this time focused on Using a table
with more records on the left than the right. You'll work with the languages
and
countries
tables.
Begin with a full Join with languages
on the left and countries
on the
right. Appropriate fields have been Selected for you again here.
Instructions 1/3
countries.name
starts with the capital letter
'V'
or is NULL
and arrange by countries.name
in
Ascending order to more clearly see the results.
Select name As country,
code,
languages.name As language
-- From languages
From languages
-- Join to to countries
Inner Join countries
Using (code)
-- Where region countries.name starts with V or is null
Where countries.name Like 'V%' Or countries.name Is Null
Order By countries.name;
sql_stmt = "Select countries.name, code, \
languages.name As language \
From countries.languages \
FULL Join countries.countries \
Using (code) \
Where countries.name LIKE 'V%%' OR countries.name IS null \
Order BY countries.name; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2)
left Join
instead of a full Join. Note
what hAs changed compared to the full Join result!
Select countries.name As country,
code,
languages.name As language
-- From languages
From languages
-- Join to to countries
Left Join countries
Using (code)
-- Where countries.name starts with V or is null
Where countries.name Like 'V%' Or countries.name Is Null
Order By countries.name;
sql_stmt = "Select countries.name, code, \
languages.name As language \
From countries.languages \
LEFT Join countries.countries \
Using (code) \
Where countries.name LIKE 'V%%' OR countries.name Is Null \
Order BY countries.name;"
pd.read_sql(sql_stmt, conn)
Instructions 3)
Inner Join
instead of a left Join. Note what hAs
changed compared to the full Join and left Join results.
Select countries.name As
country,
code,
languages.name As language
-- From languages
From languages
-- Join to countries
Inner Join countries
Using (code)
-- Where countries.name starts with V or is null
Where countries.name Like 'V%' Or countries.name Is Null
Order By countries.name;
sql_stmt = "Select countries.name, code, \
languages.name As language \
From countries.languages \
Inner Join countries.countries \
Using (code) \
Where countries.name Like 'V%%' OR countries.name IS null \
Order BY countries.name;"
pd.read_sql(sql_stmt, conn)
You'll now explore Using two consecutive full Joins on the three tables you worked with in the previous two exercises.
Instructions
countries
on the left and languages
on
the right.
currencies
on the right.LIKE
to choose the Melanesia and Micronesia regions (Hint:
'M%esia'
).
As country
, region,
language name As language
, and basic and fractional units of currency.
Select c1.name As
country,
region,
l.name As language,
basic_unit,
frac_unit
-- From countries (alias as c1)
From countries As c1
-- Join to languages
Full Join languages As l
Using (code)
-- Join to currencies (alias as c1)
Full Join currencies As c2
Using (code)
-- Where region like Melanesia and Micronesia
Where region Like 'M%esia';
sql_stmt = "Select c1.name As country, region, \
l.name As language, \
basic_unit, frac_unit \
From countries.countries As c1 \
FULL Join countries.languages As l \
Using (code) \
FULL Join countries.currencies As c2 \
Using (code) \
Where region Like 'M%%esia';"
pd.read_sql(sql_stmt, conn)
A(n) ___
Join is a Join combining the results of a ___
Join and a
___
Join.
Answer the question
CROSS Joins create all possible combinations of two tables.
id
From
table1
and id
From table2
(e.g. 1(A-C), 2(A-C), &
3(A-C))
prime_ministers
table are scheduled for individual meetings with all presidents in the presidents table.
Select prime_minister,
president,
From prime_ministers As p1
-- Join to languages
Cross Join presidents As p2
Where p1.continent In ('North America', 'Oceania');
sql_stmt = "Select prime_minister, \
president \
From leaders.prime_ministers As p1 \
CROSS Join leaders.presidents As p2 \
Where p1.continent IN ('North America', 'Oceania');"
pd.read_sql(sql_stmt, conn)
A table of two cities
This exercise looks to explore languages potentially and most frequently spoken in the cities of Hyderabad, India and Hyderabad, Pakistan.
You will begin with a cross Join with cities As c
on the left and languages As
l
on the right. Then you will modify the query Using an Inner Join in the next tab.
Instructions 1)
ON
or Using
.)
LIKE
and Hyder%
to choose Hyderabad in both countries.
As city
and language name As language
.
Select c.name As
city,
l.name As language
From cities As c
Cross Join languages As l
Where c.name Like 'Hyder%';
sql_stmt = "Select c.name As city, \
l.name As language \
From countries.cities As c \
CROSS Join countries.languages As l \
Where c.name Like 'Hyder%%';"
unique_lang = hyderabad_lang['language'].unique()
print(len(unique_lang))
hyderabad_lang = pd.read_sql(sql_stmt, conn)
hyderabad_lang
Instructions 2)
Use an Inner Join instead of a cross Join. Think about what the difference will be in the results for this Inner Join result and the one for the cross Join.
Select c.name As
city,
l.name As language
From cities As c
Inner Join languages As l
On c.country_code=l.code
Where c.name Like 'Hyder%';
sql_stmt = "Select c.name As city, \
l.name As language \
From countries.cities As c \
Inner Join countries.languages As l \
On c.country_code = l.code \
Where c.name LIKE 'Hyder%%';"
pd.read_sql(sql_stmt, conn)
Outer challenge
Now that you're fully equipped to use outer Joins, try a challenge problem to test your knowledge! In terms of life expectancy for 2010, determine the names of the lowest five countries and their regions.
Instructions
As country
, region
, and life expectancy As
life_exp
.
LEFT Join
, Where
, Order BY
, and
LIMIT
.
Select c.name As
country,
c.region,
p.life_expectancy As life_exp
From countries As c
Left Join populations As p
On c.code=l.country_code
Where p.year=2010
Order By life_exp
Limit 5;
sql_stmt = "Select c.name As country, \
c.region, \
p.life_expectancy As life_exp \
From countries.countries As c \
LEFT Join countries.populations As p \
On c.code = p.country_code \
Where p.year = 2010 \
Order BY life_exp \
Limit 5;"
pd.read_sql(sql_stmt, conn)
In this chapter, you'll learn more about set theory Using Venn diagrams and you will be introduced to union, union all, intersect, and except clauses. You'll finish by investigating semi-Joins and anti-Joins, which provide a nice introduction to subqueries.
The shading represents what's included in the result of the set operation From each table.
UNION
includes every record in both tables, but DOES
NOT double count those that are in both tables.
UNION ALL
includes every record in both tables and DOES
replicate those that are in both tables, represented by the black center
INTERSECT
results in only those records found in both of the
tables.
EXCEPT
results in only those records in one table, BUT
NOT the other.
UNION
does have no duplicate while UNION ALL
includes all duplicates.
monarchs
table in the leaders
databAse
Use UNION
on the prime_ministers
and monarchs
tables
all prime ministers and monarchs
Select prime_minister As leader,
country
From leaders.prime_ministers
UNION
Select monarch,
country
From leaders.monarchs
Order By country;
prime_minister
field hAs been aliAsed As leader. The resulting
field from the UNION
will have the name leader.
Select prime_minister As leader,
country
From leaders.prime_ministers
UNION ALL
Select monarch,
country
From leaders.monarchs
Order By country;
UNION
and UNION ALL
clauses do not do the lookup step that Join
s
do, they stack records on top of each other From one table to the next.
sql_stmt = "Select * \
From leaders.monarchs;"
pd.read_sql(sql_stmt, conn)
sql_stmt = "Select prime_minister As leader, \
country \
From leaders.prime_ministers \
UNION \
Select monarch, country \
From leaders.monarchs \
ORDER BY country;"
pd.read_sql(sql_stmt, conn)
sql_stmt = "Select prime_minister As leader, country \
From leaders.prime_ministers \
UNION ALL \
Select monarch, country \
From leaders.monarchs \
ORDER BY country;"
pd.read_sql(sql_stmt, conn)
Near query result to the right, you will see two new tables with names economies2010
and economies2015
.
Instructions
economies2010
. The economies
table is also included for reference.
Select *
From countries.economies2010
UNION
Select *
From countries.economies2015
Order By code, year;
sql_stmt = "\
Select * \
From countries.economies2010 \
UNION \
Select * \
From countries.economies2015 \
ORDER BY code, year; \
"
pd.read_sql(sql_stmt, conn)
UNION
can also be used to determine all occurrences of a field across multiple
tables. Try out this exercise with no starter code.
Instructions
cities
or the currencies
table. The result should be a table with only one field called country_code
.
country_code
in alphabetical order.
Select country_code
From countries.citie
UNION
Select code
From countries.currencies
Order By country_code;
sql_stmt = "\
Select country_code \
From countries.cities \
UNION \
Select code \
From countries.currencies \
Order BY country_code;"
country_codes = pd.read_sql(sql_stmt, conn)
country_codes.head()
country_codes.tail()
As you saw, duplicates were removed From the previous two exercises by Using UNION
.
To include duplicates, you can use UNION ALL
.
economies
or the populations
tables. Order by code
then year
.
Select code, year
From countries.economies
UNION All
Select country_code, year
From countries.populations
Order By code, year;
sql_stmt = "\
Select code, year \
From countries.economies \
UNION ALL \
Select country_code, year \
From countries.populations \
ORDER BY code, year; \
"
country_codes_year = pd.read_sql(sql_stmt, conn)
country_codes_year.head()
country_codes_year.tail()
Select id
From left_one
Intersect
Select id
From right_one;
The set theory clause INTERSECT
works in a similar fAshion
to UNION
and UNION ALL
, but remember From the Venn diagram, INTERSECT
only includes those records in common to both tables and fields Selected.
The result only includes records common to the tables Selected
Select country
From leaders.prime_ministers
Intersect
Select country
From leaders.presidents;
What happens if two columns are Selected, instead of one?
Select country,
prime_minister As leader
From leaders.prime_ministers
Intersect
Select country, president
From leaders.presidents;
INTERSECT
looks at two columns, it includes both columns in the search.
INTERSECT
looks for records in common, not individual
key fields like what a Join does to match.
sql_stmt = "\
Select country \
From leaders.prime_ministers \
INTERSECT \
Select country \
From leaders.presidents \
"
pd.read_sql(sql_stmt, conn)
Repeat the previous UNION ALL
exercise, this time looking at the records in common
for country code and year for the economies
and populations
tables.
Instructions
code
and then by year
, both in Ascending order.
UNION ALL
query result (814 records).
Select code, year
From countries.economies
INTERSECT
Select country_code, year
From countries.populations
Order By code, year;
sql_stmt = "\
Select code, year \
From countries.economies \
INTERSECT \
Select country_code, year \
From countries.populations \
ORDER BY code, year; \
"
pd.read_sql(sql_stmt, conn)
As you think about major world cities and their corresponding country, you may Ask which countries also have a city with the same name As their country name?
Instructions
INTERSECT
to answer this question with countries
and cities
!
sql_stmt = "\
Select name \
From countries.countries \
INTERSECT \
Select name \
From countries.cities; \
"
pd.read_sql(sql_stmt, conn)
Hong Kong is part of China, but it appears separately here because it hAs its own ISO country code. Depending upon your analysis, treating Hong Kong separately could be useful or a mistake. Always check your datAset closely before you perform an analysis!
Which of the following combinations of terms and definitions is correct?
Answer the question
Select monarch, country
From leaders.monarchs
Intersect
Select prime_minister, country
From leaders.prime_ministers;
EXCEPT
includes only the records in one table, but not in the
other.
There are some monarchs that also act As the prime minister. One way to determine those
monarchs in the monarchs table that do not also hold the title prime minister, is to use the
EXCEPT
clause.
This SQL query Selects the monarch field From monarchs, then looks for common entries with the prime_ministers field, while also keeping track of the country for each leader.
Only the two European monarchs are not also prime ministers in the leaders database.
Only the records that appear in the left table, BUT DO NOT appear in the right table are included.
sql_stmt = "\
Select monarch, country \
From leaders.monarchs \
EXCEPT \
Select prime_minister, country \
From leaders.prime_ministers; \
"
pd.read_sql(sql_stmt, conn)
Get the names of cities in cities
which are not noted As capital cities in countries
As a single field result.
Note that there are some countries in the world that are not included in the
countries
table, which will result in some cities not being labeled As capital
cities when in fact they are.
Instructions
sql_stmt = "\
Select name \
From countries.cities \
EXCEPT \
Select capital \
From countries.countries \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn).head()
Now you will complete the previous query in reverse!
Determine the names of capital cities that are not listed in the cities
table.
Instructions
capital
in Ascending order.cities
table contains information about 236 of the world's most populous
cities. The result of your query may surprise you in terms of the number of capital cities
that DO NOT appear in this list!
sql_stmt = "\
Select capital \
From countries.countries \
EXCEPT \
Select name \
From countries.cities \
ORDER BY capital; \
"
pd.read_sql(sql_stmt, conn).head()
The lAst two Joins use a right table to determine which records to keep in the left table.
semi-Joins
and anti-Joins
don't have the same built-in SQL
syntax that Inner Join and LEFT Join have.
semi-Joins
and anti-Joins
are useful tools in filtering table
records on the records of another table.
Select president, country, continent
From leaders.presidents
Where country In
(Select name
From leaders.states
Where indep_year<1800);
This is an example of a subquery, which is a query that sits inside another query.
Does this include the presidents of Spain and Portugal?
The semi-Join
chooses records in the first table where a condition IS
met in the second table.
semi-Join
matches records by key field in the right table with those in the
left.
sql_stmt = "\
Select president, country, continent \
From leaders.presidents \
WHERE country IN \
(Select name \
From leaders.states \
WHERE indep_year < 1800); \
"
pd.read_sql(sql_stmt, conn)
Select president, country, continent
From leaders.presidents
Where continent Like '%America'
And country Not In
(Select name
From leaders.states
Where indep_year<1800);
anti-Join
chooses records in the first table where a condition IS
NOT met in the second table.
NOT
to exclude those countries in the subquery.anti-Join
picks out those columns in the left table that do not match the
condition on the right table.
sql_stmt = "\
Select president, country, continent \
From leaders.presidents \
WHERE continent LIKE '%%America' \
AND country NOT IN \
(Select name \
From leaders.states \
WHERE indep_year < 1800); \
"
pd.read_sql(sql_stmt, conn)
You are now going to use the concept of a semi-Join to identify languages spoken in the Middle East.
Instructions 1)
Select
, From
, and
Where
.
sql_stmt = "\
Select code \
From countries.countries \
Where region = 'Middle East'; \
"
pd.read_sql(sql_stmt, conn)
You are now going to use the concept of a semi-Join to identify languages spoken in the Middle East.
Instructions 2)
/*
and
*/
. You'll come back to it!
languages
table.
name
in Ascending order.sql_stmt = "\
Select DISTINCT name \
From countries.languages \
ORDER BY name; \
"
pd.read_sql(sql_stmt, conn)
You are now going to use the concept of a semi-Join to identify languages spoken in the Middle East.
Instructions 3)
Now combine the previous two queries into one query:
Where In
statement to the Select Distinct
query, and use the
commented out query From the first instruction in there. That way, you can determine the
unique languages spoken in the Middle EAst.
Carefully review this result and its code after completing it. It serves As a great example of subqueries, which are the focus of Chapter 4.
sql_stmt = "\
Select DISTINCT name \
From countries.languages \
Where code In \
(Select code \
From countries.countries \
Where region = 'Middle East') \
Order BY name; \
"
pd.read_sql(sql_stmt, conn)
Let's revisit the code From the previous exercise, which retrieves languages spoken in the Middle East.
Select Distinct name
From languages
Where code In
(Select name
From countries
Where region='Middle East')
Order By name;
Sometimes problems solved with semi-Joins can also be solved Using an Inner Join.
Select Distinct languages.name As
language
From languages
Inner Join countries On languages.code =
countries.code
Where region='Middle East')
Order By language;
This Inner Join isn't quite right. What is missing From this second code block to get it to match with the correct answer produced by the first block?
Possible Answers
HAVING
instead of WHERE
DISTINCT
</strong>UNIQUE
sql_stmt = "\
Select DISTINCT languages.name As language \
From countries.languages \
Inner Join countries.countries \
On languages.code = countries.code \
Where region = 'Middle East' \
Order BY language;"
pd.read_sql(sql_stmt, conn)
Another powerful Join in SQL is the anti-Join. It is particularly useful in identifying which records are caUsing an incorrect number of records to appear in Join queries.
You will also see another example of a subquery here, As you saw in the first exercise on semi-Joins. Your goal is to identify the currencies used in Oceanian countries!
Instructions 1)
countries
that are listed in
Oceania Using Select
, From
, and WHERE
.
sql_stmt = "\
Select count(name) \
From countries.countries \
Where continent = 'Oceania'; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2)
countries As c1
on the left and currencies As
c2
on the right to get the different currencies used in the countries of Oceania.
ON
the code
field in the two tables.code
, country name
, and bAsic_unit As
currency
.
Observe query result and make note of how many different countries are listed here.
sql_stmt = "\
Select c1.code, \
c1.name, \
c2.bAsic_unit As currency \
From countries.countries As c1 \
Inner Join countries.currencies As c2 \
On c1.code = c2.code \
Where continent = 'Oceania'; \
"
pd.read_sql(sql_stmt, conn)
Instructions 3)
Note that not all countries in Oceania were listed in the resulting Inner Join with currencies. Use an anti-Join to determine which countries were not included!
NOT IN
and (Select code From currencies)
As a subquery to get
the country code and country name for the Oceanian countries that are not included in the
currencies
table.
sql_stmt = "\
Select code, name \
From countries.countries \
WHERE continent = 'Oceania' \
AND code NOT IN \
(Select code \
From countries.currencies); \
"
pd.read_sql(sql_stmt, conn)
Congratulations! You've now made your way to the challenge problem for this third chapter. Your
tAsk here will be to incorporate two of UNION
/UNION ALL
/INTERSECT
/EXCEPT
to solve a challenge involving three tables.
In addition, you will use a subquery As you have in the lAst two exercises! This will be great practice As you hop into subqueries more in Chapter 4!
Instructions
economies
or currencies
but not in populations
.
sql_stmt = "\
Select country_code, name \
From countries.cities As c1 \
Where country_code In \
(Select e.code \
From countries.economies As e \
UNION ALL \
Select c2.code \
From countries.currencies As c2 \
EXCEPT \
Select p.country_code \
From countries.populations As p); \
"
pd.read_sql(sql_stmt, conn)
In this closing chapter, you'll learn how to use nested queries to add some finesse to your data insights. You'll also wrap all of the content covered throughout this course into solving three challenge problems.
WHERE
statement.
sql_stmt = "\
Select name, fert_rate \
From leaders.states \
Where continent = 'Asia' \
And fert_rate < \
(Select AVG(fert_rate) \
From leaders.states); \
"
pd.read_sql(sql_stmt, conn)
Select
clause.Count the number of countries listed in states table for each continent in the prime_ministers
table.
Continents in the prime_ministers
table
Determine the counts of the number of countries in states for each of the continents in the lAst slide
Select count(name)
From leaders.states
Where continent In
(Select Distinct continent
From leaders.prime_ministers);
-- From languages
Select Distinct continent,
(Select Count(*)
From leaders.states
Where prime_ministers.continent = states.continent) As countries_num
From leaders.prime_ministers;
prime_ministers
table in
the main query.
Select
statement, you need to give the
subquery an aliAs (e.g. countries_num
in the example)
sql_stmt = "\
Select DISTINCT continent, \
(Select COUNT(*) \
From leaders.states \
Where prime_ministers.continent = states.continent) As countries_num \
From leaders.prime_ministers \
"
pd.read_sql(sql_stmt, conn)
You'll now try to figure out which countries had high average life expectancies (at the country level) in 2015.
Instructions 1)
sql_stmt = "\
Select avg(life_expectancy) \
From countries.populations \
Where year = 2015; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2)
1.15 * 100
in terms of life expectancy for 2015:
Select *,
From populations
Where life_expectancy>1.15*100 And year=2015;
populations
with records corresponding to larger than
1.15 times the average you calculated in the first tAsk for 2015. In other words, change the
100
in the example above with a subquery.
sql_stmt = "\
Select * \
From countries.populations \
Where life_expectancy > 1.15 * \
(Select avg(life_expectancy) \
From countries.populations \
Where year = 2015) And year = 2015; \
"
pd.read_sql(sql_stmt, conn)
Use your knowledge of subqueries in Where
to get the urban area population for only
capital cities.
Instructions
capital
field in the countries
table in your
subquery.
sql_stmt = "\
Select name, country_code, urbanarea_pop \
From countries.cities \
Where name In \
(Select capital \
From countries.countries) \
Order BY urbanarea_pop Desc; \
"
pd.read_sql(sql_stmt, conn)
In this exercise, you'll see how some queries can be written Using either a Join or a subquery.
You have seen previously how to use GROUP BY
with aggregate functions and an Inner
Join to get summarized information From multiple tables.
The code given in query.sql Selects the top nine countries in terms of number of
cities appearing in the cities
table. Recall that this corresponds to the most
populous cities in the world. Your tAsk will be to convert the commented out code to get the
same result As the code shown.
Instructions 1)
Submit your Answer:
sql_stmt = "\
Select countries.name As country, \
COUNT(*) As cities_num \
From countries.cities \
Inner Join countries.countries \
On countries.code = cities.country_code \
Group BY country \
Order BY cities_num Desc, country \
LIMIT 9; \
"
pd.read_sql(sql_stmt, conn)
Instructions 2)
GROUP BY
code to use a subquery inside of Select
, i.e.
fill in the blanks to get a result that matches the one given Using the GROUP
BY
code in the first query.
cities_num
descending and then by
country
Ascending.
sql_stmt = "\
Select countries.name As country, \
(Select count(*) \
From countries.cities \
Where countries.code = cities.country_code) As cities_num \
From countries.countries \
Order BY cities_num Desc, country \
LIMIT 9; \
"
pd.read_sql(sql_stmt, conn)
The last basic type of subquery exists inside of a From
clause. Determine
the maximum percentage of women in parliament for each continent listing in
leaders.states
Select continent,
Max(women_parli_perc) As max_perc
From states
Group By continent
Order By continent;
continent
is included As one of th fields in the
Select
clause, since we are grouping bAsed on that field.
sql_stmt = "\
Select continent, MAX(women_parli_perc) As max_perc \
From leaders.states \
Group BY continent \
Order BY continent; \
"
pd.read_sql(sql_stmt, conn)
FocUsing on records in monarchs
Multiple tables can be included in the From
clause, by adding a comma
between them
Produces part of the answer; how should duplicates be removed?
sql_stmt = "\
Select monarchs.continent \
From leaders.monarchs, leaders.states \
Where monarchs.continent = states.continent \
Order BY continent; \
"
pd.read_sql(sql_stmt, conn)
Finishing the subquery
To get Asia and Europe to appear only once, use DISTINCT
in the
Select
statement.
How is the max_perc
column included with continent?
Instead of including states in the From
clause, include the subquery instead
and aliAs it with a name like subquery
.
This is how to include a subquery As a temporary table in the From
clause.
sql_stmt = "\
Select DISTINCT monarchs.continent, subquery.max_perc \
From leaders.monarchs, \
(Select continent, MAX(women_parli_perc) As max_perc \
From leaders.states \
GROUP BY continent) As subquery \
WHere monarchs.continent = subquery.continent \
Order BY continent; \
"
pd.read_sql(sql_stmt, conn)
The last type of subquery you will work with is one inside of From
.
You will use this to determine the number of languages spoken for each country, identified by the
country's local name! (Note this may be different than the name
field and is stored
in the local_name
field.)
Instructions 1)
Begin by determining for each country code how many languages are listed in the
languages
table using Select
, From
, and GROUP
BY
.
Alias the aggregated field As lang_num
.
sql_stmt = "\
Select code, count(name) As lang_num \
From countries.languages \
GROUP BY code \
ORDER BY lang_num DESC; \
"
lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())
Instructions 2)
subquery
) As a subquery in the From
clause of a new query.
countries
.lang_num
From subquery
.WHERE
appropriately to match code
in countries
and in subquery
.
lang_num
in descending order.sql_stmt = "\
Select local_name, \
subquery.lang_num \
From countries.countries, \
(Select code, count(name) As lang_num \
From countries.languages \
Group BY code) As subquery \
Where countries.code = subquery.code \
Order BY lang_num DESC; \
"
lang_count = pd.read_sql(sql_stmt, conn)
print(lang_count.head())
print(lang_count.tail())
You can also nest multiple subqueries to answer even more specific questions.
In this exercise, for each of the six continents listed in 2015, you'll identify which country
had the maximum inflation rate (and how high it was) Using multiple subqueries. The table result
of your query in Task 3 should look something like the following, where
anything between <
>
will be filled in with appropriate values:
name | continent | inflation_rate |
---|---|---|
<country1> | North America | <max_inflation1> |
<country2> | Africa | <max_inflation2> |
<country3> | Oceania | <max_inflation3> |
<country4> | Europe | <max_inflation4> |
<country5> | South America | <max_inflation5> |
<country6> | Asia | <max_inflation6> |
</code></pre> Again, there are multiple ways to get to this solution Using only Joins, but the focus here is on showing you an introduction into advanced subqueries.
Instructions 1)
countries
on the left and economies
on
the right with Using
. Do not alias your tables or columns.
sql_stmt = "\
Select name, continent, inflation_rate \
From countries.countries \
Inner Join countries.economies \
Using (code) \
WHERE year = 2015; \
"
inf_rate = pd.read_sql(sql_stmt, conn)
print(inf_rate.head())
print(inf_rate.tail())
Instructions 2)
subquery
in the From
clause.
As max_inf
grouped by continent.continent
in the outer Select
statement.)
sql_stmt = "\
Select max(inflation_rate) As max_inf \
From ( \
Select name, continent, inflation_rate \
From countries.countries \
Inner Join countries.economies \
Using (code) \
WHERE year = 2015) As subquery \
GROUP BY continent; \
"
pd.read_sql(sql_stmt, conn)
Instructions 3)
Where
,
And
, and In
to obtain the name of the country, its continent, and
the maximum inflation rate for each continent in 2015. Revisit the sample output in the
Assignment text at the beginning of the exercise to see how this matches up.
On
instead of
Using
.
sql_stmt = "\
Select name, continent, inflation_rate \
From countries.countries \
Inner Join countries.economies \
ON countries.code = economies.code \
WHERE year = 2015 \
AND inflation_rate IN ( \
Select max(inflation_rate) As max_inf \
From ( \
Select name, continent, inflation_rate \
From countries.countries \
Inner Join countries.economies \
ON countries.code = economies.code \
WHERE year = 2015) As subquery \
GROUP BY continent); \
"
pd.read_sql(sql_stmt, conn)
Let's test your understanding of the subqueries with a challenge problem! Use a subquery to get 2015 economic data for countries that do not have
gov_form
of 'Constitutional Monarchy'
or'Republic'
in their gov_form
.Here, gov_form
stands for the form of the government for each country. Review the
different entries for gov_form
in the countries
table.
Instructions
-- Select fields
Select code,
inflation_rate,
unemployment_rate
-- From economies
From economies
-- Where year is 2015 and code is not in
Where year=2015 And code Not In
-- Subquery of the codes that gov_form is 'Constitutional Monarchy' or '%Republic'
(Select code
From countries
Where (gov_form='Constitutional Monarchy' Or gov_form Like '%Republic'))
-- Order by inflation rate
Order By inflation_rate;
sql_stmt = "\
Select code, inflation_rate, unemployment_rate \
From countries.economies \
Where year = 2015 Ans code Not In \
(Select code \
From countries.countries \
Where (gov_form = 'Constitutional Monarchy' Or gov_form Like '%%Republic')) \
Order BY inflation_rate; \
"
pd.read_sql(sql_stmt, conn)
Within which SQL clause are subqueries most frequently found?
Answer the question
Types of Joins:
Inner Join: also denoted As Join
OUTER Join
CROSS Join: create all possible combinations between two tables
Notes
Words appearing in ALL capital letters correspond to Joins having simple SQL syntax. Self-Joins, semi-Joins, and anti-Joins don't have built-in SQL syntax.
An Inner Join keeps only the records in which the key field (or fields) is in both tables. A LEFT Join keeps all the records in fields specified in the left table and includes the matches in the right table bAsed on the key field or fields. Key field values that don't match in the right table are included As missing data in the resulting table of a LEFT Join.
A RIGHT Join keeps all the records specified in the right table and includes the matches From the key field(s) in the left table. THose that don't match are included As missing values in the resulting table From the RIGHT Join query.
A FULL Join is a combination of a LEFT Join and a RIGHT Join showing exactly which values appear in both tables and those that appear in only one or the other table. A CROSS Join matches all records From fields specified in one table with all records From fields specified in another table. Remember that a
CROSS Join
does not have anOn
orUsing
clause, but otherwise looks very similar to the code for anInner Join
,LEFT Join
,RIGHT Join
, orFULL Join
.
Set Theory Clauses
UNION
includes every record in both tables but DOES
NOT double count those that are in both tables.
UNION ALL
does replicate those that are in both tables.INTERSECT
gives only those records found in both of the two tables.EXCEPT
gives only those records in one table but not the
other.
Semi-Joins and Anti-Joins
Types of basic subqueries
WHERE
clause.Select
clauses and inside
From
clauses.
WHERE
clauses too.
Welcome to the end of the course! The next three exercises will test your knowledge of the content covered in this course and apply many of the ideAs you've seen to difficult problems. Good luck!
Read carefully over the instructions and solve them step-by-step, thinking about how the
different clauses work together.
In this exercise, you'll need to get the country names and other 2015 data in
the economies
table and the countries
table for Central
American countries with an official language.
Instructions
countries
on the left. (An Inner Join would also work, but
pleAse use a left Join here.)
code
in the two tables And
use a subquery inside of
On
to choose the appropriate languages
records.
Order
by country name ascending.
-- Select fields
Select Distinct c.name,
e.total_investment,
e.imports
-- From countries (with alias as c)
From countries As c
-- Join to economies (with alias as e
Left Join economies As e
-- Match on code in (c,e) and the c.code in official languages
On (c.code=e.code
And c.code In (Select l.code
From languages As l
Where official='true'))
-- Where region and year are correct
Where year=2015 And region='Central America'
-- Order by field
Order By name;
sql_stmt = "\
Select DISTINCT c.name, \
e.total_investment, \
e.imports \
From countries.countries As c \
LEFT Join countries.economies As e \
On (c.code = e.code \
And c.code IN (Select l.code \
From countries.languages As l \
Where official = 'true'))\
Where year = 2015 And region = 'Central America' \
Order BY name; \
"
pd.read_sql(sql_stmt, conn)
Let's ease up a bit and calculate the average fertility rate for each region in 2015.
Instructions
avg_fert_rate
.
avg_fert_rate
ascending.Group BY
all fields that aren't included in the
aggregate function of Select
.
-- Select fields
Select c.name,
c.continent,
Avg(p.fertility_rate) As avg_fert_rate
-- From populations (with alias as p)
From populations As p
-- Join to countries (with alias as c
Inner Join countries As c
-- Match on code in (p,c)
On p.country_code=c.code
Where year=2015
-- Group by continent and region
Group By c.continent, c.region
-- Order by avg of fertility_rate
Order By avg_fert_rate;
sql_stmt = "\
Select c.region, \
c.continent, \
AVG(p.fertility_rate) As avg_fert_rate \
From countries.populations As p \
Inner Join countries.countries As c \
ON p.country_code = c.code \
Where year = 2015 \
Group BY c.continent, c.region \
Order BY avg_fert_rate; \
"
pd.read_sql(sql_stmt, conn)
Welcome to the lAst challenge problem. Remember that these challenges are designed to take you to the limit to solidify your SQL knowledge! Take a deep breath and solve this step-by-step.
You are now tAsked with determining the top 10 capital cities in Europe and the AmericAs in terms
of a calculated percentage using city_proper_pop
and metroarea_pop
in
cities
.
Do not use table aliAsing in this exercise.
Instructions
cities
, aliased as city_perc
.
city_perc
descending.city_perc
percentage.
-- Select fields
Select name,
country_code,
city_proper_pop,
metroarea_pop,
-- Calculate city_perc
city_proper_pop/metroarea_pop*100 As city_perc
-- From cities
From cities
-- Where the names from continent='Europe' or '%America' and metroarea_pop is not null
Where name In
(Select capital
From countries
Where (continent='Europe' Or continent Like '%America'))
And metroarea_pop Is Not Null
-- Order by city_perc in descreasing order
Order By city_perc Desc
Limit 10;
sql_stmt = "\
Select name, \
country_code, \
city_proper_pop, \
metroarea_pop, \
city_proper_pop / metroarea_pop * 100 As city_perc \
From countries.cities \
Where name In \
(Select capital \
From countries.countries \
Where (continent = 'Europe' OR continent Like '%%America')) \
And metroarea_pop IS NOT NULL \
Order BY city_perc desc \
Limit 10; \
"
pd.read_sql(sql_stmt, conn)
# close the connection
conn.close()
Course: DataCamp: Intermediate SQL Notebook Author: Jae Choi
Course Description
The European Soccer Database contains data about 12,800 matches from 11 countries played between 2011-2015! Throughout this course, you will be shown filtered versions of the tables in this database in order to better explore their contents.
You’ve learned how to Joining Data in SQL from tables
Datasets
Imports
import sqlite3
path = "/Users/Jae/Google_Drive_jae0325/DATA_SCIENCE/Portfolio/datacamp/"
database = path + 'database.sqlite'
# Unix/Mac - 4 initial slashes in total (3+1 in front of database)
#engine = create_engine("sqlite:///"+database)
conn = sqlite3.connect(database)
table_stmt = """SELECT *
FROM sqlite_master
WHERE type='table';"""
tables = pd.read_sql(table_stmt, conn)
tables
Learn how to use the CASE WHEN
statement to create categorical variables, aggregate
data into a single column with multiple filtering conditions, and calculate counts and
percentages.
CASE Statement Introduction
WHEN
, THEN
, and ELSE
statement, finished
with END
.
AND
to add multiple logical conditions to your WHEN
clause.
ELSE
, those don’t meet the conditions are NULLCASE
statements are great for…
CASE
statements will return any value you specify in your THEN
clause.
CASE
inside an AVG
function to calculate a percentage of information in your database.
CASE
Statement¶In this exercise, you will identify matches played between FC Schalke 04 and FC Bayern Munich.
sql_stmt="""-- Identify the home team as Bayern Munich, Schalke 04, or neither
SELECT
CASE WHEN home_team_api_id = 10189 THEN 'FC Schalke 04'
WHEN home_team_api_id = 9823 THEN 'FC Bayern Munich'
ELSE 'Other' END AS home_team,
COUNT(id) AS total_matches
FROM Match
WHERE country_id = 7809 -- Germany
-- Group by the CASE statement alias
GROUP BY home_team;
"""
pd.read_sql(sql_stmt, conn)
CASE
Statement¶
CASE & Alias
WHERE
.
SELECT
, FROM
.CASE
Statements Inside SELECT
, Comparing Column Values¶
Barcelona is considered one of the strongest teams in Spain's soccer league.
You will be creating a list of matches in the 2011/2012 season where Barcelona was the "home team". You will do this using a CASE statement that compares the values of two columns to create a new group -- wins, losses, and ties.
sql_stmt = """SELECT
m.date,
t.team_long_name AS opponent,
-- Complete the CASE statement with an alias
CASE WHEN m.home_team_goal > m.away_team_goal THEN 'Barcelona win!'
WHEN m.home_team_goal < m.away_team_goal THEN 'Barcelona loss :('
ELSE 'Tie' END AS outcome
FROM Match AS m
LEFT JOIN Team AS t
ON m.away_team_api_id = t.team_api_id
-- Filter for Barcelona as the home team
WHERE m.home_team_api_id = 8634
AND season = '2011/2012';"""
pd.read_sql(sql_stmt, conn).head()
Now, construct a query to determine the outcome of Barcelona's matches where they played as the "away team". Did their performance differ from the matches where they were the home team?
sql_stmt = """-- Select matches where Barcelona was the away team
SELECT
m.date,
t.team_long_name AS opponent,
CASE WHEN m.home_team_goal < m.away_team_goal THEN 'Barcelona win!'
WHEN m.home_team_goal > m.away_team_goal THEN 'Barcelona loss :('
ELSE 'Tie' END AS outcome
FROM Match AS m
-- Join teams_spain to matches_spain
LEFT JOIN Team AS t
ON m.home_team_api_id = t.team_api_id
WHERE m.away_team_api_id = 8634
AND season = '2011/2012';"""
pd.read_sql(sql_stmt, conn).head()
Barcelona's performance seems to be worse when they are the away team.
CASE
Statement Inside WHERE
¶
Generate a list of matches won by Italy's Bologna team!
sql_stmt = """-- Select the season, date, home_goal, and away_goal columns
SELECT
season,
date,
home_team_goal,
away_team_goal
FROM Match
WHERE
-- Exclude games not won by Bologna
CASE WHEN home_team_api_id = 9857 AND home_team_goal > away_team_goal THEN 'Bologna Win'
WHEN away_team_api_id = 9857 AND away_team_goal > home_team_goal THEN 'Bologna Win'
END IS NOT NULL;"""
pd.read_sql(sql_stmt, conn).head()
You now have details on every match in this database where Bologna won.
CASE
with Aggregation¶
CASE
with Aggregation Functions
-CASE WHEN with COUNT(): Count the rows.
-CASE WHEN with SUM(): Sum the values.
-CASE WHEN with AVG(): Average of values, or calculate percentage.
COUNT
using CASE
WHEN
¶Do the number of soccer matches played in a given European country differ across seasons?
sql_stmt = """SELECT
c.name AS country,
-- Count matches in each of the 3 seasons
COUNT(CASE WHEN m.season = '2012/2013' THEN m.id END) AS matches_2012_2013,
COUNT(CASE WHEN m.season = '2013/2014' THEN m.id END) AS matches_2013_2014,
COUNT(CASE WHEN m.season = '2014/2015' THEN m.id END) AS matches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;"""
pd.read_sql(sql_stmt, conn).head()
The number of matches played in each season seems relatively consistent across countries. Where do you see the largest difference?
SUM
and CASE WHEN
with Multiple Conditions¶
In Python, you have the ability to calculate a SUM of logical values (i.e., TRUE/FALSE) directly. In SQL, you have to convert these values into 1 and 0 before calculating a sum. This can be done using a CASE statement.
Your goal here is to use the country and match table to determine the total number of matches won by the home team in each country during the different seasons.
sql_stmt = """SELECT
c.name AS country,
-- Sum the total records in each season where the home team won
SUM(CASE WHEN m.season = '2012/2013' AND m.home_team_goal > m.away_team_goal
THEN 1 ELSE 0 END) AS matches_2012_2013,
SUM(CASE WHEN m.season = '2013/2014' AND m.home_team_goal > m.away_team_goal
THEN 1 ELSE 0 END) AS matches_2013_2014,
SUM(CASE WHEN m.season = '2014/2015' AND m.home_team_goal > m.away_team_goal
THEN 1 ELSE 0 END) AS matches_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
-- Group by country name alias
GROUP BY country;
"""
pd.read_sql(sql_stmt, conn).head()
CASE
and AVG
¶
Examine the percentage of ties in each country in different seasons.
sql_stmt = """SELECT
c.name AS country,
-- Round the percentage of tied games to 2 decimal points
-- Calculate the percentage of tied games in each season
ROUND(AVG(CASE WHEN m.season='2013/2014' AND m.home_team_goal = m.away_team_goal THEN 1
WHEN m.season='2013/2014' AND m.home_team_goal != m.away_team_goal THEN 0
END),2) AS pct_ties_2013_2014,
ROUND(AVG(CASE WHEN m.season='2014/2015' AND m.home_team_goal = m.away_team_goal THEN 1
WHEN m.season='2014/2015' AND m.home_team_goal != m.away_team_goal THEN 0
END),2) AS pct_ties_2014_2015
FROM country AS c
LEFT JOIN match AS m
ON c.id = m.country_id
GROUP BY country;
"""
pd.read_sql(sql_stmt, conn).head()
Learn about subqueries in the SELECT
, FROM
, and WHERE
clauses. You will gain an understanding of when subqueries are necessary to construct your
dataset and where to best include them in your queries.
WHERE
¶
What is a Subquery?
Why Subqueries?
Subqueries in WHERE
WHERE
Subqueries¶
Generate a list of matches where the total goals scored (for both teams in total) is more than 3 times the average for games in the 2013/2014 matches.
sql_stmt = """SELECT
-- Select the date, home goals, and away goals scored
date,
home_team_goal,
away_team_goal
FROM match
-- Filter for matches where total goals exceeds 3x the average
WHERE season = '2013/2014' AND
(home_team_goal + away_team_goal) >
(SELECT 3 * AVG(home_team_goal + away_team_goal)
FROM match WHERE season = '2013/2014');
"""
pd.read_sql(sql_stmt, conn)
WHERE
Subqueries with a List¶
Create a list of teams that scored 8 or more goals in a home match.
sql_stmt = """SELECT
-- Select the team long and short names
team_long_name,
team_short_name
FROM team
-- Filter for teams with 8 or more home goals
WHERE team_api_id IN
(SELECT home_team_api_id
FROM match
WHERE home_team_goal >= 8);
"""
pd.read_sql(sql_stmt, conn)
FROM
¶
Subqueries in FROM
FROM
¶Generate a subquery using the match table, and then join that subquery to the country table to calculate information about matches with 10 or more goals in total.
sql_stmt = """SELECT
-- Select country name and the count match IDs
c.name AS country_name,
COUNT(sub.id) AS matches
FROM country AS c
-- Inner join the subquery onto country
-- Select the country id and match id columns
INNER JOIN (SELECT country_id, id
FROM match
-- Filter the subquery by matches with 10+ goals
WHERE (home_team_goal + away_team_goal) >= 10) AS sub
ON c.id = sub.country_id
GROUP BY country_name;
"""
pd.read_sql(sql_stmt, conn)
SELECT
¶
Subqueries in SELECT
SELECT
for Calculations¶Create a column to compare each league's average total goals to the overall average goals in the 2013/2014 season. Also, add a column that directly compares these values by subtracting the overall average from the subquery.
sql_stmt = """SELECT
-- Select the league name and average goals scored
name AS league,
ROUND(AVG(m.home_team_goal + m.away_team_goal),2) AS avg_goals,
-- Subtract the overall average from the league average
ROUND(AVG(m.home_team_goal + m.away_team_goal) -
(SELECT AVG(home_team_goal + away_team_goal)
FROM match
WHERE season = '2013/2014'),2) AS diff
FROM league AS l
LEFT JOIN match AS m
ON l.country_id = m.country_id
-- Only include 2013/2014 results
WHERE season = '2013/2014'
GROUP BY league
ORDER BY avg_goals DESC;
"""
pd.read_sql(sql_stmt, conn)
Games in the Netherlands tend to score the highest number of goals on average in this season.
Must Know When Using Subqueries
In soccer leagues, games are played at different stages. Winning teams progress from one stage to the next, until they reach the final stage. In each stage, the stakes become higher than the previous one.
Extract data examining the average goals scored in each stage of a match. Does the average number of goals scored change as the stakes get higher from one stage to the next?
sql_stmt = """SELECT
-- Select the stage and average goals from s
stage,
ROUND(avg_goals,2) AS avg_goal,
-- Select the overall average for 2012/2013
ROUND((SELECT AVG(home_team_goal + away_team_goal) FROM match WHERE season = '2012/2013'),2) AS overall_avg
FROM
-- Select the stage and average goals in 2012/2013 from match
(SELECT
stage,
AVG(home_team_goal + away_team_goal) AS avg_goals
FROM match
WHERE season = '2012/2013'
GROUP BY stage) AS s
WHERE
-- Filter the main query using the subquery
s.avg_goals > (SELECT AVG(home_team_goal + away_team_goal)
FROM match WHERE season = '2012/2013')
ORDER BY stage DESC;
"""
pd.read_sql(sql_stmt, conn).head()
You used 3 subqueries to generate a list of stages that have higher than average goals scored in matches.
Learn how to use nested and correlated subqueries to extract more complex data from a relational database. You will also learn about common table expressions and how to best construct queries using multiple common table expressions.
Correlated Subquery
Simple Subquery | Correlated Subquery |
---|---|
Can be run independently from the main query. | Dependent on the main query to execute. |
Evaluated once in the whole query. | Evaluated in loops. Significantly slows down query run time. |
Practice using correlated subqueries to examine matches with scores that are extreme outliers for each country -- above 3 times the average score!
NOTE: The bottom query takes a while.
sql_stmt = """SELECT
-- Select country ID, date, home, and away goals from match
main.country_id,
date,
main.home_team_goal,
main.away_team_goal
FROM match AS main
WHERE
-- Filter the main query by the subquery
(home_team_goal + away_team_goal) >
(SELECT AVG((sub.home_team_goal + sub.away_team_goal) * 3)
FROM match AS sub
-- Join the main query to the subquery in WHERE
WHERE main.country_id = sub.country_id);
"""
pd.read_sql(sql_stmt, conn).head()
Correlated subqueries take longer to produce results, but they often prevent you from having to create multiple subqueries.
Nested Subquery
FROM
¶What's the average number of matches per season where a team scored 5 or more goals? How does this differ by country?
Follwing is a nested subquery, that count the numbers of matches that score more than 5 in every seasons, for every country.
sql_stmt = """SELECT country_id, season,
COUNT(id) AS matches
FROM (
SELECT country_id, season, id
FROM match
WHERE home_team_goal >= 5 OR away_team_goal >= 5) AS inner_s
-- Close parentheses and alias the subquery
GROUP BY country_id, season;
"""
pd.read_sql(sql_stmt, conn).head()
Common Table Expressions (CTEs)
WITH
cte_name AS
(subquery)Declare a CTE that calculates the total goals from matches in August of the 2013/2014 season. And according to the CTE, calculate the average goals of each country in August during that season.
sql_stmt = """-- Set up your CTE
WITH match_list AS (
SELECT
country_id,
(home_team_goal + away_team_goal) AS goals
FROM match
-- Create a list of match IDs to filter data in the CTE
WHERE id IN (
SELECT id
FROM match
-- EXTRACT(MONTH FROM date) = 08 -- Postgres Syntax
WHERE season = '2013/2014' AND STRFTIME('%m', date) = '08'))
-- Select the league name and average of goals in the CTE
SELECT
l.name,
AVG(goals)
FROM league AS l
-- Join the CTE onto the league table
LEFT JOIN match_list ON l.id = match_list.country_id
GROUP BY l.name;
"""
pd.read_sql(sql_stmt, conn).head()
Differentiating What You’ve Learned
Joins | Correlated Subqueries |
---|---|
Combine 2+ tables | Match subqueries & tables |
Simple | Avoid limits of joins. |
Operations / Aggregations | High processing time |
Eg: What is the total sales per employee? | Eg: Who does each employee report to in a company? |
Multiple/Nested Subqueries | Common Table Expressions |
---|---|
Multi-step transformations | Organize subqueries sequentially |
Improve accuracy and reproducibility | Can reference other CTEs |
Eg: What is the average deal size closed by each sales representative in the quarter? | Eg: How did the marketing,sales, growth, & engineering teams perform on key metrics? |
Q: How do you get both the home and away team names into one final query result?
sql_stmt = """SELECT
m.date,
-- Get the home and away team names
hometeam,
awayteam,
m.home_team_goal,
m.away_team_goal
FROM match AS m
-- Join the home subquery to the match table
LEFT JOIN (
SELECT match.id, team.team_long_name AS hometeam
FROM match
LEFT JOIN team
ON match.home_team_api_id = team.team_api_id) AS home
ON home.id = m.id
-- Join the away subquery to the match table
LEFT JOIN (
SELECT match.id, team.team_long_name AS awayteam
FROM match
LEFT JOIN team
-- Get the away team ID in the subquery
ON match.away_team_api_id = team.team_api_id) AS away
ON away.id = m.id;
"""
pd.read_sql(sql_stmt, conn).head()
sql_stmt = """SELECT
m.date,
(SELECT team_long_name
FROM team AS t
WHERE t.team_api_id = m.home_team_api_id) AS hometeam,
-- Connect the team to the match table
(SELECT team_long_name
FROM team AS t
WHERE t.team_api_id = m.away_team_api_id) AS awayteam,
-- Select home and away goals
home_team_goal,
away_team_goal
FROM match AS m;
"""
pd.read_sql(sql_stmt, conn).head()
sql_stmt = """WITH home AS (
SELECT m.id, m.date,
t.team_long_name AS hometeam, m.home_team_goal
FROM match AS m
LEFT JOIN team AS t
ON m.home_team_api_id = t.team_api_id),
-- Declare and set up the away CTE
away AS (
SELECT m.id, m.date,
t.team_long_name AS awayteam, m.away_team_goal
FROM match AS m
LEFT JOIN team AS t
ON m.away_team_api_id = t.team_api_id)
-- Select date, home_goal, and away_goal
SELECT
home.date,
home.hometeam,
away.awayteam,
home.home_team_goal,
away.away_team_goal
-- Join away and home on the id column
FROM home
INNER JOIN away
ON home.id = away.id;
"""
pd.read_sql(sql_stmt, conn).head()
You now know three separate ways to manipulate and transform data to produce a complex query result! This is a great set of skills to have when working with complex relational databases!
Learn about window functions and how to pass aggregate functions along a dataset. You will also learn how to calculate running totals and partitioned averages.
Why Need Window Functions?
GROUP BY
GROUP BY
with all non-aggregate columns, otherwise,
raise ERROR.
####### GIVES ERROR #######
# sql_stmt = """SELECT
# country_id,
# season,
# date,
# AVG(home_goal) AS avg_home
# FROM match
# GROUP BY country_id;
# """
# pd.read_sql(sql_stmt, conn).head()
WHERE
statement
RANK()
& ORDER BY
¶
Create a data set of ranked matches according to which leagues, on average, score the most goals in a match.
RANK()
: The window funciton using here
ORDER BY
: The sub-clause using here
sql_stmt = """SELECT
-- Select the league name and average goals scored
l.name AS league,
AVG(m.home_team_goal + m.away_team_goal) AS avg_goals,
-- Rank each league according to the average goals
RANK() OVER(ORDER BY AVG(m.home_team_goal + m.away_team_goal) DESC) AS league_rank
FROM league AS l
LEFT JOIN match AS m
ON l.id = m.country_id
WHERE m.season = '2011/2012'
GROUP BY l.name
-- Order the query by the rank you created
ORDER BY league_rank;
"""
pd.read_sql(sql_stmt, conn).head()
Unlike a subquery in SELECT, your window function will apply the filter that you include in your WHERE clause.
OVER
and PARTITION BY
PARTITION BY
Multiple Columns¶
Calculate the average number home and away goals scored Legia Warszawa, and their opponents, partitioned by the month in each season.
sql_stmt = """SELECT
date,
season,
home_team_goal,
away_team_goal,
CASE WHEN home_team_api_id = 8673 THEN 'home'
ELSE 'away' END AS warsaw_location,
-- Calculate average goals partitioned by season and month
AVG(home_team_goal) OVER(PARTITION BY season,
-- EXTRACT(MONTH FROM date) -- Postgres Syntax
STRFTIME('%m', date)) AS season_mo_home,
AVG(away_team_goal) OVER(PARTITION BY season,
-- EXTRACT(MONTH FROM date) -- Postgres Syntax
STRFTIME('%m', date)) AS season_mo_away
FROM match
WHERE
home_team_api_id = 8673
OR away_team_api_id = 8673
ORDER BY (home_team_goal + away_team_goal) DESC;
"""
pd.read_sql(sql_stmt, conn).head()
Sliding Windows
n PRECEDING
: Select from n
row before the current oneFOLLOWING
: Select until n
row after the current oneUNBOUNDED PRECEDING
: Select from the beginning of the column/partition
UNBOUNDED FOLLOWING
: Select to the end of the column/partitionCURRENT ROW
Sorting the data set in reverse order and calculating a backward running total from the CURRENT ROW to the end of the data set (earliest record).
sql_stmt = """SELECT
-- Select the date, home goal, and away goals
date,
home_team_goal,
away_team_goal,
-- Create a running total and running average of home goals
SUM(home_team_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_total,
AVG(home_team_goal) OVER(ORDER BY date DESC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS running_avg
FROM match
WHERE
away_team_api_id = 9908
AND season = '2011/2012';
"""
pd.read_sql(sql_stmt, conn).head()
Q: How badly did Manchester United (MU) lose in each match?
sql_stmt = """-- Set up the home team CTE
WITH home AS (
SELECT m.id, t.team_long_name,
CASE WHEN m.home_team_goal > m.away_team_goal THEN 'MU Win'
WHEN m.home_team_goal < m.away_team_goal THEN 'MU Loss'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.home_team_api_id = t.team_api_id),
-- Set up the away team CTE
away AS (
SELECT m.id, t.team_long_name,
CASE WHEN m.home_team_goal > m.away_team_goal THEN 'MU Loss'
WHEN m.home_team_goal < m.away_team_goal THEN 'MU Win'
ELSE 'Tie' END AS outcome
FROM match AS m
LEFT JOIN team AS t ON m.away_team_api_id = t.team_api_id)
-- Select columns and and rank the matches by date
SELECT DISTINCT
date,
home.team_long_name AS home_team,
away.team_long_name AS away_team,
m.home_team_goal, m.away_team_goal,
RANK() OVER(ORDER BY ABS(home_team_goal - away_team_goal) DESC) as match_rank
-- Join the CTEs onto the match table
FROM match AS m
LEFT JOIN home ON m.id = home.id
LEFT JOIN away ON m.id = away.id
WHERE m.season = '2014/2015'
AND ((home.team_long_name = 'Manchester United' AND home.outcome = 'MU Loss')
OR (away.team_long_name = 'Manchester United' AND away.outcome = 'MU Loss'));
"""
pd.read_sql(sql_stmt, conn).head()
# close the connection
conn.close()