import sqlalchemy
sqlalchemy.create_engine('mysql+mysqlconnector://root:secret@localhost:3306/lahmansbaseballdb')
%sql mysql+mysqlconnector://root:secret@localhost:3306/lahmansbaseballdb
Lahman’s Baseball db SQL practice
I wanted to brush off some of the rust for my SQL skills and I figured the Lahman Baseball database offers a decent database to practice. I downloaded the install script from here and found some practice exercises here. The rest of this post are my attempts to solve the problems. Play Ball! Query data!
Question 1:
What range of years does the provided database cover?
%%sql
SELECT
MIN(yearID) AS first_year, MAX(yearID) AS last_year
FROM
batting;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
1 rows affected.
first_year |
last_year |
1871 |
2019 |
Question 2:
Find the name and height of the shortest player in the database. How many games did he play in? What is the name of the team for which he played?
%%sql
SELECT
p1.playerID,
p1.nameFirst,
p1.nameLast,
p1.height,
SUM(apps.G_all) AS games_played,
t.name
FROM
people p1
INNER JOIN
(SELECT
MIN(people.height) AS min_height
FROM
people) p2 ON p1.height = p2.min_height
INNER JOIN
appearances AS apps ON p1.playerID = apps.playerID
INNER JOIN
teams t ON t.teamID = apps.teamID
GROUP BY p1.playerID , p1.nameFirst , p1.nameLast , p1.height , t.name;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
1 rows affected.
playerID |
nameFirst |
nameLast |
height |
games_played |
name |
gaedeed01 |
Eddie |
Gaedel |
43 |
52 |
St. Louis Browns |
Question 3:
Find all players in the database who played at Vanderbilt University. Create a list showing each player’s first and last names as well as the total salary they earned in the major leagues. Sort this list in descending order by the total salary earned. Which Vanderbilt player earned the most money in the majors?
%%sql
SELECT
p.nameFirst, p.nameLast, SUM(s.salary) total_salary
FROM
people p
INNER JOIN
salaries s ON s.playerID = p.playerID
INNER JOIN
collegeplaying cp ON p.playerID = cp.playerID
INNER JOIN
schools sc ON sc.schoolID = cp.schoolID
WHERE
sc.name_full = 'Vanderbilt University'
GROUP BY p.nameFirst , p.nameLast
ORDER BY total_salary DESC;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
15 rows affected.
nameFirst |
nameLast |
total_salary |
David |
Price |
245553888.0 |
Pedro |
Alvarez |
62045112.0 |
Scott |
Sanderson |
21500000.0 |
Mike |
Minor |
20512500.0 |
Joey |
Cora |
16867500.0 |
Mark |
Prior |
12800000.0 |
Ryan |
Flaherty |
12183000.0 |
Josh |
Paul |
7920000.0 |
Sonny |
Gray |
4627500.0 |
Mike |
Baxter |
4188836.0 |
Jensen |
Lewis |
3702000.0 |
Matt |
Kata |
3180000.0 |
Nick |
Christiani |
2000000.0 |
Jeremy |
Sowers |
1154400.0 |
Scotti |
Madison |
540000.0 |
Question 4:
Using the fielding table, group players into three groups based on their position: label players with position OF as “Outfield”, those with position “SS”, “1B”, “2B”, and “3B” as “Infield”, and those with position “P” or “C” as “Battery”. Determine the number of putouts made by each of these three groups in 2016.
%%sql
SELECT
CASE
WHEN POS = 'OF' THEN 'Outfield'
WHEN POS IN ('P' , 'C') THEN 'Battery'
ELSE 'Infield'
END AS POS_Group,
SUM(PO) AS 'Putouts'
FROM
fielding
WHERE
yearID = 2016
GROUP BY POS_Group;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
3 rows affected.
POS_Group |
Putouts |
Battery |
41424 |
Infield |
58935 |
Outfield |
29560 |
Question 5:
Find the average number of strikeouts per game by decade since 1920. Round the numbers you report to 2 decimal places. Do the same for home runs per game. Do you see any trends?
%%sql
SELECT
FLOOR(yearID / 10) * 10 AS decade,
AVG(SO / G) 'K per Game',
AVG(HR / G) 'HR per Game'
FROM
teams
WHERE
yearID >= 1920
GROUP BY decade;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
10 rows affected.
decade |
K per Game |
HR per Game |
1920 |
2.81490875 |
0.40173063 |
1930 |
3.31660500 |
0.54575125 |
1940 |
3.54989625 |
0.52308313 |
1950 |
4.39932063 |
0.84299375 |
1960 |
5.71242879 |
0.82020455 |
1970 |
5.14521423 |
0.74564065 |
1980 |
5.34239077 |
0.80403538 |
1990 |
6.15079065 |
0.96031367 |
2000 |
6.56109333 |
1.07335433 |
2010 |
7.81854300 |
1.06862133 |
Question 6:
Find the player who had the most success stealing bases in 2016, where success is measured as the percentage of stolen base attempts which are successful. (A stolen base attempt results either in a stolen base or being caught stealing.) Consider only players who attempted at least 20 stolen bases.
%%sql
SELECT
p.nameFirst,
p.nameLast,
ROUND(b.SB / (b.SB + b.CS), 2) AS sb_perc
FROM
batting b
INNER JOIN
people p ON p.playerID = b.playerID
WHERE
b.SB + b.CS >= 20 AND b.yearID = 2016
ORDER BY sb_perc DESC;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
47 rows affected.
nameFirst |
nameLast |
sb_perc |
Chris |
Owings |
0.91 |
Brian |
Dozier |
0.90 |
Rajai |
Davis |
0.88 |
Billy |
Hamilton |
0.88 |
Kevin |
Kiermaier |
0.88 |
Mookie |
Betts |
0.87 |
Paul |
Goldschmidt |
0.86 |
Trea |
Turner |
0.85 |
Keon |
Broxton |
0.85 |
Hernan |
Perez |
0.83 |
Josh |
Harrison |
0.83 |
Eduardo |
Nunez |
0.82 |
Wil |
Myers |
0.82 |
Jarrod |
Dyson |
0.81 |
Mike |
Trout |
0.81 |
Alcides |
Escobar |
0.81 |
Dee |
Gordon |
0.81 |
B. J. |
Upton |
0.80 |
Brett |
Gardner |
0.80 |
Leonys |
Martin |
0.80 |
Starling |
Marte |
0.80 |
Francisco |
Lindor |
0.79 |
Jonathan |
Villar |
0.78 |
Ian |
Desmond |
0.78 |
Odubel |
Herrera |
0.78 |
Jean |
Segura |
0.77 |
Ryan |
Braun |
0.76 |
Jose |
Ramirez |
0.76 |
Jose |
Altuve |
0.75 |
Todd |
Frazier |
0.75 |
Elvis |
Andrus |
0.75 |
Gregory |
Polanco |
0.74 |
Freddy |
Galvis |
0.74 |
Cameron |
Maybin |
0.71 |
Jacoby |
Ellsbury |
0.71 |
Travis |
Jankowski |
0.71 |
Ian |
Kinsler |
0.70 |
Ender |
Inciarte |
0.70 |
Kevin |
Pillar |
0.70 |
Jose |
Peraza |
0.68 |
Bryce |
Harper |
0.68 |
Rougned |
Odor |
0.67 |
Mallex |
Smith |
0.67 |
Charlie |
Blackmon |
0.65 |
Brandon |
Phillips |
0.64 |
Cesar |
Hernandez |
0.57 |
Danny |
Santana |
0.57 |
Question 7:
From 1970 – 2019, what is the largest number of wins for a team that did not win the world series? What is the smallest number of wins for a team that did win the world series? Doing this will probably result in an unusually small number of wins for a world series champion – determine why this is the case. Then redo your query, excluding the problem year. How often from 1970 – 2016 was it the case that a team with the most wins also won the world series? What percentage of the time?
%%sql
SELECT
t1.teamID, t1.yearID, t1.W
FROM
teams t1
INNER JOIN
(SELECT
t2.W, t2.teamID, t2.yearID
FROM
teams t2
WHERE
t2.WSWin = 'N' AND t2.yearID >= 1970
ORDER BY t2.W DESC
LIMIT 1) Max_w_no_WS ON t1.teamID = Max_w_no_WS.teamID
AND Max_w_no_WS.yearID = t1.yearID
UNION ALL SELECT
t1.teamID, t1.yearID, t1.W
FROM
teams t1
INNER JOIN
(SELECT
t2.W, t2.teamID, t2.yearID
FROM
teams t2
WHERE
t2.WSWin = 'Y' AND t2.yearID >= 1970
AND t2.G > 150
ORDER BY t2.W ASC
LIMIT 1) Min_w_WS ON t1.teamID = Min_w_WS.teamID
AND Min_w_WS.yearID = t1.yearID;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
2 rows affected.
teamID |
yearID |
W |
SEA |
2001 |
116 |
SLN |
2006 |
83 |
Part 2: How often from 1970 – 2016 was it the case that a team with the most wins also won the world series? What percentage of the time?
%%sql
SELECT
COUNT(*) / (2019 - 1970) AS perc_most_wins_and_WS_champs
FROM
teams t1
WHERE
(t1.yearID , t1.W) IN (SELECT
t2.yearID, MAX(t2.W) AS max_wins
FROM
teams t2
GROUP BY t2.yearID)
AND t1.yearID >= 1970
AND t1.WSWin = 'Y'
ORDER BY t1.yearID DESC , t1.W DESC;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
1 rows affected.
perc_most_wins_and_WS_champs |
0.2653 |
Question 8:
%%sql
SELECT
parks.parkname,
teams.name,
homegames.attendance / homegames.games average_attendance
FROM
homegames
INNER JOIN
parks ON parks.ID = homegames.park_ID
INNER JOIN
teams ON teams.ID = homegames.team_ID
WHERE
games > 10 AND yearkey = 2019
ORDER BY average_attendance DESC;
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
30 rows affected.
parkname |
name |
average_attendance |
Dodger Stadium |
Los Angeles Dodgers |
49065.5432 |
Busch Stadium III |
St. Louis Cardinals |
42967.8148 |
Yankee Stadium II |
New York Yankees |
40795.1111 |
Wrigley Field |
Chicago Cubs |
38208.2099 |
Angel Stadium of Anaheim |
Los Angeles Angels of Anaheim |
37812.9241 |
Coors Field |
Colorado Rockies |
36953.6296 |
Miller Park |
Milwaukee Brewers |
36090.5309 |
Fenway Park |
Boston Red Sox |
35402.3291 |
Minute Maid Park |
Houston Astros |
35276.1358 |
Citizens Bank Park |
Philadelphia Phillies |
33671.8642 |
AT&T Park |
San Francisco Giants |
33429.1358 |
Suntrust Park |
Atlanta Braves |
32776.7901 |
Citi Field |
New York Mets |
30154.7160 |
PETCO Park |
San Diego Padres |
29585.1728 |
Target Field |
Minnesota Twins |
28435.7901 |
Nationals Park |
Washington Nationals |
27898.5309 |
Chase Field |
Arizona Diamondbacks |
26364.3210 |
Rangers Ballpark in Arlington |
Texas Rangers |
26333.2593 |
Great American Ballpark |
Cincinnati Reds |
22473.3671 |
Safeco Field |
Seattle Mariners |
22112.4568 |
Rogers Centre |
Toronto Blue Jays |
21606.7160 |
Progressive Field |
Cleveland Indians |
21464.7160 |
Oakland-Alameda County Coliseum |
Oakland Athletics |
20626.3457 |
Guaranteed Rate Field |
Chicago White Sox |
20622.1875 |
Comerica Park |
Detroit Tigers |
18536.1728 |
PNC Park |
Pittsburgh Pirates |
18412.8272 |
Kauffman Stadium |
Kansas City Royals |
18177.5625 |
Oriole Park at Camden Yards |
Baltimore Orioles |
16145.7654 |
Tropicana Field |
Tampa Bay Rays |
14552.2840 |
Marlins Park |
Miami Marlins |
10016.0741 |
Question 9:
Which managers have won the TSN Manager of the Year award in both the National League (NL) and the American League (AL)? Give their full name and the teams that they were managing when they won the award.
%%sql
SELECT
CONCAT(p.nameFirst, ' ', p.nameLast) Manager,
am1.awardID award,
am1.yearID year,
am1.lgID league,
m.teamID
FROM
awardsmanagers am1
INNER JOIN
people p ON p.playerID = am1.playerID
INNER JOIN
managers m ON m.playerID = am1.playerID
AND m.yearID = am1.yearID
WHERE
am1.playerID IN (SELECT
am2.playerID
FROM
awardsmanagers am2
WHERE
am2.awardID = 'TSN Manager of the Year'
AND am2.lgID = 'NL')
AND am1.playerID IN (SELECT
am2.playerID
FROM
awardsmanagers am2
WHERE
am2.awardID = 'TSN Manager of the Year'
AND am2.lgID = 'AL')
AND am1.awardID = 'TSN Manager of the Year'
* mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
6 rows affected.
Manager |
award |
year |
league |
teamID |
Jim Leyland |
TSN Manager of the Year |
1988 |
NL |
PIT |
Jim Leyland |
TSN Manager of the Year |
1990 |
NL |
PIT |
Jim Leyland |
TSN Manager of the Year |
1992 |
NL |
PIT |
Jim Leyland |
TSN Manager of the Year |
2006 |
AL |
DET |
Davey Johnson |
TSN Manager of the Year |
1997 |
AL |
BAL |
Davey Johnson |
TSN Manager of the Year |
2012 |
NL |
WAS |