import sqlalchemy
sqlalchemy.create_engine('mysql+mysqlconnector://root:secret@localhost:3306/lahmansbaseballdb')
%reload_ext sql
%sql mysql+mysqlconnector://root:secret@localhost:3306/lahmansbaseballdb

Lahman’s Baseball db SQL practice

I wanted to brush off some of the rust for my SQL skills and I figured the Lahman Baseball database offers a decent database to practice. I downloaded the install script from here and found some practice exercises here. The rest of this post are my attempts to solve the problems. Play Ball! Query data!

Question 1:

What range of years does the provided database cover?

%%sql 
SELECT 
    MIN(yearID) AS first_year, MAX(yearID) AS last_year
FROM
    batting;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
1 rows affected.
first_year last_year
1871 2019

Question 2:

Find the name and height of the shortest player in the database. How many games did he play in? What is the name of the team for which he played?

%%sql
SELECT 
    p1.playerID,
    p1.nameFirst,
    p1.nameLast,
    p1.height,
    SUM(apps.G_all) AS games_played,
    t.name
FROM
    people p1
        INNER JOIN
    (SELECT 
        MIN(people.height) AS min_height
    FROM
        people) p2 ON p1.height = p2.min_height
        INNER JOIN
    appearances AS apps ON p1.playerID = apps.playerID
        INNER JOIN
    teams t ON t.teamID = apps.teamID
GROUP BY p1.playerID , p1.nameFirst , p1.nameLast , p1.height , t.name;

 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
1 rows affected.
playerID nameFirst nameLast height games_played name
gaedeed01 Eddie Gaedel 43 52 St. Louis Browns

Question 3:

Find all players in the database who played at Vanderbilt University. Create a list showing each player’s first and last names as well as the total salary they earned in the major leagues. Sort this list in descending order by the total salary earned. Which Vanderbilt player earned the most money in the majors?

%%sql
SELECT 
    p.nameFirst, p.nameLast, SUM(s.salary) total_salary
FROM
    people p
        INNER JOIN
    salaries s ON s.playerID = p.playerID
        INNER JOIN
    collegeplaying cp ON p.playerID = cp.playerID
        INNER JOIN
    schools sc ON sc.schoolID = cp.schoolID
WHERE
    sc.name_full = 'Vanderbilt University'
GROUP BY p.nameFirst , p.nameLast
ORDER BY total_salary DESC;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
15 rows affected.
nameFirst nameLast total_salary
David Price 245553888.0
Pedro Alvarez 62045112.0
Scott Sanderson 21500000.0
Mike Minor 20512500.0
Joey Cora 16867500.0
Mark Prior 12800000.0
Ryan Flaherty 12183000.0
Josh Paul 7920000.0
Sonny Gray 4627500.0
Mike Baxter 4188836.0
Jensen Lewis 3702000.0
Matt Kata 3180000.0
Nick Christiani 2000000.0
Jeremy Sowers 1154400.0
Scotti Madison 540000.0

Question 4:

Using the fielding table, group players into three groups based on their position: label players with position OF as “Outfield”, those with position “SS”, “1B”, “2B”, and “3B” as “Infield”, and those with position “P” or “C” as “Battery”. Determine the number of putouts made by each of these three groups in 2016.

%%sql
SELECT 
    CASE
        WHEN POS = 'OF' THEN 'Outfield'
        WHEN POS IN ('P' , 'C') THEN 'Battery'
        ELSE 'Infield'
    END AS POS_Group,
    SUM(PO) AS 'Putouts'
FROM
    fielding
WHERE
    yearID = 2016
GROUP BY POS_Group;

 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
3 rows affected.
POS_Group Putouts
Battery 41424
Infield 58935
Outfield 29560

Question 5:

%%sql
SELECT 
    FLOOR(yearID / 10) * 10 AS decade,
    AVG(SO / G) 'K per Game',
    AVG(HR / G) 'HR per Game'
FROM
    teams
WHERE
    yearID >= 1920
GROUP BY decade;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
10 rows affected.
decade K per Game HR per Game
1920 2.81490875 0.40173063
1930 3.31660500 0.54575125
1940 3.54989625 0.52308313
1950 4.39932063 0.84299375
1960 5.71242879 0.82020455
1970 5.14521423 0.74564065
1980 5.34239077 0.80403538
1990 6.15079065 0.96031367
2000 6.56109333 1.07335433
2010 7.81854300 1.06862133

Question 6:

Find the player who had the most success stealing bases in 2016, where success is measured as the percentage of stolen base attempts which are successful. (A stolen base attempt results either in a stolen base or being caught stealing.) Consider only players who attempted at least 20 stolen bases.

%%sql
SELECT 
    p.nameFirst,
    p.nameLast,
    ROUND(b.SB / (b.SB + b.CS), 2) AS sb_perc
FROM
    batting b
        INNER JOIN
    people p ON p.playerID = b.playerID
WHERE
    b.SB + b.CS >= 20 AND b.yearID = 2016
ORDER BY sb_perc DESC;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
47 rows affected.
nameFirst nameLast sb_perc
Chris Owings 0.91
Brian Dozier 0.90
Rajai Davis 0.88
Billy Hamilton 0.88
Kevin Kiermaier 0.88
Mookie Betts 0.87
Paul Goldschmidt 0.86
Trea Turner 0.85
Keon Broxton 0.85
Hernan Perez 0.83
Josh Harrison 0.83
Eduardo Nunez 0.82
Wil Myers 0.82
Jarrod Dyson 0.81
Mike Trout 0.81
Alcides Escobar 0.81
Dee Gordon 0.81
B. J. Upton 0.80
Brett Gardner 0.80
Leonys Martin 0.80
Starling Marte 0.80
Francisco Lindor 0.79
Jonathan Villar 0.78
Ian Desmond 0.78
Odubel Herrera 0.78
Jean Segura 0.77
Ryan Braun 0.76
Jose Ramirez 0.76
Jose Altuve 0.75
Todd Frazier 0.75
Elvis Andrus 0.75
Gregory Polanco 0.74
Freddy Galvis 0.74
Cameron Maybin 0.71
Jacoby Ellsbury 0.71
Travis Jankowski 0.71
Ian Kinsler 0.70
Ender Inciarte 0.70
Kevin Pillar 0.70
Jose Peraza 0.68
Bryce Harper 0.68
Rougned Odor 0.67
Mallex Smith 0.67
Charlie Blackmon 0.65
Brandon Phillips 0.64
Cesar Hernandez 0.57
Danny Santana 0.57

Question 7:

From 1970 – 2019, what is the largest number of wins for a team that did not win the world series? What is the smallest number of wins for a team that did win the world series? Doing this will probably result in an unusually small number of wins for a world series champion – determine why this is the case. Then redo your query, excluding the problem year. How often from 1970 – 2016 was it the case that a team with the most wins also won the world series? What percentage of the time?

%%sql
SELECT 
    t1.teamID, t1.yearID, t1.W
FROM
    teams t1
        INNER JOIN
    (SELECT 
        t2.W, t2.teamID, t2.yearID
    FROM
        teams t2
    WHERE
        t2.WSWin = 'N' AND t2.yearID >= 1970
    ORDER BY t2.W DESC
    LIMIT 1) Max_w_no_WS ON t1.teamID = Max_w_no_WS.teamID
        AND Max_w_no_WS.yearID = t1.yearID 
UNION ALL SELECT 
    t1.teamID, t1.yearID, t1.W
FROM
    teams t1
        INNER JOIN
    (SELECT 
        t2.W, t2.teamID, t2.yearID
    FROM
        teams t2
    WHERE
        t2.WSWin = 'Y' AND t2.yearID >= 1970
            AND t2.G > 150
    ORDER BY t2.W ASC
    LIMIT 1) Min_w_WS ON t1.teamID = Min_w_WS.teamID
        AND Min_w_WS.yearID = t1.yearID;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
2 rows affected.
teamID yearID W
SEA 2001 116
SLN 2006 83

Part 2: How often from 1970 – 2016 was it the case that a team with the most wins also won the world series? What percentage of the time?

%%sql
SELECT 
    COUNT(*) / (2019 - 1970) AS perc_most_wins_and_WS_champs
FROM
    teams t1
WHERE
    (t1.yearID , t1.W) IN (SELECT 
            t2.yearID, MAX(t2.W) AS max_wins
        FROM
            teams t2
        GROUP BY t2.yearID)
        AND t1.yearID >= 1970
        AND t1.WSWin = 'Y'
ORDER BY t1.yearID DESC , t1.W DESC;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
1 rows affected.
perc_most_wins_and_WS_champs
0.2653

Question 8:

Using the attendance figures from the homegames table, find the teams and parks which had the top 5 average attendance per game in 2016 (where average attendance is defined as total attendance divided by number of games). Only consider parks where there were at least 10 games played. Report the park name, team name, and average attendance. Repeat for the lowest 5 average attendance.

%%sql
SELECT 
    parks.parkname,
    teams.name,
    homegames.attendance / homegames.games average_attendance
FROM
    homegames
        INNER JOIN
    parks ON parks.ID = homegames.park_ID
        INNER JOIN
    teams ON teams.ID = homegames.team_ID
WHERE
    games > 10 AND yearkey = 2019
ORDER BY average_attendance DESC;
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
30 rows affected.
parkname name average_attendance
Dodger Stadium Los Angeles Dodgers 49065.5432
Busch Stadium III St. Louis Cardinals 42967.8148
Yankee Stadium II New York Yankees 40795.1111
Wrigley Field Chicago Cubs 38208.2099
Angel Stadium of Anaheim Los Angeles Angels of Anaheim 37812.9241
Coors Field Colorado Rockies 36953.6296
Miller Park Milwaukee Brewers 36090.5309
Fenway Park Boston Red Sox 35402.3291
Minute Maid Park Houston Astros 35276.1358
Citizens Bank Park Philadelphia Phillies 33671.8642
AT&T Park San Francisco Giants 33429.1358
Suntrust Park Atlanta Braves 32776.7901
Citi Field New York Mets 30154.7160
PETCO Park San Diego Padres 29585.1728
Target Field Minnesota Twins 28435.7901
Nationals Park Washington Nationals 27898.5309
Chase Field Arizona Diamondbacks 26364.3210
Rangers Ballpark in Arlington Texas Rangers 26333.2593
Great American Ballpark Cincinnati Reds 22473.3671
Safeco Field Seattle Mariners 22112.4568
Rogers Centre Toronto Blue Jays 21606.7160
Progressive Field Cleveland Indians 21464.7160
Oakland-Alameda County Coliseum Oakland Athletics 20626.3457
Guaranteed Rate Field Chicago White Sox 20622.1875
Comerica Park Detroit Tigers 18536.1728
PNC Park Pittsburgh Pirates 18412.8272
Kauffman Stadium Kansas City Royals 18177.5625
Oriole Park at Camden Yards Baltimore Orioles 16145.7654
Tropicana Field Tampa Bay Rays 14552.2840
Marlins Park Miami Marlins 10016.0741

Question 9:

Which managers have won the TSN Manager of the Year award in both the National League (NL) and the American League (AL)? Give their full name and the teams that they were managing when they won the award.

%%sql
SELECT 
    CONCAT(p.nameFirst, ' ', p.nameLast) Manager,
    am1.awardID award,
    am1.yearID year,
    am1.lgID league,
    m.teamID
FROM
    awardsmanagers am1
        INNER JOIN
    people p ON p.playerID = am1.playerID
        INNER JOIN
    managers m ON m.playerID = am1.playerID
        AND m.yearID = am1.yearID
WHERE
    am1.playerID IN (SELECT 
            am2.playerID
        FROM
            awardsmanagers am2
        WHERE
            am2.awardID = 'TSN Manager of the Year'
                AND am2.lgID = 'NL')
        AND am1.playerID IN (SELECT 
            am2.playerID
        FROM
            awardsmanagers am2
        WHERE
            am2.awardID = 'TSN Manager of the Year'
                AND am2.lgID = 'AL')
        AND am1.awardID = 'TSN Manager of the Year'
 * mysql+mysqlconnector://root:***@localhost:3306/lahmansbaseballdb
6 rows affected.
Manager award year league teamID
Jim Leyland TSN Manager of the Year 1988 NL PIT
Jim Leyland TSN Manager of the Year 1990 NL PIT
Jim Leyland TSN Manager of the Year 1992 NL PIT
Jim Leyland TSN Manager of the Year 2006 AL DET
Davey Johnson TSN Manager of the Year 1997 AL BAL
Davey Johnson TSN Manager of the Year 2012 NL WAS