Photo by Artturi Jalli on Unsplash

Web scraping NFL draft data with Python

Jonathan Kelly
5 min readAug 16, 2021

--

The data science field has exploded in the last few years. In fact, Data Scientist was Glassdoor’s #1 job in America from 2016–2019, #3 in 2020, and #2 in 2021. All that popularity has crowded the marketplace with people looking to switch careers and get into this amazing field. One of the best ways to stand out is by creating a portfolio of data science projects, but to do that you need data.

There are great places to find datasets like Kaggle, data science code repositories, and various government and public health sites. But, what if you wanted something more specific? What if you wanted a dataset that would allow you to show off your data science skillset while also being fun and interesting? How could you get the data of your choice easily off the internet without copy and pasting a million times? Enter web scraping.

Web scraping is a fun and surprisingly easy way to get the data you want from almost any website you can find. Web scraping is just writing a few lines of code that opens a webpage and copies information found on that page. It’s powerful because it can open lots of web pages in succession and record that data in lots of different formats. You can then use that data for data analysis and machine learning (making predictions).

The following is a breakdown of my code from my latest adventure in Python. As with almost all code, not all of it is novel and some has been adapted from other snippets floating around the internet (Google, Stack Overflow, etc.). This tutorial isn’t comprehensive and requires you to have a little knowledge of HTML and programming. The language we’re using is Python. I’m using VSCode as my IDE and have a Jupyter Notebook extension installed on it.

Step 1: Traversing the Website

I love football and decided to scrape a website all about the NFL draft. I went back to the 2000 NFL draft and created a dataset of every draft pick, school, team, round, pick number, and position from 2000–2021.

The first step is to inspect the site. To do this, we go to the first page in the succession of pages we want to scrape. In this case “http://www.drafthistory.com/index.php/years/2000". Once I go to that site, I’m going to click on the next page I want to scrape and look at the URL again. In this case it’s the 2001 NFL draft and the URL is “http://www.drafthistory.com/index.php/years/2001". Just from looking at these 2 URLs, I think you can tell how this is going to go. Because the only difference in the URL between the 2 pages is the year of the respective draft, this scrape should be pretty easy.

Now that we know how the URLs are structured, we need to build a list of URLs of all the years we want to examine on the website. This is common with a lot of sites when you are iterating through a certain section of the site (e.g. blog_page_1, blog_page_2 …., blog_page_42). This makes it really easy for us to extract the information we need. Keep this in mind if you ever want to scrape other pages.

Step 2: Building our URL list

First, we are going to create a homepage variable. In this case, it’s everything before the year of the draft. We also create an empty list container to store all of the URLs after we create them. The code below will have lots of comments in it that explain what the code does. Comments in Python start with #. If you see words behind a # then you know that is an explanation of the line(s) next to it or below it.

Step 3: Scraping your selected pages

Now that we know this code works, we want to use these URLs to retrieve the raw data from those websites. So we’ll take our code above and add it to our scraping algorithm.

The bottom part of this code will iterate over all of the desired pages, copy all the HTML from those pages, and store each page in its own text file.

I tried to use Beautiful Soup and a few other Python libraries to automatically parse the text into a nice clean CSV file, but I ran into some trouble based on the structure and security of the site. So this is how I chose to do it. There are probably much more efficient methods out there, but this worked, so I went for it.

Step 4: Cleaning the data

Now that we have all of our draft data in text form, it’s time to clean it up a bit and get it into a Pandas dataframe and ultimately into CSV form so we can manipulate and analyze it. To do that, I’m going to be using an HTML parser. This will read the HTML from the text file we made and split the text using known HTML tags. After the HTML is parsed, we are going to create a Pandas dataframe and move our data into it. We’ll create and name columns, then take each dataframe and save it by year. If you’re unfamiliar with dataframes, they’re kind of like spreadsheets and make it much easier to work with our data.

Once the files have been converted, open them up in your program of choice or display them in your IDE. You’ll notice that we have a few extra rows of data for each file. I just opened them in Excel and deleted those rows. There is a programmatic way to do it, but since each file is slightly different, I couldn’t find any similarities in the files that would allow me to do it easily. To me, it wasn’t worth the extra work since there are only 21 files and it only took me about 3 minutes total. However, if you’re working with a lot of files or tons of data, it might be worth it for you to figure out how to do it for your specific use case.

Step 5: Merging the data

This step lets us combine these different files into a single CSV file. The code below will allow you to combine all of the CSV files into one.

And we’re done! We now have a complete dataset of all players drafted into the NFL from 2000 through 2021. What you do with this data is up to you. I hope you enjoyed the read!

After I scraped this data I went on to analyze the distributions in the draft. In this blog, I show the results of that analysis. You can find the code repository here for the analysis, and you can find the scraping code repository here if you’d like to try it yourself. You can also access the dataset on Kaggle.

--

--

Jonathan Kelly

Data Engineer writing short tech tutorials and learnings from my work experiences. Find me on YouTube @datadelve