How to Set Pandas Dataframe Column Types When Reading Excel
Introduction
With pandas it is easy to read Excel files and catechumen the data into a DataFrame. Unfortunately Excel files in the real world are often poorly constructed. In those cases where the data is scattered across the worksheet, you may demand to customize the fashion you read the data. This article volition discuss how to utilise pandas and openpyxl to read these types of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.
The Problem
The pandas read_excel
function does an fantabulous job of reading Excel worksheets. However, in cases where the data is not a continuous table starting at cell A1, the results may not be what y'all wait.
If y'all try to read in this sample spreadsheet using read_excel(src_file)
:
You will get something that looks like this:
These results include a lot of Unnamed
columns, header labels inside a row also as several extra columns we don't demand.
Pandas Solutions
The simplest solution for this data fix is to apply the header
and usecols
arguments to read_excel()
. The usecols
parameter, in detail, can be very useful for controlling the columns y'all would like to include.
If y'all would like to follow along with these examples, the file is on github.
Here is one alternative approach to read simply the information we demand.
import pandas as pd from pathlib import Path src_file = Path . cwd () / 'shipping_tables.xlsx' df = pd . read_excel ( src_file , header = 1 , usecols = 'B:F' )
The resulting DataFrame only contains the data we need. In this example, we purposely exclude the notes column and date field:
The logic is relatively straightforward. usecols
can take Excel ranges such equally B:F
and read in only those columns. The header
parameter expects a unmarried integer that defines the header column. This value is 0-indexed so we pass in 1
even though this is row 2 in Excel.
In some instance, we may desire to ascertain the columns as a list of numbers. In this example, nosotros could define the list of integers:
df = pd . read_excel ( src_file , header = 1 , usecols = [ 1 , 2 , 3 , 4 , 5 ])
This approach might be useful if you lot have some sort of numerical pattern you want to follow for a large data set (i.e. every 3rd column or only fifty-fifty numbered columns).
The pandas usecols
tin can as well take a list of column names. This lawmaking will create an equivalent DataFrame:
df = pd . read_excel ( src_file , header = 1 , usecols = [ 'item_type' , 'order id' , 'lodge appointment' , 'state' , 'priority' ])
Using a list of named columns is going to be helpful if the column order changes only you lot know the names will not alter.
Finally, usecols
can take a callable function. Here'south a simple long-form example that excludes unnamed columns likewise every bit the priority column.
# Ascertain a more complex function: def column_check ( 10 ): if 'unnamed' in x . lower (): render False if 'priority' in x . lower (): return False if 'lodge' in ten . lower (): render True return Truthful df = pd . read_excel ( src_file , header = 1 , usecols = column_check )
The key concept to go along in mind is that the function will parse each column by name and must render a True
or False
for each column. Those columns that become evaluated to True
will be included.
Another approach to using a callable is to include a lambda
expression. Here is an example where we want to include simply a defined list of columns. We normalize the names by converting them to lower instance for comparison purposes.
cols_to_use = [ 'item_type' , 'order id' , 'order date' , 'state' , 'priority' ] df = pd . read_excel ( src_file , header = 1 , usecols = lambda ten : x . lower () in cols_to_use )
Callable functions give us a lot of flexibility for dealing with the real world messiness of Excel files.
Ranges and Tables
In some cases, the information could be even more than obfuscated in Excel. In this example, we have a table called ship_cost
that we want to read. If you must work with a file like this, it might be challenging to read in with the pandas options we have discussed so far.
In this case, nosotros can use openpyxl direct to parse the file and convert the information into a pandas DataFrame. The fact that the data is in an Excel tabular array can make this process a little easier.
Here's how to use openpyxl (once it is installed) to read the Excel file:
from openpyxl import load_workbook import pandas as pd from pathlib import Path src_file = src_file = Path . cwd () / 'shipping_tables.xlsx' wb = load_workbook ( filename = src_file )
This loads the whole workbook. If we desire to see all the sheets:
['sales', 'shipping_rates']
To admission the specific sheet:
canvass = wb [ 'shipping_rates' ]
To see a list of all the named tables:
dict_keys(['ship_cost'])
This cardinal corresponds to the name nosotros assigned in Excel to the table. Now we access the table to become the equivalent Excel range:
lookup_table = sheet . tables [ 'ship_cost' ] lookup_table . ref
'C8:E16'
This worked. We now know the range of data nosotros desire to load. The final pace is to convert that range to a pandas DataFrame. Here is a short code snippet to loop through each row and convert to a DataFrame:
# Access the information in the table range data = sail [ lookup_table . ref ] rows_list = [] # Loop through each row and get the values in the cells for row in information : # Get a listing of all columns in each row cols = [] for col in row : cols . append ( col . value ) rows_list . append ( cols ) # Create a pandas dataframe from the rows_list. # The outset row is the column names df = pd . DataFrame ( data = rows_list [ 1 :], index = None , columns = rows_list [ 0 ])
Here is the resulting DataFrame:
Now nosotros have the make clean tabular array and tin can apply for further calculations.
Summary
In an platonic world, the information we apply would be in a simple consistent format. See this paper for a nice give-and-take of what good spreadsheet practices wait similar.
In the examples in this commodity, you could hands delete rows and columns to make this more well-formatted. However, there are times where this is not feasible or appropriate. The good news is that pandas and openpyxl give the states all the tools we need to read Excel information - no affair how crazy the spreadsheet gets.
Changes
- 21-Oct-2020: Clarified that we don't want to include the notes column
How to Set Pandas Dataframe Column Types When Reading Excel
Source: https://pbpython.com/pandas-excel-range.html