How to Set Pandas Dataframe Column Types When Reading Excel

article header image

Introduction

With pandas it is easy to read Excel files and catechumen the data into a DataFrame. Unfortunately Excel files in the real world are often poorly constructed. In those cases where the data is scattered across the worksheet, you may demand to customize the fashion you read the data. This article volition discuss how to utilise pandas and openpyxl to read these types of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.

The Problem

The pandas read_excel function does an fantabulous job of reading Excel worksheets. However, in cases where the data is not a continuous table starting at cell A1, the results may not be what y'all wait.

If y'all try to read in this sample spreadsheet using read_excel(src_file) :

Excel

You will get something that looks like this:

Excel

These results include a lot of Unnamed columns, header labels inside a row also as several extra columns we don't demand.

Pandas Solutions

The simplest solution for this data fix is to apply the header and usecols arguments to read_excel() . The usecols parameter, in detail, can be very useful for controlling the columns y'all would like to include.

If y'all would like to follow along with these examples, the file is on github.

Here is one alternative approach to read simply the information we demand.

                            import              pandas              as              pd              from              pathlib              import              Path              src_file              =              Path              .              cwd              ()              /              'shipping_tables.xlsx'              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              'B:F'              )

The resulting DataFrame only contains the data we need. In this example, we purposely exclude the notes column and date field:

Clean DataFrame

The logic is relatively straightforward. usecols can take Excel ranges such equally B:F and read in only those columns. The header parameter expects a unmarried integer that defines the header column. This value is 0-indexed so we pass in 1 even though this is row 2 in Excel.

In some instance, we may desire to ascertain the columns as a list of numbers. In this example, nosotros could define the list of integers:

                            df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              [              1              ,              2              ,              3              ,              4              ,              5              ])

This approach might be useful if you lot have some sort of numerical pattern you want to follow for a large data set (i.e. every 3rd column or only fifty-fifty numbered columns).

The pandas usecols tin can as well take a list of column names. This lawmaking will create an equivalent DataFrame:

                            df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              [              'item_type'              ,              'order id'              ,              'lodge appointment'              ,              'state'              ,              'priority'              ])

Using a list of named columns is going to be helpful if the column order changes only you lot know the names will not alter.

Finally, usecols can take a callable function. Here'south a simple long-form example that excludes unnamed columns likewise every bit the priority column.

                            # Ascertain a more complex function:              def              column_check              (              10              ):              if              'unnamed'              in              x              .              lower              ():              render              False              if              'priority'              in              x              .              lower              ():              return              False              if              'lodge'              in              ten              .              lower              ():              render              True              return              Truthful              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              column_check              )

The key concept to go along in mind is that the function will parse each column by name and must render a True or False for each column. Those columns that become evaluated to True will be included.

Another approach to using a callable is to include a lambda expression. Here is an example where we want to include simply a defined list of columns. We normalize the names by converting them to lower instance for comparison purposes.

                            cols_to_use              =              [              'item_type'              ,              'order id'              ,              'order date'              ,              'state'              ,              'priority'              ]              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              lambda              ten              :              x              .              lower              ()              in              cols_to_use              )

Callable functions give us a lot of flexibility for dealing with the real world messiness of Excel files.

Ranges and Tables

In some cases, the information could be even more than obfuscated in Excel. In this example, we have a table called ship_cost that we want to read. If you must work with a file like this, it might be challenging to read in with the pandas options we have discussed so far.

Excel table

In this case, nosotros can use openpyxl direct to parse the file and convert the information into a pandas DataFrame. The fact that the data is in an Excel tabular array can make this process a little easier.

Here's how to use openpyxl (once it is installed) to read the Excel file:

                            from              openpyxl              import              load_workbook              import              pandas              as              pd              from              pathlib              import              Path              src_file              =              src_file              =              Path              .              cwd              ()              /              'shipping_tables.xlsx'              wb              =              load_workbook              (              filename              =              src_file              )

This loads the whole workbook. If we desire to see all the sheets:

['sales', 'shipping_rates']

To admission the specific sheet:

                            canvass              =              wb              [              'shipping_rates'              ]

To see a list of all the named tables:

dict_keys(['ship_cost'])

This cardinal corresponds to the name nosotros assigned in Excel to the table. Now we access the table to become the equivalent Excel range:

                            lookup_table              =              sheet              .              tables              [              'ship_cost'              ]              lookup_table              .              ref

'C8:E16'

This worked. We now know the range of data nosotros desire to load. The final pace is to convert that range to a pandas DataFrame. Here is a short code snippet to loop through each row and convert to a DataFrame:

                            # Access the information in the table range              data              =              sail              [              lookup_table              .              ref              ]              rows_list              =              []              # Loop through each row and get the values in the cells              for              row              in              information              :              # Get a listing of all columns in each row              cols              =              []              for              col              in              row              :              cols              .              append              (              col              .              value              )              rows_list              .              append              (              cols              )              # Create a pandas dataframe from the rows_list.              # The outset row is the column names              df              =              pd              .              DataFrame              (              data              =              rows_list              [              1              :],              index              =              None              ,              columns              =              rows_list              [              0              ])

Here is the resulting DataFrame:

Excel shipping table

Now nosotros have the make clean tabular array and tin can apply for further calculations.

Summary

In an platonic world, the information we apply would be in a simple consistent format. See this paper for a nice give-and-take of what good spreadsheet practices wait similar.

In the examples in this commodity, you could hands delete rows and columns to make this more well-formatted. However, there are times where this is not feasible or appropriate. The good news is that pandas and openpyxl give the states all the tools we need to read Excel information - no affair how crazy the spreadsheet gets.

Changes

21-Oct-2020: Clarified that we don't want to include the notes column

How to Set Pandas Dataframe Column Types When Reading Excel

Source: https://pbpython.com/pandas-excel-range.html

Kirton Womand