Importing data from Microsoft Excel files
Although Microsoft Excel supports some charting, sometimes you need more flexible and powerful visualization and need to export data from existing spreadsheets into Python for further use.
A common approach to importing data from Excel files is to export data from Excel into CSV-formatted files and use the tools described in the previous recipe to import data using Python from the CSV file. This is a fairly easy process if we have one or two files (and have Microsoft Excel or OpenOffice.org installed), but if we are automating a data pipe for many files (as part of an ongoing data processing effort), we are not in a position to manually convert every Excel file into CSV. So, we need a way to read any Excel file.
Python has decent support for reading and writing Excel files through the project www.python-excel.org. This support is available in the form of different modules for reading and writing and is platform-independent; in other words, we don't have to run it on Windows in order to read Excel files.
The Microsoft Excel file format changed over time, and support for different versions is available in different Python libraries. The latest stable version of XLRD is 0.90 at the time of this writing and it has support for reading .xlsx
files.
Getting ready
First, we need to install the required module. For this example, we will use the module xlrd
. We will use pip in our virtual environment, as shown in the following code:
$ mkvirtualenv xlrdexample (xlrdexample)$ pip install xlrd
After successful installation, use the sample file ch02-xlsxdata.xlsx
.
How to do it...
The following code example demonstrates how to read a sample dataset from a known Excel file. We will do this as shown in the following steps:
- Open the file workbook.
- Find the sheet by name.
- Read the cells using the number of rows (
nrows
) and columns (ncols
). - For demonstration purposes, we only print the read dataset.
This is shown in the following code:
import xlrd file = 'ch02-xlsxdata.xlsx' wb = xlrd.open_workbook(filename=file) ws = wb.sheet_by_name('Sheet1') dataset = [] for r in xrange(ws.nrows): col = [] for c in range(ws.ncols): col.append(ws.cell(r, c).value) dataset.append(col) from pprint import pprint pprint(dataset)
How it works...
Let's try to explain the simple object model that xlrd
uses. At the top level, we have a workbook (the Python class xlrd.book.Book
) that consists of one or more worksheets (xlrd.sheet.Sheet
), and every sheet has a cell (xlrd.sheet.Cell
) from which we can then read the value.
We load a workbook from a file using open_workbook()
, which returns the xlrd.book.Book
instance that contains all the information about a workbook like sheets. We access sheets using sheet_by_name()
; if we need all sheets, we could use sheets()
, which returns a list of the xlrd.sheet.Sheet
instances. The xlrd.sheet.Sheet
class has a number of columns and rows as attributes that we can use to infer ranges for our loop to access every particular cell inside a worksheet using the method cell()
. There is an xrld.sheet.Cell
class, though it is not something we want to use directly.
Note that the date is stored as a floating point number and not as a separate data type, but the xlrd
module is able to inspect the value and try to infer if the data is in fact a date. So, we can inspect the cell type for the cell to get the Python date object. The module xlrd
will return xlrd.XL_CELL_DATE
as the cell type if the number format string looks like a date. Here is a snippet of code that demonstrates this:
from datetime import datetime from xlrd import open_workbook, xldate_as_tuple … cell = sheet.cell(1, 0) print cell print cell.value print cell.ctype if cell.ctype == xlrd.XL_CELL_DATE: date_value = xldate_as_tuple(cell.value, book.datemode) print datetime(*date_value)
This field still has issues, so please refer to the official documentation and mailing list in case you require extensive work with dates.
There's more...
A neat feature of xlrd
is its ability to load only parts of the file that are required in the memory. There is an on_demand
parameter that can be passed as True
value while calling open_workbook
so that the worksheet will only be loaded when requested. See the following example of code snippet for this:
book = open_workbook('large.xls', on_demand=True)
We didn't mention writing Excel files in this section partly because there will be a separate recipe for that and partly because there is a different module for that—xlwt
. You will read more about it in the Exporting data to JSON, CSV, and Excel recipe in this chapter.
If you need specific usage that was not covered with the module and examples explained earlier, here is a list of other Python modules on PyPi that might help you out with spreadsheets http://pypi.python.org/pypi?:action=browse&c=377.