Importing data from CSV
In this recipe, we'll work with the most common file format that you will encounter in the wild world of data—CSV. It stands for Comma Separated Values, which almost explains all the formatting there is. (There is also a header part of the file, but those values are also comma separated.)
Python has a module called csv
that supports reading and writing CSV files in various dialects. Dialects are important because there is no standard CSV, and different applications implement CSV in slightly different ways. A file's dialect is almost always recognizable by the first look into the file.
Getting ready
What we need for this recipe is the CSV file itself. We'll use sample CSV data that you can download from ch02-data.csv
.
We assume that sample data files are in the same folder as the code reading them.
How to do it...
The following code example demonstrates how to import data from a CSV file. We will perform the following steps for this:
- Open the
ch02-data.csv
file for reading. - Read the header first.
- Read the rest of the rows.
- In case there is an error, raise an exception.
- After reading everything, print the header and the rest of the rows.
This is shown in the following code:
import csv filename = 'ch02-data.csv' data = [] try: with open(filename) as f: reader = csv.reader(f) header = reader.next() data = [row for row in reader] except csv.Error as e: print "Error reading CSV file at line %s: %s" % (reader.line_num, e) sys.exit(-1) if header: print header print '==================' for datarow in data: print datarow
How it works...
First, we import the csv
module in order to enable access to the required methods. Then, we open the file with data using the with
compound statement and bind it to the object f
. The context manager with
statement releases us of care about the closing resource after we are finished manipulating those resources. It is a very handy way of working with resource-like files because it makes sure that the resource is freed (for example, that the file is closed) after the block of code is executed over it.
Then, we use the csv.reader()
method that returns the reader
object, which allows us to iterate over all rows of the read file. Every row is just a list of values and is printed inside the loop.
Reading the first row is somewhat different as it is the header of the file and describes the data in each column. This is not mandatory for CSV files and some files don't have headers, but they are a really nice way of providing minimal metadata about datasets. Sometimes though, you will find separate text or even CSV files that are just used as metadata describing the format and additional data about the data.
The only way to check what the first line looks like is to open the file and visually inspect it (for example, see the first few lines of the file)... This can be done efficiently on Linux using bash commands like head
as shown here:
$ head some_file.csv
During iteration of data, we save the first row in header
while we add every other row to the data
list.
We can also check if the .csv
file has a header or not using the method csv.has_header
.
Should any errors occur during reading, csv.reader()
will generate an error that we can catch and then print the helpful message to the user in order to help detection of errors.
There's more...
If you want to read about the background and reasoning for the csv
module, the PEP-defined document CSV File API is available at http://www.python.org/dev/peps/pep-0305/.
If you have larger files that you want to load, it's often better to use well-known libraries like NumPy's loadtxt()
that cope better with large CSV files.
The basic usage is simple as shown in the following code snippet:
import numpy data = numpy.loadtxt('ch02-data.csv', dtype='string', delimiter=',')
Note that we need to define a delimiter to instruct NumPy to separate our data as appropriate. The function numpy.loadtxt()
is somewhat faster than the similar function numpy.genfromtxt()
, but the latter can cope better with missing data, and you are able to provide functions to express what is to be done during the processing of certain columns of loaded data files.
Note
Currently, the csv module doesn't support Unicode, and so you must explicitly convert the read data into UTF-8 or ASCII printable. The official Python CSV documentation offers good examples on how to resolve data encoding issues.
In Python 3.3 and later versions, Unicode support is default and there are no such issues.