- Python Data Visualization Cookbook
- Igor Milovanovi?
- 745字
- 2025-04-04 22:11:17
Importing data from CSV
In this recipe we will work with the most common file format that one will encounter in the wild world of data, CSV. It stands for Comma Separated Values, which almost explains all the formatting there is. (There is also a header part of the file, but those values are also comma separated.)
Python has a module called csv
that supports reading and writing CSV files in various dialects. Dialects are important because there is no standard CSV and different applications implement CSV in slightly different ways. A file's dialect is almost always recognizable by the first look into the file.
Getting ready
What we need for this recipe is the CSV file itself. We will use sample CSV data that you can download from ch02-data.csv
.
We assume that sample datafiles is in the same folder as the code reading it.
How to do it...
The following code example demonstrates how to import data from a CSV file. We will:
- Open the
ch02-data.csv
file for reading. - Read the header first.
- Read the rest of the rows.
- In case there is an error, raise an exception.
- After reading everything, print the header and the rest of the rows.
import csv filename = 'ch02-data.csv' data = [] try: with open(filename) as f: reader = csv.reader(f) header = reader.next() data = [row for row in reader] except csv.Error as e: print "Error reading CSV file at line %s: %s" % (reader.line_num, e) sys.exit(-1) if header: print header print '==================' for datarow in data: print datarow
How it works...
First, we import the csv
module in order to enable access to required methods. Then, we open the file with data using the with
compound statement and bind it to the object f
. The context manager with
statement releases us of care about the closing resource after we are finished manipulating those resources. It is a very handy way of working with resource-like files because it makes sure that the resource is freed (for example, that the file is closed) after the block of code is executed over it.
Then, we use the csv.reader()
method that returns the reader
object, allowing us to iterate over all rows of the read file. Every row is just a list of values and is printed inside the loop.
Reading the first row is somewhat different as it is the header of the file and describes the data in each column. This is not mandatory for CSV files and some files don't have headers, but they are a really nice way of providing minimal metadata about datasets. Sometimes though, you will find separate text or even CSV files that are just used as metadata, describing the format and additional data about the data.
The only way to check what the first line looks like is to open the file and visually inspect it (for example, see the first few lines of the file). This can be done efficiently on Linux using bash commands such as head
as follows:
$ head some_file.csv
During iteration of data, we save the first row in header
, while we add every other row to the data
list.
Should any errors occur during reading, csv.reader()
will generate an error that we can catch and print the helpful message to the user, in order to help detection of errors.
There's more...
If you want to read about the background and reasoning for the csv
module, the PEP-defined document CSV File API is available at http://www.python.org/dev/peps/pep-0305/.
If we have larger files that we want to load, it's often better to use well-known libraries, such as NumPy's loadtxt()
, that cope better with large CSV files.
The basic usage is simple as shown in the following code snippet:
import numpy data = numpy.looadtxt('ch02-data.csv', dtype='string', delimiter=',')
Note that we need to define a delimiter to instruct NumPy to separate our data as appropriate. The function numpy.loadtxt()
is somewhat faster than the similar function numpy.genfromtxt()
, but the latter can cope better with missing data, and you are able to provide functions to express what is to be done during the processing of certain columns of loaded datafiles.
Note
Currently, in Python 2.7.x, the csv
module doesn't support Unicode, and you must explicitly convert the read data into UTF-8 or ASCII printable. The official Python CSV documentation offers good examples on how to resolve data encoding issues.
In Python 3.3 and later versions, Unicode support is default and there are no such issues.