Dealing with real data
The atmospheric and oceanic sciences (AOS) are “data” intensive fields, whether data refers to observations or model output. Most of the analysis we do involve datasets, and so facilities for file input/output (i/o) are critical. Fortunately, Python has very robust facilities for file i/o, and we will be dealing with realistic datasets using Python.
Data in the text file
Some data is not too big and stored in the text file. A good example is the CO$_2$ concentration in the atmosphere.
In fact this is the famous “KEELING CURVE”
Reading the text file is possible using the built-in function, open
statement:
f = open('co2_mm_mlo.txt','r')
contents = f.readlines()
f.close()
As you might guess, f
is an object that many methods are attached. Do you remember how to list all available methods and attributes?
dir(f)
You can see that close
we used above is one of the methods. It is to flush and close the IO object. Once the file IO is closed, you can grab a handle on it. (You can not do anything with f
unless you open
the text file again.)
The method readlines
goes through each line and save it as an element of the list, contents
.
In [4]: contents[:10]
Out[4]:
['# --------------------------------------------------------------------\n',
'# USE OF NOAA ESRL DATA\n',
'# \n',
'# These data are made freely available to the public and the\n',
'# scientific community in the belief that their wide dissemination\n',
'# will lead to greater understanding and new scientific insights.\n',
'# The availability of these data does not constitute publication\n',
'# of the data. NOAA relies on the ethics and integrity of the user to\n',
'# ensure that ESRL receives fair credit for their work. If the data \n',
'# are obtained for potential use in a publication or presentation, \n']
.
.
.
'# NOTE: In general, the data presented for the last year are subject to change, \n',
'# depending on recalibration of the reference gas mixtures used, and other quality\n',
'# control procedures. Occasionally, earlier years may also be changed for the same\n',
'# reasons. Usually these changes are minor.\n',
'#\n',
'# CO2 expressed as a mole fraction in dry air, micromol/mol, abbreviated as ppm\n',
'# (-99.99 missing data; -1 no data for #daily means in month)\n',
'#\n',
'# decimal average interpolated trend #days\n',
'# date (season corr)\n',
'1958 3 1958.208 315.71 315.71 314.62 -1\n',
'1958 4 1958.292 317.45 317.45 315.29 -1\n',
.
.
.
Well, this dataset has quite a few numbers of lines that are not values, and we want to exclude this header. Good news is that lines in the header start a special characters “#”. So we can determine whether the line is in the header or not by evaluating the first letter.
The string object has a method for this. It is called startswith
.
CO2 = []
for line in contents:
if line.startswith('#') is False:
print(line)
We can see that line
is a string variable containing all the characters of each line.
(Try type(line)
to see the type of line
.)
We are interested in the CO$_2$ concentration that is in the middle of the string line
.
So we need to somehow break this long string into pieces to extract the CO$_2$ concentration.
Python has a host of string manipulation methods, built-in to string variables (a.k.a., objects), which are ideal for dealing with contents from text files.
We will mention only a few of these methods.
The split
method of a string object takes a string and breaks it into a list using a separator. For instance:
In [9]: print(line)
2018 9 2018.708 405.51 405.51 409.02 29
In [10]: type(line)
Out[10]: str
In [11]: print(line.split())
['2018', '9', '2018.708', '405.51', '405.51', '409.02', '29']
Finally, once we have the strings we desire, we can convert them to numerical types in order to make calculations.
If you loop through a list of strings, you can use the float
and int
functions on the string to get a number. For instance
CO2 = []
for line in contents:
if line.startswith('#') is False:
values = line.split()
CO2.append(float(values[3]))
For future purpose, we may want to save this array as another text file.
To write a string to the file that is defined by the file object f
, use the write
method attached to the file object:
f.write(astr)
Here, astr
is the string you want to write to the file. Note that a newline character is not automatically written to the file after the string is written.
To write a list of strings to the file, use the writelines
method:
f.writelines(contents)
Here, contents
is a list of strings.
In this example, we want to write CO$_2$ concentration to a new file, co2_keeling.txt
.
fileout = open('co2_keeling.txt', 'w') # open the text file to write
fileout.writelines(str(CO2))
fileout.close()
If you view the text file, the values are stored in one line. We can save one value in each line as the following.
outputstr = ['\n'] * len(CO2)
for i in range(len(CO2)):
outputstr[i] = str(CO2[i]) + outputstr[i]
fileout = open('co2_keeling.txt', 'w') # open the text file to write
fileout.writelines(outputstr)
fileout.close()
Data in the csv (comma-separated values) file
You often find data in csv format. It is not very different from text files, but data values are separated with comma as the name infers. You can read csv format data as above, but slightly different.
As an example, I extracted the car accident statistics in 2017 where the incidents are sorted by the age of the drivers who caused the accident. Get the data from here.
Let’s follow the steps in the previous example.
f = open('caraccident_data.csv','r')
contents = f.readlines()
f.close()
The list variable contents
has each line as its element. The first line that is saved in contents[0]
contains the name of each column, but it is not the data values. So, I would do the following to read the total number of accident that is in the second column.
tot = []
for line in contents[1:]:
val = line.split(',')
tot.append(float(val[1]))
When using the split
method, we specified that the line could be separated by the comma.
Python carries the module called csv
to read and write csv files. That allows us to avoid doing tedious work of counting columns and rows.
Let’s use the csv
module.
import csv
with open('caraccident_data.csv', 'r') as csvfile:
csv_reader = csv.reader(csvfile)
cnt = 0
for row in csv_reader:
if cnt == 0:
print(f'Column names are {", ".join(row)}')
else:
print(f'Values are {", ".join(row)}')
cnt += 1
Rather than deal with a list of individual String elements, you can read CSV data directly into a dictionary. The csv
module provides a method DictReader
for us to organize the data as a dictionary variable.
import csv
# Option 1
f = open('caraccident_data.csv', 'r')
csv_reader = csv.DictReader(f)
print(csv_reader.fieldnames)
for row in csv_reader:
print(f'Total number of {row["type"]} is {row["total"]}')
f.close()
# Option 2
with open('caraccident_data.csv', 'r') as csvfile:
csv_reader = csv.DictReader(csvfile)
for row in csv_reader:
print(f'Total number of {row["type"]} is {row["total"]}')
Reading the CSV-type data using Pandas
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. This package makes it even easier to handle cvs-type data files.
import pandas as pd
data = pd.read_csv('caraccident_data.csv')
print(data)
In IPython,
In [4]: whos
Variable Type Data/Info
---------------------------------
data DataFrame type tota<...>\n\n[6 rows x 10 columns]
pd module <module 'pandas' from '/U<...>ages/pandas/__init__.py'>
In [5]: print(data)
type total less than 20 ... 61-64 greater than 65 unknown
0 accident 1143175 44501 ... 62373 115674 75485
1 death toll 4185 125 ... 313 838 2
2 casualty toll 1803325 65622 ... 93589 172330 148784
3 major casualty 96810 3605 ... 7353 14565 1687
4 minor casualty 581589 19729 ... 35310 55037 50046
5 other casualty 1124926 42288 ... 50926 102728 97051
[6 rows x 10 columns]
To see the name of columns, you can try
data.columns
There are two different ways to print out a column.
data.type
data["type"]
This is how you can grab the rows.
data.head(2) # prints the first 2 rows
data.tail(1) # prints the last low
If you access the element, you can combine the name of the column and the number of the row.
print(data.type[3], data.total[3])
Using the indexing, we can get more than one rows.
print(data.total[1:3])
Selecting elements in a specific location is very similar to extracting elements from the array.
There are two ways of doing it.
When using loc
method, you can specify the name of the columns.
data.loc[:2, ['61-64','greater than 65']]
Or you can use iloc
for the location of the columns.
data.iloc[:1, 7:]
If you want to switch columns and rows,
data.T
pandas
package provides a simple statistical tools attached to the variable.
data.describe()
Exercise: Extract statistical information from the dataset.
One of the current issues is whether we regulate the permit for the senior drivers for the safety reasons. Let’s evaluate the data of the car accident in 2017 and see whether there is a statistical background for the regulation. Possible questions to be evaluated are.
- Are there more car accident caused by the drivers older than 60 than those with the age between 20-40?
- Do the accidents tend to be major by the drivers older than 60?