Without solutions
Python libraries for working with data¶
# Copyright (c) Thalesians Ltd, 2019-2023. All rights reserved.
# Copyright (c) Paul Alexander Bilokon, 2019-2023. All rights reserved.
# Author: Paul Alexander Bilokon <[email protected]>
# This version: 2.0 (2023.11.17)
# Previous versions: 1.0 (2019.03.25)
# Email: [email protected]
Motivation¶
Two third-party (but de facto standard) Python libraries play a particular role in Data Analysis: pandas
and NumPy.
The data types defined in pandas
— DataFrame
and Series
— enable the data scientist to work effectively with tables. pandas
provides some of the functionality that we find in SQL databases and spreadsheets. Since much of the data that we deal with comes in the form of tables, pandas
is extremely useful.
NumPy provides a lower-level type — the multidimensional array. This data type is used as an underlying implementation by pandas
DataFrame
s and Series
. NumPy arrays are used wherever bulk operations on numbers are needed in Python, which is what much of Data Science is about. NumPy arrays can be seen as a Python implementation of the key linear algebra object — the matrix.
Objectives¶
- To introduce
pandas
DataFrames
andSeries
. - To show how… …to quickly inspect a
pandas
DataFrame
. - …
pandas
DataFrame
s can be indexed usingloc
andiloc
. - …
DataFrame
‘s columns can be indexed. - …to iterate through columns of a
DataFrame
. - …a
DataFrame
can be effectively summarised. - …to add a column to a
DataFrame
. - …to overwrite a column in a
DataFrame
. - …to rearrange columns in a
DataFrame
. - …to delete a column from a
DataFrame
. - …to apply a function to a column in a
DataFrame
. - …to filter data in a
DataFrame
based on some boolean expression. - …to deal with missing data (NaNs).
- …to use
groupby
. - …to append to a
DataFrame
. - …to join on a
DataFrame
. - …to get at the NumPy arrays behind the
pandas
DataFrame
andSeries
. - …to define one-dimensional (flat) NumPy arrays.
- …to use
np.arange
,np.linspace
, andnp.logspace
. - To explain the reasons to prefer NumPy arrays over standard Python data structures, such as lists.
- To show how… …to define two-dimensional NumPy arrays, thus implementing matrices.
- …to create NumPy arrays using
np.zeros
,np.ones
,np.full
,np.empty
,np.tile
. - …to use 32-bit floating point numbers (
float32
) instead of 64-bit floating point numbers in NumPy arrays. - …to generate random matrices.
- …to reshape matrices.
- …to multiply matrices by scalars.
- …to add matrices together.
- …to multiply matrices together.
- …to mix arrays and scalars in arithmetic operations; to explain broadcasting.
- …to transpose matrices.
- …to invert matrices.
- …to create identity matrices.
- …to stack NumPy arrays horizontally.
- …to stack NumPy arrays vertically.
- …to index NumPy arrays, including boolean indices.
- …to compare NumPy arrays.
- …to use some useful functions:
np.cumsum
,np.cumprod
,np.min
,np.max
,np.argmin
,np.argmax
,np.mean
,np.var
,np.std
. - …to apply a function to each row/column in a NumPy array.
- …to use NumPy array flags.
- …to make NumPy arrays immutable (kind of).
- …to find out more about the configuration of the NumPy library, e.g. which BLAS it is using.
pandas
DataFrame
s and Series
¶
pandas
is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. We can import it with
import pandas as pd
The key data type provided by pandas
is DataFrame
. Let us create one:
df = pd.DataFrame(
{
'transaction date': [2012.917, 2012.917, 2013.583, 2013.500, 2012.833, 2012.667, 2012.667, 2013.417, 2013.500, 2013.417, 2013.083, 2013.333, 2012.917, 2012.667, 2013.500],
'distance to the nearest MRT station': [84.87882, 306.59470, 561.98450, 561.98450, 390.56840, 2175.03000, 623.47310, 287.60250, 5512.03800, 1783.18000, 405.21340, 90.45606, 492.23130, 2469.64500, 1164.83800],
'number of convenience stores': [10, 9, 5, 5, 5, 3, 7, 6, 1, 3, 1, 9, 5, 4, 4],
'latitude': [24.98298, 24.98034, 24.98746, 24.98746, 24.97937, 24.96305, 24.97933, 24.98042, 24.95095, 24.96731, 24.97349, 24.97433, 24.96515, 24.96108, 24.99156],
'longitude': [121.54024, 121.53951, 121.54391, 121.54391, 121.54245, 121.51254, 121.53642, 121.54228, 121.48458, 121.51486, 121.53372, 121.54310, 121.53737, 121.51046, 121.53406],
'house price per unit area': [37.9, 42.2, 47.3, 54.8, 43.1, 32.1, 40.3, 46.7, 18.8, 22.1, 41.4, 58.1, 39.3, 23.8, 34.3]
},
columns=[
'transaction date',
'distance to the nearest MRT station',
'number of convenience stores',
'latitude',
'longitude',
'house price per unit area'
])
pandas
DataFrame
s in many ways resemble SQL database tables and worksheets in spreadsheet applications.
How to quickly inspect a DataFrame
¶
Let’s examine our newly created DataFrame
:
df
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
0 | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
1 | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
2 | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
3 | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
4 | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
5 | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 |
6 | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
7 | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
8 | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 |
9 | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 |
10 | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
11 | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
12 | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
13 | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 |
14 | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
In practice, we may be dealing with DataFrame
s containing thousands of rows of data (not fifteen, as in our example). So instead of looking at the entire DataFrame
we may look at its…
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
0 | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
1 | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
2 | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
3 | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
4 | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
We may wish to look at more rows from the DataFrame
‘s head:
df.head(10)
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
0 | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
1 | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
2 | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
3 | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
4 | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
5 | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 |
6 | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
7 | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
8 | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 |
9 | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 |
Or examine its tail:
df.tail()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
10 | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
11 | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
12 | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
13 | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 |
14 | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
The index of the DataFrame
, loc
and iloc
¶
Notice the numbers in bold on the left. These constitute the index of the DataFrame
:
list(df.index)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
We can access the list of columns of the data frame through
list(df.columns)
['transaction date', 'distance to the nearest MRT station', 'number of convenience stores', 'latitude', 'longitude', 'house price per unit area']
Thus the index indexes the rows, whereas the column names index the columns. We can access individual rows of the DataFrame
through
df.loc[3]
transaction date 2013.50000 distance to the nearest MRT station 561.98450 number of convenience stores 5.00000 latitude 24.98746 longitude 121.54391 house price per unit area 54.80000 Name: 3, dtype: float64
Notice that the index of the data frame doesn’t have to be formed from integers starting with zero, as in our case
list(df.index)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
We could have chosen to index the rows with strings, 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O'
. Since the DataFrame
is a mutable object, i.e. it can be changed after construction, we can replace its index accordingly:
df.index = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O']
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
Now we can index the rows with
df.loc['D']
transaction date 2013.50000 distance to the nearest MRT station 561.98450 number of convenience stores 5.00000 latitude 24.98746 longitude 121.54391 house price per unit area 54.80000 Name: D, dtype: float64
Whatever the index of the DataFrame
, we can also refer to the rows through their integer offsets from the top, using iloc
instead of loc
:
df.iloc[3]
transaction date 2013.50000 distance to the nearest MRT station 561.98450 number of convenience stores 5.00000 latitude 24.98746 longitude 121.54391 house price per unit area 54.80000 Name: D, dtype: float64
Whereas the type of the entire object is
type(df)
pandas.core.frame.DataFrame
The type of df.loc['D']
(equivalently, of df.iloc[3]
) is
type(df.loc['D'])
pandas.core.series.Series
Series
is the second most important type defined in the pandas
package (after DataFrame
).
Indexing columns¶
As we said above, the list of columns in the dataframe is accessible through
list(df.columns)
['transaction date', 'distance to the nearest MRT station', 'number of convenience stores', 'latitude', 'longitude', 'house price per unit area']
Individual columns can be indexed using
df['latitude']
A 24.98298 B 24.98034 C 24.98746 D 24.98746 E 24.97937 F 24.96305 G 24.97933 H 24.98042 I 24.95095 J 24.96731 K 24.97349 L 24.97433 M 24.96515 N 24.96108 O 24.99156 Name: latitude, dtype: float64
The type of the object thus obtained is
type(df['latitude'])
pandas.core.series.Series
You can also access multiple columns at once using
df[['latitude', 'longitude']]
latitude | longitude | |
---|---|---|
A | 24.98298 | 121.54024 |
B | 24.98034 | 121.53951 |
C | 24.98746 | 121.54391 |
D | 24.98746 | 121.54391 |
E | 24.97937 | 121.54245 |
F | 24.96305 | 121.51254 |
G | 24.97933 | 121.53642 |
H | 24.98042 | 121.54228 |
I | 24.95095 | 121.48458 |
J | 24.96731 | 121.51486 |
K | 24.97349 | 121.53372 |
L | 24.97433 | 121.54310 |
M | 24.96515 | 121.53737 |
N | 24.96108 | 121.51046 |
O | 24.99156 | 121.53406 |
The type of the resulting object is DataFrame
, rather than Series
:
type(df[['latitude', 'longitude']])
pandas.core.frame.DataFrame
Iterating through columns¶
Let us demonstrate how we can iterate through the columns of a DataFrame
. In this example we shall store the first (0th) value from each column in a dictionary:
firsts = {}
for c in df.columns:
firsts[c] = df[c].iloc[0]
firsts
{'transaction date': 2012.917, 'distance to the nearest MRT station': 84.87882, 'number of convenience stores': 10, 'latitude': 24.98298, 'longitude': 121.54024, 'house price per unit area': 37.9}
Summarising the DataFrame
¶
Another quick way to obtain a high-level view of a (potentially very large) DataFrame
is with
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 15 entries, A to O Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transaction date 15 non-null float64 1 distance to the nearest MRT station 15 non-null float64 2 number of convenience stores 15 non-null int64 3 latitude 15 non-null float64 4 longitude 15 non-null float64 5 house price per unit area 15 non-null float64 dtypes: float64(5), int64(1) memory usage: 1.4+ KB
whereas the following will present summary statistics for each column:
df.describe()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
count | 15.000000 | 15.000000 | 15.000000 | 15.000000 | 15.000000 | 15.000000 |
mean | 2013.127867 | 1127.314552 | 5.133333 | 24.974952 | 121.530627 | 38.813333 |
std | 0.347586 | 1425.732372 | 2.722044 | 0.011395 | 0.017235 | 11.251341 |
min | 2012.667000 | 84.878820 | 1.000000 | 24.950950 | 121.484580 | 18.800000 |
25% | 2012.875000 | 348.581550 | 3.500000 | 24.966230 | 121.524290 | 33.200000 |
50% | 2013.083000 | 561.984500 | 5.000000 | 24.979330 | 121.537370 | 40.300000 |
75% | 2013.458500 | 1474.009000 | 6.500000 | 24.981700 | 121.542365 | 44.900000 |
max | 2013.583000 | 5512.038000 | 10.000000 | 24.991560 | 121.543910 | 58.100000 |
Adding a column¶
To add a column, assuming the column 'house age'
does not yet exist:
df['house age'] = [32.0, 19.5, 13.3, 13.3, 5.0, 7.1, 34.5, 20.3, 31.7, 17.9, 34.8, 6.3, 13.0, 20.4, 13.2]
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | house age | |
---|---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 | 32.0 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 | 19.5 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 | 13.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 | 13.3 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 | 5.0 |
Overwriting a column¶
Similarly, if the column 'house age'
already exists,
df['house age'] = [320., 195., 133., 133., 50., 71., 345., 203., 317., 179., 348., 63., 130., 204., 132.]
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | house age | |
---|---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 | 320.0 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 | 195.0 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 | 133.0 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 | 133.0 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 | 50.0 |
Rearranging the columns¶
To rearrange the columns:
df = df[[
'transaction date', 'house age',
'distance to the nearest MRT station', 'number of convenience stores',
'latitude', 'longitude',
'house price per unit area'
]]
df.head()
transaction date | house age | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|---|
A | 2012.917 | 320.0 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 195.0 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 133.0 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 133.0 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 50.0 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
Deleting a column¶
del df['house age']
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
Applying a function to a column¶
import math
df['distance to the nearest MRT station'].apply(math.sqrt)
A 9.212970 B 17.509846 C 23.706212 D 23.706212 E 19.762803 F 46.637217 G 24.969443 H 16.958847 I 74.243101 J 42.227716 K 20.129913 L 9.510839 M 22.186286 N 49.695523 O 34.129723 Name: distance to the nearest MRT station, dtype: float64
Notice that the original column hasn’t been overwritten:
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
To overwrite it,
df['distance to the nearest MRT station'] = df['distance to the nearest MRT station'].apply(math.sqrt)
C:\Users\paul\AppData\Local\Temp\ipykernel_22944\4176019659.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['distance to the nearest MRT station'] = df['distance to the nearest MRT station'].apply(math.sqrt)
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 9.212970 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 17.509846 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 23.706212 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 23.706212 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 19.762803 | 5 | 24.97937 | 121.54245 | 43.1 |
To reverse what we have just done:
df['distance to the nearest MRT station'] = df['distance to the nearest MRT station'].apply(lambda x: x * x)
C:\Users\paul\AppData\Local\Temp\ipykernel_22944\2618952524.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['distance to the nearest MRT station'] = df['distance to the nearest MRT station'].apply(lambda x: x * x)
df.head()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
Filtering data¶
Suppose that we want to consider only those rows where the value in the column 'number of convenience stores'
is greater than or equal to 7. We can then index df
by
df['number of convenience stores'] >= 7
A True B True C False D False E False F False G True H False I False J False K False L True M False N False O False Name: number of convenience stores, dtype: bool
obtaining, as a result,
df[df['number of convenience stores'] >= 7]
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
If we want to consider those rows where the 'number of convenience stores'
is greater than or equal to seven or the distance to the nearest MRT station is less than 500, we index the dataframe by
(df['number of convenience stores'] >= 7) | (df['distance to the nearest MRT station'] < 500)
A True B True C False D False E True F False G True H True I False J False K True L True M True N False O False dtype: bool
The result being
df[(df['number of convenience stores'] >= 7) | (df['distance to the nearest MRT station'] < 500)]
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
If instead we want to consider those rows where the 'number of convenience stores'
is greater than or equal to seven and the distance to the nearest MRT station is less than 500, we index the dataframe by
(df['number of convenience stores'] >= 7) & (df['distance to the nearest MRT station'] < 500)
A True B True C False D False E False F False G False H False I False J False K False L True M False N False O False dtype: bool
The result being
df[(df['number of convenience stores'] >= 7) & (df['distance to the nearest MRT station'] < 500)]
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
Suppose that we set df1
to the above…
df1 = df[(df['number of convenience stores'] >= 7) & (df['distance to the nearest MRT station'] < 500)]
df1.loc['L', 'number of convenience stores'] = 100
We get a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
. The slice (the view of a subset of the original DataFrame
) has changed:
df1
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
L | 2013.333 | 90.45606 | 100 | 24.97433 | 121.54310 | 58.1 |
Whereas the original has not changed:
df
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
F | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
For more information on this behaviour, read http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Dealing with missing data¶
Let us examine what happens if we have missing data in the DataFrame
.
To this end, let us take a copy
df_copy = df.copy()
and modify it like so:
import numpy as np
df_copy.loc['C', 'distance to the nearest MRT station'] = np.nan
df_copy.loc['E', 'number of convenience stores'] = np.nan
df_copy
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10.0 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9.0 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | NaN | 5.0 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5.0 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | NaN | 24.97937 | 121.54245 | 43.1 |
F | 2012.667 | 2175.03000 | 3.0 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7.0 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6.0 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1.0 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3.0 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1.0 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9.0 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5.0 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4.0 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4.0 | 24.99156 | 121.53406 | 34.3 |
The special np.nan
value indicates that the data is missing or invalid (NaN stands for “not a number”). NaNs often result from numerical calculations. Their presence may interfere with further numerical work. One quick way to address the issue of NaNs is by dropping (removing) them:
df_copy.dropna()
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10.0 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9.0 | 24.98034 | 121.53951 | 42.2 |
D | 2013.500 | 561.98450 | 5.0 | 24.98746 | 121.54391 | 54.8 |
F | 2012.667 | 2175.03000 | 3.0 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7.0 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6.0 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1.0 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3.0 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1.0 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9.0 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5.0 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4.0 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4.0 | 24.99156 | 121.53406 | 34.3 |
This has returned a copy of the DataFrame
with all rows containing NaNs removed. df_copy
itself has not been changed:
df_copy
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10.0 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9.0 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | NaN | 5.0 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5.0 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | NaN | 24.97937 | 121.54245 | 43.1 |
F | 2012.667 | 2175.03000 | 3.0 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7.0 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6.0 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1.0 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3.0 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1.0 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9.0 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5.0 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4.0 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4.0 | 24.99156 | 121.53406 | 34.3 |
But we could overwrite it with the result of dropna()
:
df_copy = df_copy.dropna()
df_copy
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10.0 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9.0 | 24.98034 | 121.53951 | 42.2 |
D | 2013.500 | 561.98450 | 5.0 | 24.98746 | 121.54391 | 54.8 |
F | 2012.667 | 2175.03000 | 3.0 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7.0 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6.0 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1.0 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3.0 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1.0 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9.0 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5.0 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4.0 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4.0 | 24.99156 | 121.53406 | 34.3 |
groupby
¶
It may be useful to summarise the pandas
DataFrame
for a given value of a particular column. This can be achieved using groupby
. For example, we may wish to look at the mean of each column for each possible value of 'number of convenience stores'
:
df.groupby('number of convenience stores').mean()
transaction date | distance to the nearest MRT station | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|
number of convenience stores | |||||
1 | 2013.29150 | 2958.625700 | 24.962220 | 121.509150 | 30.100 |
3 | 2013.04200 | 1979.105000 | 24.965180 | 121.513700 | 27.100 |
4 | 2013.08350 | 1817.241500 | 24.976320 | 121.522260 | 29.050 |
5 | 2013.20825 | 501.692175 | 24.979860 | 121.541910 | 46.125 |
6 | 2013.41700 | 287.602500 | 24.980420 | 121.542280 | 46.700 |
7 | 2012.66700 | 623.473100 | 24.979330 | 121.536420 | 40.300 |
9 | 2013.12500 | 198.525380 | 24.977335 | 121.541305 | 50.150 |
10 | 2012.91700 | 84.878820 | 24.982980 | 121.540240 | 37.900 |
It looks like the house price per unit area may increase with the number of convenience stores, although we aren’t particularly certain about this conclusion, whereas the distance to the nearest MRT station does seem to decrease as the number of convenience stores increases.
Instead of the mean, we may consider the minimum:
df.groupby('number of convenience stores').min()
transaction date | distance to the nearest MRT station | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|
number of convenience stores | |||||
1 | 2013.083 | 405.21340 | 24.95095 | 121.48458 | 18.8 |
3 | 2012.667 | 1783.18000 | 24.96305 | 121.51254 | 22.1 |
4 | 2012.667 | 1164.83800 | 24.96108 | 121.51046 | 23.8 |
5 | 2012.833 | 390.56840 | 24.96515 | 121.53737 | 39.3 |
6 | 2013.417 | 287.60250 | 24.98042 | 121.54228 | 46.7 |
7 | 2012.667 | 623.47310 | 24.97933 | 121.53642 | 40.3 |
9 | 2012.917 | 90.45606 | 24.97433 | 121.53951 | 42.2 |
10 | 2012.917 | 84.87882 | 24.98298 | 121.54024 | 37.9 |
Or maximum:
df.groupby('number of convenience stores').max()
transaction date | distance to the nearest MRT station | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|
number of convenience stores | |||||
1 | 2013.500 | 5512.03800 | 24.97349 | 121.53372 | 41.4 |
3 | 2013.417 | 2175.03000 | 24.96731 | 121.51486 | 32.1 |
4 | 2013.500 | 2469.64500 | 24.99156 | 121.53406 | 34.3 |
5 | 2013.583 | 561.98450 | 24.98746 | 121.54391 | 54.8 |
6 | 2013.417 | 287.60250 | 24.98042 | 121.54228 | 46.7 |
7 | 2012.667 | 623.47310 | 24.97933 | 121.53642 | 40.3 |
9 | 2013.333 | 306.59470 | 24.98034 | 121.54310 | 58.1 |
10 | 2012.917 | 84.87882 | 24.98298 | 121.54024 | 37.9 |
Appending to a DataFrame
¶
Let us again consider our DataFrame
,
df
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
F | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
Suppose that we have another DataFrame
,
df1 = pd.DataFrame({
'transaction date': [2013.417, 2013.083, 2012.917],
'distance to the nearest MRT station': [378.90278, 90.23891, 489.32891],
'number of convenience stores': [5, 6, 7],
'latitude': [24.97432, 24.97435, 24.97428],
'longitude': [121.53290, 121.53290, 121.53390],
'house price per unit area': [33.2, 82.0, 32.1]
},
index = ['P', 'Q', 'R'],
columns = ['transaction date', 'distance to the nearest MRT station', 'number of convenience stores',
'latitude', 'longitude', 'house price per unit area']
)
df1
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
P | 2013.417 | 378.90278 | 5 | 24.97432 | 121.5329 | 33.2 |
Q | 2013.083 | 90.23891 | 6 | 24.97435 | 121.5329 | 82.0 |
R | 2012.917 | 489.32891 | 7 | 24.97428 | 121.5339 | 32.1 |
We can append df1
to the end of df
using
pd.concat([df, df1])
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
F | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
P | 2013.417 | 378.90278 | 5 | 24.97432 | 121.53290 | 33.2 |
Q | 2013.083 | 90.23891 | 6 | 24.97435 | 121.53290 | 82.0 |
R | 2012.917 | 489.32891 | 7 | 24.97428 | 121.53390 | 32.1 |
Joining on a DataFrame
¶
Let us again consider our DataFrame
,
df
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | |
---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 |
F | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 |
I | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 |
J | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 |
N | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 |
O | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
Suppose we have another DataFrame
,
df_comments = pd.DataFrame({
'comments': ['data to be validated', 'to be confirmed'],
},
index = ['H', 'J']
)
df_comments
comments | |
---|---|
H | data to be validated |
J | to be confirmed |
We can join df_comments
onto df
based on the matching indices:
joined_df = df.join(df_comments)
joined_df
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | comments | |
---|---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 | NaN |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 | NaN |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 | NaN |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 | NaN |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 | NaN |
F | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 | NaN |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 | NaN |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 | data to be validated |
I | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 | NaN |
J | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 | to be confirmed |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 | NaN |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 | NaN |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 | NaN |
N | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 | NaN |
O | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 | NaN |
We may wish to replace the NaNs that have resulted in the 'comments'
column with blank strings:
joined_df.loc[joined_df['comments'].isnull(), 'comments'] = ''
joined_df
transaction date | distance to the nearest MRT station | number of convenience stores | latitude | longitude | house price per unit area | comments | |
---|---|---|---|---|---|---|---|
A | 2012.917 | 84.87882 | 10 | 24.98298 | 121.54024 | 37.9 | |
B | 2012.917 | 306.59470 | 9 | 24.98034 | 121.53951 | 42.2 | |
C | 2013.583 | 561.98450 | 5 | 24.98746 | 121.54391 | 47.3 | |
D | 2013.500 | 561.98450 | 5 | 24.98746 | 121.54391 | 54.8 | |
E | 2012.833 | 390.56840 | 5 | 24.97937 | 121.54245 | 43.1 | |
F | 2012.667 | 2175.03000 | 3 | 24.96305 | 121.51254 | 32.1 | |
G | 2012.667 | 623.47310 | 7 | 24.97933 | 121.53642 | 40.3 | |
H | 2013.417 | 287.60250 | 6 | 24.98042 | 121.54228 | 46.7 | data to be validated |
I | 2013.500 | 5512.03800 | 1 | 24.95095 | 121.48458 | 18.8 | |
J | 2013.417 | 1783.18000 | 3 | 24.96731 | 121.51486 | 22.1 | to be confirmed |
K | 2013.083 | 405.21340 | 1 | 24.97349 | 121.53372 | 41.4 | |
L | 2013.333 | 90.45606 | 9 | 24.97433 | 121.54310 | 58.1 | |
M | 2012.917 | 492.23130 | 5 | 24.96515 | 121.53737 | 39.3 | |
N | 2012.667 | 2469.64500 | 4 | 24.96108 | 121.51046 | 23.8 | |
O | 2013.500 | 1164.83800 | 4 | 24.99156 | 121.53406 | 34.3 |
Exercise¶
Consider the DataFrame
eg_df = pd.DataFrame({
'date': ['2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07'],
'value': [3.78, 2.90, 3.29, 1.21, 3.20, 9.39, 8.90]
},
columns=['date', 'value'])
eg_df
date | value | |
---|---|---|
0 | 2019-09-01 | 3.78 |
1 | 2019-09-02 | 2.90 |
2 | 2019-09-03 | 3.29 |
3 | 2019-09-04 | 1.21 |
4 | 2019-09-05 | 3.20 |
5 | 2019-09-06 | 9.39 |
6 | 2019-09-07 | 8.90 |
Replace the default index with the parsed dates from the 'date'
column. Once this is done, remove the 'date'
column.
Exercise¶
Consider again the DataFrame
eg_df = pd.DataFrame({
'date': ['2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07'],
'value': [3.78, 2.90, 3.29, 1.21, 3.20, 9.39, 8.90]
},
columns=['date', 'value'])
Find the mean, minimum, and maximum 'value'
. Use the library functions numpy.mean
, numpy.min
, and numpy.max
.
Exercise¶
Consider the DataFrame
eg_df = pd.DataFrame({
'date': ['2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07'],
'value': [3.78, 2.90, 3.29, 1.21, 3.20, 9.39, 8.90]
},
columns=['date', 'value'])
eg_df
date | value | |
---|---|---|
0 | 2019-09-01 | 3.78 |
1 | 2019-09-02 | 2.90 |
2 | 2019-09-03 | 3.29 |
3 | 2019-09-04 | 1.21 |
4 | 2019-09-05 | 3.20 |
5 | 2019-09-06 | 9.39 |
6 | 2019-09-07 | 8.90 |
Join it with another DataFrame
so that, for dates 2019-09-03
and 2019-09-06
the comment 'missing data'
is added in the 'comments'
column.
Exercise¶
Consider the DataFrame
eg_df = pd.DataFrame({
'date': ['2019-09-01', '2019-09-02', '2019-09-03', '2019-09-04', '2019-09-05', '2019-09-06', '2019-09-07'],
'value': [3.78, 2.90, 3.29, 1.21, 3.20, 9.39, 8.90]
},
columns=['date', 'value'])
eg_df
date | value | |
---|---|---|
0 | 2019-09-01 | 3.78 |
1 | 2019-09-02 | 2.90 |
2 | 2019-09-03 | 3.29 |
3 | 2019-09-04 | 1.21 |
4 | 2019-09-05 | 3.20 |
5 | 2019-09-06 | 9.39 |
6 | 2019-09-07 | 8.90 |
Two new data points become available. 9.89
for 2019-09-08
and 3.89
for 2019-09-09
. Append them to the DataFrame
.
Exercise¶
In the DataFrame
resulting from appending the data in the previous exercise, find the mean value for each day of the weak (Monday, Tuesday, Wednesday, etc.). You can get a weekday from a date using
import datetime as dt
dt.date(2019, 9, 2).weekday()
0
NumPy arrays¶
Many Python libraries (such as pandas
) rely on NumPy under the hood. NumPy, imported with
import numpy as np
implements multidimensional arrays.
NumPy arrays behind the pandas
DataFrame
and Series
¶
To access the underlying NumPy array of a pandas
DataFrame
, we can use values
:
df.values
array([[2.0129170e+03, 8.4878820e+01, 1.0000000e+01, 2.4982980e+01, 1.2154024e+02, 3.7900000e+01], [2.0129170e+03, 3.0659470e+02, 9.0000000e+00, 2.4980340e+01, 1.2153951e+02, 4.2200000e+01], [2.0135830e+03, 5.6198450e+02, 5.0000000e+00, 2.4987460e+01, 1.2154391e+02, 4.7300000e+01], [2.0135000e+03, 5.6198450e+02, 5.0000000e+00, 2.4987460e+01, 1.2154391e+02, 5.4800000e+01], [2.0128330e+03, 3.9056840e+02, 5.0000000e+00, 2.4979370e+01, 1.2154245e+02, 4.3100000e+01], [2.0126670e+03, 2.1750300e+03, 3.0000000e+00, 2.4963050e+01, 1.2151254e+02, 3.2100000e+01], [2.0126670e+03, 6.2347310e+02, 7.0000000e+00, 2.4979330e+01, 1.2153642e+02, 4.0300000e+01], [2.0134170e+03, 2.8760250e+02, 6.0000000e+00, 2.4980420e+01, 1.2154228e+02, 4.6700000e+01], [2.0135000e+03, 5.5120380e+03, 1.0000000e+00, 2.4950950e+01, 1.2148458e+02, 1.8800000e+01], [2.0134170e+03, 1.7831800e+03, 3.0000000e+00, 2.4967310e+01, 1.2151486e+02, 2.2100000e+01], [2.0130830e+03, 4.0521340e+02, 1.0000000e+00, 2.4973490e+01, 1.2153372e+02, 4.1400000e+01], [2.0133330e+03, 9.0456060e+01, 9.0000000e+00, 2.4974330e+01, 1.2154310e+02, 5.8100000e+01], [2.0129170e+03, 4.9223130e+02, 5.0000000e+00, 2.4965150e+01, 1.2153737e+02, 3.9300000e+01], [2.0126670e+03, 2.4696450e+03, 4.0000000e+00, 2.4961080e+01, 1.2151046e+02, 2.3800000e+01], [2.0135000e+03, 1.1648380e+03, 4.0000000e+00, 2.4991560e+01, 1.2153406e+02, 3.4300000e+01]])
This gives us the raw numerical data. Similarly, we can access the NumPy array behind the Series
representing a specific column:
df['transaction date'].values
array([2012.917, 2012.917, 2013.583, 2013.5 , 2012.833, 2012.667, 2012.667, 2013.417, 2013.5 , 2013.417, 2013.083, 2013.333, 2012.917, 2012.667, 2013.5 ])
Or row:
df.loc['A'].values
array([2012.917 , 84.87882, 10. , 24.98298, 121.54024, 37.9 ])
Defining one-dimensional (flat) NumPy arrays¶
Defining a one-dimensional (flat) array is easy: you can simply wrap a Python list:
a = np.array([3.57, 4.18, 25.7])
a
array([ 3.57, 4.18, 25.7 ])
np.ndim(a)
1
np.size(a)
3
np.shape(a)
(3,)
np.arange
, np.linspace
, np.logspace
¶
Often we want to obtain a result similar to that of range
but for floating point numbers (rather than integers); we want to obtain an array of floating point numbers at equally spaced intervals (e.g. for plotting). In NumPy this is attained using linspace
. Thus to obtain an array of fifteen equally spaced floating point numbers starting at -5. (inclusive), finishing at 10. (inclusive), we can use
np.linspace(-5., 10., 15)
array([-5. , -3.92857143, -2.85714286, -1.78571429, -0.71428571, 0.35714286, 1.42857143, 2.5 , 3.57142857, 4.64285714, 5.71428571, 6.78571429, 7.85714286, 8.92857143, 10. ])
Notice that the arguments specifying the start and end of the interval are both inclusive: the first value is exactly -5.
, the last value is exactly 10.
. This is unlike the arguments of Python’s standard range
: the start of the range is inclusive, but the end of the range is exclusive:
list(range(-5, 10, 1))
[-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
Also note that the third argument of np.linspace
is the total number of points to be produced, whereas the third argument of range is the step size. NumPy has a function np.arange
, which is similar to range
, but works with floating point arguments and produces NumPy arrays:
np.arange(-5., 10., 1.25)
array([-5. , -3.75, -2.5 , -1.25, 0. , 1.25, 2.5 , 3.75, 5. , 6.25, 7.5 , 8.75])
np.arange(-5., 10., 1.)
array([-5., -4., -3., -2., -1., 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
np.linspace
and np.arange
are particularly useful for plotting:
xs = np.linspace(0., 99. * math.pi / 8., 100)
ys = np.sin(xs)
We need the following cell magic to enable plotting in a Jupyter notebook:
%matplotlib inline
And then we plot using the library Matplotlib (also known as pyplot
):
import matplotlib.pyplot as plt
plt.plot(xs, ys);
Instead of linearly (equally) spaced points, we can produce points spaced logarithmically using np.logspace
:
xs = np.logspace(start=0., stop=5., num=15, base=10.)
xs
array([1.00000000e+00, 2.27584593e+00, 5.17947468e+00, 1.17876863e+01, 2.68269580e+01, 6.10540230e+01, 1.38949549e+02, 3.16227766e+02, 7.19685673e+02, 1.63789371e+03, 3.72759372e+03, 8.48342898e+03, 1.93069773e+04, 4.39397056e+04, 1.00000000e+05])
ys = [x**2 for x in xs]
Here we start with the zeroth power of ten and end with the fifth power of 10.
plt.plot(xs, ys);
Reasons to prefer NumPy arrays over standard Python data structures, such as lists¶
Suppose that we want to compute the sine of a number of regularly spaced angles (perhaps with a view of plotting the resulting graph):
import math
xs = [n * math.pi / 8. for n in range(10000)]
ys = [math.sin(x) for x in xs]
We can time the creation of each of the arrays, xs
and ys
, using the timeit
cell magic:
%timeit -o xs = [n * math.pi / 8. for n in range(10000)]
1.57 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
<TimeitResult : 1.57 ms ± 167 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)>
The creation of xs
took about 6 milliseconds.
%timeit -o ys = [math.sin(x) for x in xs]
1.66 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
<TimeitResult : 1.66 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)>
The creation of ys
took about 7 milliseconds.
We can replace the above Python lists with NumPy arrays and initialise them using vectorised operations (vectorised operations work on many values in one go), np.linspace
and np.sin
:
xs = np.linspace(0., 9999. * math.pi / 8., 10000)
ys = np.sin(xs)
Let us time these operations.
%timeit -o xs = np.linspace(0., 9999. * math.pi / 8., 10000)
29.4 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
<TimeitResult : 29.4 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)>
%timeit -o ys = np.sin(xs)
130 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
<TimeitResult : 130 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)>
Vectorised operations on NumPy arrays take hundreds of microseconds, as opposed to milliseconds for Python lists, thus we have an order of magnitude improvement in speed.
Performance is one reason to prefer NumPy arrays over standard Python data structures, such as lists.
Defining two-dimensional NumPy arrays (implementing matrices)¶
The real power of NumPy is in supporting two-dimensional, arrays, which can implement matrices:
my_matrix = np.array([[4.28, 3.23, 5.87], [1.23, 5.32, 3.33]])
my_matrix
array([[4.28, 3.23, 5.87], [1.23, 5.32, 3.33]])
np.ndim(my_matrix)
2
np.size(my_matrix)
6
This particular two-dimensional NumPy array (matrix) is 2 by 3:
np.shape(my_matrix)
(2, 3)
Other ways of creating NumPy arrays¶
Instead of creating NumPy arrays from (either nested or flat) Python lists, we coul