## Without solutions

# The fundamentals of the Python language and Jupyter notebooks¶

```
# Copyright (c) Thalesians Ltd, 2019-2023. All rights reserved.
# Copyright (c) Paul Alexander Bilokon, 2019-2023. All rights reserved.
# Author: Paul Alexander Bilokon <[email protected]>
# This version: 2.0 (2023.11.17)
# Previous versions: 1.0 (2019.01.28)
# Email: [email protected]
```

## Motivation¶

**Programming** is one of the most important skills for a data scientist, and Python is the *de facto* *lingua franca* — the programming language of choice — for Data Science.

Data Scientists perform much of this programming inside the Jupyter environment.

In this Chapter we introduce just enough Python (and Jupyter) to get you started in Data Science.

## Objectives¶

- To introduce the Python programming language.
- To explain where and how the reader can download the Anaconda Python distribution.
- To introduce the Jupyter notebooks.
- To demonstrate how different types of Jupyter notebook cells can be used.
- To introduce the Python programming language.
- To introduce variables.
- To explain how to use Python’s numeric data types:
`int`

s and`float`

s. - To introduce type casting.
- To demonstrate how to use Python libraries, using
`math`

as an example. - To explain the concept of dynamic typing.
- To introduce strings.
- To introduce
`None`

. - To introduce arithmetic expressions.
- To introduce functions and explain their role in code reuse.
- To explain why functions are first-class citizens in Python.
- To introduce
`bool`

eans and logic. - To introduce comparison operators.
- To explain how comparison operators can be combined with logical operators, such as
`not`

,`and`

, and`or`

. - To introduce
`all`

and`any`

. - To explain that any value can be cast to a
`bool`

. - To introduce control flow and
`if`

statements. - To introduce key data structures: lists, tuples, dictionaries, and sets.
- To explain the difference between the shallow copy and the deep copy.
- To explain iteration, and introduce the
`while`

loop and the`for`

loop. - To introduce the temporal types:
`date`

,`time`

, and`datetime`

. - To provide examples and exercises on this material, so the reader can practise programming.
- To introduce the Python to the literature and web resources on Python.

## What are Python and Jupyter¶

Python is a **programming language** that was created by Guido van Rossum and first released in 1991.

Its distinguishing characteristics are *straightforwardness* and *readability*, especially in comparison with other programming languages, such as C++ and Java. At the same time, Python is very expressive, powerful, and laconic, enabling programmers to express complex ideas in very little code.

Python is not only a language of choice for data science. It is frequently employed by web designers (for making websites), system administrators (for writing scripts and automation), hackers (also for writing scripts) and anyone who needs to process numeric and textual data in bulk.

There are two “lineages” of the Python language in existence. There is Python 2.x (the latest being version 2.7.18) and there is Python 3.x (the latest being version 3.12.0 at the time of writing). Python 3.x is supposed to supersede Python 2.x, but because so much systems code is powered by Python 2.x, Python 2.x is still supported and distributed. We shall stick with Python 3.x in the present work.

There are several Python **distributions** to choose from. The **Anaconda distribution** is a popular choice among Data Scientists. You can download the latest version of the Anaconda distribution for your operating system from https://www.anaconda.com/

If you have a 64-bit operating system, we suggest that you download the 64-bit version.

Once you have downloaded the distribution, install it.

Launch **Anaconda Navigator**. Once the Anaconda Navigator window shows up, launch **Jupyter notebook** from it. When it shows up in the browser, click on “New”, then “Python 3”. A blank Jupyter notebook should show up inviting you to enter some Python code.

It is worth noting that Jupyter notebooks are not the only way to write Python code. You could launch the Python **interpreter** (`python.exe`

on Windows) from the **Anaconda Prompt** and type in Python code closer to the metal. Or you could write your Python code in a text file, save it as `something.py`

and end up with a standalone Python module or multiple such modules, forming a complex software product. This is something that we would do for a finished, polished solution in **production**. For **research** and **prototyping**, though, Jupyter notebooks are a perfect environment. (While Python is perfectly good for many production use cases, for others you may consider migrating to a language like C++, C#, or Java.)

For completeness, we shall mention that you don’t have to use Python in Jupyter notebooks. The name “Jupyter” itself stands for “Julia, Python, R” — indeed, other programming languages, such as kdb+’s q, can be used in Jupyter notebooks, although we shall stick with Python in this work.

## Introduction to Jupyter¶

Jupyter notebooks are at the core of Python’s research environment. In Jupyter notebooks, the data is

- loaded,
- cleaned,
- visualised,
- analysed,

possibly over multiple iterations, until the desired result is obtained. It is therefore unsurprising that Jupyter notebooks are often quite messy. Until they are finally cleaned up to present the conclusions of the research work. In fact, what you are reading right now is also a Jupyter notebook.

### Cells¶

A Jupyter notebook comprises a column of basic building blocks called **cells**.

To insert a new cell in Jupyter, first click on an existing cell, then click on “Insert” in Jupyter’s menu and select “Insert Cell Above” or “Insert Cell Below”.

Under the menu, in the toolbar, there is a drop-down box with cell types: “Code”, “Markdown”, “Raw NBConvert”, and “Heading”. Click on an existing cell, then alter its type by selecting a different value from that drop-down box.

The most important cell types for us are “Code” and “Markdown”.

“Code” cells, such as the one below…

```
3 + 5
```

8

allow you to enter Python code (in our example, the numeric expression `3 + 5`

) as “In” (input) and display the result as “Out” (output, in our example, `8`

). Don’t forget to press [Shift] + [Enter], once you have entered the code in your “Code” cell, to evaluate it and display the result in “Out”. (The cursor will automatically move to the next cell.)

Markdown cells, such as the one you are currently reading, enable you to document your code. Moreover, you can use markdown syntax, such as `*this*`

(to *italicise* the text), `**this**`

(to make the text **bold**), include `# Headings`

(prefixed with `#`

), bulleted lists (prefixed with `*`

, such as

- this
- simple
- list),

and numbered lists (prefixed with `1.`

, such as

- this
- simple
- list).

It is possible to include snippets of Python code, between two backticks, which will be `rendered in a special font`

.

Finally, if you are a mathematician, you will be pleased to hear that you can include mathematical formulae, in $\LaTeX$, between two dollar signs (or double dollar signs for standalone equations). $\LaTeX$ looks pretty in Jupyter notebooks, such as this Euler’s formula, $e^{ix} = \cos x + i \sin x$.

If we use double dollar signs, then we get $$e^{ix} = \cos x + i \sin x.$$

Unfortunately, teaching you $\LaTeX$, Donald Knuth’s mathematics typesetting language, is outside the scope of this work, but you will find plenty of resources on it online.

However, by now we hope that we have shown you the power of Markdown, Jupyter’s language for documenting Python. The work that you are reading now is written in Markdown. You can read up on Markdown in Wikipedia: https://en.wikipedia.org/wiki/Markdown

#### Exercise¶

Typeset the following in Markdown:

In algebra, a **quadratic equation** (from the Latin *quadratus* for “square”) is any equation having the form
$$ax^2 + bx + c = 0,$$
where

- $x$ represents an unknown, and
- $a$, $b$, and $c$ represent known numbers, with $a \neq 0$.

If $a = 0$, then the equation is linear, not quadratic, as there is no $ax^2$ term.

The numbers $a$, $b$, and $c$ are the **coefficients** of the equation and may be distinguished by calling them, respectively, the **quadratic coefficient**, the **linear coefficient**, and the **constant** or **free term**.

The values of $x$ that satisfy the equation are called **solutions** of the equation, and **roots** or **zeros** of its left-hand side. A quadratic equation has at most two solutions.

The value $b^2 – 4ac$ is known as the **discriminant**.

- If the discriminant is
*positive*, there are two real solutions given by the formula $$x_{1,2} = \frac{-b \pm \sqrt{b^2 – 4ac}}{2a}.$$ - If the discriminant is
*zero*, there is one real solution (referred to as a**double root**) given by the formula $$x = \frac{-b}{2a}.$$ - If the discriminant is
*negative*, there are no (real) solutions.

You can learn more about the quadratic equations on Wikipedia: https://en.wikipedia.org/wiki/Quadratic_equation

## Introduction to Python¶

We have already entered our first piece of Python code, namely

```
3 + 5
```

8

#### Exercise¶

Compute, using Python, (i) the product of seven and eight, (ii) the difference between 2190 and 518, (iii) the result of dividing 100 by four (iv) the result of multiplying by 10 of the difference between 2190 and 518.

```
(2190 - 518) * 10
```

16720

It should now be clear why we call Python a “supercalculator”. Indeed, you could use Python as a calculator (but it is so much more). To start harnessing its power we should introduce

### Variables¶

A **variable** is one of the most important concepts in programming. Essentially, it is a named value. Moreover, as the name suggests, this named value can be varied (changed), while keeping the name the same.

Let us create a variable named `a`

. We create a variable by assigning to it, using the **assignment operator =**, its initial value:

```
a = 5
```

This **statement** (command) essentially says “set the variable `a`

to value 5″.

Once the variable `a`

has been created and **initialised** (set to its initial value), we can use it in **expressions**, such as `a + 3`

. When we write `a`

in expressions, its value (5) will be substituted for `a`

, so the result of the arithmetic expression `a + 3`

will be `5 + 3`

, in other words, 8:

```
a + 3
```

8

We note that the difference between the statements and expressions is that the latter evaluate to a result.

As we said, the value of the variable can be varied (changed). Let us assign to `a`

a different value, say, 7:

```
a = 7
```

Now when we evaluate the expression `a + 3`

, we will get a different result, namely 10:

```
a + 3
```

10

What if we now assign to `a`

the result of the expression `a + 3`

?

```
a = a + 3
```

First, the expression `a + 3`

on the right-hand side of `=`

is evaluated (it is 7 + 3, i.e. 10). Next, it is assigned to `a`

as its new value. So, as a result of this assignment, the value of the variable `a`

has become

```
a
```

10

We may now introduce a different variable, say `b`

,

```
b = 5
```

and use it in arithmetic expressions alongside `a`

:

```
a + b + 3
```

18

Notice that the values of the variables persist (are remembered) as we go from one Jupyter cell to the next.

We could write all of the above more succinctly in a single cell:

```
a = 7
a = a + 3
b = 5
a + b + 3
```

18

Notice that only the result of the last expression, `a + b + 3`

is returned as the output (“Out”) by Jupyter.

Sometimes there is no “Out” to be printed, as is the case with assignment to a variable:

```
a = 10
```

However, as we said, variables persist throughout the Jupyter session, so we can inspect them in one of the following cells:

```
a
```

10

Remember that only the result from the last expression is printed:

```
2 + 2
3 + 7
```

10

However, you can print multiple things using the `print`

**function**. This is convenient for inspecting intermediate results in your code:

```
print(2 + 2)
print(3 + 5)
3 + 4
```

4 8

7

In the example above, `4`

and `8`

are displayed by the two `print`

functions, whereas the output of the cell is `7`

, which is the result of evaluating the last expression in the cell, `3 + 4`

.

Note that variable names in Python are case-sensitive, so `a`

is not the same as `A`

, `myvar`

is not the same as `myVar`

:

```
a = 3
A = 5
a
```

3

Whereas

```
A
```

5

#### Exercise¶

Set the variable `a`

to `15`

, the variable `b`

to `7`

, then, without typing in any digits, swap the values of the two variables, so the variable `a`

becomes equal to `7`

and the variable `b`

to `15`

.

### Numerics¶

So far all the values that we have dealt with in Python have been **numeric**, such as `3`

and `5`

in the expression

```
3 + 5
```

8

The result, `8`

, is also numeric.

Moreover, these values are all integers. An **integer** in programming is the same as in mathematics: a whole number with no digits after the decimal point:

```
8
```

8

We can use the built-in Python function `type`

to confirm that the **type** of 8 is indeed an integer (or `int`

for short):

```
type(8)
```

int

We can assign this value to a variable

```
my_int = 8
```

And then that variable will have the type integer:

```
type(my_int)
```

int

Or we could print out the value of `my_int`

along with the type of its value using `print`

:

```
print(my_int, type(my_int))
```

8 <class 'int'>

Python supports fractions (mathematically speaking, **real numbers**), as well as integers. Fractions are implemented using a different type, the **floating point** type, `float`

:

```
type(3.57)
```

float

We can force a **literal** to be interpreted as a float (rather than as an integer) by including the decimal point:

```
type(42.)
```

float

whereas

```
type(42)
```

int

We say that `42.`

is a `float`

literal, whereas `42`

is an `int`

literal.

We could also **cast** a value of type `int`

to `float`

:

```
float(42)
```

42.0

```
type(float(42))
```

float

When casting a value of type `float`

to type `int`

we may end up losing precision as we lose all digits after the decimal point:

```
int(3.57)
```

3

```
type(int(3.57))
```

int

The `float`

data type is used throughout data science to represent numerical values in arithmetic operations.

#### Exercise¶

Is the sum of `3`

and `3.57`

an `int`

or a `float`

? Will you lose precision by casting `3`

to a `float`

then back to an `int`

? Will you lose precision by casting `3.57`

to an `int`

then back to a `float`

?

### Standard python libraries¶

The power of Python is in its **libraries** — pre-written collections of Python code that do useful stuff for us. We make use of libraries by `import`

ing their **modules**:

```
import math
```

Once we have imported the standard Python library module `math`

, we can start using functions defined in it, such as `sqrt`

for the square root:

```
math.sqrt(3.57)
```

1.8894443627691184

We can use the results of these functions in expressions:

```
4.5 + 2 * math.sqrt(3.57)
```

8.278888725538238

Modules may define other things in addition to functions, such as constants. In particular, the `math`

module defines the mathematical $\pi$ (“pi”) constant, which relates the radius of a circle to its circumference (via $C = 2\pi r$, where $r$ is the radius, $C$ the circumference):

```
math.pi
```

3.141592653589793

As a side comment, many fractions, such as the **transcendental** number $\pi$, cannot be represented exactly using floating point. Floating point arithmetics relies on truncated, approximate representations of real numbers, which may lead to all sorts of **numerical issues** (often subtle) in scientific computing. However, what we are doing here is too basic for us to worry about these numerical issues. If you want to *really* understand floating point numbers, have a look at the paper *What Every Computer Scientist Should Know About Floating-Point Arithmetic* by David Goldberg (Google it).

#### Exercise¶

In one of the previous exercises we have already mentioned quadratic equations. Use `math`

to find both solutions of the quadratic equation $2x^2 -3x + \frac{1}{2} = 0$.

### Dynamic typing¶

Let us set the variable `x`

, so it equals 65:

```
x = 65
```

Its type, then, will be integer:

```
type(x)
```

int

We could overwrite `x`

with a value of a different type, such as a float:

```
x = 3.57
```

The type of `x`

has now changed:

```
type(x)
```

float

Some programming languages (such as Java, C++, C#, and many others) would not allow overwriting `x`

with a value of a different type: once something is an `int`

, it is always an `int`

. We say that these languages are **statically typed**, whereas Python is **dynamically typed**. Types are important in Python, and Python is still a **strongly typed** language, although the type of a variable may change over the lifetime of the program, hence the expression: “dynamically typed”.

### Strings¶

The string type allows us define textual variables. A `string`

literal is enclosed within two single `'`

or double `"`

quotation marks.

```
my_str = 'foo'
print(my_str, type(my_str))
```

foo <class 'str'>

It is customary for introductions to programming languages to include an example that prints out the string `'Hello, World!'`

In Python, this is a one-liner:

```
print('Hello, World!')
```

Hello, World!

The function `len`

returns the length of a string:

```
len('Hello, World!')
```

13

We can access individual characters in a string using **indexing** with the square brackets. Notice that the indexing starts at zero, thus

```
'Hello, World!'[0]
```

'H'

whereas

```
'Hello, World!'[1]
```

'e'

We can also index from the back using negative indices:

```
'Hello, World!'[-1]
```

'!'

Moreover, we can index longer **substrings**, rather than individual characters:

```
'Hello, World!'[3:7]
```

'lo, '

Notice that the first index is inclusive, whereas the second exclusive, so the resulting substring consists of characters at indices 3, 4, 5, and 6 (but not 7).

When indexing, we can also provide a step:

```
'Hello, World!'[3:7:2]
```

'l,'

```
'Hello, World!'[::2]
```

'Hlo ol!'

Of course, instead of repearing the string ‘Hello, World!’ so many times (while running the risk of mistyping it), we should have stored it in a variable…

```
greet = 'Hello, World!'
```

…and then indexed:

```
greet[::2]
```

'Hlo ol!'

One of the most useful operations on strings is **concatenation**. It enables us to produce a single string from multiple:

```
'first' + 'second'
```

'firstsecond'

```
separator = ', '
'first' + separator + 'second' + separator + 'third'
```

'first, second, third'

#### Exercise¶

Use indexing and concatenation to obtain the string `'World, Hello!'`

from `'Hello, World!'`

.

### None¶

We can set Python variables to a special value, `None`

,

```
a = None
```

of a special type,

```
type(a)
```

NoneType

`None`

is used to signal that the value is absent or missing.

In fact, this is the value implicitly returned by statements, such as

```
print(357)
```

357

### Arithmetic expressions¶

Python supports the standard arithmetic operators:

```
print('Addition:', 5 + 3)
print('Subtraction:', 5 - 3)
print('Multiplication:', 5 * 3)
print('Division:', 5 / 3)
print('Exponentiataion:', 5**3)
print('Modulo:', 5 % 3)
```

Addition: 8 Subtraction: 2 Multiplication: 15 Division: 1.6666666666666667 Exponentiataion: 125 Modulo: 2

Python also supports integer division, which produces the largest integer less than or equal to `5 / 3`

:

```
5 // 3
```

1

If any of the arguments is a `float`

, the result will also be of type `float`

:

```
5.1 // 3.1
```

1.0

**Expressions** such as

```
3 + 5
```

8

```
2. * x + 7.
```

14.14

evaluate to numbers (whether integers, or floating point numbers). They are known as **arithmetic** expressions.

We can perform some other common operations on numerics:

```
print('Absolute value:', abs(-5))
print('Rounding:', round(3.56))
print('Maximum value:', max(3, 2, 8, 10, 2, 5))
print('Minimum value:', min(3, 2, 8, 10, 2, 5))
```

Absolute value: 5 Rounding: 4 Maximum value: 10 Minimum value: 2

### Functions¶

Suppose that we have written some code to compute the area of a circle:

```
radius = 5.
area = math.pi * radius * radius
print(area)
```

78.53981633974483

There is little point in rewriting it each time we encounter a new circle with a different radius. So we wrap it inside a **function**, which takes `radius`

as its **parameter** (**argument**) and **returns** the result:

```
def area_of_circle(radius):
area = math.pi * radius * radius
return area
```

We can **call** our function with the values of the arguments that we need in each case:

```
area_of_circle(5.)
```

78.53981633974483

```
r = 7.5
area_of_circle(r)
```

176.71458676442586

Functions can have multiple arguments:

```
def area_of_triangle(base, height):
print('Base:', base)
print('Height:', height)
area = .5 * base * height
return area
area_of_triangle(3., 5.)
```

Base: 3.0 Height: 5.0

7.5

Notice how the **block** of code was indented (we chose to indent it using four spaces, although some people prefer to use tabs) to dilimit it, designating it as the **body** of the function `area_of_triangle`

. The function call that ensues, `area_of_triangle(3., 5.)`

, is not indented, and is not part of that body.

The variables `base`

and `height`

are defined only within the body of the function. We say that those variables’ **scope** is limited to the body of the function.

It is possible to call the function specifying the values of the arguments in order

```
area_of_triangle(3., 5.)
```

Base: 3.0 Height: 5.0

7.5

or by name

```
area_of_triangle(height=5., base=3.)
```

Base: 3.0 Height: 5.0

7.5

Functions can also specify default values for their arguments in their definitions:

```
def area_of_triangle(base, height=5.):
return .5 * base * height
```

So calling

```
area_of_triangle(3., 5.)
```

7.5

can now be equivalently done as

```
area_of_triangle(3.)
```

7.5

Notice that like everything else (e.g. integers) functions are objects and **first-class citizens**. Thus we can think of `area_of_triangle`

as a variable set to a value of type `function`

:

```
type(area_of_triangle)
```

function

Function objects can be passed to other functions as parameters:

```
def add(x, y):
return x + y
def multiply(x, y):
return x * y
def result_printer(op, x, y):
print('The result is', op(x, y))
result_printer(add, 3, 5)
result_printer(multiply, 3, 5)
```

The result is 8 The result is 15

Good programmers are masters of **code reuse** therefore they wrap generally useful pieces of code into convenient functions.

If a library defines the function that we need, then we don’t need to write our own. We have already seen (and used) the function

```
math.sqrt(9.)
```

3.0

#### Exercise¶

Write two functions that will return the two roots of a given quadratic equation. Test them on the quadratic equation $2x^2 -3x + \frac{1}{2} = 0$.

### Booleans and logic¶

`bool`

ean is a binary variable type, that can either be `True`

or `False`

. It is so named after the self-taught English mathematician, philosopher, and logician George Boole: https://en.wikipedia.org/wiki/George_Boole

```
my_bool = True
print(my_bool, type(my_bool))
```

True <class 'bool'>

```
my_bool = False
print(my_bool, type(my_bool))
```

False <class 'bool'>

Let us set `x`

to the integer 10:

```
x = 10
```

Expressions that evaluate to either `True`

or `False`

are known as **boolean** expressions.

```
x < 10
```

False

```
type(x < 10)
```

bool

Different boolean expressions can be obtained by using different **comparison operators**, such as **less than**:

```
x < 10
```

False

**less than or equals**:

```
x <= 10
```

True

**equals**:

```
x == 10
```

True

**greater than or equals**:

```
x >= 10
```

True

**greater than**:

```
x > 10
```

False

And these comparison operators can be combined with **logical operators**, such as `not`

, `and`

, and `or`

:

```
x <= 10 and x % 2 == 1
```

False

```
x <= 10 or x % 2 == 1
```

True

We can also use the built-in function `all`

:

```
all([x > 1, 5 <= x, 5 > 3, 7 != 1])
```

True

which is equivalent to

```
x > 1 and 5 <= x and 5 > 3 and 7 != 1
```

True

Similarly,

```
all([x > 1, 5 <= x, x == 5, 5 > 3, 7 != 1])
```

False

is equivalent to

```
x > 1 and 5 <= x and x == 5 and 5 > 3 and 7 != 1
```

False

Another builtin function, `any`

, enables us to write

```
any([x > 1, 5 <= x, x == 5, 5 > 3, 7 != 1])
```

True

which is somewhat more succinct and arguably more readable than the equivalent

```
x > 1 or 5 <= x or x == 5 or 5 > 3 or 7 != 1
```

True

Each data type can also be cast to `True`

or `False`

. As a general rule, objects like `string`

if they do not contain anything, zeros, and `None`

will be cast to `False`

, while everything else will be cast to `True`

:

```
print(bool())
print(bool(''))
print(bool(' '))
print(bool(0))
print(bool(0.))
print(bool(1))
print(bool(1.5))
print(bool(None))
```

False False True False False True True False

### Control flow¶

We can control the flow of our programs using the basic logical operators and `if`

statements. The `if`

statement evaluates the `if`

block if the given boolean expression is `True`

and the `else`

block (as long as it is present) if the given boolean expression is `False`

. Else-if or `elif`

lets us set a specific boolean expression to evaluate if the base case is not `True`

.

```
if x <= 7:
print('x is less than or equal to seven')
else:
print('x is greater than seven')
```

x is greater than seven

```
if x <= 7:
print('x is less than or equal to seven')
```

In this example, `x > 7`

(so `x <= 7`

is `False`

) but there is no `else`

block, so nothing is evaluated/printed.

```
if x <= 7:
print('x is less than or equal to seven')
elif x <= 10:
print('x is greater than seven but less than or equal to ten')
elif x <= 15:
print('x is greater than ten but less than or equal to fifteen')
else:
print('x is greater than fifteen')
```

x is greater than seven but less than or equal to ten

We can also have nested `if-else`

statements:

```
if x % 2 == 0:
print('x is divisible by 2')
if x % 5 == 0:
print('x is divisible by 2 and 5')
elif x % 5 == 0:
print('x is divisible by 5 but not 2')
else:
print('x is divisible by neither 2 nor 5')
```

x is divisible by 2 x is divisible by 2 and 5

To check whether a variable is `None`

we use `is None`

rather than `== None`

:

```
if x is None:
print('x is None')
else:
print('x is not None')
```

x is not None

#### Exercise¶

Write a function that will return the number of real solutions of a quadratic equation.

#### Exercise¶

The **Fibonacci sequence** is a sequence of integers, starting with zero and one, such that each term in the sequence is the sum of the previous two. Thus the first few terms of the sequence are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, etc. Write a function that, given `n`

, will return the `n`

th term of the Fibonacci sequence.

### Data structures¶

As data scientists, we care a lot about **data structures** that let us store and access large amounts of data. Some such data structures, such as lists, tuples, dictionaries, and sets, are part of the Python standard. Others, such as multidimensional arrays and dataframes, are provided by third-party, but *de facto* standard libraries, such as NumPy and Pandas, respectively.

#### Lists¶

A list is arguably the most commonly used data structure in Python. Its core function is to allow storage of and access to various elements. Financial data in particular are often represented as time-series, which are, collections of observed values with corresponding time. To define a `list`

we use square brackets `[]`

:

```
my_list = [1, 5, 6, 3]
print(my_list, type(my_list))
```

[1, 5, 6, 3] <class 'list'>

Python allows us to combine elements of different types into the same `list`

:

```
my_list = [3, "hello world", True, None, 3, math.pi]
print(my_list)
```

[3, 'hello world', True, None, 3, 3.141592653589793]

Let’s examine the length of our list:

```
len(my_list)
```

6

Notice that repeated values are counted as distinct elements.

Accessing elements of a `list`

is performed using indexing with `[]`

. Rememeber that the index of the first element of the list is `0`

:

```
print(my_list[0])
print(my_list[1])
print(my_list[3])
print(my_list[2])
print(my_list[4])
```

3 hello world None True 3

You may also access elements from the end of a list by using negative indexing:

```
print(my_list[-1])
print(my_list[-2])
print(my_list[-3])
print(my_list[-4])
```

3.141592653589793 3 None True

We may set an element in a `list`

to a new value:

```
my_list[-1] = 4
print(my_list)
```

[3, 'hello world', True, None, 3, 4]

We can select a sublist from the list:

```
my_list = ['problems','worthy','of','attack','prove','their','worth','by','fighting','back']
print(my_list[3:6])
```

['attack', 'prove', 'their']

Notice that the index 3 is inclusive, whereas the index 6 exclusive, so, as a result, we obtain a sublist containing elements at indices 3, 4, and 5 (but not 6).

We may also select sublists without the lower and/or upper bounds:

```
print(my_list[3:])
print(my_list[:5])
print(my_list[:])
```

['attack', 'prove', 'their', 'worth', 'by', 'fighting', 'back'] ['problems', 'worthy', 'of', 'attack', 'prove'] ['problems', 'worthy', 'of', 'attack', 'prove', 'their', 'worth', 'by', 'fighting', 'back']

You can specify a step:

```
my_list[::2]
```

['problems', 'of', 'prove', 'worth', 'fighting']

Reverse the order by setting a negative step size:

```
my_list[::-1]
```

['back', 'fighting', 'by', 'worth', 'their', 'prove', 'attack', 'of', 'worthy', 'problems']

And combine the step with lower and upper bounds:

```
my_list[2:10:3]
```

['of', 'their', 'fighting']

We can use Python’s `range`

function to generate a list of consecutive integers:

```
list(range(10))
```

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Conveniently, we can specify a step as well:

```
list(range(1,10,2))
```

[1, 3, 5, 7, 9]

We can add elements to the end of the list via the `append`

method (a **method** is a function associated with a particular object, in our example, `my_list`

):

```
my_list = list(range(0,10))
my_list.append(25)
my_list.append(25)
print(my_list)
```

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 25, 25]

We can remove specific elements by calling the method `remove`

and supplying it with the value of an element that we would like to remove. Note that only the first instance of an element will be removed:

```
my_list.remove(25)
my_list.remove(my_list[5])
print(my_list)
```

[0, 1, 2, 3, 4, 6, 7, 8, 9, 25]

We can **filter** a list using something like

```
list(filter(lambda x: x > 5, my_list))
```

[6, 7, 8, 9, 25]

Here the **lambda** or **anonymous function** `lambda x: x > 5`

is equivalent to

```
def my_func(x): return x > 5
```

but shorter and avoids giving the function a name — it is not needed, as we don’t intent to call this function in the future.

We can **map** or apply a function (or lambda) to each element of a list:

```
list(map(lambda x: x*2, my_list))
```

[0, 2, 4, 6, 8, 12, 14, 16, 18, 50]

The function `sorted`

sorts a list without modifying it — it returns a new, sorted list, while keeping the original one intact:

```
sorted(my_list, reverse=True)
```

[25, 9, 8, 7, 6, 4, 3, 2, 1, 0]

```
my_list
```

[0, 1, 2, 3, 4, 6, 7, 8, 9, 25]

On the other hand, the method `sort`

modifies the list — it sorts it **in place**:

```
my_list.sort()
```

```
my_list
```

[0, 1, 2, 3, 4, 6, 7, 8, 9, 25]

#### Tuples¶

Let us consider an example.

```
a = [3, "hello world", True, None, 3, math.pi]
a = ['some', 'other', 'list']
a
```

['some', 'other', 'list']

The variable `a`

was first assigned to the list `[3, "hello world", True, None, 3, math.pi]`

, but was then reassigned to another list, `['some', 'other', 'list']`

. Variables can be thought of as pointers (**references**) to objects in memory, such as lists. Two variables can reference the same object in memory, e.g.

```
a = [3, "hello world", True, None, 3, math.pi]
b = a
```

Now,

```
a
```

[3, 'hello world', True, None, 3, 3.141592653589793]

```
b
```

[3, 'hello world', True, None, 3, 3.141592653589793]

Since lists are **mutable** objects, they can be modified after construction. Notice that we are not reassigning a variable so it references a new object in memory, we are modifying the object that it is already pointing to:

```
a[2] = False
a
```

[3, 'hello world', False, None, 3, 3.141592653589793]

Notice that, since the variable `b`

is referencing the same object, its value has also changed:

```
b
```

[3, 'hello world', False, None, 3, 3.141592653589793]

In this sense, mutable objects are somewhat dangerous. Consider the following code:

```
def my_mean(arg):
# This could be a long function, which, perhaps by mistake,
# modifies arg:
# ...
arg[3] = 11.7
# ...
return sum(arg) / len(arg)
```

Let’s apply this function to

```
a = [4.25, 18.5, 22.5, 13.7, 25.4]
```

The result of the (broken) `my_mean`

looks roughly correct…

```
my_mean(a)
```

16.47

…although it’s not. But what’s worse, the user of `my_mean`

, who never expected that function to modify its argument, is in for a surprise:

```
a
```

[4.25, 18.5, 22.5, 11.7, 25.4]

When we doubt the validity of some code, we may defensively copy the arguments like so:

```
a = [4.25, 18.5, 22.5, 13.7, 25.4]
print(my_mean(a.copy()))
a
```

16.47

[4.25, 18.5, 22.5, 13.7, 25.4]

Notice that a copy is equal to…

```
a = [4.25, 18.5, 22.5, 13.7, 25.4]
b = [4.25, 18.5, 22.5, 13.7, 25.4]
a == b
```

True

…but not identical to (does not correspond to the same object in memory as) the original:

```
a is b
```

False

Whereas if both variables point to the same object in memory we get both **equality** and **identity**:

```
a = [4.25, 18.5, 22.5, 13.7, 25.4]
a = b
print(a == b)
print(a is b)
```

True True

We can also check this by examining the `id`

of the object, which in CPython is equal to its address in memory:

```
id(a)
```

1401550383104

```
id(b)
```

1401550383104

Mutable objects, such as lists, may therefore be a source of subtle and difficult to track bugs. They are less safe than **immutable** objects, which cannot be modified after construction. Mutable objects are particularly dangerous in multi-threaded environments where code runs in parallel.

Fortunately, Python has a built-in data structure, which is very similar to a list, but immutable: a **tuple**. We create a tuple instead of a list by using round brackets instead of square brackets:

```
a = (4.25, 18.5, 22.5, 13.7, 25.4)
type(a)
```

tuple

Alternatively, we may cast a list to a tuple:

```
a = tuple([4.25, 18.5, 22.5, 13.7, 25.4])
type(a)
```

tuple

Once a tuple has been created, it cannot be modified: `a[0] = 3.57`

will raise an error and the tuple doesn’t have methods such as `a.append(3.57)`

.

Notice that

```
(3)
```

3

is interpreted as the number 3, whereas

```
(3,)
```

(3,)

is interpreted as a tuple containing a single element — number 3.

#### Exercise¶

Write a single function that will return the two roots of a given quadratic equation as a tuple. Test your function on the quadratic equation $2x^2 -3x + \frac{1}{2} = 0$.

#### Exercise¶

Set the variable `a`

to `15`

, the variable `b`

to `7`

, then, without typing in any digits, *without using arithmetics, and without introducing any new variables*, swap the values of the two variables, so the variable `a`

becomes equal to `7`

and the variable `b`

to `15`

. Hint: use tuples.

#### Dictionaries¶

As we have mentioned the copying of objects, we should point out that there are the **shallow** and **deep** variants of copy in Python.

The shallow variant copies the object but not its elements; elements of the original data structure are still referenced. For example:

```
a = (['one', 'two', 'three'], [0, 1, 2, 3, 4, 5])
```

```
import copy
a_copy = copy.copy(a)
```

```
a_copy
```

(['one', 'two', 'three'], [0, 1, 2, 3, 4, 5])

```
a_copy[0].append('four')
```

While we cannot change the tuple itself since the tuple is immutable, we can change the tuple’s elements, which in this particular case are mutable. `a_copy`

‘s zeroth element has changed:

```
a_copy
```

(['one', 'two', 'three', 'four'], [0, 1, 2, 3, 4, 5])

And, because `a_copy`

is a shallow copy of `a`

, the zeroth element of `a`

has also changed:

```
a
```

(['one', 'two', 'three', 'four'], [0, 1, 2, 3, 4, 5])

This isn’t the case for the deep copy:

```
a_deep_copy = copy.deepcopy(a)
```

```
a
```

(['one', 'two', 'three', 'four'], [0, 1, 2, 3, 4, 5])

```
a_deep_copy
```

(['one', 'two', 'three', 'four'], [0, 1, 2, 3, 4, 5])

```
a_deep_copy[0].append('five')
```

```
a_deep_copy
```

(['one', 'two', 'three', 'four', 'five'], [0, 1, 2, 3, 4, 5])

Notice that the zeroth element of the original `a`

has not changed:

```
a
```

(['one', 'two', 'three', 'four'], [0, 1, 2, 3, 4, 5])

Since we took a deep copy of `a`

to produce `a_deep_copy`

from `a`

, `a_deep_copy[0]`

and `a[0]`

are distinct objects:

```
id(a[0])
```

1401549898496

```
id(a_deep_copy[0])
```

1401550384640

#### Dictionaries¶

Python **dictionaries** are powerful abstractions that let us define **key-value pairs**. In other programming languages, such abstractions are also known as **maps**. We define dictionaries by using the following notation:

```
book = {
'authors': 'Michael Berthold',
'title': 'Intelligent Data Analysis',
'publisher': 'Springer',
'year': 2003
}
```

In this dictionary, the **keys** `'authors'`

, `'title'`

, `'publisher'`

, and `'year'`

correspond to the **values** `'Michael Berthold'`

, `'Intelligent Data Analysis'`

, `'Springer'`

, and `2003`

, respectively.

Data structures can be **nested**. For example, the value in a dictionary may itself be a data structure, such as a list:

```
book = {
'authors': ['Michael Berthold', 'David J. Hand'],
'title': 'Intelligent Data Analysis',
'publisher': 'Springer',
'year': 2003
}
```

We can index the dictionary using the `[]`

notation:

```
book['authors']
```

['Michael Berthold', 'David J. Hand']

Notice that dictionaries are mutable:

```
my_dict = {1:'one',2:'two',3:'three'}
print(my_dict[1])
my_dict[4] = 'four'
print(my_dict)
```

one {1: 'one', 2: 'two', 3: 'three', 4: 'four'}

Let’s see how we could define a toy dataset of financial time-series:

```
my_dict = {
'AAPL':[200,201,200.1,205],
'GOOG':[700,750,640,720],
'AMZN':[900,850,920,910]
}
```

```
my_dict
```

{'AAPL': [200, 201, 200.1, 205], 'GOOG': [700, 750, 640, 720], 'AMZN': [900, 850, 920, 910]}

Here, each value is a list of asset prices, e.g.

```
my_dict['AMZN']
```

[900, 850, 920, 910]

#### Sets¶

Sets are defined using the syntax

```
s = {'red', 'green', 'blue', 'red', 'red', 'green', 'blue'}
```

Alternatively, a set can be constructed from another collection (such as the list in the following example) using the `set`

constructor:

```
s = set(['red', 'green', 'blue', 'red', 'red', 'green', 'blue'])
```

Unlike lists, repeated elements in sets count as one:

```
s
```

{'blue', 'green', 'red'}

```
len(s)
```

3

It doesn’t make sense to talk about the indices of the elements of the set. The element is either present in or absent from the set:

```
'green' in s
```

True

```
'purple' in s
```

False

Sets are mutable:

```
s.add('cyan')
s
```

{'blue', 'cyan', 'green', 'red'}

We can consider **unions** of sets…

```
{'red', 'green', 'blue', 'red', 'red', 'green', 'blue'}.union({'purple', 'green', 'yellow'})
```

{'blue', 'green', 'purple', 'red', 'yellow'}

…**intersections** of sets…

```
{'red', 'green', 'blue', 'red', 'red', 'green', 'blue'}.intersection({'purple', 'green', 'yellow'})
```

{'green'}

…as well as set **differences**:

```
{'red', 'green', 'blue', 'red', 'red', 'green', 'blue'}.difference({'purple', 'green', 'yellow'})
```

{'blue', 'red'}

#### Iteration¶

**Iteration** is the process of going through the elements of a collection. To facilitate iteration, we rely on **loops**, which allow us to evaluate blocks of code multiple times.

##### The `while`

loop¶

The `while`

loop is a basic loop that will execute if some condition is `True`

and stops executing when the condition is `False`

:

```
a = True
while a:
print('inside while loop')
a = False
```

inside while loop

Here is another example:

```
x = 0
while x < 10:
print(x)
x = x + 1
```

0 1 2 3 4 5 6 7 8 9

We start by setting the variable `x`

to 0. We then repeat the indented block, consisting of the lines `print(x)`

and `x = x + 1`

, while the condition `x < 10`

holds. The second line in this block, `x = x + 1`

, keeps incrementing the variable `x`

by 1, so eventually the condition `x < 10`

will end up being false. Thus we get ten **iterations** of the loop. If we check the value of the variable `x`

after we have left the loop, we find that it is

```
x
```

10

We can escape the loop by issuing a `break`

command:

```
a = 0
while True:
a += 1
print(a)
if a == 10:
break
```

1 2 3 4 5 6 7 8 9 10

##### The `for`

loop¶

We use `for`

loops to iterate through lists, dictionaries, ranges and other data structures. The `for`

loop will go through every element in the collection and perform a given task on that element. Here is an example — let us add up all elements of a range using a `for`

loop:

```
a = 0
for i in range(10):
a += i
print(a)
```

45

#### Exercise¶

Define

```
list_of_words = ['problems', 'worthy', 'of', 'attack', 'prove', 'their', 'worth', 'by', 'fighting', 'back']
```

Use a `for`

loop to concatenate these words into a single string, separating them with spaces.