# Data Structures: Linked Lists vs. Arrays

This week we're going to look at implementations of core data structures in Python.

We'll start with two different ways to represent sequential data: **linked lists** & **arrays**.

Python's `list` could have chosen either of these, and in some languages like Java or C++ users explicitly choose the implementation most suited to their needs.

## Arrays

Arrays are blocks of contiguous memory. 

Each block is the same size, so you can find the memory location of a given block
via `start_position + (idx * block_size)`.  That will give the address of a given block, allowing **O(1)** access to any element.

This means looking up the 0th element takes the same amount of time as the 1,000,00th. 

In [None]:
class Array:
    """
    psuedocode class demonstrating array lookup 
    """
    def __init__(self, size, block_size=8):
        # need a contiguous block of free memory
        self.initial_memory_address = request_memory(amount=size*block_size)
        # each "cell" in the array needs to be the same number of bytes
        self.block_size = block_size
        # we need to know how many cells we need
        self.size = size
    
    def __getitem__(self, index):
        return read_from_memory_address(
            self.initial_memory_address + idx * self.block_size
        )

Python's `list` type is internally implemented as an array.

- What happens when we need to grow the list?
- what does `list.append` do?
- what does `list.insert(0, 0)` (at the beginning) do?



## Linked Lists

An alternative way to store sequential items is by using a linked list.

Linked lists store individual elements and a pointer to the next element.  This means that looking up the Nth element requires traversing the entire list.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Singly-linked-list.svg/408px-Singly-linked-list.svg.png)

Linked lists can grow without bound, each new node can be allocated on the fly.

In [1]:
class Node:
    def __init__(self, value, _next=None):
        self.value = value
        self.next = _next
        self.prev = ..

class LinkedList:
    def __init__(self):
        self.root = None

    def add(self, value):
        if self.root is None:
            # first value: special case
            self.root = Node(value)
        else:
            cur = self.root
            # traverse to end of list
            while cur.next:
                cur = cur.next
            # drop a new node at the end of list
            cur.next = Node(value)

    def __str__(self):
        s = ""
        cur = self.root
        while cur:
            s += f"[{cur.value}] -> "
            cur = cur.next
        s += "END"
        return s

In [2]:
ll = LinkedList()
ll.add(1)
ll.add(3)
ll.add(5)
ll.add(7)
print(ll)

[1] -> [3] -> [5] -> [7] -> END


### Optimizations

Doubly linked lists, and more complicated internal pointer structures can lead to increased performance at cost of more memory/complexity.

(Our first memory vs. runtime trade-off)

`collections.deque` is a doubly linked list implementation in Python.


### Linked List vs. Array

**Array**
  
- Lookup: O(1)
- Append: O(1) unless at capacity, then expensive O(n) copy
- Insertion: O(n)

Requires over-allocation of memory to gain efficiency.

**Linked List** 
    
- Lookup: O(n)
- Append: O(1)
- Insertion: O(n)

Requires pointer/node structure to gain efficiency.

## Stack

A stack is a last-in-first-out (LIFO) data structure that needs to primarily serve two operations: push, and pop.

Both should be O(1).

### Usage

- Undo/Redo
- Analogy: stack of plates -- add to/take from the top
- Call Stacks

Sometimes when we're writing code we talk about "the stack", which is the call stack of functions we're in & their scopes.

```python

def f():
    ...
    
    
def g():
    if ...:
        f()
    else:
        ...

def h():
    y = g()
    ...
```

When we call h(), it is added to the call stack, then g is added, and f is added on top.  We return from these functions in LIFO order, f() exits, then g(), then h().


In [3]:
class Stack:
    def __init__(self):
        self._data = []

    def push(self, item):
        # remember: adding/removing at the end of the list is faster than the front
        self._data.append(item)

    def pop(self):
        return self._data.pop()

    def __len__(self):
        return len(self._data)

    def __str__(self):
        return " TOP -> " + "\n        ".join(
            f"[ {item} ]" for item in reversed(self._data)
        )


In [6]:
s = Stack()
s.push("h()")
print('\ncalled h()')
print(s)
print('\nh called g()')
s.push("g()")
print(s)
print('\ng called f()')
s.push("f()")
print(s)
print("\nexited", s.pop())
print(s)
print("\nexited", s.pop())
print(s)
print("\nexited", s.pop())
print(s)

NameError: name 'Stack' is not defined

## Queue

A queue is a first-in-first-out (FIFO) data structure that adds items to one end, and removes them from the other.

We see queues all over the place in everyday life and computing.  Most resources are accessed on a FIFO basis.

In [1]:
class ArrayQueue:
    def __init__(self, _iterable=None):
        if _iterable:
            self._data = list(_iterable)
        else:
            self._data = []

    def push(self, item):
        # adding to the end of the list is faster than the front
        self._data.append(item)

    def pop(self):
        # only change from `Stack` is we remove from the other end
        # this can be slower, why?
        return self._data.pop(0)

    def __len__(self):
        return len(self._data)

    def __repr__(self):
        return " TOP -> " + "\n        ".join(
            f"[ {item} ]" for item in reversed(self._data)
        )



In [2]:
from collections import deque


class DequeQueue:
    def __init__(self, _iterable=None):
        if _iterable:
            self._data = deque(_iterable)
        else:
            self._data = deque()

    def push(self, item):
        self._data.append(item)

    def pop(self):
        return self._data.popleft()

    def __len__(self):
        return len(self._data)

    def __repr__(self):
        return " TOP -> " + "\n        ".join(
            f"[ {item} ]" for item in reversed(self._data)
        )



## Performance Testing

We can try to measure performance something takes by measuring how much time has passed.

```python
import time

before = time.time()
# do something
after = time.time()
elapsed = before - after
```

Issue is that in practice, times involved are miniscule, and other events on the system will overshadow differences.

In [3]:
import time

def print_elapsed(func):
    def newfunc(*args, **kwargs):
        before = time.time()
        retval = func(*args, **kwargs)
        elapsed = time.time() - before
        print(f"Took {elapsed} sec to run {func.__name__}")

    return newfunc

@print_elapsed
def testfunc(QueueCls):
    queue = QueueCls()
    for item in range(1000):
        queue.push(item)
    while queue:
        queue.pop()

In [4]:
testfunc(ArrayQueue)

Took 0.0012509822845458984 sec to run testfunc


In [5]:
testfunc(DequeQueue)

Took 0.001255035400390625 sec to run testfunc


The differences are just too small to be sure.  We need to run our code many more times.

Python has a built in module for this, `timeit`.

```python
import timeit

timeit.timeit(stmt='pass', setup='pass', timer=<default timer>, number=1000000, globals=None)

# for more: https://docs.python.org/3/library/timeit.html
```

In [6]:
import timeit

number = 1_000_000

elapsed = timeit.timeit(
    "queue.push(1)",
    setup="queue = QueueCls()",
    globals={"QueueCls": ArrayQueue},
    number=number,
)
elapsed2 = timeit.timeit(
    "queue.push(1)",
    setup="queue = QueueCls()",
    globals={"QueueCls": DequeQueue},
    number=number,
)
print(f"{number}x ArrayQueue.push, took", elapsed)
print(f"{number}x DequeQueue.push, took", elapsed2)
print(f"DequeQueue is {(elapsed-elapsed2) / elapsed * 100:.3f}% less time")


1000000x ArrayQueue.push, took 0.11357445899921004
1000000x DequeQueue.push, took 0.06381245900047361
DequeQueue is 43.814% less time


In [7]:
number = 10_000

elapsed = timeit.timeit(
    "queue.pop()",
    setup="queue = QueueCls([0] * 10000000)",
    globals={"QueueCls": ArrayQueue},
    number=number,
)
elapsed2 = timeit.timeit(
    "queue.pop()",
    setup="queue = QueueCls([0] * 10000000)",
    globals={"QueueCls": DequeQueue},
    number=number,
)
print(f"{number}x ArrayQueue.pop, took", elapsed)
print(f"{number}x DequeQueue.pop, took", elapsed2)
print(f"DequeQueue is {(elapsed-elapsed2) / elapsed * 100:.3f}% less time")

10000x ArrayQueue.pop, took 16.82198095900094
10000x DequeQueue.pop, took 0.0005862910002178978
DequeQueue is 99.997% less time


In [16]:
timeit.timeit("''.join(['a', 'b', 'c', 'd'])", number=1000000000)

58.76708083400081

In [15]:
timeit.timeit("d = ('apple'*5) + 'banana' + 'c' + 'd'", number=1000000000)

5.945709666997573

### Queue Performance

| Operation | ArrayQueue | DequeQueue |
| --------- | ---------- | ---------- |
| push      | O(1)       | O(1)       |
| pop       | O(n)       | O(1)       |




For a Stack, an array or linked list can both give O(1) performance.

For a Queue, a (doubly) linked list is necessary.

But arrays are superior for indexing operations. And *typical* code indexes list far more than it appends/inserts. Depending on your needs Python's `list` implementation may not be the optimal data structure.