51042-notes/15.linked-lists-and-arrays.ipynb
2024-11-19 18:07:13 -06:00

637 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "4d5e8ae9-592e-48b1-bf17-9923d267349c",
"metadata": {},
"source": [
"# Data Structures: Linked Lists vs. Arrays\n",
"\n",
"This week we're going to look at implementations of core data structures in Python.\n",
"\n",
"We'll start with two different ways to represent sequential data: **linked lists** & **arrays**.\n",
"\n",
"Python's `list` could have chosen either of these, and in some languages like Java or C++ users explicitly choose the implementation most suited to their needs.\n",
"\n",
"## Arrays\n",
"\n",
"Arrays are blocks of contiguous memory. \n",
"\n",
"Each block is the same size, so you can find the memory location of a given block\n",
"via `start_position + (idx * block_size)`. That will give the address of a given block, allowing **O(1)** access to any element.\n",
"\n",
"This means looking up the 0th element takes the same amount of time as the 1,000,00th. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e774f52f-63cc-4dcf-a8ec-e7ed9197f76f",
"metadata": {},
"outputs": [],
"source": [
"class Array:\n",
" \"\"\"\n",
" psuedocode class demonstrating array lookup \n",
" \"\"\"\n",
" def __init__(self, size, block_size=8):\n",
" # need a contiguous block of free memory\n",
" self.initial_memory_address = request_memory(amount=size*block_size)\n",
" # each \"cell\" in the array needs to be the same number of bytes\n",
" self.block_size = block_size\n",
" # we need to know how many cells we need\n",
" self.size = size\n",
" \n",
" def __getitem__(self, index):\n",
" return read_from_memory_address(\n",
" self.initial_memory_address + idx * self.block_size\n",
" )"
]
},
{
"cell_type": "markdown",
"id": "fcd44a0c-5fca-4fdb-af1b-b09bb3be5bca",
"metadata": {},
"source": [
"Python's `list` type is internally implemented as an array.\n",
"\n",
"- What happens when we need to grow the list?\n",
"- what does `list.append` do?\n",
"- what does `list.insert(0, 0)` (at the beginning) do?\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "3503e142-c26f-4598-8240-a11b9397147b",
"metadata": {},
"source": [
"## Linked Lists\n",
"\n",
"An alternative way to store sequential items is by using a linked list.\n",
"\n",
"Linked lists store individual elements and a pointer to the next element. This means that looking up the Nth element requires traversing the entire list.\n",
"\n",
"![](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6d/Singly-linked-list.svg/408px-Singly-linked-list.svg.png)\n",
"\n",
"Linked lists can grow without bound, each new node can be allocated on the fly."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "cfb1a2af-b065-43cf-b7b0-3bfb046d9d85",
"metadata": {},
"outputs": [],
"source": [
"class Node:\n",
" def __init__(self, value, _next=None):\n",
" self.value = value\n",
" self.next = _next\n",
" self.prev = ..\n",
"\n",
"class LinkedList:\n",
" def __init__(self):\n",
" self.root = None\n",
"\n",
" def add(self, value):\n",
" if self.root is None:\n",
" # first value: special case\n",
" self.root = Node(value)\n",
" else:\n",
" cur = self.root\n",
" # traverse to end of list\n",
" while cur.next:\n",
" cur = cur.next\n",
" # drop a new node at the end of list\n",
" cur.next = Node(value)\n",
"\n",
" def __str__(self):\n",
" s = \"\"\n",
" cur = self.root\n",
" while cur:\n",
" s += f\"[{cur.value}] -> \"\n",
" cur = cur.next\n",
" s += \"END\"\n",
" return s"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2a796778-c06e-4585-b201-07061d25e561",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1] -> [3] -> [5] -> [7] -> END\n"
]
}
],
"source": [
"ll = LinkedList()\n",
"ll.add(1)\n",
"ll.add(3)\n",
"ll.add(5)\n",
"ll.add(7)\n",
"print(ll)"
]
},
{
"cell_type": "markdown",
"id": "99f36edd-93e8-4658-b5cd-41eed7d9b4ce",
"metadata": {},
"source": [
"### Optimizations\n",
"\n",
"Doubly linked lists, and more complicated internal pointer structures can lead to increased performance at cost of more memory/complexity.\n",
"\n",
"(Our first memory vs. runtime trade-off)\n",
"\n",
"`collections.deque` is a doubly linked list implementation in Python.\n"
]
},
{
"cell_type": "markdown",
"id": "0bfd80aa-4a8d-465c-9c29-aa1aba55c18c",
"metadata": {},
"source": [
"### Linked List vs. Array\n",
"\n",
"**Array**\n",
" \n",
"- Lookup: O(1)\n",
"- Append: O(1) unless at capacity, then expensive O(n) copy\n",
"- Insertion: O(n)\n",
"\n",
"Requires over-allocation of memory to gain efficiency.\n",
"\n",
"**Linked List** \n",
" \n",
"- Lookup: O(n)\n",
"- Append: O(1)\n",
"- Insertion: O(n)\n",
"\n",
"Requires pointer/node structure to gain efficiency."
]
},
{
"cell_type": "markdown",
"id": "9465bc5b-8d44-43bf-83c6-5ede55f722f5",
"metadata": {},
"source": [
"## Stack\n",
"\n",
"A stack is a last-in-first-out (LIFO) data structure that needs to primarily serve two operations: push, and pop.\n",
"\n",
"Both should be O(1).\n",
"\n",
"### Usage\n",
"\n",
"- Undo/Redo\n",
"- Analogy: stack of plates -- add to/take from the top\n",
"- Call Stacks\n",
"\n",
"Sometimes when we're writing code we talk about \"the stack\", which is the call stack of functions we're in & their scopes.\n",
"\n",
"```python\n",
"\n",
"def f():\n",
" ...\n",
" \n",
" \n",
"def g():\n",
" if ...:\n",
" f()\n",
" else:\n",
" ...\n",
"\n",
"def h():\n",
" y = g()\n",
" ...\n",
"```\n",
"\n",
"When we call h(), it is added to the call stack, then g is added, and f is added on top. We return from these functions in LIFO order, f() exits, then g(), then h().\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "6c73e9f9-17e4-4057-b5fb-d169d0db9b96",
"metadata": {},
"outputs": [],
"source": [
"class Stack:\n",
" def __init__(self):\n",
" self._data = []\n",
"\n",
" def push(self, item):\n",
" # remember: adding/removing at the end of the list is faster than the front\n",
" self._data.append(item)\n",
"\n",
" def pop(self):\n",
" return self._data.pop()\n",
"\n",
" def __len__(self):\n",
" return len(self._data)\n",
"\n",
" def __str__(self):\n",
" return \" TOP -> \" + \"\\n \".join(\n",
" f\"[ {item} ]\" for item in reversed(self._data)\n",
" )\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1db244c8-9168-4c19-a73c-53a20c48485a",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'Stack' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[6], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m s \u001b[38;5;241m=\u001b[39m \u001b[43mStack\u001b[49m()\n\u001b[1;32m 2\u001b[0m s\u001b[38;5;241m.\u001b[39mpush(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mh()\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[1;32m 3\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m'\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124mcalled h()\u001b[39m\u001b[38;5;124m'\u001b[39m)\n",
"\u001b[0;31mNameError\u001b[0m: name 'Stack' is not defined"
]
}
],
"source": [
"s = Stack()\n",
"s.push(\"h()\")\n",
"print('\\ncalled h()')\n",
"print(s)\n",
"print('\\nh called g()')\n",
"s.push(\"g()\")\n",
"print(s)\n",
"print('\\ng called f()')\n",
"s.push(\"f()\")\n",
"print(s)\n",
"print(\"\\nexited\", s.pop())\n",
"print(s)\n",
"print(\"\\nexited\", s.pop())\n",
"print(s)\n",
"print(\"\\nexited\", s.pop())\n",
"print(s)"
]
},
{
"cell_type": "markdown",
"id": "05fefd4e-4701-437b-b7e8-e5566e753708",
"metadata": {},
"source": [
"## Queue\n",
"\n",
"A queue is a first-in-first-out (FIFO) data structure that adds items to one end, and removes them from the other.\n",
"\n",
"We see queues all over the place in everyday life and computing. Most resources are accessed on a FIFO basis."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4525594e-070b-4bcb-b4c7-a3641f514b2d",
"metadata": {},
"outputs": [],
"source": [
"class ArrayQueue:\n",
" def __init__(self, _iterable=None):\n",
" if _iterable:\n",
" self._data = list(_iterable)\n",
" else:\n",
" self._data = []\n",
"\n",
" def push(self, item):\n",
" # adding to the end of the list is faster than the front\n",
" self._data.append(item)\n",
"\n",
" def pop(self):\n",
" # only change from `Stack` is we remove from the other end\n",
" # this can be slower, why?\n",
" return self._data.pop(0)\n",
"\n",
" def __len__(self):\n",
" return len(self._data)\n",
"\n",
" def __repr__(self):\n",
" return \" TOP -> \" + \"\\n \".join(\n",
" f\"[ {item} ]\" for item in reversed(self._data)\n",
" )\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "20c843b5-9c64-45cb-af77-514b680cbd9c",
"metadata": {},
"outputs": [],
"source": [
"from collections import deque\n",
"\n",
"\n",
"class DequeQueue:\n",
" def __init__(self, _iterable=None):\n",
" if _iterable:\n",
" self._data = deque(_iterable)\n",
" else:\n",
" self._data = deque()\n",
"\n",
" def push(self, item):\n",
" self._data.append(item)\n",
"\n",
" def pop(self):\n",
" return self._data.popleft()\n",
"\n",
" def __len__(self):\n",
" return len(self._data)\n",
"\n",
" def __repr__(self):\n",
" return \" TOP -> \" + \"\\n \".join(\n",
" f\"[ {item} ]\" for item in reversed(self._data)\n",
" )\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "3af2e4e7-e600-4201-859b-9b36751a5247",
"metadata": {},
"source": [
"## Performance Testing\n",
"\n",
"We can try to measure performance something takes by measuring how much time has passed.\n",
"\n",
"```python\n",
"import time\n",
"\n",
"before = time.time()\n",
"# do something\n",
"after = time.time()\n",
"elapsed = before - after\n",
"```\n",
"\n",
"Issue is that in practice, times involved are miniscule, and other events on the system will overshadow differences."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a414509d-84bb-4ae7-a105-2c6b9dcb7406",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"def print_elapsed(func):\n",
" def newfunc(*args, **kwargs):\n",
" before = time.time()\n",
" retval = func(*args, **kwargs)\n",
" elapsed = time.time() - before\n",
" print(f\"Took {elapsed} sec to run {func.__name__}\")\n",
"\n",
" return newfunc\n",
"\n",
"@print_elapsed\n",
"def testfunc(QueueCls):\n",
" queue = QueueCls()\n",
" for item in range(1000):\n",
" queue.push(item)\n",
" while queue:\n",
" queue.pop()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "f1842168-b12d-4d1c-a004-10614904f3f2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Took 0.0012509822845458984 sec to run testfunc\n"
]
}
],
"source": [
"testfunc(ArrayQueue)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "19a3b107-da59-40f3-bafb-70160747fb53",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Took 0.001255035400390625 sec to run testfunc\n"
]
}
],
"source": [
"testfunc(DequeQueue)"
]
},
{
"cell_type": "markdown",
"id": "0cd093da-0a2c-4adb-96b8-bd56a874f6f2",
"metadata": {},
"source": [
"The differences are just too small to be sure. We need to run our code many more times.\n",
"\n",
"Python has a built in module for this, `timeit`.\n",
"\n",
"```python\n",
"import timeit\n",
"\n",
"timeit.timeit(stmt='pass', setup='pass', timer=<default timer>, number=1000000, globals=None)\n",
"\n",
"# for more: https://docs.python.org/3/library/timeit.html\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "93010cc5-55c9-4308-b0d0-8fac623f2c55",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000000x ArrayQueue.push, took 0.11357445899921004\n",
"1000000x DequeQueue.push, took 0.06381245900047361\n",
"DequeQueue is 43.814% less time\n"
]
}
],
"source": [
"import timeit\n",
"\n",
"number = 1_000_000\n",
"\n",
"elapsed = timeit.timeit(\n",
" \"queue.push(1)\",\n",
" setup=\"queue = QueueCls()\",\n",
" globals={\"QueueCls\": ArrayQueue},\n",
" number=number,\n",
")\n",
"elapsed2 = timeit.timeit(\n",
" \"queue.push(1)\",\n",
" setup=\"queue = QueueCls()\",\n",
" globals={\"QueueCls\": DequeQueue},\n",
" number=number,\n",
")\n",
"print(f\"{number}x ArrayQueue.push, took\", elapsed)\n",
"print(f\"{number}x DequeQueue.push, took\", elapsed2)\n",
"print(f\"DequeQueue is {(elapsed-elapsed2) / elapsed * 100:.3f}% less time\")\n"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "f40f7691-599c-49f9-8f08-30aa14c1e90a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10000x ArrayQueue.pop, took 16.82198095900094\n",
"10000x DequeQueue.pop, took 0.0005862910002178978\n",
"DequeQueue is 99.997% less time\n"
]
}
],
"source": [
"number = 10_000\n",
"\n",
"elapsed = timeit.timeit(\n",
" \"queue.pop()\",\n",
" setup=\"queue = QueueCls([0] * 10000000)\",\n",
" globals={\"QueueCls\": ArrayQueue},\n",
" number=number,\n",
")\n",
"elapsed2 = timeit.timeit(\n",
" \"queue.pop()\",\n",
" setup=\"queue = QueueCls([0] * 10000000)\",\n",
" globals={\"QueueCls\": DequeQueue},\n",
" number=number,\n",
")\n",
"print(f\"{number}x ArrayQueue.pop, took\", elapsed)\n",
"print(f\"{number}x DequeQueue.pop, took\", elapsed2)\n",
"print(f\"DequeQueue is {(elapsed-elapsed2) / elapsed * 100:.3f}% less time\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "4333f0dd-846e-44fa-8aa0-9c3d883c165e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"58.76708083400081"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"timeit.timeit(\"''.join(['a', 'b', 'c', 'd'])\", number=1000000000)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "5a4a6e0d-0e02-42bf-8d53-06b09ce1a5dd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5.945709666997573"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"timeit.timeit(\"d = ('apple'*5) + 'banana' + 'c' + 'd'\", number=1000000000)"
]
},
{
"cell_type": "markdown",
"id": "b20fc917-73d2-4378-ae0a-a2119883efa3",
"metadata": {},
"source": [
"### Queue Performance\n",
"\n",
"| Operation | ArrayQueue | DequeQueue |\n",
"| --------- | ---------- | ---------- |\n",
"| push | O(1) | O(1) |\n",
"| pop | O(n) | O(1) |\n"
]
},
{
"cell_type": "markdown",
"id": "c44f9af3-4a0a-4dad-8bf6-4ec3ecc7d3f1",
"metadata": {},
"source": [
"\n",
"\n",
"For a Stack, an array or linked list can both give O(1) performance.\n",
"\n",
"For a Queue, a (doubly) linked list is necessary.\n",
"\n",
"But arrays are superior for indexing operations. And *typical* code indexes list far more than it appends/inserts. Depending on your needs Python's `list` implementation may not be the optimal data structure."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6c3fd9d-12a3-49a2-aba4-257d8a6eb829",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}