51042-notes/hashtables.ipynb
2024-11-18 09:45:21 -06:00

402 lines
10 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "a6c3f443-273b-4f49-9339-902e3251e0e4",
"metadata": {},
"source": [
"# Hash Tables\n",
"\n",
"A hash table is a collection that maps keys to values. Python's `dict` is an implementation of a hash table.\n",
"\n",
"For the course project, you will be implementing a hash table without using `dict`.\n",
"\n",
"### Hash Table Performance\n",
"\n",
"| Operation | Average | Worst Case | \n",
"| --------- | ---- | ---------- |\n",
"| lookup | O(1) | O(n) |\n",
"| insert | O(1) | O(n) |\n",
"| delete | O(1) | O(n) |\n",
"\n",
"Note: These are average case, as we'll see, depending on implementation, worst case can be much worse.\n",
"\n",
"A key property for hash tables is that we **do not need to linearly search through them for our data**.\n",
"\n",
"If you find yourself scanning every element in a hash table, you're doing something wrong.\n",
"\n",
"### Example\n",
"\n",
"Let's first model a simple hashtable with fixed capacity of 10. For simplicity we'll stick to string keys.\n",
"\n",
"```\n",
"[ ][ ][ ][ ][ ][ ][ ][ ][ ][ ]\n",
"```\n",
"\n",
"When we get a key-value pair, we need to assign it a bucket.\n",
"\n",
"How can we write a function that takes a string and assigns it to a bucket?\n",
"\n",
"1. Turn string into a number. **Hash Function**\n",
"2. Take (number % capacity)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "450367cc-fbd5-4b97-89d6-37f69cba5d91",
"metadata": {},
"outputs": [],
"source": [
"def strhash(key):\n",
" # ord converts a character to it's numeric representation\n",
" # ord(\"A\") == 65\n",
" # ord(\"z\") == 122\n",
" # etc.\n",
" return sum(ord(letter) for letter in key)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e689c2aa-a3a5-4400-b7bb-557944a725da",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"410"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"strhash(\"bear\")"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "bb545dc8-7db6-46a0-bbc3-716f6bf3639b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"333"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"strhash(\"fox\")"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "0ef489a6-7f2c-46aa-aa2e-3ff10241eb46",
"metadata": {},
"outputs": [],
"source": [
"class Hashtable:\n",
" def __init__(self, capacity=10):\n",
" self._table = [None] * capacity\n",
" self.capacity = capacity\n",
"\n",
" def __setitem__(self, key, value):\n",
" index = strhash(key) % self.capacity\n",
" self._table[index] = value\n",
"\n",
" def __getitem__(self, key):\n",
" index = strhash(key) % self.capacity\n",
" value = self._table[index]\n",
" if value:\n",
" return value\n",
" else:\n",
" raise KeyError(key)\n",
"\n",
" def display(self):\n",
" print(f\"Capacity = {self.capacity}\")\n",
" for idx, elem in enumerate(self._table):\n",
" if elem:\n",
" print(f\"{idx}: {elem}\")"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "8f73d5b3-1fbe-4439-a19f-3c1c73cf1e04",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Capacity = 10\n",
"0: 3\n",
"3: 12\n"
]
}
],
"source": [
"h = Hashtable()\n",
"h[\"bear\"] = 3\n",
"h[\"fox\"] = 12\n",
"h.display()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "1e68e2ee-9feb-4623-88a3-03b16d7fedd4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3\n"
]
}
],
"source": [
"print(h[\"bear\"])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "1535a473-c2e5-474f-8a80-e89ff3bd00bd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"strhash(\"bear\") == strhash(\"been\") # different word, same hash!"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "3458faf2-cc40-4e31-a792-7a1515f718e7",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Capacity = 10\n",
"0: 12\n"
]
}
],
"source": [
"h = Hashtable()\n",
"h[\"bear\"] = 33\n",
"h[\"been\"] = 12\n",
"h.display()"
]
},
{
"cell_type": "markdown",
"id": "b23a0b1f-1e59-4d83-aa4c-2c57d5706298",
"metadata": {},
"source": [
"### Linear Probing\n",
"\n",
"One solution is to just walk forward in the storage list, until we find an empty space.\n",
"\n",
"Either way, we'll need to start storing the key as well. Let's revise our class:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "6a5dd148-07ce-4e02-998a-75b9edce31f7",
"metadata": {},
"outputs": [],
"source": [
"class Hashtable:\n",
" def __init__(self, capacity=100):\n",
" self._table = [None] * 100\n",
" self.capacity = capacity\n",
"\n",
" def __setitem__(self, key, value):\n",
" index = strhash(key) % self.capacity\n",
" while self._table[index] is not None:\n",
" index += 1\n",
" # Handling wrap-around omitted for brevity\n",
"\n",
" # we now store the key and value\n",
" self._table[index] = (key, value)\n",
"\n",
" def __getitem__(self, key):\n",
" index = strhash(key) % self.capacity\n",
"\n",
" # walk forward until we either reach the item or an empty space\n",
" while self._table[index] is not None:\n",
" if self._table[index][0] == key:\n",
" return self._table[index][1]\n",
" index += 1\n",
" # Handling wrap-around omitted for brevity\n",
"\n",
" # if code got here, the item wasn't in the list\n",
" raise KeyError(key)\n",
"\n",
" def display(self):\n",
" print(f\"Capacity = {self.capacity}\")\n",
" for idx, elem in enumerate(self._table):\n",
" if elem:\n",
" print(f\"{idx}: {elem}\")"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "3251b310-dfae-4aae-9d04-603a5e20dbea",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Capacity = 10\n",
"0: 12\n"
]
}
],
"source": [
"h = Hashtable()\n",
"h[\"bear\"] = 33\n",
"h[\"been\"] = 12\n",
"h.display()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "781024b5-a81d-46be-8c87-84f3860395c6",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"12\n",
"12\n"
]
}
],
"source": [
"print(h[\"bear\"])\n",
"print(h[\"been\"])"
]
},
{
"cell_type": "markdown",
"id": "f90d7b38-2f69-465d-8da1-59f8efd82909",
"metadata": {},
"source": [
"### Separate Chaining\n",
"\n",
"Another solution is to use a linked list to store multiple items in the same bucket. One could imagine each bucket being the head of a linked list where items that hash to that value are stored linearly.\n",
"\n",
"This is the approach we're asking you to use for the course project.\n",
"\n",
"**What happens if we have a lot of items in the same bucket?**\n",
"\n",
"### Better Hash Function\n",
"\n",
"Ideally a hash function will evenly distribute values across the collection.\n",
"\n",
"A common pattern is to use a polynomial hash function.\n",
"\n",
"$$h(x_0, ..., x_n) = (\\sum_{i=0}^{k-1}{c_ip^i})\\mod{m}$$\n",
"\n",
"Where:\n",
"- $x_0...x_i $ is the sequence\n",
"- $k = len(x)$\n",
"- $c_i$ is the numeric value of the character $x_i$ ($ord(x$ in Python)\n",
"- $p$ is an arbitrary constant.\n",
"- and $m$ is the size of the collection.\n",
"\n",
"```python\n",
" def _hash(self, key):\n",
" \"\"\"\n",
" This method takes in a string and returns \n",
" an integer value between 0 and self.capacity.\n",
"\n",
" This particular hash function uses \n",
" Horner's rule to compute a large polynomial.\n",
" \"\"\"\n",
" val = 0\n",
" for letter in key:\n",
" val = self.P_CONSTANT * val + ord(letter)\n",
" return val % self.capacity\n",
"```\n",
"\n",
"### Questions\n",
"\n",
"* Linear probing vs. separate chaining? Other approaches?\n",
"* What if our hash function wasn't reliable?\n",
"\n",
"### Rehashing\n",
"\n",
"As we add more items to our hash table, we'll eventually run out of space. We can increase the capacity of our table, but we'll need to rehash all of our existing items.\n",
"\n",
"Because which bucket we choose depends on `hash(item) % capacity` items would end up in different buckets if we change capacity.\n",
"\n",
"This means when we resize, we need to rehash all of our items.\n",
"\n",
"A common pattern is to double capacity when the table is ~50% full."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c1d5023d-25db-4885-a303-2c34d1a453de",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}