From a1ed6175de34c7858282973d1aa92aaf84b72631 Mon Sep 17 00:00:00 2001
From: James Turk <dev@jpt.sh>
Date: Wed, 9 Oct 2024 22:36:50 -0500
Subject: [PATCH] file-io appendix

---
 file-io.ipynb | 409 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 409 insertions(+)
 create mode 100644 file-io.ipynb

diff --git a/file-io.ipynb b/file-io.ipynb
new file mode 100644
index 0000000..cff5eb6
--- /dev/null
+++ b/file-io.ipynb
@@ -0,0 +1,409 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "907bb05d-99ef-400b-93ec-9fd2bf8b619c",
+   "metadata": {
+    "tags": [],
+    "toc-hr-collapsed": true
+   },
+   "source": [
+    "## I/O"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff711218-4d4c-4999-a378-3a03ae5d6222",
+   "metadata": {},
+   "source": [
+    "### `print()`\n",
+    "\n",
+    "`print(*objects, sep=' ', end='\\n', file=sys.stdout, flush=False)`\n",
+    "\n",
+    "https://docs.python.org/3/library/functions.html#print"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "894ed224-a63a-44d7-b3a5-3dd4233830b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Can\", \"pass\", \"multiple\", {\"objects\": True})\n",
+    "print(\"Hello\", \"World\", sep=\"~~~~\", end=\"!\")\n",
+    "print(\"Same line\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d05db43-1062-4a51-a205-b91e37d7d9c1",
+   "metadata": {},
+   "source": [
+    "### `input()`"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "302e8ee6-9f25-4202-99e0-a3556cd53da7",
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "name = input(\"What is your name: \")\n",
+    "print(f\"Hello {name}\")\n",
+    "\n",
+    "# always a string\n",
+    "year = input(\"What year is it: \")\n",
+    "print(year, type(year))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e12f7814-b77d-4c57-9857-7dd5238e5d08",
+   "metadata": {},
+   "source": [
+    "### pathlib\n",
+    "\n",
+    "There are a few ways of working with files in Python, mostly due to improvements over time.\n",
+    "\n",
+    "You'll still sometimes see code that uses the older method with `open`, but there's almost no reason to write code in that style now that `pathlib` is widely available.\n",
+    "\n",
+    "To use `pathlib`, you'll need to import the `Path` object. (We'll discuss these imports more soon.)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6d64c104-bf39-4932-9178-922bb2ecb43d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3be8ec33-dcea-4834-8fc4-0bf71b19b72d",
+   "metadata": {},
+   "source": [
+    "Imports like this should be at the top of the file.\n",
+    "\n",
+    "To use this type you'll create objects with file paths, for example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "25f2bbab-d317-404c-be35-c31df5180a5a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# this looks like a function call\n",
+    "# but the capital letter denotes that this is instead a class\n",
+    "file_path = Path(\"data/names.txt\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad9a7bb9-e8ff-4929-a50d-5bc4e1aeb9f4",
+   "metadata": {},
+   "source": [
+    "#### Typical workflow:\n",
+    "\n",
+    "- Read contents of file(s) from disk into working memory.\n",
+    "- Parse and/or manipulate data as needed.\n",
+    "- (Optional) Write data back to disk with modifications.\n",
+    "\n",
+    "#### Other Workflows\n",
+    "\n",
+    "- Append-only (e.g. logging)\n",
+    "- Streaming data (needed for large files where we can't fit into memory)\n",
+    "\n",
+    "#### Text vs. Binary\n",
+    "\n",
+    "We're opening our files in the default, text mode. It is also possible to open files in a binary mode where it isn't assumed we're reading strings."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "31b06442-f771-48cd-b821-0ec6f54b1188",
+   "metadata": {},
+   "source": [
+    "### Reading From a File\n",
+    "\n",
+    "**emails.txt**\n",
+    "\n",
+    "```\n",
+    "borja@cs.uchicago.edu\n",
+    "jturk@uchicago.edu\n",
+    "lamonts@uchicago.edu\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b6afd77-85ac-4be1-82d5-39273e7df035",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# to access a file's contents, we create the path, and then\n",
+    "# use read_text()\n",
+    "emails_path = Path(\"data/emails.txt\")\n",
+    "emails = emails_path.read_text()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97883212-61e5-4c22-9d4b-dbfee04bf382",
+   "metadata": {},
+   "source": [
+    "### Writing to a File\n",
+    "\n",
+    "We need to open the file with write or append permissions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "51000430-0b55-41a5-8573-759b76529494",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "names_file = Path(\"data/animals.txt\").open(\"w\")\n",
+    "names_file.write(\"Aardvark\\nChimpanzee\\nElephant\\n\")\n",
+    "\n",
+    "# (the ! indicates this is is a shell command, not Python)\n",
+    "!cat data/animals.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a379fae3-a5e8-4917-898d-e4ec6c3e91c1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# open(\"w\") erases the file, use \"a\" if you want to append\n",
+    "names_file = Path(\"data/animals.txt\").open(\"a\")\n",
+    "names_file.write(\"Kangaroo\\n\")\n",
+    "names_file.flush()\n",
+    "!cat data/animals.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "64cd8142-037d-4005-b591-4a1cda922b07",
+   "metadata": {},
+   "source": [
+    "#### `flush` and `close`\n",
+    "\n",
+    "`flush` ensures that the in-memory contents get written to disk, actually saved.\n",
+    "\n",
+    "(Analogy: program crashes and you lose your unsaved work)\n",
+    "\n",
+    "At the end, important to `close` the file.\n",
+    "\n",
+    "- Frees resources.\n",
+    "- Allows other programs to access file contents.\n",
+    "- Ensures edits are written to disk."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb6aa734-ea51-4582-9d4a-aea652d9dec0",
+   "metadata": {},
+   "source": [
+    "### `with`\n",
+    "\n",
+    "The file object is a \"context manager\", we'll cover those in more detail in a few weeks.\n",
+    "\n",
+    "The `with` statement allows us to safely use files without fear of leaving them open.\n",
+    "\n",
+    "```python\n",
+    "\n",
+    "with path.open() as variable:\n",
+    "    statement1\n",
+    "    statement2\n",
+    "```\n",
+    "\n",
+    "No matter what happens inside `with` block, the file will be closed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e1428cd7-53ac-4138-8b43-fe7b477ccb24",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "f = open(\"names.txt\", \"w\")\n",
+    "f.write(\"Bob\\n\")\n",
+    "f.write(\"Phil\\n\")\n",
+    "1 / 0"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2a632c3d-9d35-441d-a123-49a6ba44c432",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat names.txt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0c62751d-6f85-4bfb-8e15-788d2d928089",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Full Example\n",
+    "\n",
+    "# load data into our chosen data structure\n",
+    "emails = []\n",
+    "with open(\"data/emails.txt\") as f:\n",
+    "    for email in f:\n",
+    "        emails.append(email)\n",
+    "print(emails)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "81df15e5-da33-4367-ae08-dea08cfc2cf6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# transform data\n",
+    "cnet_ids = []\n",
+    "for email in emails:\n",
+    "    cnet_id, domain = email.split(\"@\")\n",
+    "    cnet_ids.append(cnet_id)\n",
+    "print(cnet_ids)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b5c5988-90ea-4464-8c11-c51518f0657c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# write new data\n",
+    "with open(\"data/cnetids.txt\", \"w\") as f:\n",
+    "    for cnet_id in cnet_ids:\n",
+    "        # print() adds newlines by default\n",
+    "        print(cnet_id, file=f)\n",
+    "        # or\n",
+    "        # f.write(cnet_id + \"\\n\")\n",
+    "\n",
+    "!cat data/cnetids.txt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "eb00e0ce-ff29-40d2-aa72-fa34315fa9a5",
+   "metadata": {},
+   "source": [
+    "#### Useful `file` Methods\n",
+    "\n",
+    "| Operation | Purpose |\n",
+    "|-----------|---------|\n",
+    "| `f.read()` | Read entire file & return contents. |\n",
+    "| `f.read(N)` | Read N characters (or bytes). |\n",
+    "| `f.readline()` | Read up to (and including) next newline. |\n",
+    "| `f.readlines() ` | Read entire file split into list of lines. |\n",
+    "| `f.write(aStr)` | Write string `aStr` into file. |\n",
+    "| `f.writelines(lines)` | Write list of strings into file. |\n",
+    "| `f.close()` | Close file, prefer `with open()` instead. |\n",
+    "| `f.flush()` | Manually flush output to disk without closing. |\n",
+    "| `f.seek(N)` | Move cursor to position N. |\n",
+    "\n",
+    "-- Table based on Learning Python 2013"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f152aaaf-0c07-41c4-a90a-75891606c14e",
+   "metadata": {},
+   "source": [
+    "### Common Gotchas\n",
+    "\n",
+    "- Relative paths - use `pathlib` https://docs.python.org/3/library/pathlib.html\n",
+    "- File permissions\n",
+    "- Mind file mode (read/write/append)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b01a89d7-1419-40bf-8005-78cbedad82b8",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc364347-7e0a-47ef-a90d-e7b3d91e3fc1",
+   "metadata": {},
+   "source": [
+    "### Note: Relative Paths\n",
+    "\n",
+    "You may find that if you are running your code from, for example, the homework1 directory instead of homework1/problem3, you'd need to modify this path to be `Path(\"problem3/towing.csv\")`.\n",
+    "\n",
+    "That is because by default, paths are *relative*, meaning that they are assumed to start in the directory that you are running your code from.\n",
+    "\n",
+    "This can be frustrating at first, you want your code to work the same regardless of what directory you are in.\n",
+    "\n",
+    "### Building an absolute path\n",
+    "\n",
+    "To get around this, you can construct an absolute path:\n",
+    "\n",
+    "First you can use the special `__file__` variable which always contains the path to the current file.\n",
+    "\n",
+    "Then you can use that as the \"anchor\" of your path, and navigate from there.\n",
+    "\n",
+    "A common pattern then is to get the current file's parent, and navigate from there:\n",
+    "\n",
+    "```python\n",
+    "from pathlib import Path\n",
+    "\n",
+    "path = Path(__file__).parent / \"towing.csv\"\n",
+    "```\n",
+    "\n",
+    "This line uses the special built-in variable `__file__` to get the path of the Python file itself.\n",
+    "It then gets this file's parent directory (`.parent`) and appends the filename \"towing.csv\" to it.\n",
+    "\n",
+    "Using this technique in your code allows you to set paths that don't depend on the current working directory.\n",
+    "\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.15"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}