51042-notes/file-io.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "907bb05d-99ef-400b-93ec-9fd2bf8b619c",
   "metadata": {
    "tags": [],
    "toc-hr-collapsed": true
   },
   "source": [
    "## I/O"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff711218-4d4c-4999-a378-3a03ae5d6222",
   "metadata": {},
   "source": [
    "### `print()`\n",
    "\n",
    "`print(*objects, sep=' ', end='\\n', file=sys.stdout, flush=False)`\n",
    "\n",
    "https://docs.python.org/3/library/functions.html#print"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "894ed224-a63a-44d7-b3a5-3dd4233830b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Can\", \"pass\", \"multiple\", {\"objects\": True})\n",
    "print(\"Hello\", \"World\", sep=\"~~~~\", end=\"!\")\n",
    "print(\"Same line\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d05db43-1062-4a51-a205-b91e37d7d9c1",
   "metadata": {},
   "source": [
    "### `input()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "302e8ee6-9f25-4202-99e0-a3556cd53da7",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "name = input(\"What is your name: \")\n",
    "print(f\"Hello {name}\")\n",
    "\n",
    "# always a string\n",
    "year = input(\"What year is it: \")\n",
    "print(year, type(year))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e12f7814-b77d-4c57-9857-7dd5238e5d08",
   "metadata": {},
   "source": [
    "### pathlib\n",
    "\n",
    "There are a few ways of working with files in Python, mostly due to improvements over time.\n",
    "\n",
    "You'll still sometimes see code that uses the older method with `open`, but there's almost no reason to write code in that style now that `pathlib` is widely available.\n",
    "\n",
    "To use `pathlib`, you'll need to import the `Path` object. (We'll discuss these imports more soon.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d64c104-bf39-4932-9178-922bb2ecb43d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3be8ec33-dcea-4834-8fc4-0bf71b19b72d",
   "metadata": {},
   "source": [
    "Imports like this should be at the top of the file.\n",
    "\n",
    "To use this type you'll create objects with file paths, for example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25f2bbab-d317-404c-be35-c31df5180a5a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# this looks like a function call\n",
    "# but the capital letter denotes that this is instead a class\n",
    "file_path = Path(\"data/names.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad9a7bb9-e8ff-4929-a50d-5bc4e1aeb9f4",
   "metadata": {},
   "source": [
    "#### Typical workflow:\n",
    "\n",
    "- Read contents of file(s) from disk into working memory.\n",
    "- Parse and/or manipulate data as needed.\n",
    "- (Optional) Write data back to disk with modifications.\n",
    "\n",
    "#### Other Workflows\n",
    "\n",
    "- Append-only (e.g. logging)\n",
    "- Streaming data (needed for large files where we can't fit into memory)\n",
    "\n",
    "#### Text vs. Binary\n",
    "\n",
    "We're opening our files in the default, text mode. It is also possible to open files in a binary mode where it isn't assumed we're reading strings."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31b06442-f771-48cd-b821-0ec6f54b1188",
   "metadata": {},
   "source": [
    "### Reading From a File\n",
    "\n",
    "**emails.txt**\n",
    "\n",
    "```\n",
    "borja@cs.uchicago.edu\n",
    "jturk@uchicago.edu\n",
    "lamonts@uchicago.edu\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b6afd77-85ac-4be1-82d5-39273e7df035",
   "metadata": {},
   "outputs": [],
   "source": [
    "# to access a file's contents, we create the path, and then\n",
    "# use read_text()\n",
    "emails_path = Path(\"data/emails.txt\")\n",
    "emails = emails_path.read_text()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "97883212-61e5-4c22-9d4b-dbfee04bf382",
   "metadata": {},
   "source": [
    "### Writing to a File\n",
    "\n",
    "We need to open the file with write or append permissions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51000430-0b55-41a5-8573-759b76529494",
   "metadata": {},
   "outputs": [],
   "source": [
    "names_file = Path(\"data/animals.txt\").open(\"w\")\n",
    "names_file.write(\"Aardvark\\nChimpanzee\\nElephant\\n\")\n",
    "\n",
    "# (the ! indicates this is is a shell command, not Python)\n",
    "!cat data/animals.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a379fae3-a5e8-4917-898d-e4ec6c3e91c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "# open(\"w\") erases the file, use \"a\" if you want to append\n",
    "names_file = Path(\"data/animals.txt\").open(\"a\")\n",
    "names_file.write(\"Kangaroo\\n\")\n",
    "names_file.flush()\n",
    "!cat data/animals.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64cd8142-037d-4005-b591-4a1cda922b07",
   "metadata": {},
   "source": [
    "#### `flush` and `close`\n",
    "\n",
    "`flush` ensures that the in-memory contents get written to disk, actually saved.\n",
    "\n",
    "(Analogy: program crashes and you lose your unsaved work)\n",
    "\n",
    "At the end, important to `close` the file.\n",
    "\n",
    "- Frees resources.\n",
    "- Allows other programs to access file contents.\n",
    "- Ensures edits are written to disk."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fb6aa734-ea51-4582-9d4a-aea652d9dec0",
   "metadata": {},
   "source": [
    "### `with`\n",
    "\n",
    "The file object is a \"context manager\", we'll cover those in more detail in a few weeks.\n",
    "\n",
    "The `with` statement allows us to safely use files without fear of leaving them open.\n",
    "\n",
    "```python\n",
    "\n",
    "with path.open() as variable:\n",
    "    statement1\n",
    "    statement2\n",
    "```\n",
    "\n",
    "No matter what happens inside `with` block, the file will be closed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1428cd7-53ac-4138-8b43-fe7b477ccb24",
   "metadata": {},
   "outputs": [],
   "source": [
    "f = open(\"names.txt\", \"w\")\n",
    "f.write(\"Bob\\n\")\n",
    "f.write(\"Phil\\n\")\n",
    "1 / 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2a632c3d-9d35-441d-a123-49a6ba44c432",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat names.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0c62751d-6f85-4bfb-8e15-788d2d928089",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Full Example\n",
    "\n",
    "# load data into our chosen data structure\n",
    "emails = []\n",
    "with open(\"data/emails.txt\") as f:\n",
    "    for email in f:\n",
    "        emails.append(email)\n",
    "print(emails)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "81df15e5-da33-4367-ae08-dea08cfc2cf6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform data\n",
    "cnet_ids = []\n",
    "for email in emails:\n",
    "    cnet_id, domain = email.split(\"@\")\n",
    "    cnet_ids.append(cnet_id)\n",
    "print(cnet_ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6b5c5988-90ea-4464-8c11-c51518f0657c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# write new data\n",
    "with open(\"data/cnetids.txt\", \"w\") as f:\n",
    "    for cnet_id in cnet_ids:\n",
    "        # print() adds newlines by default\n",
    "        print(cnet_id, file=f)\n",
    "        # or\n",
    "        # f.write(cnet_id + \"\\n\")\n",
    "\n",
    "!cat data/cnetids.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb00e0ce-ff29-40d2-aa72-fa34315fa9a5",
   "metadata": {},
   "source": [
    "#### Useful `file` Methods\n",
    "\n",
    "| Operation | Purpose |\n",
    "|-----------|---------|\n",
    "| `f.read()` | Read entire file & return contents. |\n",
    "| `f.read(N)` | Read N characters (or bytes). |\n",
    "| `f.readline()` | Read up to (and including) next newline. |\n",
    "| `f.readlines() ` | Read entire file split into list of lines. |\n",
    "| `f.write(aStr)` | Write string `aStr` into file. |\n",
    "| `f.writelines(lines)` | Write list of strings into file. |\n",
    "| `f.close()` | Close file, prefer `with open()` instead. |\n",
    "| `f.flush()` | Manually flush output to disk without closing. |\n",
    "| `f.seek(N)` | Move cursor to position N. |\n",
    "\n",
    "-- Table based on Learning Python 2013"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f152aaaf-0c07-41c4-a90a-75891606c14e",
   "metadata": {},
   "source": [
    "### Common Gotchas\n",
    "\n",
    "- Relative paths - use `pathlib` https://docs.python.org/3/library/pathlib.html\n",
    "- File permissions\n",
    "- Mind file mode (read/write/append)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b01a89d7-1419-40bf-8005-78cbedad82b8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "cc364347-7e0a-47ef-a90d-e7b3d91e3fc1",
   "metadata": {},
   "source": [
    "### Note: Relative Paths\n",
    "\n",
    "You may find that if you are running your code from, for example, the homework1 directory instead of homework1/problem3, you'd need to modify this path to be `Path(\"problem3/towing.csv\")`.\n",
    "\n",
    "That is because by default, paths are *relative*, meaning that they are assumed to start in the directory that you are running your code from.\n",
    "\n",
    "This can be frustrating at first, you want your code to work the same regardless of what directory you are in.\n",
    "\n",
    "### Building an absolute path\n",
    "\n",
    "To get around this, you can construct an absolute path:\n",
    "\n",
    "First you can use the special `__file__` variable which always contains the path to the current file.\n",
    "\n",
    "Then you can use that as the \"anchor\" of your path, and navigate from there.\n",
    "\n",
    "A common pattern then is to get the current file's parent, and navigate from there:\n",
    "\n",
    "```python\n",
    "from pathlib import Path\n",
    "\n",
    "path = Path(__file__).parent / \"towing.csv\"\n",
    "```\n",
    "\n",
    "This line uses the special built-in variable `__file__` to get the path of the Python file itself.\n",
    "It then gets this file's parent directory (`.parent`) and appends the filename \"towing.csv\" to it.\n",
    "\n",
    "Using this technique in your code allows you to set paths that don't depend on the current working directory.\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}