From a1ed6175de34c7858282973d1aa92aaf84b72631 Mon Sep 17 00:00:00 2001 From: James Turk Date: Wed, 9 Oct 2024 22:36:50 -0500 Subject: [PATCH] file-io appendix --- file-io.ipynb | 409 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 409 insertions(+) create mode 100644 file-io.ipynb diff --git a/file-io.ipynb b/file-io.ipynb new file mode 100644 index 0000000..cff5eb6 --- /dev/null +++ b/file-io.ipynb @@ -0,0 +1,409 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "907bb05d-99ef-400b-93ec-9fd2bf8b619c", + "metadata": { + "tags": [], + "toc-hr-collapsed": true + }, + "source": [ + "## I/O" + ] + }, + { + "cell_type": "markdown", + "id": "ff711218-4d4c-4999-a378-3a03ae5d6222", + "metadata": {}, + "source": [ + "### `print()`\n", + "\n", + "`print(*objects, sep=' ', end='\\n', file=sys.stdout, flush=False)`\n", + "\n", + "https://docs.python.org/3/library/functions.html#print" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "894ed224-a63a-44d7-b3a5-3dd4233830b7", + "metadata": {}, + "outputs": [], + "source": [ + "print(\"Can\", \"pass\", \"multiple\", {\"objects\": True})\n", + "print(\"Hello\", \"World\", sep=\"~~~~\", end=\"!\")\n", + "print(\"Same line\")" + ] + }, + { + "cell_type": "markdown", + "id": "0d05db43-1062-4a51-a205-b91e37d7d9c1", + "metadata": {}, + "source": [ + "### `input()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "302e8ee6-9f25-4202-99e0-a3556cd53da7", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "name = input(\"What is your name: \")\n", + "print(f\"Hello {name}\")\n", + "\n", + "# always a string\n", + "year = input(\"What year is it: \")\n", + "print(year, type(year))" + ] + }, + { + "cell_type": "markdown", + "id": "e12f7814-b77d-4c57-9857-7dd5238e5d08", + "metadata": {}, + "source": [ + "### pathlib\n", + "\n", + "There are a few ways of working with files in Python, mostly due to improvements over time.\n", + "\n", + "You'll still sometimes see code that uses the older method with `open`, but there's almost no reason to write code in that style now that `pathlib` is widely available.\n", + "\n", + "To use `pathlib`, you'll need to import the `Path` object. (We'll discuss these imports more soon.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6d64c104-bf39-4932-9178-922bb2ecb43d", + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path" + ] + }, + { + "cell_type": "markdown", + "id": "3be8ec33-dcea-4834-8fc4-0bf71b19b72d", + "metadata": {}, + "source": [ + "Imports like this should be at the top of the file.\n", + "\n", + "To use this type you'll create objects with file paths, for example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "25f2bbab-d317-404c-be35-c31df5180a5a", + "metadata": {}, + "outputs": [], + "source": [ + "# this looks like a function call\n", + "# but the capital letter denotes that this is instead a class\n", + "file_path = Path(\"data/names.txt\")" + ] + }, + { + "cell_type": "markdown", + "id": "ad9a7bb9-e8ff-4929-a50d-5bc4e1aeb9f4", + "metadata": {}, + "source": [ + "#### Typical workflow:\n", + "\n", + "- Read contents of file(s) from disk into working memory.\n", + "- Parse and/or manipulate data as needed.\n", + "- (Optional) Write data back to disk with modifications.\n", + "\n", + "#### Other Workflows\n", + "\n", + "- Append-only (e.g. logging)\n", + "- Streaming data (needed for large files where we can't fit into memory)\n", + "\n", + "#### Text vs. Binary\n", + "\n", + "We're opening our files in the default, text mode. It is also possible to open files in a binary mode where it isn't assumed we're reading strings." + ] + }, + { + "cell_type": "markdown", + "id": "31b06442-f771-48cd-b821-0ec6f54b1188", + "metadata": {}, + "source": [ + "### Reading From a File\n", + "\n", + "**emails.txt**\n", + "\n", + "```\n", + "borja@cs.uchicago.edu\n", + "jturk@uchicago.edu\n", + "lamonts@uchicago.edu\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3b6afd77-85ac-4be1-82d5-39273e7df035", + "metadata": {}, + "outputs": [], + "source": [ + "# to access a file's contents, we create the path, and then\n", + "# use read_text()\n", + "emails_path = Path(\"data/emails.txt\")\n", + "emails = emails_path.read_text()" + ] + }, + { + "cell_type": "markdown", + "id": "97883212-61e5-4c22-9d4b-dbfee04bf382", + "metadata": {}, + "source": [ + "### Writing to a File\n", + "\n", + "We need to open the file with write or append permissions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51000430-0b55-41a5-8573-759b76529494", + "metadata": {}, + "outputs": [], + "source": [ + "names_file = Path(\"data/animals.txt\").open(\"w\")\n", + "names_file.write(\"Aardvark\\nChimpanzee\\nElephant\\n\")\n", + "\n", + "# (the ! indicates this is is a shell command, not Python)\n", + "!cat data/animals.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a379fae3-a5e8-4917-898d-e4ec6c3e91c1", + "metadata": {}, + "outputs": [], + "source": [ + "# open(\"w\") erases the file, use \"a\" if you want to append\n", + "names_file = Path(\"data/animals.txt\").open(\"a\")\n", + "names_file.write(\"Kangaroo\\n\")\n", + "names_file.flush()\n", + "!cat data/animals.txt" + ] + }, + { + "cell_type": "markdown", + "id": "64cd8142-037d-4005-b591-4a1cda922b07", + "metadata": {}, + "source": [ + "#### `flush` and `close`\n", + "\n", + "`flush` ensures that the in-memory contents get written to disk, actually saved.\n", + "\n", + "(Analogy: program crashes and you lose your unsaved work)\n", + "\n", + "At the end, important to `close` the file.\n", + "\n", + "- Frees resources.\n", + "- Allows other programs to access file contents.\n", + "- Ensures edits are written to disk." + ] + }, + { + "cell_type": "markdown", + "id": "fb6aa734-ea51-4582-9d4a-aea652d9dec0", + "metadata": {}, + "source": [ + "### `with`\n", + "\n", + "The file object is a \"context manager\", we'll cover those in more detail in a few weeks.\n", + "\n", + "The `with` statement allows us to safely use files without fear of leaving them open.\n", + "\n", + "```python\n", + "\n", + "with path.open() as variable:\n", + " statement1\n", + " statement2\n", + "```\n", + "\n", + "No matter what happens inside `with` block, the file will be closed." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1428cd7-53ac-4138-8b43-fe7b477ccb24", + "metadata": {}, + "outputs": [], + "source": [ + "f = open(\"names.txt\", \"w\")\n", + "f.write(\"Bob\\n\")\n", + "f.write(\"Phil\\n\")\n", + "1 / 0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a632c3d-9d35-441d-a123-49a6ba44c432", + "metadata": {}, + "outputs": [], + "source": [ + "!cat names.txt" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0c62751d-6f85-4bfb-8e15-788d2d928089", + "metadata": {}, + "outputs": [], + "source": [ + "# Full Example\n", + "\n", + "# load data into our chosen data structure\n", + "emails = []\n", + "with open(\"data/emails.txt\") as f:\n", + " for email in f:\n", + " emails.append(email)\n", + "print(emails)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "81df15e5-da33-4367-ae08-dea08cfc2cf6", + "metadata": {}, + "outputs": [], + "source": [ + "# transform data\n", + "cnet_ids = []\n", + "for email in emails:\n", + " cnet_id, domain = email.split(\"@\")\n", + " cnet_ids.append(cnet_id)\n", + "print(cnet_ids)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6b5c5988-90ea-4464-8c11-c51518f0657c", + "metadata": {}, + "outputs": [], + "source": [ + "# write new data\n", + "with open(\"data/cnetids.txt\", \"w\") as f:\n", + " for cnet_id in cnet_ids:\n", + " # print() adds newlines by default\n", + " print(cnet_id, file=f)\n", + " # or\n", + " # f.write(cnet_id + \"\\n\")\n", + "\n", + "!cat data/cnetids.txt" + ] + }, + { + "cell_type": "markdown", + "id": "eb00e0ce-ff29-40d2-aa72-fa34315fa9a5", + "metadata": {}, + "source": [ + "#### Useful `file` Methods\n", + "\n", + "| Operation | Purpose |\n", + "|-----------|---------|\n", + "| `f.read()` | Read entire file & return contents. |\n", + "| `f.read(N)` | Read N characters (or bytes). |\n", + "| `f.readline()` | Read up to (and including) next newline. |\n", + "| `f.readlines() ` | Read entire file split into list of lines. |\n", + "| `f.write(aStr)` | Write string `aStr` into file. |\n", + "| `f.writelines(lines)` | Write list of strings into file. |\n", + "| `f.close()` | Close file, prefer `with open()` instead. |\n", + "| `f.flush()` | Manually flush output to disk without closing. |\n", + "| `f.seek(N)` | Move cursor to position N. |\n", + "\n", + "-- Table based on Learning Python 2013" + ] + }, + { + "cell_type": "markdown", + "id": "f152aaaf-0c07-41c4-a90a-75891606c14e", + "metadata": {}, + "source": [ + "### Common Gotchas\n", + "\n", + "- Relative paths - use `pathlib` https://docs.python.org/3/library/pathlib.html\n", + "- File permissions\n", + "- Mind file mode (read/write/append)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b01a89d7-1419-40bf-8005-78cbedad82b8", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "cc364347-7e0a-47ef-a90d-e7b3d91e3fc1", + "metadata": {}, + "source": [ + "### Note: Relative Paths\n", + "\n", + "You may find that if you are running your code from, for example, the homework1 directory instead of homework1/problem3, you'd need to modify this path to be `Path(\"problem3/towing.csv\")`.\n", + "\n", + "That is because by default, paths are *relative*, meaning that they are assumed to start in the directory that you are running your code from.\n", + "\n", + "This can be frustrating at first, you want your code to work the same regardless of what directory you are in.\n", + "\n", + "### Building an absolute path\n", + "\n", + "To get around this, you can construct an absolute path:\n", + "\n", + "First you can use the special `__file__` variable which always contains the path to the current file.\n", + "\n", + "Then you can use that as the \"anchor\" of your path, and navigate from there.\n", + "\n", + "A common pattern then is to get the current file's parent, and navigate from there:\n", + "\n", + "```python\n", + "from pathlib import Path\n", + "\n", + "path = Path(__file__).parent / \"towing.csv\"\n", + "```\n", + "\n", + "This line uses the special built-in variable `__file__` to get the path of the Python file itself.\n", + "It then gets this file's parent directory (`.parent`) and appends the filename \"towing.csv\" to it.\n", + "\n", + "Using this technique in your code allows you to set paths that don't depend on the current working directory.\n", + "\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.15" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}