file-io appendix
This commit is contained in:
parent
64b09cdaa5
commit
a1ed6175de
409
file-io.ipynb
Normal file
409
file-io.ipynb
Normal file
@ -0,0 +1,409 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "907bb05d-99ef-400b-93ec-9fd2bf8b619c",
|
||||
"metadata": {
|
||||
"tags": [],
|
||||
"toc-hr-collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"## I/O"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ff711218-4d4c-4999-a378-3a03ae5d6222",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### `print()`\n",
|
||||
"\n",
|
||||
"`print(*objects, sep=' ', end='\\n', file=sys.stdout, flush=False)`\n",
|
||||
"\n",
|
||||
"https://docs.python.org/3/library/functions.html#print"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "894ed224-a63a-44d7-b3a5-3dd4233830b7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"print(\"Can\", \"pass\", \"multiple\", {\"objects\": True})\n",
|
||||
"print(\"Hello\", \"World\", sep=\"~~~~\", end=\"!\")\n",
|
||||
"print(\"Same line\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "0d05db43-1062-4a51-a205-b91e37d7d9c1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### `input()`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "302e8ee6-9f25-4202-99e0-a3556cd53da7",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"name = input(\"What is your name: \")\n",
|
||||
"print(f\"Hello {name}\")\n",
|
||||
"\n",
|
||||
"# always a string\n",
|
||||
"year = input(\"What year is it: \")\n",
|
||||
"print(year, type(year))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e12f7814-b77d-4c57-9857-7dd5238e5d08",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### pathlib\n",
|
||||
"\n",
|
||||
"There are a few ways of working with files in Python, mostly due to improvements over time.\n",
|
||||
"\n",
|
||||
"You'll still sometimes see code that uses the older method with `open`, but there's almost no reason to write code in that style now that `pathlib` is widely available.\n",
|
||||
"\n",
|
||||
"To use `pathlib`, you'll need to import the `Path` object. (We'll discuss these imports more soon.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6d64c104-bf39-4932-9178-922bb2ecb43d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from pathlib import Path"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "3be8ec33-dcea-4834-8fc4-0bf71b19b72d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Imports like this should be at the top of the file.\n",
|
||||
"\n",
|
||||
"To use this type you'll create objects with file paths, for example:"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "25f2bbab-d317-404c-be35-c31df5180a5a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# this looks like a function call\n",
|
||||
"# but the capital letter denotes that this is instead a class\n",
|
||||
"file_path = Path(\"data/names.txt\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "ad9a7bb9-e8ff-4929-a50d-5bc4e1aeb9f4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Typical workflow:\n",
|
||||
"\n",
|
||||
"- Read contents of file(s) from disk into working memory.\n",
|
||||
"- Parse and/or manipulate data as needed.\n",
|
||||
"- (Optional) Write data back to disk with modifications.\n",
|
||||
"\n",
|
||||
"#### Other Workflows\n",
|
||||
"\n",
|
||||
"- Append-only (e.g. logging)\n",
|
||||
"- Streaming data (needed for large files where we can't fit into memory)\n",
|
||||
"\n",
|
||||
"#### Text vs. Binary\n",
|
||||
"\n",
|
||||
"We're opening our files in the default, text mode. It is also possible to open files in a binary mode where it isn't assumed we're reading strings."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "31b06442-f771-48cd-b821-0ec6f54b1188",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Reading From a File\n",
|
||||
"\n",
|
||||
"**emails.txt**\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"borja@cs.uchicago.edu\n",
|
||||
"jturk@uchicago.edu\n",
|
||||
"lamonts@uchicago.edu\n",
|
||||
"```"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "3b6afd77-85ac-4be1-82d5-39273e7df035",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# to access a file's contents, we create the path, and then\n",
|
||||
"# use read_text()\n",
|
||||
"emails_path = Path(\"data/emails.txt\")\n",
|
||||
"emails = emails_path.read_text()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "97883212-61e5-4c22-9d4b-dbfee04bf382",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Writing to a File\n",
|
||||
"\n",
|
||||
"We need to open the file with write or append permissions."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "51000430-0b55-41a5-8573-759b76529494",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"names_file = Path(\"data/animals.txt\").open(\"w\")\n",
|
||||
"names_file.write(\"Aardvark\\nChimpanzee\\nElephant\\n\")\n",
|
||||
"\n",
|
||||
"# (the ! indicates this is is a shell command, not Python)\n",
|
||||
"!cat data/animals.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "a379fae3-a5e8-4917-898d-e4ec6c3e91c1",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# open(\"w\") erases the file, use \"a\" if you want to append\n",
|
||||
"names_file = Path(\"data/animals.txt\").open(\"a\")\n",
|
||||
"names_file.write(\"Kangaroo\\n\")\n",
|
||||
"names_file.flush()\n",
|
||||
"!cat data/animals.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "64cd8142-037d-4005-b591-4a1cda922b07",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### `flush` and `close`\n",
|
||||
"\n",
|
||||
"`flush` ensures that the in-memory contents get written to disk, actually saved.\n",
|
||||
"\n",
|
||||
"(Analogy: program crashes and you lose your unsaved work)\n",
|
||||
"\n",
|
||||
"At the end, important to `close` the file.\n",
|
||||
"\n",
|
||||
"- Frees resources.\n",
|
||||
"- Allows other programs to access file contents.\n",
|
||||
"- Ensures edits are written to disk."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "fb6aa734-ea51-4582-9d4a-aea652d9dec0",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### `with`\n",
|
||||
"\n",
|
||||
"The file object is a \"context manager\", we'll cover those in more detail in a few weeks.\n",
|
||||
"\n",
|
||||
"The `with` statement allows us to safely use files without fear of leaving them open.\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"\n",
|
||||
"with path.open() as variable:\n",
|
||||
" statement1\n",
|
||||
" statement2\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"No matter what happens inside `with` block, the file will be closed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e1428cd7-53ac-4138-8b43-fe7b477ccb24",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"f = open(\"names.txt\", \"w\")\n",
|
||||
"f.write(\"Bob\\n\")\n",
|
||||
"f.write(\"Phil\\n\")\n",
|
||||
"1 / 0"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "2a632c3d-9d35-441d-a123-49a6ba44c432",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"!cat names.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "0c62751d-6f85-4bfb-8e15-788d2d928089",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Full Example\n",
|
||||
"\n",
|
||||
"# load data into our chosen data structure\n",
|
||||
"emails = []\n",
|
||||
"with open(\"data/emails.txt\") as f:\n",
|
||||
" for email in f:\n",
|
||||
" emails.append(email)\n",
|
||||
"print(emails)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "81df15e5-da33-4367-ae08-dea08cfc2cf6",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# transform data\n",
|
||||
"cnet_ids = []\n",
|
||||
"for email in emails:\n",
|
||||
" cnet_id, domain = email.split(\"@\")\n",
|
||||
" cnet_ids.append(cnet_id)\n",
|
||||
"print(cnet_ids)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6b5c5988-90ea-4464-8c11-c51518f0657c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# write new data\n",
|
||||
"with open(\"data/cnetids.txt\", \"w\") as f:\n",
|
||||
" for cnet_id in cnet_ids:\n",
|
||||
" # print() adds newlines by default\n",
|
||||
" print(cnet_id, file=f)\n",
|
||||
" # or\n",
|
||||
" # f.write(cnet_id + \"\\n\")\n",
|
||||
"\n",
|
||||
"!cat data/cnetids.txt"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "eb00e0ce-ff29-40d2-aa72-fa34315fa9a5",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"#### Useful `file` Methods\n",
|
||||
"\n",
|
||||
"| Operation | Purpose |\n",
|
||||
"|-----------|---------|\n",
|
||||
"| `f.read()` | Read entire file & return contents. |\n",
|
||||
"| `f.read(N)` | Read N characters (or bytes). |\n",
|
||||
"| `f.readline()` | Read up to (and including) next newline. |\n",
|
||||
"| `f.readlines() ` | Read entire file split into list of lines. |\n",
|
||||
"| `f.write(aStr)` | Write string `aStr` into file. |\n",
|
||||
"| `f.writelines(lines)` | Write list of strings into file. |\n",
|
||||
"| `f.close()` | Close file, prefer `with open()` instead. |\n",
|
||||
"| `f.flush()` | Manually flush output to disk without closing. |\n",
|
||||
"| `f.seek(N)` | Move cursor to position N. |\n",
|
||||
"\n",
|
||||
"-- Table based on Learning Python 2013"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "f152aaaf-0c07-41c4-a90a-75891606c14e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Common Gotchas\n",
|
||||
"\n",
|
||||
"- Relative paths - use `pathlib` https://docs.python.org/3/library/pathlib.html\n",
|
||||
"- File permissions\n",
|
||||
"- Mind file mode (read/write/append)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b01a89d7-1419-40bf-8005-78cbedad82b8",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cc364347-7e0a-47ef-a90d-e7b3d91e3fc1",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Note: Relative Paths\n",
|
||||
"\n",
|
||||
"You may find that if you are running your code from, for example, the homework1 directory instead of homework1/problem3, you'd need to modify this path to be `Path(\"problem3/towing.csv\")`.\n",
|
||||
"\n",
|
||||
"That is because by default, paths are *relative*, meaning that they are assumed to start in the directory that you are running your code from.\n",
|
||||
"\n",
|
||||
"This can be frustrating at first, you want your code to work the same regardless of what directory you are in.\n",
|
||||
"\n",
|
||||
"### Building an absolute path\n",
|
||||
"\n",
|
||||
"To get around this, you can construct an absolute path:\n",
|
||||
"\n",
|
||||
"First you can use the special `__file__` variable which always contains the path to the current file.\n",
|
||||
"\n",
|
||||
"Then you can use that as the \"anchor\" of your path, and navigate from there.\n",
|
||||
"\n",
|
||||
"A common pattern then is to get the current file's parent, and navigate from there:\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"from pathlib import Path\n",
|
||||
"\n",
|
||||
"path = Path(__file__).parent / \"towing.csv\"\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"This line uses the special built-in variable `__file__` to get the path of the Python file itself.\n",
|
||||
"It then gets this file's parent directory (`.parent`) and appends the filename \"towing.csv\" to it.\n",
|
||||
"\n",
|
||||
"Using this technique in your code allows you to set paths that don't depend on the current working directory.\n",
|
||||
"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.15"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
Loading…
Reference in New Issue
Block a user