Merge branch 'altair-lab'

This commit is contained in:
James Turk 2024-10-02 17:43:49 -05:00
commit aaffd1ec3b
13 changed files with 469258 additions and 0 deletions

View File

@ -0,0 +1,523 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d7b2fe09-0107-4c55-b9b1-ff9e6cfc5333",
"metadata": {},
"source": [
"# Altair Practice Lab\n",
"\n",
"In this assignment we will be crafting a series of visualizations using\n",
"Altair to get practice working with real data in this context.\n",
"\n",
"Your responses should be within the functions given,\n",
"using appropriate helper functions to help with clarity\n",
"and reduce redundancy."
]
},
{
"cell_type": "markdown",
"id": "2160f88c-ce5b-43e4-a19b-17b9116de256",
"metadata": {},
"source": [
"## Rubric\n",
"\n",
"The criteria to receive an S on this assignment is a good-faith attempt at each portion.\n",
"\n",
"A good-faith attempt should either:\n",
"\n",
"- be fundamentally correct producing the expected output with minimal deviations,\n",
"- **OR** contain an explanation of what does not work and _details on what was tried_.\n",
"\n",
"**To receive an N, at least half of the assignment should have good-faith attempts.**\n",
"\n",
"Your charts do not need to match the examples exactly! They are helpful to get a sense of what you're after, but focus on the problem description."
]
},
{
"cell_type": "markdown",
"id": "203da284-d444-40aa-9672-08b06089270b",
"metadata": {},
"source": [
"## Introducing the Dataset\n",
"\n",
"In this directory you'll find two files:\n",
"\n",
"**legislators.csv** which consists of ~7400 records representing state legislators, it has the following fields:\n",
"\n",
"- name\n",
"- given_name\n",
"- family_name\n",
"- party: As reported by the state.\n",
"- gender: Male / Female / Other\\*\n",
"- jurisdiction: This field contains an identifier for the state or jurisdiction, see below for details.\n",
"- district: the name of the district represented\n",
"- type: upper | lower - The classification of the legislative. Most states have both, but DC and NE only have an upper chamber.\n",
"\n",
"Note: Accurate data on gender is hard to come by in many states. There may be irregularities in this field. This is also why this field does not make further distinctions beyond Male/Female/Other."
]
},
{
"cell_type": "markdown",
"id": "92fb1bea-0448-4f96-be36-e8752a1605cf",
"metadata": {},
"source": [
"## Part 1: Data Exploration\n",
"\n",
"First, we'll build a few exploratory visualizations to get a sense of the data for this assignment.\n",
"\n",
"### 1.0: Cleaning\n",
"\n",
"As mentioned above, there is no 'state' field. Jurisdiction is in the format:\n",
"\n",
"`ocd-jurisdiction/country:us/state:nc/government`\n",
"\n",
"and for non-states:\n",
"\n",
"- `ocd-jurisdiction/country:us/district:dc/government`\n",
"- `ocd-jurisdiction/country:us/territory:pr/government`\n",
"\n",
"So for our purposes, we want to add a `state` column from the two letter code after either \"state:\", \"district:\", or \"territory:\".\n",
"(We will treat DC and PR as states.)\n",
"\n",
"Complete the function `legislators_df` which should return the data from `legislators.csv` in a dataframe, with an additional `state` column."
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "8285070f-682d-446f-8992-56f5a4a967b8",
"metadata": {},
"outputs": [],
"source": [
"def legislators_df():\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"# render dataframe\n",
"legislators_df()"
]
},
{
"cell_type": "markdown",
"id": "50f31b88-90b3-4ebe-b91e-dc66b1dbf71e",
"metadata": {},
"source": [
"### 1.1: Initial Plot\n",
"\n",
"First let's build a visualization of gender breakdowns in state legislatures.\n",
"Use the following:\n",
"\n",
"- stacked bars per state\n",
"- each segment of stacked bar is gender\n",
"\n",
"Your graph should somewhat resemble *imgs/ex1.1.png*."
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "97bebe40-bee5-4983-babe-1d1112b2d636",
"metadata": {},
"outputs": [],
"source": [
"def states_by_gender_initial(df):\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"# render chart\n",
"states_by_gender_initial(legislators_df())"
]
},
{
"cell_type": "markdown",
"id": "e28870e3-2a98-4ebe-b52e-6769cc31a563",
"metadata": {},
"source": [
"### 1.2: Improvements\n",
"\n",
"While it is clear from the first chart that there are more elected officials that are men than women, it is hard to compare across states.\n",
"\n",
"Make the following adjustments:\n",
"\n",
"- Normalize the chart so that each bar is a percentage, allowing for direct comparison across states.\n",
"- Since this is US political data, the colors red and blue have a strong meaning, associated with the Republican and Democratic parties. Change the color scheme to avoid red and blue. (I chose #8624f5 for women and #1fc3aa for men based on this article: <https://blog.datawrapper.de/gendercolor/>)\n",
"- Two states are very close to 50%, add a line at 50% using a layered chart to make it easier to see if they exceed 50% or not.\n",
"\n",
"Your graph should somewhat resemble *imgs/ex1.2.png*."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "bb8332be-1570-472c-b7f9-c7fe820bbfc9",
"metadata": {},
"outputs": [],
"source": [
"def states_by_gender_improved(df):\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"# render chart\n",
"states_by_gender_improved(legislators_df())"
]
},
{
"cell_type": "markdown",
"id": "85fdeee8-27ce-4730-864f-171dfa4a85c4",
"metadata": {},
"source": [
"## Part 2: Party Breakdown\n",
"\n",
"We'll now take a look at party control. We can start with essentially the same chart.\n",
"\n",
"### 2.0 - Party Control\n",
"\n",
"Copy your code from 1.2 above & modify it to use party instead of gender. Your graph will wind up with too many parties, see `imgs/ex2.0.png`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "6af187e2-198c-4ebd-931f-c1c2973ad3a2",
"metadata": {},
"outputs": [],
"source": [
"def party_control_raw(df):\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"party_control_raw(legislators_df())"
]
},
{
"cell_type": "markdown",
"id": "27bb9ccf-3ad8-4d37-8cfc-64aea58d9848",
"metadata": {},
"source": [
"### 2.1 - Cleaning Data\n",
"\n",
"The above graph still has some shortcomings:\n",
"\n",
"- Most states have an upper and lower chamber, and party control may vary between them. We'll need to make two bars per state (which we'll tackle in 2.2).\n",
"- Also, there are too many variations of party as you can see here:\n",
"\n",
"Let's transform the data again, adding a new column \"party_code\" with the following rules:\n",
"\n",
"- if the word 'Democratic' appears, set party_code to 'D'\n",
"- if the word 'Republican' appears, set the party_code to 'R'\n",
"- otherwise, set the party_code to 'O'\n",
"\n",
"Party data in NE, DC, and PR does not work with this scheme.\n",
"For simplicity, we will exclude them from our analysis.\n",
"\n",
"For this portion, implement `clean_party_df` which should return a modified legislators DataFrame with the `party_code` column, and the rows for states 'DC', 'NE' and 'PR' dropped."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "f6e95081-bb32-49a7-be03-3cad634b5f1b",
"metadata": {},
"outputs": [],
"source": [
"def clean_party_df():\n",
" # start with the DataFrame from part 1 & return transformed copy\n",
" df = legislators_df()\n",
" # IMPLEMENTATION HERE\n",
"\n",
"clean_party_df()"
]
},
{
"cell_type": "markdown",
"id": "794c3b48-7c52-4808-a215-97df57c7b7f9",
"metadata": {},
"source": [
"### 2.2 - Faceted Plot\n",
"\n",
"Add a function `party_control_by_chamber` that contains the following elements:\n",
"\n",
"- One bar per state, **along the Y axis**.\n",
"- Each bar should consist of a stack: a blue portion, a green portion, and a red portion, corresponding to the D, O, and R `party_code` respectively.\n",
"- A vertical line at the 50% mark, indicating (likely) party control.\n",
"- Finally, facet the chart on `type` so that you get a set of bars for the lower and upper chambers.\n",
"\n",
"See `imgs/ex2.2.png` for an example."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "af9397d2-da7b-4c79-bb03-a98c0c98f5be",
"metadata": {},
"outputs": [],
"source": [
"def party_control_by_chamber(df):\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"party_control_by_chamber(clean_party_df())"
]
},
{
"cell_type": "markdown",
"id": "be7cd442-d5c4-463a-ac58-e4746bd06ba3",
"metadata": {},
"source": [
"## Part 3: Comparing by Population\n",
"\n",
"For part three, we are interested in the relationship of various properties of legislatures to the total population of the state.\n",
"\n",
"To do this, we'll need to create a combined DataFrame that mixes in data from `populations.csv`."
]
},
{
"cell_type": "markdown",
"id": "4f7b26f4-8258-4703-9391-4af27d939d74",
"metadata": {},
"source": [
"### 3.0 - Create Combined DataFrame\n",
"\n",
"Write the function `population_combined_df`, which should return a DataFrame with the columns:\n",
"\n",
"- state: abbreviation of state\n",
"- upper: total seats in upper chamber\n",
"- lower: total seats in lower chamber\n",
"- pop_2020: the 2020 population, obtained from merging with `population.csv`\n",
"\n",
"**Data Note:** These numbers are based on the non-vacant seats as-of a particular day in September 2024. Vacancies will cause the counts to be off by a bit, but the general size of the legislature should be roughly the same."
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "0db0bec3-09df-4e51-97d8-4882e834e81a",
"metadata": {},
"outputs": [],
"source": [
"def population_combined_df():\n",
" pass # IMPLEMENTAION HERE"
]
},
{
"cell_type": "markdown",
"id": "37df7c37-cfde-4a5e-b7b9-f2e8c4617b39",
"metadata": {},
"source": [
"### 3.1 - Create Population vs. Seats Scatterplot\n",
"\n",
"Create a new plot with two layers:\n",
"\n",
"- Population on the X axis\n",
"- Number of seats on the Y axis\n",
"- Upper chamber points should be purple and use the 'triangle-up' shape.\n",
"- Lower chamber points should be orange and use the 'triangle-down' shape.\n",
"- Make a customization or two to your chart's default labels and axes, whatever you feel is appropriate.\n",
"\n",
"Hint: You can layer two charts for this."
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "50274435-48e5-4f6f-9b13-e863ffba9a1c",
"metadata": {},
"outputs": [],
"source": [
"def scatter_pop_size():\n",
" pass # IMPLEMENTATION HERE"
]
},
{
"cell_type": "markdown",
"id": "ad14558c-c3ee-4dc2-9ab4-497b6fd0de01",
"metadata": {},
"source": [
"### 3.2 - Regressions\n",
"\n",
"Add two more layers, a purple & orange regression line for each chamber. See `imgs/ex3.2.png`\n",
"\n",
"Hint: See `transform_regression`."
]
},
{
"cell_type": "code",
"execution_count": 43,
"id": "54bf78b6-5031-4c3b-9368-7787fe492c2b",
"metadata": {},
"outputs": [],
"source": [
"def scatter_pop_size_regression():\n",
" pass # IMPLEMENTATION HERE"
]
},
{
"cell_type": "markdown",
"id": "82c005e1-ad3d-49e9-a4a8-87fba2700ebb",
"metadata": {},
"source": [
"## Part 4: Actions Heatmap\n",
"\n",
"The file `actions_il-in-mi-wi_2021-2024.csv` contains nearly half a million records, representing every official action taken on a piece of legislation in these four midwestern states over the past two sessions.\n",
"\n",
"Legislatures work quite differently, some meet all year, while others meet for very short periods.\n",
"By creating a heatmap of what days actions take place, we can get a sense of how different states compare.\n",
"\n",
"### 4.0 - Load Actions\n",
"\n",
"Complete `actions_df`, which should load the data from `actions_il-in-mi-wi_2021-2024.csv`.\n",
"\n",
"Tips: \n",
"- Make sure that the `date` column is loaded as a date type!\n",
"- Dates are in YYYY-MM-DD format, though some also have additional characters for time, which you will want to ignore."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "40c1eb03-8960-4fde-85e6-8b1fd2e4c6a5",
"metadata": {},
"outputs": [],
"source": [
"def actions_df():\n",
" pass # IMPLEMENTATION HERE"
]
},
{
"cell_type": "markdown",
"id": "aa1772db-b31b-4282-9ad7-602bdc6c71b9",
"metadata": {},
"source": [
"### 4.1 - Actions Heatmap\n",
"\n",
"Generate a heatmap (using `mark_rect`) with:\n",
"\n",
"- a row per state\n",
"- each row consists of shaded marks with shading based on the total action count for a given week\n",
"\n",
"Tip: Use the 'yearweek(date)' aggregation for the X channel.\n",
"\n",
"See `imgs/ex4.1.png`."
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "81f24d38-f9f9-4c00-8d10-c8a8b42891b7",
"metadata": {},
"outputs": [],
"source": [
"def actions_heatmap_scaled(df):\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"actions_heatmap_scaled(actions_df())"
]
},
{
"cell_type": "markdown",
"id": "03f78853-5c2d-4da4-b612-5514751bce6f",
"metadata": {},
"source": [
"### 4.2 - Excluding IL Outliers\n",
"\n",
"Illinois clearly dominates the above graph, below modify two calls to `actions_heatmap` with a modified dataframe that excludes IL, and a modified dataframe that only includes IL.\n",
"\n",
"(Note how by using functions in our dataframe we can more easily reuse portions by making small adjustments to the data.)\n",
"\n",
"See `ex4.2a.png` and `ex4.2b.png`"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "4be0c226-a362-4189-a303-19c7fa0de29b",
"metadata": {},
"outputs": [],
"source": [
"# MODIFY THIS LINE (exclude IL)\n",
"actions_heatmap_scaled(actions_df())"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "ca6a81aa-a2c3-44e3-a4b1-029bf8d49f86",
"metadata": {},
"outputs": [],
"source": [
"# MODIFY THIS LINE (only IL)\n",
"actions_heatmap_scaled(actions_df())"
]
},
{
"cell_type": "markdown",
"id": "58f04553-cfe4-4533-a21c-ea66523bc6b1",
"metadata": {},
"source": [
"#### 4.3 - Cumulative Line Chart\n",
"\n",
"Another way to view this data would be with a cumulative line chart.\n",
"\n",
"Create a chart with:\n",
"\n",
"- days on the X axis\n",
"- cumulative actions to date on the Y axis\n",
"- one line per state\n",
"\n",
"Hint: To do this you will need to look at the `transform_window` function.\n",
"\n",
"See `ex4.3.png` for an example."
]
},
{
"cell_type": "code",
"execution_count": 45,
"id": "93af038f-23d6-412a-9240-ceaa830ecfe5",
"metadata": {},
"outputs": [],
"source": [
"def actions_cumulative(df):\n",
" pass # IMPLEMENTATION HERE\n",
"\n",
"actions_cumulative(actions_df())"
]
},
{
"cell_type": "markdown",
"id": "aba02ff4-5d98-4ca7-9da9-d7facceb4bca",
"metadata": {},
"source": [
"### TODO?\n",
"\n",
"- error bars?\n",
"- rank line chart?\n",
"- polish a chart"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "acf7e21d-33b2-4a58-a549-7f04c14aad22",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,53 @@
state,pop_2020
al,5024294
ak,733374
az,7157902
ar,3011490
ca,39538212
co,5773707
ct,3605912
de,989946
dc,689548
fl,21538216
ga,10713771
hi,1455274
id,1839117
il,12813469
in,6785442
ia,3190427
ks,2937835
ky,4506297
la,4657785
me,1363177
md,6177253
ma,7032933
mi,10077674
mn,5706804
ms,2961306
mo,6154889
mt,1084244
ne,1961965
nv,3104617
nh,1377524
nj,9289039
nm,2117525
ny,20202320
nc,10439459
nd,779079
oh,11799331
ok,3959411
or,4237279
pa,13002788
ri,1097371
sc,5118422
sd,886668
tn,6910786
tx,29145459
ut,3271614
vt,643077
va,8631373
wa,7705267
wv,1793713
wi,5893713
wy,576850
pr,3285874
1 state pop_2020
2 al 5024294
3 ak 733374
4 az 7157902
5 ar 3011490
6 ca 39538212
7 co 5773707
8 ct 3605912
9 de 989946
10 dc 689548
11 fl 21538216
12 ga 10713771
13 hi 1455274
14 id 1839117
15 il 12813469
16 in 6785442
17 ia 3190427
18 ks 2937835
19 ky 4506297
20 la 4657785
21 me 1363177
22 md 6177253
23 ma 7032933
24 mi 10077674
25 mn 5706804
26 ms 2961306
27 mo 6154889
28 mt 1084244
29 ne 1961965
30 nv 3104617
31 nh 1377524
32 nj 9289039
33 nm 2117525
34 ny 20202320
35 nc 10439459
36 nd 779079
37 oh 11799331
38 ok 3959411
39 or 4237279
40 pa 13002788
41 ri 1097371
42 sc 5118422
43 sd 886668
44 tn 6910786
45 tx 29145459
46 ut 3271614
47 vt 643077
48 va 8631373
49 wa 7705267
50 wv 1793713
51 wi 5893713
52 wy 576850
53 pr 3285874

BIN
labs/altair/imgs/ex1.1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

BIN
labs/altair/imgs/ex1.2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 43 KiB

BIN
labs/altair/imgs/ex2.0.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 65 KiB

BIN
labs/altair/imgs/ex2.2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

BIN
labs/altair/imgs/ex3.2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

BIN
labs/altair/imgs/ex4.1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB

BIN
labs/altair/imgs/ex4.2a.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

BIN
labs/altair/imgs/ex4.2b.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

BIN
labs/altair/imgs/ex4.3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB