import morpc
import pandas as pd
import random
Conversion factors¶
As of Jan 2024, the following commonly used conversion factors are available in the library. Review the morpc/morpc.py to see if others are available.
Area¶
Square feet per acre
morpc.CONST_SQFT_PER_ACRE
43560
Round preserving sum (aka “bucket rounding”)¶
Imagine we have a series of values that need to be rounded, but we want the rounded values to sum to the same value as the original series. Create a random series for demonstration purposes.
rawValues = pd.Series([random.randrange(0, 100000)/100 for x in range(1,10)])
list(rawValues)
[295.66, 350.99, 619.41, 461.7, 201.56, 106.95, 645.77, 627.49, 403.05]
Specify the number of decimal digits to preserve. For this demo we’ll round to integers (i.e. zero decimal places), which is typically what we want, but the function supports rounding to other decimal places as well.
digits = 0
Perform bucket-rounding
bucketRoundedValues = morpc.round_preserve_sum(rawValues, digits, verbose=True)
Multiplier: 1
Inflated values: [295.66, 350.99, 619.41, 461.7, 201.56, 106.95, 645.77, 627.49, 403.05]
Truncated values: [295, 350, 619, 461, 201, 106, 645, 627, 403]
Residuals for individual records: [0.66, 0.99, 0.41, 0.7, 0.56, 0.95, 0.77, 0.49, 0.05]
Order of residuals: [8, 2, 7, 4, 0, 3, 6, 5, 1]
Overall residual: 5.579999999999927
Units to reallocate: 6
Indices of records to receive reallocated units: [4, 0, 3, 6, 5, 1]
Adjusted values (still inflated): [296, 351, 619, 462, 202, 107, 646, 627, 403]
Bucket-rounded values (deflated): [296, 351, 619, 462, 202, 107, 646, 627, 403]
Raw values:
rawValues.tolist()
[295.66, 350.99, 619.41, 461.7, 201.56, 106.95, 645.77, 627.49, 403.05]
Bucket-rounded values:
bucketRoundedValues.tolist()
[296, 351, 619, 462, 202, 107, 646, 627, 403]
Sum of raw values:
round(sum(rawValues))
3713
Sum of bucket-rounded values:
sum(bucketRoundedValues)
3713
Control variable to group¶
Often we have a set of values representing the members of some group and we need the sum of those values to match a total for the group that was computed independently. Perhaps the best known example of this is the annual population estimates for sub-county jurisdictions. The estimates for all of the jurisdictions in the county must total to the county-level population estimates, which are derived independently. In this case the county (group) totals are known as the “control values” or “control totals” and the process of adjusting the sub-county (group member) values so that their total is equal to the control total is known as “controlling” the variable. The process includes the following steps, which will be described in more detail below.
- Establish control values for the groups (e.g. the county-level estimnates in the example above)
- Create a series of grouped values to be controlled (e.g. the sub-county estimates)
- Control the values in each group to the control total. This consists of three sub-parts:
- Compute group sums
- Compute group shares
- Compute controlled values
In the sections that follow, we’ll look at a more contrived example, namely controlling the 2021 ACS 5-year estimates for county subdivisions to the 2020 decennial county populations. This is not a recommended application and is used only for the sake of convenience.
Establish control values for groups¶
Download county populations from 2020 decennial census
import requests
r = requests.get(
url="https://api.census.gov/data/2020/dec/dhc",
params={
"get":",".join(["P1_001N"]),
"for":"county:{}".format(",".join([x[2:] for x in countyLookup.list_ids()])),
"in": "state:39"
}
)
records = r.json()
countyPop = pd.DataFrame.from_records(records[1:], columns=records[0])
countyPop["C_GEOID"] = countyPop["state"] + countyPop["county"]
countyPop = countyPop.loc[countyPop["county"].isin([x[2:] for x in countyLookup.list_ids()])].copy() \
.rename(columns={"P1_001N":"C_POP"}) \
.drop(columns={"state","county"}) \
.astype({"C_POP":"int"}) \
.set_index("C_GEOID")
Now we have the population for each county (indexed by their GEOIDs) which will serve as the control totals.
countyPop.head()
Create series of grouped values to be controlled¶
Download sub-county populations from the 2021 ACS 5-year estimates
r = requests.get(
url="https://api.census.gov/data/2021/acs/acs5",
params={
"get":",".join(["NAME","GEO_ID","B01001_001E","B01001_001M"]),
"for":"county subdivision:*",
"in": [
"state:39",
"county:{}".format(",".join([x[2:] for x in countyLookup.list_ids()])),
]
}
)
records = r.json()
subdivPop = pd.DataFrame.from_records(records[1:], columns=records[0])
subdivPop = subdivPop \
.rename(columns={"GEO_ID":"GEOID","B01001_001E":"POP","B01001_001M":"POP_MOE"}) \
.astype({"POP":"int"}) \
.set_index("GEOID")
subdivPop["C_GEOID"] = subdivPop["state"] + subdivPop["county"]
Now we have population estimates for the members of each group (county). Note that the county GEOID (C_GEOID) has been assigned to each member record. We’ll use this to iterate through groups.
subdivPop.head()
Note that the sums of the subdivision populations doesn’t match the sum of the county populations. This is expected and it is the reason we need to control the subdivision values.
subdivPop["POP"].sum()
countyPop["C_POP"].sum()
Control the values in each group to the control total¶
Recall that this step has three sub-parts:
- Compute group sums (see
morpc.compute_group_sum()
) - Compute group shares (see
morpc.compute_group_share()
) - Compute controlled values (see
morpc.compute_controlled_values()
)
The morpc-common library has a function for each of these steps as noted above, but it also has a high-level function that performs all three steps in sequence, namely morpc.control_variable_to_group()
. It requires the following inputs:
inputDf
is a pandas DataFrame with a column containing the group shares and (optionally) a column containg the group labels.controlValues
is one of the following:- If
groupbyField == None
:controlValues
is a scalar number (integer or float) - If
groupbyField != None
:controlValues
is a pandas Series of numbers indexed by group labels
- If
groupbyField
(optional) is the name of the column ofinputDf
that contains the group labels.shareField
(optional) is the name of the column ofinputDf
containing the shares that the values comprise. If this is not specified, “GROUP_SHARE” will be used.roundPreserveSumDigits
(optional) is the number of decimal places that the scaled values (i.e. the values in the “CONTROLLED_VALUE” column) should be rounded to. A “bucket rounding” technique (seemorpc.round_preserve_sum()
will be used to ensure that the sum of the values in the group is preserved. If this is not specified, the scaled values will be left unrounded.
This is what the function call looks like for our example case:
subdivPopControlled = morpc.control_variable_to_group(inputDf=subdivPop, controlValues=countyPop["C_POP"], valueField="POP", groupbyField="C_GEOID", roundPreserveSumDigits=0)
subdivPopControlled.head()
Check the results¶
Now the sum of our controlled values should match the county control totals. We can see that this is true by comparing the “POP_SUM_CONTROLLED” columns (which the sum of “CONTROLLED_VALUE” by county) and the “C_POP” column (which is the county control total) and verifying that the two are equal for all records.
subdivPopControlled[["C_GEOID","POP","CONTROLLED_VALUE"]] \
.groupby("C_GEOID").sum() \
.rename(columns={"POP":"POP_SUM","CONTROLLED_VALUE":"POP_SUM_CONTROLLED"}) \
.join(countyPop)
We may want to get a sense of how much adjustment of the sub-county values was required. To do this we can compute the difference between the controlled value and the original value and do some desriptive analysis.
subdivPopControlled["RESIDUAL"] = subdivPopControlled["CONTROLLED_VALUE"] - subdivPopControlled["POP"]
subdivPopControlled["RESIDUAL_PCT"] = subdivPopControlled["RESIDUAL"]/subdivPopControlled["POP"]
subdivPopControlled["RESIDUAL_PCT"] = subdivPopControlled["RESIDUAL_PCT"].replace(np.inf, 0)
subdivPopControlled["RESIDUAL_PCT"] = subdivPopControlled["RESIDUAL_PCT"].replace(-np.inf, 0)
subdivPopControlled["RESIDUAL_PCT"] = subdivPopControlled["RESIDUAL_PCT"].fillna(0)
First we’ll look at the stats for the raw residual.
subdivPopControlled["RESIDUAL"].describe()
subdivPopControlled["RESIDUAL"].hist(bins=25, log=True)
The residual is close to zero in the vast majority of cases. Let’s look at the ten cases with the greatest residual.
subdivPopControlled[["NAME","POP","CONTROLLED_VALUE","RESIDUAL","RESIDUAL_PCT"]].sort_values("RESIDUAL", ascending=False).head(10)
And the ten cases with the smallest residual (which could be large but negative)
subdivPopControlled[["NAME","POP","CONTROLLED_VALUE","RESIDUAL","RESIDUAL_PCT"]].sort_values("RESIDUAL", ascending=False).tail(10)
The raw residual for Columbus was very large, but as a percentage it is not that bad. Let’s look at the stats for the percentages.
subdivPopControlled["RESIDUAL_PCT"].describe()
subdivPopControlled["RESIDUAL_PCT"].hist(bins=25)
subdivPopControlled[["NAME","POP","CONTROLLED_VALUE","RESIDUAL","RESIDUAL_PCT"]].sort_values("RESIDUAL_PCT", ascending=False).head(10)
subdivPopControlled[["NAME","POP","CONTROLLED_VALUE","RESIDUAL","RESIDUAL_PCT"]].sort_values("RESIDUAL_PCT", ascending=False).tail(10)
Schema tools (Apache Avro format) - DEPRECIATED¶
DEPRECATION WARNING: As of January 2024 the Data Team is considering a new standard for machine-readable metadata, namely TableSchema (see below). Code that makes use of the features described in this section will likely need to be updated to make use of the new standard when it is adopted. Use discretion when making use of these features.
Apache Avro is an open source data serialization technology that includes a JSON-based schema specification format. MORPC does not typically use the Avro format itself, however code written through 2023 may have relied on schemas specified in Avro format. This section demonstrates utility functions for loading Avro-formatted schemas and using the schemas to manipulate data.
The demonstration relies on a local copy of data from the morpc-lookup repository in GitHub. Prior to running the code you must download the required data and schema and update the following paths (as needed) to point to the correct files.
dataPath = "..\morpc-lookup\MORPC_Counties.csv"
schemaPath = dataPath.replace(".csv", "_schema.json")
print("Data path: {}".format(dataPath))
print("Schema path: {}".format(schemaPath))
Load the data.
dataRaw = pd.read_csv(dataPath)
dataRaw.head()
The data is stored in a CSV file. In a CSV, all data is stored as strings. There is no built-in mechanism for specifying the data type for each field. Note that pandas (like many other software applications) tried to infer the data type.
dataRaw.dtypes
Sometimes the inference works well, other times not so well. It is safer to specify the field types explictly. One way to do this is to create a schema definition for the data. Here is an example of a schema definition specified in Apache Avro format:
with open(schemaPath, "r") as f:
schemaRaw = f.read()
print(schemaRaw)
Note that that this format allows for specification of the field names and types, as well as dataset-level and variable-level metadata. Because Avro-style schemas are formatted as JSON, Python can easily convert the schema to a dictionary structure so that we can access it programmatically. The morpc-common library contains a convenience function to load the schema and convert it to a dictionary.
schema = morpc.load_avro_schema(schemaPath)
print("The fields in this dataset are: " + ", ".join([field["name"] for field in schema["fields"]]))
The morpc-common library contains several convenience functions for interacting with Avro schemas. One such function casts each of the fields in a dataset as the correct data type as specified in the schema.
data = morpc.cast_field_types(dataRaw, schema)
Now the data types should match the schema, regardless of what pandas inferred originally.
data.dtypes
A note about integers The pandas “int” dtype does not support null values. If a field contains null values and you try to cast it as “int”, this function will automatically attempt to convert them to “Int64” (which does support null values) instead. If this fails, it might be because the fractional part of one of your values (i.e. the part to the right of the decimal point) is non-zero. You can either round the values before attempting the conversion or set forceInteger=True
when calling the function. In the latter case, the function will round the values to the ones place prior to recasting the values.
Here’s another function that creates a list object containing the names of the fields included in the schema.
morpc.avro_get_field_names(schema)
This one returns a dictionary mapping each field name to its type.
morpc.avro_to_pandas_dtype_map(schema)
Sometimes a variable may be referred to by different names. It is possible to list the alternate names in the schema using the “aliases” property. The following function creates a dictionary that maps the original field name to the first (and presumably most common) alias. This can be used to easily rename the fields in the dataset for use in a different application.
morpc.avro_map_to_first_alias(schema)
The following function does the reverse of the previous one, namely it creates a dictionary mapping the first alias to the original field name. This can be useful to reverse the previous remapping. It is also useful for Shapefiles, which have a ten-character field name limit. In that case, you can store the human-readable field name as the original field name and store the Shapefile-compliant field name as an alias.
morpc.avro_map_from_first_alias(schema)
Using the schema dictionary and the helper functions, you can easily do transformations of the data. Here are some examples. First, take a look at the original data.
data.head()
Rename the columns in the data to the first alias for each column.
data.rename(columns=morpc.avro_map_to_first_alias(schema)).head()
Filter and reorder fields.
reverseOrder = morpc.avro_get_field_names(schema)
reverseOrder.reverse()
data[reverseOrder].head()
oneLessField = morpc.avro_get_field_names(schema)
oneLessField.remove("STATE_ID")
data[oneLessField].head()