Skip to article frontmatterSkip to article content

Frictionless Tools

As of January 2024 the Data Team is considering a new standard for machine-readable metadata, namely TableSchema. TableSchema is a schema for tabular formats that includes many of the features for Avro (see above) plus rich types and constraints. TableSchema is supported in Python and R, and the libraries include many utilty functions.

The foundation of the morpc.frictionless is frictionless-py. The functions are written to create and load resources.

The foundation of the frictionless framework are resouces. Resources are structured json or yaml files that include metadata for the a file or number of files.

import pandas as pd
import morpc
import os
df = pd.read_excel('./temp_data/dataChartToExcelOutput.xlsx') ## import sample data from temp_data
df.columns = ["column1", "column2", "column3"] ## give some reasonable names to columns
df.to_csv('./temp_data/temp_df.csv', index=False) ## save a csv

Typically we will create some constant variable name for the file, resource, and schema. The resource and schema are stored in yaml files.

RESOURCE_DIR = './temp_data/'
TABLE_FILE_NAME = 'temp_df.csv'
TABLE_RESOURCE_NAME = TABLE_FILE_NAME.replace('.csv', '.resource.yaml')
TABLE_SCHEMA_NAME = TABLE_FILE_NAME.replace('.csv', '.schema.yaml')

Schema can be defined manually, or can be created via standard frictionless functions.

import frictionless
frictionless.Schema.describe(os.path.join(RESOURCE_DIR, TABLE_FILE_NAME)).to_yaml(os.path.join(RESOURCE_DIR, TABLE_SCHEMA_NAME)) ## Create a default schema and save as a yaml
'fields:\n - name: column1\n type: integer\n - name: column2\n type: integer\n - name: column3\n type: integer\n'

Create a resource

morpc.frictionless.create_resource(TABLE_FILE_NAME, # the filename relative to resource dir, often just filename
                                   resourcePath=os.path.join(RESOURCE_DIR, TABLE_RESOURCE_NAME), # file path to resource location
                                   schemaPath=TABLE_SCHEMA_NAME, # path of schema relative to resource dir
                                   name = "temp_df", # simple name
                                   title = "A title for the resource", # A human readable title
                                   description = "A description of the resource to explain what it contains.", # A full description
                                   writeResource = True, # Boolean - Whether to archive the resouce file 
                                   resFormat = "csv",
                                   resMediaType= "text/csv",  
                                   computeBytes= True, # Compute the size if the file in bytes
                                   computeHash = True, # Create a md5 hash of the file, a unique string to check if file has been changed.
                                   validate=True # Validate the resource after creating
                                  )
morpc.create_resource | INFO | Writing Frictionless Resource file to temp_data\temp_df.resource.yaml
morpc.create_resource | INFO | Validating resource on disk.
morpc.validate_resource | INFO | Validating resource on disk (including data and schema). This may take some time.
morpc.validate_resource | INFO | Resource is valid
{'name': 'temp_df', 'type': 'table', 'title': 'A title for the resource', 'description': 'A description of the resource to explain what it contains.', 'profile': 'data-resource', 'path': 'temp_df.csv', 'scheme': 'file', 'format': 'csv', 'mediatype': 'text/csv', 'hash': '3f0fe472ad7bf42606eba5184f838dab', 'bytes': 53, 'schema': 'temp_df.schema.yaml'}

Load data from a resource file. Returns the data, a resource, and the schema

Load data from a resource file

data, resource, schema = morpc.frictionless.load_data(os.path.join(RESOURCE_DIR, TABLE_RESOURCE_NAME))
morpc.load_data | INFO | Loading Frictionless Resource file at location temp_data\temp_df.resource.yaml
morpc.load_data | INFO | Loading data, resource file, and schema from their source locations
morpc.load_data | INFO | --> Data file: temp_data\temp_df.csv
morpc.load_data | INFO | --> Resource file: temp_data\temp_df.resource.yaml
morpc.load_data | INFO | --> Schema file: temp_data\temp_df.schema.yaml
morpc.load_data | INFO | Loading data.
cast_field_types | INFO | Casting field column1 as type integer.
cast_field_types | INFO | Casting field column2 as type integer.
cast_field_types | INFO | Casting field column3 as type integer.
data
Loading...
resource
{'name': 'temp_df', 'type': 'table', 'title': 'A title for the resource', 'description': 'A description of the resource to explain what it contains.', 'profile': 'data-resource', 'path': 'temp_df.csv', 'scheme': 'file', 'format': 'csv', 'mediatype': 'text/csv', 'hash': '3f0fe472ad7bf42606eba5184f838dab', 'bytes': 53, 'schema': 'temp_df.schema.yaml'}
schema
{'fields': [{'name': 'column1', 'type': 'integer'}, {'name': 'column2', 'type': 'integer'}, {'name': 'column3', 'type': 'integer'}]}