As of January 2024 the Data Team is considering a new standard for machine-readable metadata, namely TableSchema. TableSchema is a schema for tabular formats that includes many of the features for Avro (see above) plus rich types and constraints. TableSchema is supported in Python and R, and the libraries include many utilty functions.
The foundation of the morpc.frictionless is frictionless-py. The functions are written to create and load resources.
The foundation of the frictionless framework are resouces. Resources are structured json or yaml files that include metadata for the a file or number of files.
import pandas as pd
import morpc
import os
df = pd.read_excel('./temp_data/dataChartToExcelOutput.xlsx') ## import sample data from temp_data
df.columns = ["column1", "column2", "column3"] ## give some reasonable names to columns
df.to_csv('./temp_data/temp_df.csv', index=False) ## save a csv
Typically we will create some constant variable name for the file, resource, and schema. The resource and schema are stored in yaml files.
RESOURCE_DIR = './temp_data/'
TABLE_FILE_NAME = 'temp_df.csv'
TABLE_RESOURCE_NAME = TABLE_FILE_NAME.replace('.csv', '.resource.yaml')
TABLE_SCHEMA_NAME = TABLE_FILE_NAME.replace('.csv', '.schema.yaml')
Schema can be defined manually, or can be created via standard frictionless functions.
import frictionless
frictionless.Schema.describe(os.path.join(RESOURCE_DIR, TABLE_FILE_NAME)).to_yaml(os.path.join(RESOURCE_DIR, TABLE_SCHEMA_NAME)) ## Create a default schema and save as a yaml
'fields:\n - name: column1\n type: integer\n - name: column2\n type: integer\n - name: column3\n type: integer\n'
Create a resource¶
morpc.frictionless.create_resource(TABLE_FILE_NAME, # the filename relative to resource dir, often just filename
resourcePath=os.path.join(RESOURCE_DIR, TABLE_RESOURCE_NAME), # file path to resource location
schemaPath=TABLE_SCHEMA_NAME, # path of schema relative to resource dir
name = "temp_df", # simple name
title = "A title for the resource", # A human readable title
description = "A description of the resource to explain what it contains.", # A full description
writeResource = True, # Boolean - Whether to archive the resouce file
resFormat = "csv",
resMediaType= "text/csv",
computeBytes= True, # Compute the size if the file in bytes
computeHash = True, # Create a md5 hash of the file, a unique string to check if file has been changed.
validate=True # Validate the resource after creating
)
morpc.create_resource | INFO | Writing Frictionless Resource file to temp_data\temp_df.resource.yaml
morpc.create_resource | INFO | Validating resource on disk.
morpc.validate_resource | INFO | Validating resource on disk (including data and schema). This may take some time.
morpc.validate_resource | INFO | Resource is valid
{'name': 'temp_df',
'type': 'table',
'title': 'A title for the resource',
'description': 'A description of the resource to explain what it contains.',
'profile': 'data-resource',
'path': 'temp_df.csv',
'scheme': 'file',
'format': 'csv',
'mediatype': 'text/csv',
'hash': '3f0fe472ad7bf42606eba5184f838dab',
'bytes': 53,
'schema': 'temp_df.schema.yaml'}
Load data from a resource file. Returns the data, a resource, and the schema
Load data from a resource file¶
data, resource, schema = morpc.frictionless.load_data(os.path.join(RESOURCE_DIR, TABLE_RESOURCE_NAME))
morpc.load_data | INFO | Loading Frictionless Resource file at location temp_data\temp_df.resource.yaml
morpc.load_data | INFO | Loading data, resource file, and schema from their source locations
morpc.load_data | INFO | --> Data file: temp_data\temp_df.csv
morpc.load_data | INFO | --> Resource file: temp_data\temp_df.resource.yaml
morpc.load_data | INFO | --> Schema file: temp_data\temp_df.schema.yaml
morpc.load_data | INFO | Loading data.
cast_field_types | INFO | Casting field column1 as type integer.
cast_field_types | INFO | Casting field column2 as type integer.
cast_field_types | INFO | Casting field column3 as type integer.
data
resource
{'name': 'temp_df',
'type': 'table',
'title': 'A title for the resource',
'description': 'A description of the resource to explain what it contains.',
'profile': 'data-resource',
'path': 'temp_df.csv',
'scheme': 'file',
'format': 'csv',
'mediatype': 'text/csv',
'hash': '3f0fe472ad7bf42606eba5184f838dab',
'bytes': 53,
'schema': 'temp_df.schema.yaml'}
schema
{'fields': [{'name': 'column1', 'type': 'integer'},
{'name': 'column2', 'type': 'integer'},
{'name': 'column3', 'type': 'integer'}]}