GeoData Module
The Geodata module provides tools for managing and preprocessing geographic data, including administrative boundaries and census data. The functions in this module facilitate reading, filtering, and converting data into formats suitable for geographic analysis, such as GeoDataFrame and GeoPackage.
preprocess_geodata(census_shp_folder, census_target_columns, census_tipo_loc_mapping, output_folder, census_layer_name, census_column_remapping=None, regions_file_path=None, regions_target_columns=None, regions_index_column=None, regions_column_remapping=None, provinces_file_path=None, provinces_target_columns=None, provinces_index_column=None, provinces_column_remapping=None, municipalities_file_path=None, municipalities_target_columns=None, municipalities_index_column=None, municipalities_column_remapping=None, municipalities_code=[])
Preprocess census geodata and administrative boundaries and save to GeoPackage.
This function executes the complete workflow for preparing geographic data for a census year, combining:
- Reading and normalizing administrative boundaries (regions, provinces, municipalities).
- Optionally correcting missing fields (e.g.,
COD_PROVfor 2021). - Reading and preparing census data (sections) from shapefiles.
- Joining sections with municipalities, provinces, and regions.
- Optionally filtering for a subset of municipalities (
municipalities_code). - Saving the final result to a GeoPackage.
| PARAMETER | DESCRIPTION |
|---|---|
census_shp_folder
|
Folder containing census shapefiles (sections).
TYPE:
|
census_target_columns
|
Columns to select from census data (sections).
TYPE:
|
census_tipo_loc_mapping
|
Mapping for the
TYPE:
|
output_folder
|
Folder where the resulting GeoPackage will be saved.
TYPE:
|
census_layer_name
|
Name of the census layer (e.g.,
TYPE:
|
census_column_remapping
|
Optional mapping to rename census data columns.
TYPE:
|
regions_file_path
|
Optional path to the regional boundaries vector file.
TYPE:
|
regions_target_columns
|
Optional columns to select from regional data.
TYPE:
|
regions_index_column
|
Optional column to use as index for regional data.
TYPE:
|
regions_column_remapping
|
Optional mapping to rename regional data columns.
TYPE:
|
provinces_file_path
|
Optional path to the provincial boundaries vector file.
TYPE:
|
provinces_target_columns
|
Optional columns to select from provincial data.
TYPE:
|
provinces_index_column
|
Optional column to use as index for provincial data.
TYPE:
|
provinces_column_remapping
|
Optional mapping to rename provincial data columns.
TYPE:
|
municipalities_file_path
|
Optional path to the municipal boundaries vector file.
TYPE:
|
municipalities_target_columns
|
Optional columns to select from municipal data.
TYPE:
|
municipalities_index_column
|
Optional column to use as index for municipal data.
TYPE:
|
municipalities_column_remapping
|
Optional mapping to rename municipal data columns.
TYPE:
|
municipalities_code
|
Optional list of ISTAT municipality codes (
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path to the generated GeoPackage containing the census layer enriched |
Path
|
with administrative information. |
Note
The census year is derived from the layer name census_layer_name[6:]
(e.g., census2011 → 2011). For 2021, the COD_PROV column is manually
reconstructed from PRO_COM_T (see repository issue #47). The GeoPackage
is saved as {YEAR_GEODATA_NAME}.gpkg and the layer as
{YEAR_GEODATA_NAME}{census_year}.
read_administrative_boundaries(file_path, target_columns, index_column, column_remapping=None, output_folder=None, layer_name=None)
Read administrative boundaries and return a DataFrame or GeoPackage.
This function reads an administrative boundary file (typically a shapefile), selects a subset of columns, and sets a column as the index. Depending on the provided parameters, it can:
- Return a DataFrame without geometry, sorted and indexed; or
- Save the data as a layer in a GeoPackage, preserving the geometry.
The encoding is derived from the .dbf file associated with the shapefile to avoid issues with accented characters or special symbols.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the vector file (e.g., shapefile) containing administrative boundaries.
TYPE:
|
target_columns
|
List of columns to select from the source dataset. The geometry column is added automatically.
TYPE:
|
index_column
|
Name of the column to use as the DataFrame index (e.g., ISTAT code).
TYPE:
|
column_remapping
|
Optional dictionary to rename columns (e.g.,
TYPE:
|
output_folder
|
Optional output folder where the GeoPackage will be saved. If None, the function returns a DataFrame (without geometry) instead of writing to disk.
TYPE:
|
layer_name
|
Optional name of the layer to use within the GeoPackage. Must
be specified if
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame | Path
|
Either an indexed and sorted DataFrame without geometry column if |
DataFrame | Path
|
|
DataFrame | Path
|
|
Note
The geometry column is automatically added to target_columns via the
GEOMETRY_COLUMN_NAME constant. The GeoPackage is saved with a name
based on the YEAR_GEODATA_NAME constant and contains the layer
specified by layer_name.
read_census(shp_folder, target_columns, tipo_loc_mapping, column_remapping=None, output_folder=None, layer_name=None)
Read census data from shapefiles and return a GeoDataFrame or GeoPackage.
This function recursively searches for all shapefiles in a folder, reads their
data, selects a subset of columns, corrects invalid geometries, adds the
locality type description (derived from tipo_loc_mapping), and builds a
unified GeoDataFrame with all census sections.
Depending on the parameters, it can:
- Return the resulting GeoDataFrame directly; or
- Save the data as a layer in a GeoPackage (
YEAR_GEODATA_NAME.gpkg) and return the path to the created file.
| PARAMETER | DESCRIPTION |
|---|---|
shp_folder
|
Path to the folder containing census shapefiles (recursive
reading via
TYPE:
|
target_columns
|
List of columns to select from each shapefile (must include or be compatible with the geometry column).
TYPE:
|
tipo_loc_mapping
|
Mapping of locality codes for the
TYPE:
|
column_remapping
|
Optional dictionary to rename selected columns
(e.g.,
TYPE:
|
output_folder
|
Optional folder where the resulting GeoPackage will be saved. If None, the function does not write to disk and returns the GeoDataFrame directly.
TYPE:
|
layer_name
|
Optional name of the layer to use within the GeoPackage.
Must be specified if
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
GeoDataFrame | Path
|
Either a |
GeoDataFrame | Path
|
|
GeoDataFrame | Path
|
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no shapefile is found in the specified folder. |
Note
Geometries are validated with make_valid() to reduce issues caused
by invalid polygons. An area_mq column containing the area in square
meters is calculated. The GeoDataFrame index is set to the first column
in df_columns (typically the census section code).