Skip to content

GeoData Module

The Geodata module provides tools for managing and preprocessing geographic data, including administrative boundaries and census data. The functions in this module facilitate reading, filtering, and converting data into formats suitable for geographic analysis, such as GeoDataFrame and GeoPackage.

preprocess_geodata(census_shp_folder, census_target_columns, census_tipo_loc_mapping, output_folder, census_layer_name, census_column_remapping=None, regions_file_path=None, regions_target_columns=None, regions_index_column=None, regions_column_remapping=None, provinces_file_path=None, provinces_target_columns=None, provinces_index_column=None, provinces_column_remapping=None, municipalities_file_path=None, municipalities_target_columns=None, municipalities_index_column=None, municipalities_column_remapping=None, municipalities_code=[])

Preprocess census geodata and administrative boundaries and save to GeoPackage.

This function executes the complete workflow for preparing geographic data for a census year, combining:

  1. Reading and normalizing administrative boundaries (regions, provinces, municipalities).
  2. Optionally correcting missing fields (e.g., COD_PROV for 2021).
  3. Reading and preparing census data (sections) from shapefiles.
  4. Joining sections with municipalities, provinces, and regions.
  5. Optionally filtering for a subset of municipalities (municipalities_code).
  6. Saving the final result to a GeoPackage.
PARAMETER DESCRIPTION
census_shp_folder

Folder containing census shapefiles (sections).

TYPE: Path

census_target_columns

Columns to select from census data (sections).

TYPE: list

census_tipo_loc_mapping

Mapping for the TIPO_LOC field to derive the locality type description.

TYPE: dict

output_folder

Folder where the resulting GeoPackage will be saved.

TYPE: Path

census_layer_name

Name of the census layer (e.g., census2011), also used to derive the year from the suffix.

TYPE: str

census_column_remapping

Optional mapping to rename census data columns.

TYPE: dict | None DEFAULT: None

regions_file_path

Optional path to the regional boundaries vector file.

TYPE: Path | None DEFAULT: None

regions_target_columns

Optional columns to select from regional data.

TYPE: list | None DEFAULT: None

regions_index_column

Optional column to use as index for regional data.

TYPE: str | None DEFAULT: None

regions_column_remapping

Optional mapping to rename regional data columns.

TYPE: dict | None DEFAULT: None

provinces_file_path

Optional path to the provincial boundaries vector file.

TYPE: Path | None DEFAULT: None

provinces_target_columns

Optional columns to select from provincial data.

TYPE: list | None DEFAULT: None

provinces_index_column

Optional column to use as index for provincial data.

TYPE: str | None DEFAULT: None

provinces_column_remapping

Optional mapping to rename provincial data columns.

TYPE: dict | None DEFAULT: None

municipalities_file_path

Optional path to the municipal boundaries vector file.

TYPE: Path | None DEFAULT: None

municipalities_target_columns

Optional columns to select from municipal data.

TYPE: list | None DEFAULT: None

municipalities_index_column

Optional column to use as index for municipal data.

TYPE: str | None DEFAULT: None

municipalities_column_remapping

Optional mapping to rename municipal data columns.

TYPE: dict | None DEFAULT: None

municipalities_code

Optional list of ISTAT municipality codes (PRO_COM field) to extract. If empty, all municipalities are kept.

TYPE: list[int] DEFAULT: []

RETURNS DESCRIPTION
Path

Path to the generated GeoPackage containing the census layer enriched

Path

with administrative information.

Note

The census year is derived from the layer name census_layer_name[6:] (e.g., census20112011). For 2021, the COD_PROV column is manually reconstructed from PRO_COM_T (see repository issue #47). The GeoPackage is saved as {YEAR_GEODATA_NAME}.gpkg and the layer as {YEAR_GEODATA_NAME}{census_year}.

read_administrative_boundaries(file_path, target_columns, index_column, column_remapping=None, output_folder=None, layer_name=None)

Read administrative boundaries and return a DataFrame or GeoPackage.

This function reads an administrative boundary file (typically a shapefile), selects a subset of columns, and sets a column as the index. Depending on the provided parameters, it can:

  • Return a DataFrame without geometry, sorted and indexed; or
  • Save the data as a layer in a GeoPackage, preserving the geometry.

The encoding is derived from the .dbf file associated with the shapefile to avoid issues with accented characters or special symbols.

PARAMETER DESCRIPTION
file_path

Path to the vector file (e.g., shapefile) containing administrative boundaries.

TYPE: Path

target_columns

List of columns to select from the source dataset. The geometry column is added automatically.

TYPE: list

index_column

Name of the column to use as the DataFrame index (e.g., ISTAT code).

TYPE: str

column_remapping

Optional dictionary to rename columns (e.g., {"DEN_REG": "REGIONE"}). If None, original names are kept.

TYPE: dict | None DEFAULT: None

output_folder

Optional output folder where the GeoPackage will be saved. If None, the function returns a DataFrame (without geometry) instead of writing to disk.

TYPE: Path | None DEFAULT: None

layer_name

Optional name of the layer to use within the GeoPackage. Must be specified if output_folder is provided, to properly distinguish layers.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame | Path

Either an indexed and sorted DataFrame without geometry column if

DataFrame | Path

output_folder is None, or the path to the created GeoPackage if

DataFrame | Path

output_folder is provided.

Note

The geometry column is automatically added to target_columns via the GEOMETRY_COLUMN_NAME constant. The GeoPackage is saved with a name based on the YEAR_GEODATA_NAME constant and contains the layer specified by layer_name.

read_census(shp_folder, target_columns, tipo_loc_mapping, column_remapping=None, output_folder=None, layer_name=None)

Read census data from shapefiles and return a GeoDataFrame or GeoPackage.

This function recursively searches for all shapefiles in a folder, reads their data, selects a subset of columns, corrects invalid geometries, adds the locality type description (derived from tipo_loc_mapping), and builds a unified GeoDataFrame with all census sections.

Depending on the parameters, it can:

  • Return the resulting GeoDataFrame directly; or
  • Save the data as a layer in a GeoPackage (YEAR_GEODATA_NAME.gpkg) and return the path to the created file.
PARAMETER DESCRIPTION
shp_folder

Path to the folder containing census shapefiles (recursive reading via rglob("*.shp")).

TYPE: Path

target_columns

List of columns to select from each shapefile (must include or be compatible with the geometry column).

TYPE: list

tipo_loc_mapping

Mapping of locality codes for the TIPO_LOC field (e.g., {1: "Centro abitato", 2: "Nucleo", ...}), used to create the descriptive column DEN_LOC.

TYPE: dict

column_remapping

Optional dictionary to rename selected columns (e.g., {"PRO_COM": "PRO_COMUNE"}). If None, original names are kept.

TYPE: dict | None DEFAULT: None

output_folder

Optional folder where the resulting GeoPackage will be saved. If None, the function does not write to disk and returns the GeoDataFrame directly.

TYPE: Path | None DEFAULT: None

layer_name

Optional name of the layer to use within the GeoPackage. Must be specified if output_folder is provided.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
GeoDataFrame | Path

Either a GeoDataFrame with census data and corrected geometries if

GeoDataFrame | Path

output_folder is None, or the path to the created GeoPackage if

GeoDataFrame | Path

output_folder is provided.

RAISES DESCRIPTION
ValueError

If no shapefile is found in the specified folder.

Note

Geometries are validated with make_valid() to reduce issues caused by invalid polygons. An area_mq column containing the area in square meters is calculated. The GeoDataFrame index is set to the first column in df_columns (typically the census section code).