Vai al contenuto

Modulo GeoData

Il modulo Geodata fornisce strumenti per la gestione e il preprocessing dei dati geografici, inclusi confini amministrativi e dati censuari. Le funzioni di questo modulo facilitano la lettura, il filtraggio e la conversione dei dati in formati utilizzabili per l'analisi geografica, come GeoDataFrame e GeoPackage.

preprocess_geodata(census_shp_folder, census_target_columns, census_tipo_loc_mapping, output_folder, census_layer_name, census_column_remapping=None, regions_file_path=None, regions_target_columns=None, regions_index_column=None, regions_column_remapping=None, provinces_file_path=None, provinces_target_columns=None, provinces_index_column=None, provinces_column_remapping=None, municipalities_file_path=None, municipalities_target_columns=None, municipalities_index_column=None, municipalities_column_remapping=None, municipalities_code=[])

Preprocess census geodata and administrative boundaries and save to GeoPackage.

This function executes the complete workflow for preparing geographic data for a census year, combining:

  1. Reading and normalizing administrative boundaries (regions, provinces, municipalities).
  2. Optionally correcting missing fields (e.g., COD_PROV for 2021).
  3. Reading and preparing census data (sections) from shapefiles.
  4. Joining sections with municipalities, provinces, and regions.
  5. Optionally filtering for a subset of municipalities (municipalities_code).
  6. Saving the final result to a GeoPackage.
PARAMETER DESCRIPTION
census_shp_folder

Folder containing census shapefiles (sections).

TYPE: Path

census_target_columns

Columns to select from census data (sections).

TYPE: list

census_tipo_loc_mapping

Mapping for the TIPO_LOC field to derive the locality type description.

TYPE: dict

output_folder

Folder where the resulting GeoPackage will be saved.

TYPE: Path

census_layer_name

Name of the census layer (e.g., census2011), also used to derive the year from the suffix.

TYPE: str

census_column_remapping

Optional mapping to rename census data columns.

TYPE: dict | None DEFAULT: None

regions_file_path

Optional path to the regional boundaries vector file.

TYPE: Path | None DEFAULT: None

regions_target_columns

Optional columns to select from regional data.

TYPE: list | None DEFAULT: None

regions_index_column

Optional column to use as index for regional data.

TYPE: str | None DEFAULT: None

regions_column_remapping

Optional mapping to rename regional data columns.

TYPE: dict | None DEFAULT: None

provinces_file_path

Optional path to the provincial boundaries vector file.

TYPE: Path | None DEFAULT: None

provinces_target_columns

Optional columns to select from provincial data.

TYPE: list | None DEFAULT: None

provinces_index_column

Optional column to use as index for provincial data.

TYPE: str | None DEFAULT: None

provinces_column_remapping

Optional mapping to rename provincial data columns.

TYPE: dict | None DEFAULT: None

municipalities_file_path

Optional path to the municipal boundaries vector file.

TYPE: Path | None DEFAULT: None

municipalities_target_columns

Optional columns to select from municipal data.

TYPE: list | None DEFAULT: None

municipalities_index_column

Optional column to use as index for municipal data.

TYPE: str | None DEFAULT: None

municipalities_column_remapping

Optional mapping to rename municipal data columns.

TYPE: dict | None DEFAULT: None

municipalities_code

Optional list of ISTAT municipality codes (PRO_COM field) to extract. If empty, all municipalities are kept.

TYPE: list[int] DEFAULT: []

RETURNS DESCRIPTION
Path

Path to the generated GeoPackage containing the census layer enriched

Path

with administrative information.

Note

The census year is derived from the layer name census_layer_name[6:] (e.g., census20112011). For 2021, the COD_PROV column is manually reconstructed from PRO_COM_T (see repository issue #47). The GeoPackage is saved as {YEAR_GEODATA_NAME}.gpkg and the layer as {YEAR_GEODATA_NAME}{census_year}.

read_administrative_boundaries(file_path, target_columns, index_column, column_remapping=None, output_folder=None, layer_name=None)

Read administrative boundaries and return a DataFrame or GeoPackage.

This function reads an administrative boundary file (typically a shapefile), selects a subset of columns, and sets a column as the index. Depending on the provided parameters, it can:

  • Return a DataFrame without geometry, sorted and indexed; or
  • Save the data as a layer in a GeoPackage, preserving the geometry.

The encoding is derived from the .dbf file associated with the shapefile to avoid issues with accented characters or special symbols.

PARAMETER DESCRIPTION
file_path

Path to the vector file (e.g., shapefile) containing administrative boundaries.

TYPE: Path

target_columns

List of columns to select from the source dataset. The geometry column is added automatically.

TYPE: list

index_column

Name of the column to use as the DataFrame index (e.g., ISTAT code).

TYPE: str

column_remapping

Optional dictionary to rename columns (e.g., {"DEN_REG": "REGIONE"}). If None, original names are kept.

TYPE: dict | None DEFAULT: None

output_folder

Optional output folder where the GeoPackage will be saved. If None, the function returns a DataFrame (without geometry) instead of writing to disk.

TYPE: Path | None DEFAULT: None

layer_name

Optional name of the layer to use within the GeoPackage. Must be specified if output_folder is provided, to properly distinguish layers.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame | Path

Either an indexed and sorted DataFrame without geometry column if

DataFrame | Path

output_folder is None, or the path to the created GeoPackage if

DataFrame | Path

output_folder is provided.

Note

The geometry column is automatically added to target_columns via the GEOMETRY_COLUMN_NAME constant. The GeoPackage is saved with a name based on the YEAR_GEODATA_NAME constant and contains the layer specified by layer_name.

read_census(shp_folder, target_columns, tipo_loc_mapping, column_remapping=None, output_folder=None, layer_name=None)

Read census data from shapefiles and return a GeoDataFrame or GeoPackage.

This function recursively searches for all shapefiles in a folder, reads their data, selects a subset of columns, corrects invalid geometries, adds the locality type description (derived from tipo_loc_mapping), and builds a unified GeoDataFrame with all census sections.

Depending on the parameters, it can:

  • Return the resulting GeoDataFrame directly; or
  • Save the data as a layer in a GeoPackage (YEAR_GEODATA_NAME.gpkg) and return the path to the created file.
PARAMETER DESCRIPTION
shp_folder

Path to the folder containing census shapefiles (recursive reading via rglob("*.shp")).

TYPE: Path

target_columns

List of columns to select from each shapefile (must include or be compatible with the geometry column).

TYPE: list

tipo_loc_mapping

Mapping of locality codes for the TIPO_LOC field (e.g., {1: "Centro abitato", 2: "Nucleo", ...}), used to create the descriptive column DEN_LOC.

TYPE: dict

column_remapping

Optional dictionary to rename selected columns (e.g., {"PRO_COM": "PRO_COMUNE"}). If None, original names are kept.

TYPE: dict | None DEFAULT: None

output_folder

Optional folder where the resulting GeoPackage will be saved. If None, the function does not write to disk and returns the GeoDataFrame directly.

TYPE: Path | None DEFAULT: None

layer_name

Optional name of the layer to use within the GeoPackage. Must be specified if output_folder is provided.

TYPE: str | None DEFAULT: None

RETURNS DESCRIPTION
GeoDataFrame | Path

Either a GeoDataFrame with census data and corrected geometries if

GeoDataFrame | Path

output_folder is None, or the path to the created GeoPackage if

GeoDataFrame | Path

output_folder is provided.

RAISES DESCRIPTION
ValueError

If no shapefile is found in the specified folder.

Note

Geometries are validated with make_valid() to reduce issues caused by invalid polygons. An area_mq column containing the area in square meters is calculated. The GeoDataFrame index is set to the first column in df_columns (typically the census section code).