Modulo GeoData

Il modulo Geodata fornisce strumenti per la gestione e il preprocessing dei dati geografici, inclusi confini amministrativi e dati censuari. Le funzioni di questo modulo facilitano la lettura, il filtraggio e la conversione dei dati in formati utilizzabili per l'analisi geografica, come GeoDataFrame e GeoPackage.

preprocess_geodata(census_shp_folder, census_target_columns, census_tipo_loc_mapping, output_folder, census_layer_name, census_column_remapping=None, regions_file_path=None, regions_target_columns=None, regions_index_column=None, regions_column_remapping=None, provinces_file_path=None, provinces_target_columns=None, provinces_index_column=None, provinces_column_remapping=None, municipalities_file_path=None, municipalities_target_columns=None, municipalities_index_column=None, municipalities_column_remapping=None, municipalities_code=[])

Preprocess census geodata and administrative boundaries and save to GeoPackage.

This function executes the complete workflow for preparing geographic data for a census year, combining:

Reading and normalizing administrative boundaries (regions, provinces, municipalities).
Optionally correcting missing fields (e.g., COD_PROV for 2021).
Reading and preparing census data (sections) from shapefiles.
Joining sections with municipalities, provinces, and regions.
Optionally filtering for a subset of municipalities (municipalities_code).
Saving the final result to a GeoPackage.

PARAMETER	DESCRIPTION
`census_shp_folder`	Folder containing census shapefiles (sections). TYPE: `Path`
`census_target_columns`	Columns to select from census data (sections). TYPE: `list`
`census_tipo_loc_mapping`	Mapping for the `TIPO_LOC` field to derive the locality type description. TYPE: `dict`
`output_folder`	Folder where the resulting GeoPackage will be saved. TYPE: `Path`
`census_layer_name`	Name of the census layer (e.g., `census2011`), also used to derive the year from the suffix. TYPE: `str`
`census_column_remapping`	Optional mapping to rename census data columns. TYPE: `dict \| None` DEFAULT: `None`
`regions_file_path`	Optional path to the regional boundaries vector file. TYPE: `Path \| None` DEFAULT: `None`
`regions_target_columns`	Optional columns to select from regional data. TYPE: `list \| None` DEFAULT: `None`
`regions_index_column`	Optional column to use as index for regional data. TYPE: `str \| None` DEFAULT: `None`
`regions_column_remapping`	Optional mapping to rename regional data columns. TYPE: `dict \| None` DEFAULT: `None`
`provinces_file_path`	Optional path to the provincial boundaries vector file. TYPE: `Path \| None` DEFAULT: `None`
`provinces_target_columns`	Optional columns to select from provincial data. TYPE: `list \| None` DEFAULT: `None`
`provinces_index_column`	Optional column to use as index for provincial data. TYPE: `str \| None` DEFAULT: `None`
`provinces_column_remapping`	Optional mapping to rename provincial data columns. TYPE: `dict \| None` DEFAULT: `None`
`municipalities_file_path`	Optional path to the municipal boundaries vector file. TYPE: `Path \| None` DEFAULT: `None`
`municipalities_target_columns`	Optional columns to select from municipal data. TYPE: `list \| None` DEFAULT: `None`
`municipalities_index_column`	Optional column to use as index for municipal data. TYPE: `str \| None` DEFAULT: `None`
`municipalities_column_remapping`	Optional mapping to rename municipal data columns. TYPE: `dict \| None` DEFAULT: `None`
`municipalities_code`	Optional list of ISTAT municipality codes (`PRO_COM` field) to extract. If empty, all municipalities are kept. TYPE: `list[int]` DEFAULT: `[]`

RETURNS	DESCRIPTION
`Path`	Path to the generated GeoPackage containing the census layer enriched
`Path`	with administrative information.

Note

The census year is derived from the layer name census_layer_name[6:] (e.g., census2011 → 2011). For 2021, the COD_PROV column is manually reconstructed from PRO_COM_T (see repository issue #47). The GeoPackage is saved as {YEAR_GEODATA_NAME}.gpkg and the layer as {YEAR_GEODATA_NAME}{census_year}.

`read_administrative_boundaries(file_path, target_columns, index_column, column_remapping=None, output_folder=None, layer_name=None)`

Read administrative boundaries and return a DataFrame or GeoPackage.

This function reads an administrative boundary file (typically a shapefile), selects a subset of columns, and sets a column as the index. Depending on the provided parameters, it can:

Return a DataFrame without geometry, sorted and indexed; or
Save the data as a layer in a GeoPackage, preserving the geometry.

The encoding is derived from the .dbf file associated with the shapefile to avoid issues with accented characters or special symbols.

PARAMETER	DESCRIPTION
`file_path`	Path to the vector file (e.g., shapefile) containing administrative boundaries. TYPE: `Path`
`target_columns`	List of columns to select from the source dataset. The geometry column is added automatically. TYPE: `list`
`index_column`	Name of the column to use as the DataFrame index (e.g., ISTAT code). TYPE: `str`
`column_remapping`	Optional dictionary to rename columns (e.g., `{"DEN_REG": "REGIONE"}`). If None, original names are kept. TYPE: `dict \| None` DEFAULT: `None`
`output_folder`	Optional output folder where the GeoPackage will be saved. If None, the function returns a DataFrame (without geometry) instead of writing to disk. TYPE: `Path \| None` DEFAULT: `None`
`layer_name`	Optional name of the layer to use within the GeoPackage. Must be specified if `output_folder` is provided, to properly distinguish layers. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame \| Path`	Either an indexed and sorted DataFrame without geometry column if
`DataFrame \| Path`	`output_folder` is None, or the path to the created GeoPackage if
`DataFrame \| Path`	`output_folder` is provided.

Note

The geometry column is automatically added to target_columns via the GEOMETRY_COLUMN_NAME constant. The GeoPackage is saved with a name based on the YEAR_GEODATA_NAME constant and contains the layer specified by layer_name.

`read_census(shp_folder, target_columns, tipo_loc_mapping, column_remapping=None, output_folder=None, layer_name=None)`

Read census data from shapefiles and return a GeoDataFrame or GeoPackage.

This function recursively searches for all shapefiles in a folder, reads their data, selects a subset of columns, corrects invalid geometries, adds the locality type description (derived from tipo_loc_mapping), and builds a unified GeoDataFrame with all census sections.

Depending on the parameters, it can:

Return the resulting GeoDataFrame directly; or
Save the data as a layer in a GeoPackage (YEAR_GEODATA_NAME.gpkg) and return the path to the created file.

PARAMETER	DESCRIPTION
`shp_folder`	Path to the folder containing census shapefiles (recursive reading via `rglob(".shp")`). TYPE:* `Path`
`target_columns`	List of columns to select from each shapefile (must include or be compatible with the geometry column). TYPE: `list`
`tipo_loc_mapping`	Mapping of locality codes for the `TIPO_LOC` field (e.g., `{1: "Centro abitato", 2: "Nucleo", ...}`), used to create the descriptive column `DEN_LOC`. TYPE: `dict`
`column_remapping`	Optional dictionary to rename selected columns (e.g., `{"PRO_COM": "PRO_COMUNE"}`). If None, original names are kept. TYPE: `dict \| None` DEFAULT: `None`
`output_folder`	Optional folder where the resulting GeoPackage will be saved. If None, the function does not write to disk and returns the GeoDataFrame directly. TYPE: `Path \| None` DEFAULT: `None`
`layer_name`	Optional name of the layer to use within the GeoPackage. Must be specified if `output_folder` is provided. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`GeoDataFrame \| Path`	Either a `GeoDataFrame` with census data and corrected geometries if
`GeoDataFrame \| Path`	`output_folder` is None, or the path to the created GeoPackage if
`GeoDataFrame \| Path`	`output_folder` is provided.

RAISES	DESCRIPTION
`ValueError`	If no shapefile is found in the specified folder.

Note

Geometries are validated with make_valid() to reduce issues caused by invalid polygons. An area_mq column containing the area in square meters is calculated. The GeoDataFrame index is set to the first column in df_columns (typically the census section code).