GeoData Module

The Geodata module provides tools for managing and preprocessing geographic data, including administrative boundaries and census data. The functions in this module facilitate reading, filtering, and converting data into formats suitable for geographic analysis, such as GeoDataFrame and GeoPackage.

preprocess_geodata(census_shp_folder, census_target_columns, census_tipo_loc_mapping, output_folder, census_layer_name, census_column_remapping=None, regions_file_path=None, regions_target_columns=None, regions_index_column=None, regions_column_remapping=None, provinces_file_path=None, provinces_target_columns=None, provinces_index_column=None, provinces_column_remapping=None, municipalities_file_path=None, municipalities_target_columns=None, municipalities_index_column=None, municipalities_column_remapping=None, municipalities_code=[])

Preprocess census geodata and administrative boundaries and save to GeoPackage.

This function executes the complete workflow for preparing geographic data for a census year, combining:

Reading and normalizing administrative boundaries (regions, provinces, municipalities).
Optionally correcting missing fields (e.g., COD_PROV for 2021).
Reading and preparing census data (sections) from shapefiles.
Joining sections with municipalities, provinces, and regions.
Optionally filtering for a subset of municipalities (municipalities_code).
Saving the final result to a GeoPackage.

PARAMETER	DESCRIPTION
`census_shp_folder`	Folder containing census shapefiles (sections). TYPE: `Path`
`census_target_columns`	Columns to select from census data (sections). TYPE: `list`
`census_tipo_loc_mapping`	Mapping for the `TIPO_LOC` field to derive the locality type description. TYPE: `dict`
`output_folder`	Folder where the resulting GeoPackage will be saved. TYPE: `Path`
`census_layer_name`	Name of the census layer (e.g., `census2011`), also used to derive the year from the suffix. TYPE: `str`
`census_column_remapping`	Optional mapping to rename census data columns. TYPE: `dict \| None` DEFAULT: `None`
`regions_file_path`	Optional path to the regional boundaries vector file. TYPE: `Path \| None` DEFAULT: `None`
`regions_target_columns`	Optional columns to select from regional data. TYPE: `list \| None` DEFAULT: `None`
`regions_index_column`	Optional column to use as index for regional data. TYPE: `str \| None` DEFAULT: `None`
`regions_column_remapping`	Optional mapping to rename regional data columns. TYPE: `dict \| None` DEFAULT: `None`
`provinces_file_path`	Optional path to the provincial boundaries vector file. TYPE: `Path \| None` DEFAULT: `None`
`provinces_target_columns`	Optional columns to select from provincial data. TYPE: `list \| None` DEFAULT: `None`
`provinces_index_column`	Optional column to use as index for provincial data. TYPE: `str \| None` DEFAULT: `None`
`provinces_column_remapping`	Optional mapping to rename provincial data columns. TYPE: `dict \| None` DEFAULT: `None`
`municipalities_file_path`	Optional path to the municipal boundaries vector file. TYPE: `Path \| None` DEFAULT: `None`
`municipalities_target_columns`	Optional columns to select from municipal data. TYPE: `list \| None` DEFAULT: `None`
`municipalities_index_column`	Optional column to use as index for municipal data. TYPE: `str \| None` DEFAULT: `None`
`municipalities_column_remapping`	Optional mapping to rename municipal data columns. TYPE: `dict \| None` DEFAULT: `None`
`municipalities_code`	Optional list of ISTAT municipality codes (`PRO_COM` field) to extract. If empty, all municipalities are kept. TYPE: `list[int]` DEFAULT: `[]`

RETURNS	DESCRIPTION
`Path`	Path to the generated GeoPackage containing the census layer enriched
`Path`	with administrative information.

Note

The census year is derived from the layer name census_layer_name[6:] (e.g., census2011 → 2011). For 2021, the COD_PROV column is manually reconstructed from PRO_COM_T (see repository issue #47). The GeoPackage is saved as {YEAR_GEODATA_NAME}.gpkg and the layer as {YEAR_GEODATA_NAME}{census_year}.

`read_administrative_boundaries(file_path, target_columns, index_column, column_remapping=None, output_folder=None, layer_name=None)`

Read administrative boundaries and return a DataFrame or GeoPackage.

This function reads an administrative boundary file (typically a shapefile), selects a subset of columns, and sets a column as the index. Depending on the provided parameters, it can:

Return a DataFrame without geometry, sorted and indexed; or
Save the data as a layer in a GeoPackage, preserving the geometry.

The encoding is derived from the .dbf file associated with the shapefile to avoid issues with accented characters or special symbols.

PARAMETER	DESCRIPTION
`file_path`	Path to the vector file (e.g., shapefile) containing administrative boundaries. TYPE: `Path`
`target_columns`	List of columns to select from the source dataset. The geometry column is added automatically. TYPE: `list`
`index_column`	Name of the column to use as the DataFrame index (e.g., ISTAT code). TYPE: `str`
`column_remapping`	Optional dictionary to rename columns (e.g., `{"DEN_REG": "REGIONE"}`). If None, original names are kept. TYPE: `dict \| None` DEFAULT: `None`
`output_folder`	Optional output folder where the GeoPackage will be saved. If None, the function returns a DataFrame (without geometry) instead of writing to disk. TYPE: `Path \| None` DEFAULT: `None`
`layer_name`	Optional name of the layer to use within the GeoPackage. Must be specified if `output_folder` is provided, to properly distinguish layers. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame \| Path`	Either an indexed and sorted DataFrame without geometry column if
`DataFrame \| Path`	`output_folder` is None, or the path to the created GeoPackage if
`DataFrame \| Path`	`output_folder` is provided.

Note

The geometry column is automatically added to target_columns via the GEOMETRY_COLUMN_NAME constant. The GeoPackage is saved with a name based on the YEAR_GEODATA_NAME constant and contains the layer specified by layer_name.

`read_census(shp_folder, target_columns, tipo_loc_mapping, column_remapping=None, output_folder=None, layer_name=None)`

Read census data from shapefiles and return a GeoDataFrame or GeoPackage.

This function recursively searches for all shapefiles in a folder, reads their data, selects a subset of columns, corrects invalid geometries, adds the locality type description (derived from tipo_loc_mapping), and builds a unified GeoDataFrame with all census sections.

Depending on the parameters, it can:

Return the resulting GeoDataFrame directly; or
Save the data as a layer in a GeoPackage (YEAR_GEODATA_NAME.gpkg) and return the path to the created file.

PARAMETER	DESCRIPTION
`shp_folder`	Path to the folder containing census shapefiles (recursive reading via `rglob(".shp")`). TYPE:* `Path`
`target_columns`	List of columns to select from each shapefile (must include or be compatible with the geometry column). TYPE: `list`
`tipo_loc_mapping`	Mapping of locality codes for the `TIPO_LOC` field (e.g., `{1: "Centro abitato", 2: "Nucleo", ...}`), used to create the descriptive column `DEN_LOC`. TYPE: `dict`
`column_remapping`	Optional dictionary to rename selected columns (e.g., `{"PRO_COM": "PRO_COMUNE"}`). If None, original names are kept. TYPE: `dict \| None` DEFAULT: `None`
`output_folder`	Optional folder where the resulting GeoPackage will be saved. If None, the function does not write to disk and returns the GeoDataFrame directly. TYPE: `Path \| None` DEFAULT: `None`
`layer_name`	Optional name of the layer to use within the GeoPackage. Must be specified if `output_folder` is provided. TYPE: `str \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`GeoDataFrame \| Path`	Either a `GeoDataFrame` with census data and corrected geometries if
`GeoDataFrame \| Path`	`output_folder` is None, or the path to the created GeoPackage if
`GeoDataFrame \| Path`	`output_folder` is provided.

RAISES	DESCRIPTION
`ValueError`	If no shapefile is found in the specified folder.

Note

Geometries are validated with make_valid() to reduce issues caused by invalid polygons. An area_mq column containing the area in square meters is calculated. The GeoDataFrame index is set to the first column in df_columns (typically the census section code).