Skip to content

Utils Module

The Utils module collects a series of utility functions for file management and conversion, directory creation, and content extraction from compressed archives. These functions are essential to support census and geographic data management processes.

census_folder(output_data_folder, year)

Create (if necessary) the folder dedicated to census data for a specific year.

This function generates a directory named census_<year> within the output_data_folder and creates it if it doesn't already exist. This folder represents the root of downloaded and processed data for a specific census, maintaining an organized and consistent structure across different years.

PARAMETER DESCRIPTION
output_data_folder

Main folder under which to create the census directory.

TYPE: Path

year

Census year to organize (e.g., 1991, 2001, 2011, 2021).

TYPE: int

RETURNS DESCRIPTION
Path

Complete path to the created or existing census folder.

RAISES DESCRIPTION
Exception

If the folder cannot be created due to permissions or invalid paths.

check_encoding(data)

Determine file encoding by reading an initial sample.

This function opens the file in binary mode, reads the first 100,000 bytes, and uses the chardet library to estimate the text encoding. If chardet identifies the encoding as 'ascii', it is converted to 'latin1' to ensure greater compatibility, since many administrative and geographic files may contain extended characters while being formally interpreted as ASCII.

PARAMETER DESCRIPTION
data

Path to the file whose encoding should be determined.

TYPE: Path

RETURNS DESCRIPTION
str

The detected encoding. If 'ascii' is detected, it is automatically

str

replaced with 'latin1'.

Note

chardet provides a heuristic estimate of encoding and is not infallible. Reading is limited to the first 100,000 bytes to improve performance. 'latin1' is a safe choice to avoid errors on files with accented characters or ambiguous encodings typical of ISTAT administrative datasets.

csv_from_excel(data, output_path, metadata=False)

Convert an Excel file (.xls) to CSV format.

This function reads a legacy Excel file (.xls) using xlrd and converts the content of a sheet to a CSV file. If metadata=True, the sheet named "Metadati" is converted; otherwise, the first available sheet is converted, excluding "Metadati" if present.

The conversion preserves row order and writes all fields using csv.QUOTE_ALL to ensure compatibility and preserve delimiters, strings with spaces, or special characters.

PARAMETER DESCRIPTION
data

Path to the Excel file to convert.

TYPE: Path

output_path

Path where the output CSV file will be saved.

TYPE: Path

metadata

If True, converts the "Metadati" sheet. If False, converts the first available sheet excluding "Metadati". Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Path

Path to the generated CSV file.

RAISES DESCRIPTION
FileNotFoundError

If the specified Excel file does not exist.

XLRDError

If the file cannot be read or the requested sheet does not exist.

Exception

For any other errors during conversion or writing.

Note

The conversion uses xlrd, so the file must be in .xls format (Excel legacy). .xlsx files are not supported by xlrd. The CSV is saved in UTF-8 encoding. The function uses tqdm to display a progress bar.

get_census_dictionary(census_year, region_list=[])

Generate official ISTAT URLs for census data, geodata, and administrative boundaries.

This function dynamically constructs download paths based on the census year and the list of desired regions. It handles structural differences between previous censuses (1991–2011) and the 2021 census.

PARAMETER DESCRIPTION
census_year

Census year (1991, 2001, 2011, or 2021).

TYPE: int

region_list

Optional list of regions for which to generate geodata URLs. If empty, uses regions 1–20.

TYPE: list[int] DEFAULT: []

RETURNS DESCRIPTION
dict

Dictionary containing the URLs: - data_url: URL for census data - geodata_urls: URLs for territorial bases - admin_boundaries_url: URL for administrative boundaries - census_code: Primary code for joins and identifiers

RAISES DESCRIPTION
ValueError

If the provided year is not supported.

get_region(region_list=[])

Return the list of regions to use for geodata download.

If no list is provided, returns the complete list of 20 Italian regions (codes 1–20). Otherwise, returns the provided list.

PARAMETER DESCRIPTION
region_list

Optional list of region codes to use. If empty, returns all regions (1–20).

TYPE: list[int] DEFAULT: []

RETURNS DESCRIPTION
list[int]

List of region codes to process.

remove_files(files_path)

Remove a list of files from the filesystem.

PARAMETER DESCRIPTION
files_path

List of Path objects to delete.

TYPE: list

Note

Exceptions are not caught: if a file cannot be deleted, the error emerges explicitly (desirable behavior in ETL workflows).

unzip_data(input_data, output_folder)

Decompress a ZIP file into the specified destination folder.

This function opens a ZIP archive and extracts its entire content into the specified folder. If the output folder does not exist, it is created automatically. Functions as an internal component of the ISTAT data download workflow.

PARAMETER DESCRIPTION
input_data

Path to the ZIP file to decompress.

TYPE: Path

output_folder

Folder where the archive content will be extracted.

TYPE: Path

RETURNS DESCRIPTION
Path

Path to the folder containing the extracted files.

RAISES DESCRIPTION
FileNotFoundError

If the ZIP file does not exist.

BadZipFile

If the provided file is not a valid ZIP archive.

Exception

For any error during decompression.

census_trace(file_path, year, output_path=None)

Extract metadata trace record from the "Metadati" sheet of an Excel file.

This function accesses the sheet named "Metadati" in an Excel file related to census data, extracts the fundamental columns (field name and description), and constructs a pandas DataFrame with an index based on the field name. If an output path is provided, the trace record is also saved in CSV format.

PARAMETER DESCRIPTION
file_path

Path to the Excel file from which to extract metadata.

TYPE: Path

year

Reference year for the census, used to generate the output file name.

TYPE: int

output_path

Path to the folder where the trace record CSV will be saved. If None, a DataFrame is returned directly.

TYPE: Path | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame | Path

Path to the generated CSV file if output_path is provided, or a pandas

DataFrame | Path

DataFrame containing the metadata trace record if output_path is None.

RAISES DESCRIPTION
FileNotFoundError

If the specified Excel file does not exist.

XLRDError

If an error occurs while opening or reading the Excel file.

Exception

For any unexpected errors during parsing or saving.

read_xls(file_path, census_code, output_path=None)

Read an Excel file (.xls) and return a DataFrame or save data as CSV.

This function opens an Excel file in .xls format, automatically selects the first useful sheet (excluding any sheets named "Metadati"), extracts the sheet rows, constructs a pandas DataFrame, and sets as index the column corresponding to the provided census code.

If an output path is specified, the DataFrame is saved in CSV format; otherwise, it is returned directly.

PARAMETER DESCRIPTION
file_path

Path to the Excel file to read.

TYPE: Path

census_code

Name of the column to use as the DataFrame index (e.g., ISTAT municipality code).

TYPE: str

output_path

Path to the folder where the resulting CSV will be saved. If None, the DataFrame is returned without saving.

TYPE: Path | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame | Path

A DataFrame containing data read from the Excel file if output_path

DataFrame | Path

is None, or the path to the saved CSV file if output_path is specified.

RAISES DESCRIPTION
FileNotFoundError

If the specified file does not exist.

XLRDError

If an error occurs while reading the Excel file.

Exception

For any unexpected error during parsing or saving.

read_xlsx(file_path, output_path=None)

Read an Excel file (XLSX format) and convert to a Pandas DataFrame.

If specified, saves the data in CSV format.

PARAMETER DESCRIPTION
file_path

Path to the Excel file to read.

TYPE: Path

output_path

Path where the generated CSV file will be saved. If not specified, returns the DataFrame.

TYPE: Path | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame | Path

A DataFrame if output_path is None, otherwise returns the path to

DataFrame | Path

the saved CSV file.

RAISES DESCRIPTION
FileNotFoundError

If the specified Excel file is not found.

ValueError

If the Excel file cannot be read correctly.