Utils Module

The Utils module collects a series of utility functions for file management and conversion, directory creation, and content extraction from compressed archives. These functions are essential to support census and geographic data management processes.

`census_folder(output_data_folder, year)`

Create (if necessary) the folder dedicated to census data for a specific year.

This function generates a directory named census_<year> within the output_data_folder and creates it if it doesn't already exist. This folder represents the root of downloaded and processed data for a specific census, maintaining an organized and consistent structure across different years.

PARAMETER	DESCRIPTION
`output_data_folder`	Main folder under which to create the census directory. TYPE: `Path`
`year`	Census year to organize (e.g., 1991, 2001, 2011, 2021). TYPE: `int`

RETURNS	DESCRIPTION
`Path`	Complete path to the created or existing census folder.

RAISES	DESCRIPTION
`Exception`	If the folder cannot be created due to permissions or invalid paths.

`check_encoding(data)`

Determine file encoding by reading an initial sample.

This function opens the file in binary mode, reads the first 100,000 bytes, and uses the chardet library to estimate the text encoding. If chardet identifies the encoding as 'ascii', it is converted to 'latin1' to ensure greater compatibility, since many administrative and geographic files may contain extended characters while being formally interpreted as ASCII.

PARAMETER	DESCRIPTION
`data`	Path to the file whose encoding should be determined. TYPE: `Path`

RETURNS	DESCRIPTION
`str`	The detected encoding. If `'ascii'` is detected, it is automatically
`str`	replaced with `'latin1'`.

Note

chardet provides a heuristic estimate of encoding and is not infallible. Reading is limited to the first 100,000 bytes to improve performance. 'latin1' is a safe choice to avoid errors on files with accented characters or ambiguous encodings typical of ISTAT administrative datasets.

`csv_from_excel(data, output_path, metadata=False)`

Convert an Excel file (.xls) to CSV format.

This function reads a legacy Excel file (.xls) using xlrd and converts the content of a sheet to a CSV file. If metadata=True, the sheet named "Metadati" is converted; otherwise, the first available sheet is converted, excluding "Metadati" if present.

The conversion preserves row order and writes all fields using csv.QUOTE_ALL to ensure compatibility and preserve delimiters, strings with spaces, or special characters.

PARAMETER	DESCRIPTION
`data`	Path to the Excel file to convert. TYPE: `Path`
`output_path`	Path where the output CSV file will be saved. TYPE: `Path`
`metadata`	If True, converts the "Metadati" sheet. If False, converts the first available sheet excluding "Metadati". Defaults to False. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`Path`	Path to the generated CSV file.

RAISES	DESCRIPTION
`FileNotFoundError`	If the specified Excel file does not exist.
`XLRDError`	If the file cannot be read or the requested sheet does not exist.
`Exception`	For any other errors during conversion or writing.

Note

The conversion uses xlrd, so the file must be in .xls format (Excel legacy). .xlsx files are not supported by xlrd. The CSV is saved in UTF-8 encoding. The function uses tqdm to display a progress bar.

`get_census_dictionary(census_year, region_list=[])`

Generate official ISTAT URLs for census data, geodata, and administrative boundaries.

This function dynamically constructs download paths based on the census year and the list of desired regions. It handles structural differences between previous censuses (1991–2011) and the 2021 census.

PARAMETER	DESCRIPTION
`census_year`	Census year (1991, 2001, 2011, or 2021). TYPE: `int`
`region_list`	Optional list of regions for which to generate geodata URLs. If empty, uses regions 1–20. TYPE: `list[int]` DEFAULT: `[]`

RETURNS	DESCRIPTION
`dict`	Dictionary containing the URLs: - `data_url`: URL for census data - `geodata_urls`: URLs for territorial bases - `admin_boundaries_url`: URL for administrative boundaries - `census_code`: Primary code for joins and identifiers

RAISES	DESCRIPTION
`ValueError`	If the provided year is not supported.

`get_region(region_list=[])`

Return the list of regions to use for geodata download.

If no list is provided, returns the complete list of 20 Italian regions (codes 1–20). Otherwise, returns the provided list.

PARAMETER	DESCRIPTION
`region_list`	Optional list of region codes to use. If empty, returns all regions (1–20). TYPE: `list[int]` DEFAULT: `[]`

RETURNS	DESCRIPTION
`list[int]`	List of region codes to process.

`remove_files(files_path)`

Remove a list of files from the filesystem.

PARAMETER	DESCRIPTION
`files_path`	List of Path objects to delete. TYPE: `list`

Note

Exceptions are not caught: if a file cannot be deleted, the error emerges explicitly (desirable behavior in ETL workflows).

`unzip_data(input_data, output_folder)`

Decompress a ZIP file into the specified destination folder.

This function opens a ZIP archive and extracts its entire content into the specified folder. If the output folder does not exist, it is created automatically. Functions as an internal component of the ISTAT data download workflow.

PARAMETER	DESCRIPTION
`input_data`	Path to the ZIP file to decompress. TYPE: `Path`
`output_folder`	Folder where the archive content will be extracted. TYPE: `Path`

RETURNS	DESCRIPTION
`Path`	Path to the folder containing the extracted files.

RAISES	DESCRIPTION
`FileNotFoundError`	If the ZIP file does not exist.
`BadZipFile`	If the provided file is not a valid ZIP archive.
`Exception`	For any error during decompression.

`census_trace(file_path, year, output_path=None)`

Extract metadata trace record from the "Metadati" sheet of an Excel file.

This function accesses the sheet named "Metadati" in an Excel file related to census data, extracts the fundamental columns (field name and description), and constructs a pandas DataFrame with an index based on the field name. If an output path is provided, the trace record is also saved in CSV format.

PARAMETER	DESCRIPTION
`file_path`	Path to the Excel file from which to extract metadata. TYPE: `Path`
`year`	Reference year for the census, used to generate the output file name. TYPE: `int`
`output_path`	Path to the folder where the trace record CSV will be saved. If None, a DataFrame is returned directly. TYPE: `Path \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame \| Path`	Path to the generated CSV file if `output_path` is provided, or a pandas
`DataFrame \| Path`	DataFrame containing the metadata trace record if `output_path` is None.

RAISES	DESCRIPTION
`FileNotFoundError`	If the specified Excel file does not exist.
`XLRDError`	If an error occurs while opening or reading the Excel file.
`Exception`	For any unexpected errors during parsing or saving.

`read_xls(file_path, census_code, output_path=None)`

Read an Excel file (.xls) and return a DataFrame or save data as CSV.

This function opens an Excel file in .xls format, automatically selects the first useful sheet (excluding any sheets named "Metadati"), extracts the sheet rows, constructs a pandas DataFrame, and sets as index the column corresponding to the provided census code.

If an output path is specified, the DataFrame is saved in CSV format; otherwise, it is returned directly.

PARAMETER	DESCRIPTION
`file_path`	Path to the Excel file to read. TYPE: `Path`
`census_code`	Name of the column to use as the DataFrame index (e.g., ISTAT municipality code). TYPE: `str`
`output_path`	Path to the folder where the resulting CSV will be saved. If None, the DataFrame is returned without saving. TYPE: `Path \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame \| Path`	A DataFrame containing data read from the Excel file if `output_path`
`DataFrame \| Path`	is None, or the path to the saved CSV file if `output_path` is specified.

RAISES	DESCRIPTION
`FileNotFoundError`	If the specified file does not exist.
`XLRDError`	If an error occurs while reading the Excel file.
`Exception`	For any unexpected error during parsing or saving.

`read_xlsx(file_path, output_path=None)`

Read an Excel file (XLSX format) and convert to a Pandas DataFrame.

If specified, saves the data in CSV format.

PARAMETER	DESCRIPTION
`file_path`	Path to the Excel file to read. TYPE: `Path`
`output_path`	Path where the generated CSV file will be saved. If not specified, returns the DataFrame. TYPE: `Path \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`DataFrame \| Path`	A DataFrame if `output_path` is None, otherwise returns the path to
`DataFrame \| Path`	the saved CSV file.

RAISES	DESCRIPTION
`FileNotFoundError`	If the specified Excel file is not found.
`ValueError`	If the Excel file cannot be read correctly.