Utils Module
The Utils module collects a series of utility functions for file management and conversion, directory creation, and content extraction from compressed archives. These functions are essential to support census and geographic data management processes.
census_folder(output_data_folder, year)
Create (if necessary) the folder dedicated to census data for a specific year.
This function generates a directory named census_<year> within the
output_data_folder and creates it if it doesn't already exist. This folder
represents the root of downloaded and processed data for a specific census,
maintaining an organized and consistent structure across different years.
| PARAMETER | DESCRIPTION |
|---|---|
output_data_folder
|
Main folder under which to create the census directory.
TYPE:
|
year
|
Census year to organize (e.g., 1991, 2001, 2011, 2021).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Complete path to the created or existing census folder. |
| RAISES | DESCRIPTION |
|---|---|
Exception
|
If the folder cannot be created due to permissions or invalid paths. |
check_encoding(data)
Determine file encoding by reading an initial sample.
This function opens the file in binary mode, reads the first 100,000 bytes,
and uses the chardet library to estimate the text encoding. If chardet
identifies the encoding as 'ascii', it is converted to 'latin1' to ensure
greater compatibility, since many administrative and geographic files may contain
extended characters while being formally interpreted as ASCII.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Path to the file whose encoding should be determined.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The detected encoding. If |
str
|
replaced with |
Note
chardet provides a heuristic estimate of encoding and is not infallible.
Reading is limited to the first 100,000 bytes to improve performance.
'latin1' is a safe choice to avoid errors on files with accented
characters or ambiguous encodings typical of ISTAT administrative datasets.
csv_from_excel(data, output_path, metadata=False)
Convert an Excel file (.xls) to CSV format.
This function reads a legacy Excel file (.xls) using xlrd and converts
the content of a sheet to a CSV file. If metadata=True, the sheet named
"Metadati" is converted; otherwise, the first available sheet is converted,
excluding "Metadati" if present.
The conversion preserves row order and writes all fields using csv.QUOTE_ALL
to ensure compatibility and preserve delimiters, strings with spaces, or
special characters.
| PARAMETER | DESCRIPTION |
|---|---|
data
|
Path to the Excel file to convert.
TYPE:
|
output_path
|
Path where the output CSV file will be saved.
TYPE:
|
metadata
|
If True, converts the "Metadati" sheet. If False, converts the first available sheet excluding "Metadati". Defaults to False.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path to the generated CSV file. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified Excel file does not exist. |
XLRDError
|
If the file cannot be read or the requested sheet does not exist. |
Exception
|
For any other errors during conversion or writing. |
Note
The conversion uses xlrd, so the file must be in .xls format
(Excel legacy). .xlsx files are not supported by xlrd.
The CSV is saved in UTF-8 encoding.
The function uses tqdm to display a progress bar.
get_census_dictionary(census_year, region_list=[])
Generate official ISTAT URLs for census data, geodata, and administrative boundaries.
This function dynamically constructs download paths based on the census year and the list of desired regions. It handles structural differences between previous censuses (1991–2011) and the 2021 census.
| PARAMETER | DESCRIPTION |
|---|---|
census_year
|
Census year (1991, 2001, 2011, or 2021).
TYPE:
|
region_list
|
Optional list of regions for which to generate geodata URLs. If empty, uses regions 1–20.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Dictionary containing the URLs:
- |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the provided year is not supported. |
get_region(region_list=[])
Return the list of regions to use for geodata download.
If no list is provided, returns the complete list of 20 Italian regions (codes 1–20). Otherwise, returns the provided list.
| PARAMETER | DESCRIPTION |
|---|---|
region_list
|
Optional list of region codes to use. If empty, returns all regions (1–20).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[int]
|
List of region codes to process. |
remove_files(files_path)
Remove a list of files from the filesystem.
| PARAMETER | DESCRIPTION |
|---|---|
files_path
|
List of Path objects to delete.
TYPE:
|
Note
Exceptions are not caught: if a file cannot be deleted, the error emerges explicitly (desirable behavior in ETL workflows).
unzip_data(input_data, output_folder)
Decompress a ZIP file into the specified destination folder.
This function opens a ZIP archive and extracts its entire content into the specified folder. If the output folder does not exist, it is created automatically. Functions as an internal component of the ISTAT data download workflow.
| PARAMETER | DESCRIPTION |
|---|---|
input_data
|
Path to the ZIP file to decompress.
TYPE:
|
output_folder
|
Folder where the archive content will be extracted.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Path
|
Path to the folder containing the extracted files. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the ZIP file does not exist. |
BadZipFile
|
If the provided file is not a valid ZIP archive. |
Exception
|
For any error during decompression. |
census_trace(file_path, year, output_path=None)
Extract metadata trace record from the "Metadati" sheet of an Excel file.
This function accesses the sheet named "Metadati" in an Excel file related to census data, extracts the fundamental columns (field name and description), and constructs a pandas DataFrame with an index based on the field name. If an output path is provided, the trace record is also saved in CSV format.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the Excel file from which to extract metadata.
TYPE:
|
year
|
Reference year for the census, used to generate the output file name.
TYPE:
|
output_path
|
Path to the folder where the trace record CSV will be saved. If None, a DataFrame is returned directly.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame | Path
|
Path to the generated CSV file if |
DataFrame | Path
|
DataFrame containing the metadata trace record if |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified Excel file does not exist. |
XLRDError
|
If an error occurs while opening or reading the Excel file. |
Exception
|
For any unexpected errors during parsing or saving. |
read_xls(file_path, census_code, output_path=None)
Read an Excel file (.xls) and return a DataFrame or save data as CSV.
This function opens an Excel file in .xls format, automatically selects
the first useful sheet (excluding any sheets named "Metadati"), extracts
the sheet rows, constructs a pandas DataFrame, and sets as index the column
corresponding to the provided census code.
If an output path is specified, the DataFrame is saved in CSV format; otherwise, it is returned directly.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the Excel file to read.
TYPE:
|
census_code
|
Name of the column to use as the DataFrame index (e.g., ISTAT municipality code).
TYPE:
|
output_path
|
Path to the folder where the resulting CSV will be saved. If None, the DataFrame is returned without saving.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame | Path
|
A DataFrame containing data read from the Excel file if |
DataFrame | Path
|
is None, or the path to the saved CSV file if |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified file does not exist. |
XLRDError
|
If an error occurs while reading the Excel file. |
Exception
|
For any unexpected error during parsing or saving. |
read_xlsx(file_path, output_path=None)
Read an Excel file (XLSX format) and convert to a Pandas DataFrame.
If specified, saves the data in CSV format.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
Path to the Excel file to read.
TYPE:
|
output_path
|
Path where the generated CSV file will be saved. If not specified, returns the DataFrame.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame | Path
|
A DataFrame if |
DataFrame | Path
|
the saved CSV file. |
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the specified Excel file is not found. |
ValueError
|
If the Excel file cannot be read correctly. |