Historic Hospital Data Collection Pipeline

Overview

This pipeline processes raw hospital data to produce a consolidated and cleaned dataset of historic hospitals. The final output is stored in /output/processed_data/processed_hospitals_combined.csv. The pipeline is designed to standardize, enrich, and validate hospital data for historical research purposes.

Final Output Data Dictionary

Below is the data dictionary for the final output file, /output/processed_data/processed_hospitals_combined.csv:

Column Name	Description
`state`	The U.S. state where the hospital is located.
`city`	The city where the hospital is located.
`hospital_name`	The primary name of the hospital.
`alt_name1` to `alt_name5`	Alternative names or aliases for the hospital.
`hospital_type`	The type of hospital (e.g., State Hospitals, Reform Schools).
`url`	URL to the hospital's historical page or reference.
`current_status`	The current operational status of the hospital (e.g., Active, Closed).
`building_style`	The architectural style of the hospital's building(s).
`architecture_style`	Specific architectural details, if available.
`final_year_opened`	The year the hospital first opened.
`final_year_closed`	The year the hospital closed, if applicable.
`final_hospital_age`	The total operational age of the hospital in years.
`final_number_of_beds`	The maximum number of beds recorded for the hospital.
`final_number_of_patients`	The maximum number of patients recorded for the hospital.
`incomplete_page_flag`	Flag indicating incomplete or ambiguous data (1 = incomplete, 0 = complete).
`kirkbride_flag`	Flag indicating if the hospital follows the Kirkbride architectural plan.
`hand_check_flag`	Flag indicating if the record requires manual verification.
---------------------------	-----------------------------------------------------------------------------

Process Flow Diagrams

Diagram 1: Overview of the Process Flow

Core Business Logic

The pipeline processes hospital data using the following core logic:

Data Cleaning:
- Remove duplicate entries based on key identifiers (e.g., hospital_name, state, city).
- Standardize missing values and ensure consistent formatting across columns.
Data Enrichment:
- Derive additional attributes such as final_hospital_age, final_number_of_beds, and final_number_of_patients based on available data.
- Identify and flag incomplete or ambiguous records using the incomplete_page_flag.
Data Validation:
- Validate URLs to ensure they are accessible and correctly formatted.
- Cross-check current_status and final_year_closed for logical consistency (e.g., a hospital marked as "Active" should not have a final_year_closed).
Final Variable Calculation:

The following functions are used to calculate the final variables. Each function processes specific columns in the dataset to derive the most accurate and reliable values for the final output.
- final_year_opened:
  - Logic: The function get_year_opened(row) iterates through a prioritized list of columns (['Opened', 'year_opened_LLM', 'Established', 'Construction Began']) to find the earliest known year the hospital was operational. It ensures the year is within a valid range (1700–2025) to filter out erroneous data.
  - Impact: This ensures that the final_year_opened reflects the most reliable and earliest available data about when the hospital began operations.
```
def get_year_opened(row):
    for col in ['Opened', 'year_opened_LLM', 'Established', 'Construction Began']:
        if pd.notnull(row[col]) and 1700 <= row[col] <= 2025:
            return row[col]
    return np.nan
```
- final_year_closed:
  - Logic: The function get_year_closed(row) checks columns like ['Closed', 'year_closed_LLM'] to determine the latest known year the hospital ceased operations. It validates the year to ensure it falls within the range of 1700–2025.
  - Impact: This ensures that the final_year_closed accurately represents the year the hospital stopped functioning, or remains blank if the hospital is still active.
```
def get_year_closed(row):
    for col in ['Closed', 'year_closed_LLM']:
        if pd.notnull(row[col]) and 1700 <= row[col] <= 2025:
            return row[col]
    return np.nan
```
- final_hospital_age:
  - Logic: This is calculated as:
    - final_year_closed - final_year_opened if the hospital is closed.
    - current_year - final_year_opened if the hospital is still active.
  - Impact: This provides a clear measure of the hospital's operational lifespan, whether it is still active or has been closed.
- final_number_of_beds:
  - Logic: The function get_number_of_beds(row) uses the column number_of_beds_LLM to determine the maximum number of beds recorded for the hospital. It ensures the value is valid (greater than 0) before returning it.
  - Impact: This ensures that final_number_of_beds reflects the most reliable data about the hospital's capacity.
```
def get_number_of_beds(row):
    return row['number_of_beds_LLM'] if pd.notnull(row['number_of_beds_LLM']) and row['number_of_beds_LLM'] > 0 else np.nan
```
- final_number_of_patients:
  - Logic: The function get_number_of_patients(row) iterates through a prioritized list of columns (['number_of_patients_LLM', 'peak_patient_population_LLM', 'Peak Patient Population']) to find the maximum number of patients recorded for the hospital. It ensures the value is valid (greater than 0) before returning it.
  - Impact: This ensures that final_number_of_patients captures the peak patient population, providing insight into the hospital's historical usage.
```
def get_number_of_patients(row):
    for col in ['number_of_patients_LLM', 'peak_patient_population_LLM', 'Peak Patient Population']:
        if pd.notnull(row[col]) and row[col] > 0:
            return row[col]
    return np.nan
```
Flagging:
- incomplete_page_flag: Set to 1 if the hospital's data is incomplete or requires manual review.
- kirkbride_flag: Set to 1 if the hospital follows the Kirkbride architectural plan.
- hand_check_flag: Set to 1 if the record requires manual verification.

How to Use

Place raw hospital data files in the appropriate input directory.
Run the pipeline script to process the data.
Review the output file located at /output/processed_data/processed_hospitals_combined.csv.
Use the flags (incomplete_page_flag, hand_check_flag) to identify records requiring manual review.

Notes

The pipeline assumes that all input data is in CSV format and adheres to a predefined schema.
Any discrepancies or missing data should be flagged for manual review.
For further customization or debugging, refer to the pipeline scripts in the /scripts directory.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
output		output
src		src
.gitignore		.gitignore
README.md		README.md
README.pdf		README.pdf
detailed_process_readme.md		detailed_process_readme.md
detailed_process_readme.pdf		detailed_process_readme.pdf
hospital_dataset_summary.md		hospital_dataset_summary.md
hospital_dataset_summary.pdf		hospital_dataset_summary.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Historic Hospital Data Collection Pipeline

Overview

Final Output Data Dictionary

Process Flow Diagrams

Diagram 1: Overview of the Process Flow

Core Business Logic

How to Use

Notes

About

Uh oh!

Releases

Packages

Languages

jordanholbrook/Historic_Hospital_Data_Collection

Folders and files

Latest commit

History

Repository files navigation

Historic Hospital Data Collection Pipeline

Overview

Final Output Data Dictionary

Process Flow Diagrams

Diagram 1: Overview of the Process Flow

Core Business Logic

How to Use

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages