- Business Background
- Business Problem
- Business Questions
- Data cleaning and validation
- Exploratory Data Analysis
- Findings
- Recommendation
Vivendo is a fast food chain in Brazil with over 200 outlets. Customers often claim compensation from the company for food poisoning. The legal team processes these claims. The legal team has offices in four locations. The legal team wants to improve how long it takes to reply to customers and close claims. The head of the legal department wants a report on how each location differs in the time it takes to close claims.
The legal department is dealing with:
- Inefficient customer response times.
- Unanalyzed geographical differences in claim processing time.
- Inadequate insights into claim closure duration
- How many customers are there in each location?
- Is there any relationship between Locations and the time it takes to close claims?
- How does the time to close claims looks like?
# import all the the required library
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.style as style
style.use('ggplot')
plt.figure(figsize=(10,6))
sns.set_palette(['gray'])
# Load the dataset
df = pd.read_csv('food_claims.csv')
# Check all info and description
df.head()
df.info()
df.describe()
df.boxplot()
# Check for null values
null_counts = df.isnull().sum()
print(null_counts)
# - Two fields found with null values.
Output: Data head
Output: Dataset info
Output: Dataset statistic description
Output: Dataset numeric statistic data points
# Check for null values
null_counts = df.isnull().sum()
print(null_counts)
- Two fields found with null values.
Output: null count
Data Validation: Data cleaning and validating for all the columns.
1. Claim_ID
# checking if there's any duplicates
has_duplicates = df['claim_id'].duplicated().any()
print(has_duplicates)
# # checking for numebr of duplicates
df['claim_id'].duplicated().sum()
# checking for number of unique ID
len(df['claim_id'].unique())
- Unique 2000 records. Output: No duplicates and 2000 unique claim ids
2. Time_to_close
df['time_to_close'].info()
# checking Missing values
df['time_to_close'].isnull().sum()
df['time_to_close'].describe()
check = df['time_to_close'] >= 400
filtered_data = df[check]
# Print the filtered data
print(filtered_data)
No missing values
Output: time_to_close description
Output: time_to_close greater than 400
3. Claim_amount
The claim amount column has the Brazilian dollar sign 'R$' before the amount, making the column a string data type. I have to remove the R$ preset and set the column data type to float.
# Removing R$ from 'claim_amount'
df['claim_amount'] = df['claim_amount'].str.replace('R\$ ', '')
# Convert the claim_amount to a float data type
df['claim_amount'] = df['claim_amount'].astype(float)
# check the data info
df['claim_amount'].info
- The claim_amount Claim cleaned and converted to float datatype
Output:
4. Amount_paid: This column contains the claims amount paid out to customers, I want to check for any inconsistencies, null values or errors.
# 4. Amount_paid
# view claim_amount description
df['amount_paid'].describe()
# check the data info
df['amount_paid'].info
# check amount equal to 0
count_am = (df['amount_paid'] == 0).sum()
print(count_am)
# chech for nulls
null_counts_am = df['amount_paid'].isnull().sum()
print(null_counts_am)
Output: 36 null values
There were 36 null values, after carefully observing the null values, I decided that it was best to replace the null values with the amount_paid median value.
Replace null values with amount median value:
# Replace null values with amount median value
median_amount_paid = df['amount_paid'].median()
df['amount_paid'].fillna(median_amount_paid, inplace=True)
Check for nulls
# chech for nulls
null_counts_am = df['amount_paid'].isnull().sum()
print(null_counts_am)
df.info()
Output: all null values replaced
5. Location:
# check for unique values
unique_location = df['location'].unique()
print(unique_location)
# check for null values
null_counts_location = df['location'].isnull().sum()
print(null_counts_location)
# Grouping by location to observe
location = df.groupby('location')['location'].count()
Output: Location names are capitalized and i need to normalize it.
# Capitalized and titled
df['location'] = df['location'].str.capitalize()
df['location'] = df['location'].str.title()
Output: Location capitalized
6. Individuals on claim: Checking for null values and column description.
# check for null values
individuals_on_claim = df['individuals_on_claim'].isnull().sum()
print(individuals_on_claim)
#view claim_amount description
df['individuals_on_claim'].describe()
Output: No null values, 15 unique values and no cleaning needed.
7. Linked_cases The linked cases column is supposed to be 2 unique values True and False.
# check the unique linked_case
unique_linked_cases = df['linked_cases'].unique()
print(unique_linked_cases)
# count the null values in linked_cases
null_counts_linked_cases = df['linked_cases'].isnull().sum()
print(null_counts_linked_cases)
Output: Linked cases has 26 missing values.
Since there are 26 missing values, I replaced them with 'False'
# Fill NA/Null values with False
df['linked_cases'].fillna(False, inplace=True)
# Desclribe
df['linked_cases'].describe()
Output: Missing values replaced
8. cause
# 8 cause
# check unique cause
unique_cause = df['cause'].unique()
print(unique_cause)
# check null values
null_counts_cause = df['cause'].isnull().sum()
print(null_counts_cause)
Output:
The cause column is meant to be 3 unique values, next step is to clean up by renaming meat and vegetable to merge them, and also capitalize it.
# Group cause categories
group = df.groupby('cause')['cause'].count()
print(group)
Output:
Replacing Meat with meat and VEGETABLES
# Replacing Meat with meat and VEGETABLES
df['cause'] = df['cause'].replace({' Meat': 'meat', 'VEGETABLES': 'vegetable'})
# Group cause categories to check value replacement
group = df.groupby('cause')['cause'].count()
print(group)
# Capitalize cause
df['cause'] = df['cause'].str.capitalize()
Output: Cause cleaned and capitalised.
View cleaned data head
# View cleaned data head
df.head()
Claim ID: 2000 unique values match the description given. There are no missing values. No changes were made to this column.
Time to close: The values of this column range from 76 to 518. There were no missing values. No changes were made to this column.
Claim amount: The values of this column were all rounded to 2 decimal places, and the values range from 1637.94 to 76106.80. There were no missing values, rather 'R$' Brazilian currency logo was removed, and the data type was converted to float.
Amount paid: The values of this column were all rounded to 2 decimal places ranging from 1516.72 to 52498.75, which is consistent with the description given. 36 values were missing. The missing values were replaced with the median value of the remaining data, which was 20105.69.
Location: This column had four unique categories,'RECIFE', 'FORTALEZA', 'SAO LUIS', 'NATAL', that match those in the description. There were no missing values, and were also capitalized and used the string title method for the column.
Individual on claim: The values of this column range from 1 to 15, and match those in the description. There were no missing values and no changes were made to this column.
Linked cases: The values in this column were either True, False or missing. There were 26 missing values. All missing values were replaced with FALSE.
Cause: This column had three unique categories, 'meat','unknown', and 'vegetable', that match those in the description. But had twe additional categories that doesn't match the description, ' Meat' and 'VEGETABLES', both were renamed to match the description, and also capitalized the column.
The dataset had 2000 rows and 8 columns before cleaning, and after cleaning and validation, the dataset remains 2000 rows and 8 columns.
I will be creating visualizations to answer the Business Questions
# Number of Claims in each location
value_counts = df['location'].value_counts()
# Adjust the figure size
plt.figure(figsize=(10, 6))
# Create the bar chart
plt.bar(value_counts.index, value_counts.values)
# Set the color palette and linestyle for the grid lines
sns.set_palette(['dimgray'])
sns.set_style("whitegrid", {'axes.grid': True, 'grid.linestyle': '-', 'grid.color': '#F0F0F0'})
# Label each bar with their respective numbers
for i, value in enumerate(value_counts.values):
plt.text(i, value, str(value), ha='center', va='bottom')
# Customize the chart
plt.xlabel('Location')
plt.ylabel('Count / Number of Claims')
plt.title('Number of Claims in each location')
# Display the chart
plt.show()
Output:
There are four categories of location included in this data. The location with the most number of claims is Recife, with Sao Luis being second although with 25% of the number of the overall claim.
The categories are unbalanced, with most observations being either Recife or Sao Luis. The legal team should focus on Recife and Sao Luis as they are the location with most claims. Recife: 44.25%, Sao Luis: 25.85%, Fortaleza: 15.55%, Natal: 14.35%.
To find the relationship between Locations and the time I have to create a visualization that describes the distribution of time to close for all claims.
# graph size
plt.figure(figsize=(10, 6))
# Create a histogram with time_to_close
sns.histplot(df['time_to_close'], bins=30)
# Set the color palette and linestyle for the grid lines
sns.set_palette(['dimgray'])
sns.set_style("whitegrid", {'axes.grid': True, 'grid.linestyle': '-', 'grid.color': '#F0F0F0'})
# Customize the chart
plt.ylabel('Count')
plt.xlabel('Time to close')
plt.title('Distribution of Time to Close')
# Display chart
plt.show()
Output: Distribution of Time to Close
The graph above exhibits a predominantly symmetrical, with more data clusters around the peak. distribution, indicating a normal distribution pattern. However, when specifically examining the time it takes to close claims, the distribution is right-skews. While outliers exist, indicating rare instances of claims taking over 300 days to close, the majority of claims fall within the range of 173 to 188 days.
Based on this insightful analysis, the legal team is now equipped to establish targeted objectives and performance metrics aimed at enhancing the closure time of claims. By considering the typical timeframe within which the majority of claims are resolved, the team can set realistic goals and develop strategies to streamline and expedite the claims closure process.
Here I created a visualization that shows the relationship between time to close and location
# Adjust the figure size as needed
plt.figure(figsize=(9, 6))
# Create a box plot to visualize the distribution of 'time_to_close' over 'location'
sns.boxplot(data=df, x='location', y='time_to_close', color='gray')
plt.xlabel('Location')
plt.ylabel('Time to close')
plt.title('Range in number of Time to close by Location')
# Display the plot
plt.show();
Distribution of time to close over individual location
We can examine two variables to explore the impact of office location on claim closure time. This analysis provides valuable insights for the legal team regarding the distribution of observations across different office locations.
From the graph, it is evident that all office locations exhibit similar interquartile ranges. The mean slightly surpasses the median for each location, indicating the presence of outliers. However, both the mean and median values demonstrate consistency, ranging from 178 to 180.
Notably, the Recife office location stands out with the highest number of outliers, representing cases that took more than 300 to 400 days to close. Similarly, Sao Luis exhibits a few cases exceeding 400 days for closure, alongside outlier cases taking over 300 days. Fortaleza and Natal, on the other hand, have a lower number of outliers.
Given these observations, the legal team can prioritize Recife and Sao Luis office locations when formulating solutions to enhance the claims closure process.
Additionally, further investigations can be conducted to determine any potential relationships between variables such as claim amount, individuals involved in the claim, and the duration of the claims closure process.
Thank you for taking out time to read through this project, kindly drop comments on your thoughts, suggestions etc.
By Michael Olaniyi Jeremiah