In the practice, we will filter out bad data, fill in missing values, transform data into a suitable format, and draw appropriate chart.
Our goal in this analysis is to explore the salaries in San Francisco overtime. What we expected to discover are:
- The salary will raise as time goes by.
- The central tendency and variability will increase years over years
- The range of salary will increase years over years.
The data we have chosen to use is titled “SF Salaries” and was found on kaggle.com. There are over 150,000 dataset containing the employee name, job title, and the compensation for employees in San Francisco city from 2012 to 2017.
We want to know tendency of salary in San Francisco during this period. Is there any increase of central tendency or how is the range of maximum and minimum salary.
The library we currently envision using are pandas, numpy, statistics, matplotlib and seaborn. Individual data points consist of salaries postings containing employee name, position title, base pay, over time pay, other pay, benefits, total pay, total pay benefits, and year. Descriptive statistics will be used to turn the data into different quantity distribution containing the percentage we desire. Histogram, box plot, and line chart will be used to build the salary scheme.
• Drop some useless column(note, agency and status):
- Drop status marked as PT(part time) row.
- salary.drop([“Note”, “Agency”] ,axis = 1) • Impute missing values(Nan)
- Detect the variables containing null value: salary.isnull().sum()
- Fill the missing values: salary.fillna(value = {given a dictionary mapping the missing value to specific value}) • Remove negative values
- We found there is one row of data showed negative benefits. The value is abnormal and will affect total value and mean.
- Drop rows with negative value
• Compute the salary statistics(mean, median, mode, minimum, maximum, range, 25% quantile, 50% quantile, 75% quantile )
- Apply sorted method to compute minimum and maximum and range from minimum and maximum
- Import statistics library and apply mean, median and mode methods
- import numpy library and apply quantile method
*mean: the average value *median: the middle value after sorted *mode: the most frequently occurring value
• Compute the variance and standard deviation of salary
- Import statistics library and apply pvariance and pstdev methods
Attach the picture, and why we want to use it • Histogram of salaries: show the distribution total pay and benefits. • Bar chart of people in each year. • Boxplot of salaries in annual basis: take the totalPayBenefits as a salary
From this analytics, we can find that the salary in San Francisco is increasing slowly from 2012~2017. By observing mean, median value of each year, the numbers are similar in 2012, 2013, 2016. On the other hand, the value in 2014, 2015, and 2017 are higher. The value of standard deviation can also group into two. It showed that 2014, 2015 have different values lower than others. In the contrast, the range of salary went to a different side to standard deviation and variable. In conclusion, we find the difference of salary from year to year is slight. As this result, we can suggest the government to increase the minimum wage of salary in order to improve welfare of the whole society.