This readme file describes the various aspects and components of the course project (Human Activity Recognition Using Smartphones). The ultimate objective of this project is to process the set of input files and create a tiny data set representing the average of each mean and std column for each activity and participant
This package includes 4 main files used to understand and process the raw data:
1.ReadMe.md - this file contains an overview of all components submitted as part of the course project 2.CodeBook.md - this file describes the various aspects of the "tidy" data set created as an output of the R script 3.run_analysis.R - this file is an R script which takes utilizes the various input files in order to create the tidy dataset output. 4.tidy_data.txt - this file contains the output tidy data set generated from the run_analysis.R file
Additionally, there are a handful of raw data files which are processed to create the tidy dataset
- X_test.txt - raw data for 561 different measurements for the various subjects
- Y_test.txt - raw data for the activities associated to each of the measurements
- subject_test.txt - contains the data describing the subjects associated to the measurments
- X_train.txt - raw data for 561 different measurements for the various subjects
- Y_train.txt - raw data for the activities associated to each of the measurements
- subject_train.txt - contains the data describing the subjects associated to the measurments
- activity_labels.txt - contains the description for the numeric codes in the Y_test and Y_train files
The raw data describes experiments that have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data (X_train , Y_train, subject_train) and 30% the test data (X_test , Y_test, subject_test).
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain. See 'features_info.txt' for more details.
The run_analysis.R file is the main script that's executed in order to process the raw data files. The script ultimately takes the raw data files, joins and cleans the files, and creates a tidy data set where each row represents the average of the various Mean and Standard Deviation measurements for each particpant for each activity
The run_analysis.R file begins by importing the 6 txt files mentioned above (x_test, y_text, subject_test, x_train, y_train, subject_train).
Once these files are imported, a 7th file which contains the 561 column headers is imported and transposed into a dataframe which is written out to a temporary CSV file. This list of column headers is then read back into the script and each header is applied to the X_train.txt and X_test.txt data files.
The script then takes the train and test data sets, the two activity files and two participant files and binds appropriately. At this point, there are three main dataframes:
- Contains the merged measurement data (561 measurements)
- Contains the merged participant data
- Contains the merged activity data
Since the activity details in the raw data are numeric codes, the script utilizes the activity_labels.txt file and replaces the codes with the proper descriptions.
The next main task of the script is to start creating a smaller data frame that only has mean and standard deviation measurements. The script takes the activity data and binds to the participant data. The script then loops through the measurement data and identifies the rows that have either "-mean()" or "-std()" in the column names.
If a column has an alternate naming convention of mean or std deviation and does not contain the exact text "-mean()" or "-std()", the column is not included in the final data set.
The next step in the script take the full data set (activity + participant + mean / std dev measurements) and computes the average of each measurement column (66 columns). The calculation is the grouped by particpant and activity (i.e. each particpant has 66 measurements for each activity).
At this point, a tidy data set is created and is output to a txt file.
Based on the definition of "tidy data" conveyed in Jeff Leek's lecture, the ouput conforms to the defintion:
- Each variable measured should be in one column (this holds true in the tidy_data.txt output)
- Each differnet observation should be in a different row (this holds true in the tidy_data.txt output)
- There should be one table for each kind of variable (this holds true in the tidy_data.txt output - there is only one table)
- If you have multiple tables, they should be linked through a column (this holds true in the tidy_data.txt output - there is only one table)
In addition to the 4 rules, there were also 3 suggested tips from Jeff Leeks lecture which were also adhered to:
- include a row at the top of each file with variable names
- make variable names human readable
- save data in one file
The naming convention used in the table used the following rules:
- use camecase
- write out frequency or time where appropriate
- refer to mean as Mean and standar deviation as StdDev
- Spell out magnitude and accelerator