diff --git a/Project2-Character-Recognition/README.md b/Project2-Character-Recognition/README.md
index 4503fac..a2d3199 100644
--- a/Project2-Character-Recognition/README.md
+++ b/Project2-Character-Recognition/README.md
@@ -3,12 +3,162 @@ CUDA Character Recognition
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Author : Kushagra
+  - [LinkedIn](https://www.linkedin.com/in/kushagragoel/)
+* Tested on:  Windows 10 Education, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 @ 4GB (Moore 100B Lab)
 
-### (TODO: Your README)
+____________________________________________________________________________________
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+## Breaking the Ice
+![](img/PrettyOutput.jpg)
 
+## Table of Contents
+1. [Introduction](#intro)
+2. [What is a Multi Layer Perceptron](#mlp)
+2.1. [Forward](#forward)
+2.2. [Backward](#backward)
+3. [Implementation](#impl)
+4. [Performance Analysis](#perform)
+5. [Humble Brag](#brag)
+6. [References](#ref)
+
+
+<a name = "intro"/>
+
+## Introduction
+In this project we have created a generic multi layer perceptron from scratch in CUDA. We then train the multi layer perceptrons on 3 different datasets :  
+* MNIST
+* Custom Dataset having upper and lower case alphabets
+* XOR
+
+<a name = "mlp"/>
+
+## What is a Multi Layer Perceptron
+To understand what a multi layer perceptron is, we start by looking at what a perceptron is. The following image gives an idea of how a perceptron functions and the motivation behind it
+
+Neuron             |  Perceptron
+:-------------------------:|:-------------------------:
+![](img/neuron.png) | ![](img/neuron_model.jpeg)
+
+And with several such perceptrons, we make a multi-layer perceptron as depicted here :
+~[](img/MNISTmlp.png)
+
+<a name = "forward"/>
+
+### Forward Pass
+A forward pass means going from inputs to outputs in a mlp by calculating the value at each intermediate layer and then multiplying with the layer's weights to get the inputs for the next intermediate layer.
+
+
+<a name = "backward"/>
+
+### Backward Pass
+In this step, we calculate how our trainable parameters effect the loss and adjust them accordingly. This is more popularly known as BackPropagation and is essentially an application chain rule from calculus.
+
+A depiction of the forward pass and the backward pass looks like :
+![](img/partial_derivative_notations.png)
+
+<a name = "impl"/>
+
+## Implementation
+On to the fun stuff now. I implemented this project trying my best to keep the mlp architecture as generic as possible. Due to this design, we have the capability of having :
+* Variable number of hidden layers
+* Variable sizes of hidden layers
+* Variable batchSizes for faster training
+While also trying to encapsulate as much of the implementation detail from the user as possible. The following class definition explains how its being done. 
+### MultiLayerPerceptron
+```
+class MultiLayerPerceptron {
+
+  std::vector<FullyConnectedLayer*> layers;
+  int batchDim;
+public :
+  MultiLayerPerceptron(int inputDim, int numHiddenLayers, int *hiddenDim, int outputDim, int batchDim);
+  void forward(float *input, float *output, bool test = false);
+  void backward(float *output, float *predicted, float learningRate);
+  float loss(float *label, float *predicted);
+};
+```
+We see here that the MultiLayerPerceptron has a vector of FullyConnectedLayers which can be instantiated using the hiddenDim array. The forward and backward method perform the operations we described above. 
+The MultilayerPerceptron takes input from the user, iterates through the layers and calls their respective forward and backward methods and finally returns the prediction. 
+As long as a class extends FullyConnectedLayer and implements the forward and the backward method, we can add it to our MultiLayerPerceptron.
+
+### FullyConnectedLayer
+```
+class FullyConnectedLayer {
+  float *weight = NULL;
+  float *inputs = NULL;
+  int inputDim;
+  int batchDim;
+  int outputDim;
+  bool lastLayer;
+
+public:
+  FullyConnectedLayer(int inputDim, int outputDim, int batchDim, bool lastLayer);
+  void forward(float *input, float *output, bool test = false);
+  void backward(float learningRate, float *incomingGradient, float *outgoingGradient);
+  int getInputDim();
+  int getOutputDim();
+};
+```
+This class symbolizes each hidden layer in the multi layer perceptron. The forward and the backward methods over here have the core logic to calculate the hidden values and finally the output.
+By setting the lastLayer to true, we can signify the last layer to use softmax to give outputs as probabilites, otherwise each layer uses ReLU as its activation.
+One thing that is absent here are the biases (although the first layer can handle biases if we append 1 to each sample). But in our experimentation, it was found that biases in the input layer were sufficient to give use good results.
+
+### Isn't this too complicated?
+Yes, but actually no. We are calculating the gradients of each layer in a very clever and efficient fashion. 
+The magic is present in the FullyConnectedLayer implementation. Each layer recieves a partial gradient from the next layer, which it uses to calculate gradients for its own weights and also what information it should pass along to the previous layer. The following image will provide a better idea of how this is working :
+
+<img src="img/Autograd.jpg" alt="PyTorch Autograd" width="600"/> 
+
+And if someone is curious enough for the maths behind it, don't worry, I got you covered. With a little bit of maths, its not hard to see that  current layer will recieve the derivative of the loss w.r.t to its output and the derivative of the loss w.r.t to the input of the current layer needs to be passed to the previous layer.
+
+<img src="img/EquationsPart1.jpg" alt="NumBoidsVsFPS" width="600"/> 
+<img src="img/EquationsPart2.jpg" alt="If you can read this, you must be a genius or a doctor" width="600"/> 
+
+<a name = "perform"/>
+
+## Performance Analysis
+Let's look at how our implementation performs on the datasets :  
+
+### XOR  
+
+<img src="charts/XORLoss.jpg" width="600"/>   
+
+### Custom Dataset  
+
+<img src="charts/CharacterRecognitionLoss.jpg" width="600"/>   
+
+### MNIST  
+
+<img src="charts/MNISTLoss.jpg" width="600"/>   
+
+
+So the loss is going down quite smoothly. This means that our implementation works really well.
+
+### Observations
+* It was observed that for XOR would not train if the size of the hidden layer was less than 5. At anything more than or equal to 5, the loss keeps decreasing as long as we train. Which was expected as it will keep pushing the probabilites towards 1 without actually every reaching it due to the softmax.
+* The character recognition with little training would give 100% accuracy with just 1 hidden layer. This means its overfitting badly which was expected since we have only one data point per class. We can revisit this later to add some kind of regularization/penalty to ensure that the mlp doesn't just memorize the input.
+* To see if our MLP actually learns anything, we tried an alternate dataset, The MNIST. In fact we use a 2 hidden layer with less hidden units as having more hidden units would exhaust the memory of the GPU. This is because we are doing batch gradient descent and we are storing the inputs for all layers, leading us to signficant memory consumption.
+* Another interesting fact about MNIST is with just the addition of biases to the input layer, we observed huge improvement to the point that the network actually started to predict correct values and not give the same answer for all the inputs.
+
+
+<a name = "brag"/>
+
+## Humble Brag
+* As Archimedes once said, give me enough CUDA cores and I will meet you at the global minimum, you can have as many layers as you want and with as many hidden units. (almost) Nothing is hard coded.
+* The interface to use the Multi-Layer Perceptrons follows PyTorch's style and therefore is highly intutive to anyone who has experience in PyTorch.
+* Batch Gradient descent : You don't have to run the mlp again and again for different inputs, just batch them up and run the mlp once with correct parameters and let the implementation handle the un-fun stuff for you. On a serious note, batch gradient descent with generic classes introduced a lot of several complications about how to calculate and propagate the gradients which are now being handled very gracefully by the Autograd style of backpropagation.
+* MNIST : I was successfully able to learn the correct labels for hand-drawn numbers, how cool is that!!
+* Not giving up even when windows did :)  
+<img src="img/bsod.gif" width="300"/>  
+
+
+
+<a name = "ref"/>
+
+## References  
+* http://cs231n.github.io/neural-networks-1/  
+* https://corochann.com/mnist-training-with-multi-layer-perceptron-1149.html  
+* https://www.ritchievink.com/blog/2017/07/10/programming-a-neural-network-from-scratch/
+* https://towardsdatascience.com/getting-started-with-pytorch-part-1-understanding-how-automatic-differentiation-works-5008282073ec
+* http://cs231n.stanford.edu/handouts/linear-backprop.pdf
diff --git a/Project2-Character-Recognition/bookKeeping/characterLosses.csv b/Project2-Character-Recognition/bookKeeping/characterLosses.csv
new file mode 100644
index 0000000..6eec68c
--- /dev/null
+++ b/Project2-Character-Recognition/bookKeeping/characterLosses.csv
@@ -0,0 +1,100 @@
+3.95124
+3.95123
+3.95121
+3.95118
+3.95112
+3.95099
+3.95074
+3.9502
+3.949
+3.9463
+3.94008
+3.92563
+3.89247
+3.82118
+3.69507
+3.53697
+3.35283
+3.1358
+2.88086
+2.6352
+2.59458
+2.14501
+1.87281
+1.86736
+1.61598
+1.20534
+0.901629
+0.793453
+0.737934
+0.587051
+0.432607
+0.351006
+0.356106
+0.337291
+0.317566
+0.246129
+0.213596
+0.173226
+0.128
+0.107839
+0.0979261
+0.075282
+0.0709963
+0.0601452
+0.0587817
+0.0496783
+0.0470992
+0.041245
+0.0382672
+0.0347674
+0.0324327
+0.0304023
+0.0287442
+0.0273021
+0.0259963
+0.0247993
+0.0237088
+0.0226944
+0.0217569
+0.0208874
+0.0200767
+0.0193191
+0.0186157
+0.0179514
+0.0173283
+0.0167464
+0.0161945
+0.0156764
+0.0151852
+0.0147219
+0.0142862
+0.0138674
+0.0134729
+0.0131006
+0.0127408
+0.0124031
+0.0120795
+0.0117706
+0.0114763
+0.011194
+0.0109258
+0.0106667
+0.01042
+0.0101831
+0.0099549
+0.00973787
+0.00952755
+0.00932477
+0.00913258
+0.00894522
+0.0087638
+0.00859288
+0.00842283
+0.00826305
+0.00810679
+0.0079552
+0.00781064
+0.00766904
+0.00753303
+0.00740097
diff --git a/Project2-Character-Recognition/bookKeeping/mnistLosses.csv b/Project2-Character-Recognition/bookKeeping/mnistLosses.csv
new file mode 100644
index 0000000..3f6f592
--- /dev/null
+++ b/Project2-Character-Recognition/bookKeeping/mnistLosses.csv
@@ -0,0 +1,98 @@
+2.30258
+2.30182
+2.29723
+2.25676
+2.10882
+2.13333
+1.85369
+1.54322
+1.91046
+2.24429
+1.99083
+1.98546
+2.77747
+2.51213
+2.86779
+3.02199
+3.08375
+2.58157
+2.34638
+2.1385
+2.00144
+1.34227
+1.22468
+1.18058
+1.08096
+1.03805
+0.990021
+0.969472
+0.959191
+0.960439
+0.937562
+0.978949
+0.978516
+0.932206
+0.867721
+0.856745
+0.852589
+0.847493
+0.844346
+0.836677
+0.823922
+1.38576
+1.65748
+1.18833
+1.14968
+0.985466
+0.876304
+0.85324
+0.843678
+0.838025
+0.829976
+0.825726
+0.82305
+0.818249
+0.815175
+0.813122
+0.81156
+0.810342
+0.80937
+0.808547
+0.807909
+0.807415
+0.807029
+0.806829
+0.807355
+0.807097
+0.808126
+0.805296
+0.804825
+0.80313
+0.802475
+0.801791
+0.801563
+0.801426
+0.800862
+0.785671
+0.774506
+0.768242
+0.757633
+0.75637
+0.755529
+0.754989
+0.754938
+0.754965
+0.754398
+0.754042
+0.753789
+0.753555
+0.753346
+0.753154
+1.67381
+0.825814
+0.790645
+0.780903
+0.77252
+0.768511
+0.764477
+0.761343
diff --git a/Project2-Character-Recognition/bookKeeping/xorLosses.csv b/Project2-Character-Recognition/bookKeeping/xorLosses.csv
new file mode 100644
index 0000000..f16fc50
--- /dev/null
+++ b/Project2-Character-Recognition/bookKeeping/xorLosses.csv
@@ -0,0 +1,1000 @@
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693147
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693146
+0.693145
+0.693146
+0.693145
+0.693145
+0.693145
+0.693145
+0.693144
+0.693144
+0.693144
+0.693143
+0.693143
+0.693143
+0.693142
+0.693142
+0.693142
+0.693141
+0.693141
+0.69314
+0.693139
+0.693139
+0.693138
+0.693136
+0.693136
+0.693135
+0.693133
+0.693133
+0.693131
+0.693129
+0.693129
+0.693126
+0.693124
+0.693123
+0.69312
+0.693117
+0.693116
+0.693112
+0.693108
+0.693106
+0.693101
+0.693096
+0.693094
+0.693087
+0.69308
+0.693077
+0.693069
+0.69306
+0.693056
+0.693045
+0.693033
+0.693028
+0.693014
+0.692998
+0.692991
+0.692973
+0.692953
+0.692943
+0.69292
+0.692893
+0.692881
+0.69285
+0.692815
+0.692798
+0.692759
+0.692713
+0.69269
+0.692638
+0.69258
+0.692548
+0.692482
+0.692403
+0.692364
+0.692274
+0.692177
+0.692116
+0.692008
+0.691872
+0.691803
+0.691647
+0.691488
+0.691373
+0.691197
+0.690958
+0.690843
+0.690568
+0.690315
+0.690091
+0.689818
+0.689388
+0.689211
+0.688712
+0.688335
+0.687882
+0.687498
+0.686698
+0.686473
+0.685531
+0.685046
+0.684086
+0.683672
+0.682107
+0.681976
+0.680108
+0.679736
+0.6776
+0.677569
+0.674335
+0.674861
+0.670946
+0.671538
+0.667015
+0.667985
+0.663178
+0.662478
+0.659536
+0.656057
+0.654775
+0.648511
+0.649922
+0.642232
+0.643056
+0.636868
+0.631815
+0.6317
+0.622712
+0.624016
+0.617759
+0.607508
+0.607463
+0.599253
+0.593901
+0.592584
+0.58468
+0.578085
+0.583598
+0.57793
+0.566345
+0.5596
+0.559372
+0.547895
+0.545127
+0.540579
+0.537014
+0.528957
+0.527817
+0.529037
+0.519797
+0.507824
+0.50203
+0.506378
+0.504285
+0.492417
+0.485617
+0.475582
+0.478282
+0.479006
+0.471932
+0.459429
+0.458391
+0.450852
+0.449606
+0.445809
+0.435567
+0.441644
+0.441455
+0.431693
+0.428465
+0.425977
+0.418322
+0.416089
+0.41874
+0.41126
+0.40861
+0.413934
+0.413048
+0.406471
+0.400915
+0.396166
+0.397808
+0.399359
+0.394369
+0.392301
+0.388921
+0.385665
+0.390627
+0.389692
+0.386328
+0.38459
+0.381742
+0.380682
+0.384348
+0.383411
+0.385552
+0.382282
+0.379521
+0.377145
+0.375073
+0.374189
+0.372673
+0.371339
+0.372263
+0.375777
+0.374565
+0.372611
+0.370922
+0.369526
+0.368935
+0.367702
+0.366834
+0.366226
+0.365245
+0.364931
+0.365996
+0.371909
+0.371448
+0.369909
+0.368532
+0.367295
+0.366175
+0.36516
+0.364234
+0.363388
+0.362611
+0.361895
+0.361247
+0.361018
+0.36042
+0.36017
+0.359684
+0.3616
+0.363658
+0.362841
+0.362056
+0.361256
+0.360544
+0.360061
+0.359644
+0.359104
+0.358719
+0.358414
+0.357963
+0.357749
+0.357388
+0.357104
+0.35686
+0.356552
+0.356364
+0.35606
+0.355894
+0.355606
+0.355442
+0.355362
+0.358762
+0.358074
+0.357647
+0.357069
+0.356563
+0.356081
+0.355764
+0.355396
+0.35498
+0.35478
+0.356736
+0.356212
+0.355656
+0.355181
+0.354639
+0.354194
+0.353644
+0.35321
+0.352631
+0.352194
+0.35157
+0.351101
+0.350421
+0.3499
+0.349142
+0.348544
+0.347681
+0.347065
+0.346089
+0.345331
+0.344174
+0.343242
+0.341855
+0.340706
+0.339904
+0.340635
+0.338414
+0.336456
+0.333933
+0.331454
+0.328584
+0.325626
+0.322398
+0.318754
+0.314934
+0.310629
+0.306266
+0.300996
+0.296089
+0.290104
+0.28458
+0.277459
+0.271647
+0.263646
+0.256618
+0.248611
+0.240929
+0.233394
+0.224475
+0.216547
+0.208418
+0.200359
+0.192321
+0.184331
+0.177569
+0.171166
+0.164648
+0.157501
+0.150256
+0.144548
+0.137565
+0.132472
+0.126336
+0.121721
+0.116814
+0.113534
+0.108668
+0.104416
+0.10102
+0.0984068
+0.0951113
+0.0918439
+0.0884466
+0.0850489
+0.0820951
+0.0795349
+0.0777572
+0.0749728
+0.0729759
+0.0703241
+0.0688848
+0.0675097
+0.0661515
+0.0638861
+0.0620168
+0.0605538
+0.0588444
+0.0571957
+0.055952
+0.055088
+0.0541734
+0.0522955
+0.0512557
+0.0501433
+0.0485552
+0.0478443
+0.0468148
+0.0456181
+0.0449285
+0.044594
+0.0431728
+0.0419527
+0.0415809
+0.040547
+0.0396303
+0.0393156
+0.0387712
+0.0376286
+0.0368865
+0.036351
+0.0356082
+0.0353907
+0.0346197
+0.0337546
+0.0332245
+0.0327645
+0.0320924
+0.0316921
+0.0314928
+0.0306666
+0.030028
+0.0297432
+0.0292473
+0.0289932
+0.0284533
+0.0278836
+0.0274512
+0.0271506
+0.0269332
+0.0263532
+0.0259606
+0.0255839
+0.0251688
+0.0248426
+0.0246226
+0.0243805
+0.0239027
+0.0234925
+0.0232311
+0.0230314
+0.0227388
+0.02239
+0.022066
+0.0217135
+0.0215615
+0.0213882
+0.0209508
+0.0207348
+0.0204597
+0.0202199
+0.0201323
+0.0197738
+0.0194897
+0.0192901
+0.019096
+0.0189607
+0.0186666
+0.0184148
+0.0182172
+0.018065
+0.0178901
+0.017655
+0.0174668
+0.0172035
+0.0170996
+0.0169804
+0.0166848
+0.0165725
+0.0163503
+0.0162072
+0.0161193
+0.0158658
+0.0157357
+0.015546
+0.0154
+0.0152399
+0.0151013
+0.0149803
+0.0149061
+0.0146851
+0.0144942
+0.014501
+0.0142867
+0.0141075
+0.0140119
+0.0138773
+0.013813
+0.0136426
+0.0134953
+0.0133832
+0.0132818
+0.0131856
+0.0130576
+0.0129283
+0.0127747
+0.0127339
+0.0126388
+0.0124724
+0.0124016
+0.0122604
+0.0121926
+0.0121175
+0.0119654
+0.0118963
+0.0117637
+0.0116774
+0.0116005
+0.0115536
+0.011406
+0.0113287
+0.0112275
+0.0111429
+0.0111114
+0.0109757
+0.0108857
+0.0108001
+0.0107373
+0.0106927
+0.0105642
+0.0104764
+0.0104117
+0.0103555
+0.0102831
+0.0101913
+0.0101057
+0.0100249
+0.0100039
+0.00991242
+0.00982874
+0.00975851
+0.00966924
+0.00966659
+0.00956948
+0.00948102
+0.0094388
+0.00937931
+0.00929716
+0.00924934
+0.00915408
+0.00912198
+0.00904224
+0.00898058
+0.00895897
+0.00889404
+0.00881794
+0.00876254
+0.0087248
+0.00865587
+0.00859323
+0.0085244
+0.00849513
+0.00845568
+0.00837149
+0.0083314
+0.00826449
+0.00823711
+0.00819389
+0.00811244
+0.00808106
+0.00801509
+0.00796769
+0.00793833
+0.00790642
+0.00783218
+0.00778562
+0.00773821
+0.007707
+0.00767312
+0.00760396
+0.00756433
+0.00754272
+0.00747229
+0.00744165
+0.00738727
+0.00735122
+0.00733173
+0.00726571
+0.00722842
+0.00718072
+0.00715927
+0.0071251
+0.00706442
+0.00702961
+0.00699357
+0.00695186
+0.00691761
+0.00690353
+0.0068424
+0.00680192
+0.00676822
+0.00674454
+0.00671631
+0.00666115
+0.00662816
+0.00661604
+0.00655937
+0.0065271
+0.00649288
+0.00646586
+0.00643966
+0.00639569
+0.00636261
+0.00632882
+0.00629841
+0.00626747
+0.00625572
+0.00620441
+0.00617259
+0.00614408
+0.00611966
+0.00609809
+0.00605614
+0.00602129
+0.00601364
+0.00596936
+0.00593951
+0.00590963
+0.00588674
+0.00586737
+0.00582768
+0.00579966
+0.00577183
+0.0057475
+0.00571615
+0.00570905
+0.00566918
+0.0056375
+0.00561371
+0.00559437
+0.00557624
+0.00553883
+0.00550885
+0.00548798
+0.00547675
+0.0054443
+0.00541554
+0.00539555
+0.00537675
+0.00534366
+0.0053226
+0.00529648
+0.00527478
+0.00524775
+0.0052452
+0.00520764
+0.00517941
+0.00516255
+0.00514458
+0.00512639
+0.005096
+0.00507326
+0.00505079
+0.00504219
+0.00501476
+0.00499081
+0.00497222
+0.0049545
+0.00493005
+0.00490964
+0.00488582
+0.00486863
+0.00484654
+0.0048408
+0.00480892
+0.00478741
+0.00477048
+0.00475494
+0.00473878
+0.0047143
+0.00469291
+0.00467228
+0.00466809
+0.00464224
+0.0046205
+0.00460546
+0.00459138
+0.00456766
+0.00455032
+0.00453054
+0.00451542
+0.00449602
+0.00448894
+0.00446436
+0.00444354
+0.00442814
+0.00441688
+0.00440179
+0.00437907
+0.00436135
+0.00435456
+0.00433078
+0.00431381
+0.00429765
+0.00428346
+0.00426402
+0.00426026
+0.00423607
+0.00421944
+0.00420387
+0.00419006
+0.00418232
+0.00415906
+0.00414177
+0.00412835
+0.00412152
+0.00410304
+0.0040839
+0.00407296
+0.00406225
+0.00404161
+0.00402809
+0.00401466
+0.00399948
+0.00398435
+0.00398046
+0.00395845
+0.00394373
+0.00392934
+0.0039213
+0.00390953
+0.00388944
+0.00387645
+0.00386383
+0.00385042
+0.003839
+0.00383303
+0.00381202
+0.00379853
+0.00379372
+0.00377424
+0.00376233
+0.00374741
+0.00373824
+0.00373107
+0.00371091
+0.00369935
+0.00368696
+0.00367506
+0.00366527
+0.0036582
+0.00363974
+0.0036281
+0.00362231
+0.00360541
+0.00359482
+0.00358028
+0.00357171
+0.00356518
+0.00354662
+0.00353741
+0.00352363
+0.00351786
+0.00350797
+0.00349009
+0.00348247
+0.00347016
+0.00345779
+0.00345249
+0.0034436
+0.0034262
+0.00341857
+0.00341181
+0.00339567
+0.00338741
+0.00337307
+0.00336914
+0.00335928
+0.00334264
+0.00333648
+0.00332383
+0.00331294
+0.0033081
+0.00329961
+0.00328482
+0.0032763
+0.00326963
+0.00325722
+0.00324718
+0.00323429
+0.00323087
+0.00322171
+0.00320653
+0.0032008
+0.0031882
+0.00318032
+0.00317328
+0.00316567
+0.00315431
+0.00314352
+0.00313818
+0.00312822
+0.00311717
+0.00310634
+0.00310186
+0.00309402
+0.00308152
+0.003074
+0.00306177
+0.00305702
+0.00304811
+0.00304141
+0.00303252
+0.00302065
+0.00301185
+0.0030087
+0.00299964
+0.00298824
+0.00298127
+0.00297467
+0.00296499
+0.00295611
+0.00294493
+0.00294439
+0.00293387
+0.00292241
+0.00291736
+0.00291043
+0.00290039
+0.00289348
+0.00288324
+0.00287797
+0.00286933
+0.0028641
+0.0028553
+0.00284619
+0.00283741
+0.00283429
+0.00282569
+0.00281629
+0.00280953
+0.00279983
+0.00279514
+0.00278825
+0.00278243
+0.00277346
+0.00276474
+0.00276176
+0.00275177
+0.00274345
+0.00273682
+0.00273161
+0.00272511
+0.00271631
+0.00270921
+0.00270088
+0.00269582
+0.00268914
+0.00268504
+0.00267526
+0.00266739
+0.00266562
+0.00265468
+0.00264833
+0.0026414
+0.00263614
+0.00263085
+0.00262191
+0.00261526
+0.00260843
+0.00260239
+0.00259702
+0.00259317
+0.00258277
+0.00257726
+0.00257431
+0.00256406
+0.00255874
+0.0025516
+0.00254683
+0.00254247
+0.00253304
+0.00252793
+0.00252104
+0.00251446
+0.00251109
+0.00250632
+0.00249653
+0.00249199
+0.00248857
+0.00247921
+0.00247451
+0.00246711
+0.00246378
+0.00245859
+0.00244927
+0.00244609
+0.00243853
+0.00243216
+0.00242975
+0.00242461
+0.00241594
+0.0024112
+0.0024078
+0.00240011
+0.00239473
+0.00238716
+0.00238559
+0.00237966
+0.00237087
+0.00236841
+0.00236078
+0.00235564
+0.00235223
+0.00234755
+0.00234055
+0.00233505
+0.00233137
+0.00232553
+0.00231945
+0.00231259
+0.00231107
+0.00230533
+0.00229798
+0.00229419
+0.00228735
+0.00228354
+0.00227938
+0.00227451
+0.00226916
+0.00226312
+0.00226018
+0.00225399
+0.0022484
+0.00224284
+0.00224011
+0.00223497
+0.00222898
+0.00222453
+0.00221787
+0.00221501
+0.00221056
+0.00220663
+0.00220048
+0.00219528
+0.00219317
+0.00218611
+0.00218125
+0.00217677
+0.00217352
+0.00216845
+0.00216313
+0.00215868
+0.00215304
+0.00214926
+0.00214569
+0.00214234
+0.00213543
+0.00213114
+0.00212943
+0.00212216
+0.00211792
diff --git a/Project2-Character-Recognition/character_recognition/CMakeLists.txt b/Project2-Character-Recognition/character_recognition/CMakeLists.txt
index 7446175..9e834c1 100644
--- a/Project2-Character-Recognition/character_recognition/CMakeLists.txt
+++ b/Project2-Character-Recognition/character_recognition/CMakeLists.txt
@@ -7,5 +7,5 @@ set(SOURCE_FILES
 
 cuda_add_library(character_recognition
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/Project2-Character-Recognition/character_recognition/mlp.cu b/Project2-Character-Recognition/character_recognition/mlp.cu
index 5a3ed7f..0dc7e6c 100644
--- a/Project2-Character-Recognition/character_recognition/mlp.cu
+++ b/Project2-Character-Recognition/character_recognition/mlp.cu
@@ -2,6 +2,14 @@
 #include <cuda_runtime.h>
 #include "common.h"
 #include "mlp.h"
+#include <fstream>
+#include <string>
+
+#ifndef imax
+#define imax(a,b) (((a)>(b))?(a):(b))
+#endif
+
+#define blockSize 128
 
 namespace CharacterRecognition {
     using Common::PerformanceTimer;
@@ -10,18 +18,362 @@ namespace CharacterRecognition {
         static PerformanceTimer timer;
         return timer;
     }
-        
-    // TODO: __global__
-
-    /**
-        * Example of use case (follow how you did it in stream compaction)
-        */
-    /*void scan(int n, int *odata, const int *idata) {
-        timer().startGpuTimer();
-        // TODO
-        timer().endGpuTimer();
-    }
-    */
 
-	// TODO: implement required elements for MLP sections 1 and 2 here
+	__global__ void kernCrossEntropyLoss(int n, float *predicted, float *label, float *lossForEachLabel) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+
+		if (index < n) {
+			lossForEachLabel[index] = -1 * (label[index] * logf(predicted[index]));
+		}
+	}
+
+	float MultiLayerPerceptron::loss(float *label, float *predicted) {
+
+		float *devLabel;
+		cudaMalloc((void**)&devLabel, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+		cudaMemcpy(devLabel, label, batchDim * layers[layers.size() - 1]->getOutputDim() * sizeof(float), cudaMemcpyHostToDevice);
+		checkCUDAError("cudaMemcpy");
+
+		float *devPredicted;
+		cudaMalloc((void**)&devPredicted, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+		cudaMemcpy(devPredicted, predicted, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float), cudaMemcpyHostToDevice);
+		checkCUDAError("cudaMemcpy");
+		
+
+		float *lossForEachLabel = new float[batchDim * layers[layers.size() - 1]->getOutputDim()];
+		float *devLossForEachLabel;
+		cudaMalloc((void**)&devLossForEachLabel, batchDim * layers[layers.size() - 1]->getOutputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+		
+
+		int gridRows = ((batchDim * layers[layers.size() - 1]->getOutputDim()) + blockSize - 1) / blockSize;
+		kernCrossEntropyLoss << <gridRows, blockSize >> > (batchDim * layers[layers.size() - 1]->getOutputDim(), devPredicted, devLabel, devLossForEachLabel);
+		checkCUDAError("kernCrossEntropyLoss");
+
+		cudaMemcpy(lossForEachLabel, devLossForEachLabel, batchDim * layers[layers.size() - 1]->getOutputDim() * sizeof(float), cudaMemcpyDeviceToHost);
+		checkCUDAError("cudaMemcpy");
+
+		float loss = 0;
+
+		for (int i = 0; i < batchDim * layers[layers.size() - 1]->getOutputDim(); i++) {
+			loss += lossForEachLabel[i];
+		}
+		return loss / batchDim;
+		
+	}
+
+
+	__global__ void kernSubtractMatrices(float *input1, float *input2, float *output, int m, int n) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		int row = index / n;
+		int col = index % n;
+
+		if (col < n && row < m) {
+			int pos = row * n + col;
+			output[pos] = input1[pos] - input2[pos];
+		}
+
+	}
+
+	__global__ void kernMultiplyMatrices(float *input, float *weight, float *output, int m, int n, int k) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		int row = index/k;
+		int col = index%k;
+		float sum = 0;
+
+		if (col < k && row < m) {
+			for (int i = 0; i < n; i++) {
+				sum += input[row * n + i] * weight[i*k + col];
+			}
+			output[row*k + col] = sum;
+		}
+	}
+
+	__global__ void kernMultMatricesHammard(float *input1, float *input2, float *output, int m, int n) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		int row = index / n;
+		int col = index % n;
+
+		if (col < n && row < m) {
+			output[row*n + col] = input1[row*n + col] * input2[row*n + col];
+		}
+	}
+
+	__global__ void kernMultMatricesWithScalar(float *input, float *output, int m, int n, float scalar) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		int row = index / n;
+		int col = index % n;
+
+		if (col < n && row < m) {
+			int pos = row * n + col;
+			output[pos] = scalar * input[pos];
+		}
+	}
+
+
+	__global__ void kernTransposeMatrices(float *input, float *output, int m, int n) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		int row = index / n;
+		int col = index % n;
+
+		if (col < n && row < m) {
+			int pos = row * n + col;
+			int newPos = col * m + row;
+			output[newPos] = input[pos];
+		}
+	}
+
+	__global__ void kernActivateReLU(float *input, int n) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		if (index < n) {
+			input[index] = imax(input[index], 0);
+		}
+	}
+
+	__global__ void kernActivateReLUDerivative(float *input, float *output, int n) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		if (index < n) {
+			output[index] = (input[index] > 0) ? 1 : 0;
+		}
+	}
+
+	__global__ void kernActivateSoftmax(float *input, int n, int outputDim, float *softmaxDenominator) {
+		int index = blockIdx.x * blockDim.x + threadIdx.x;
+		int batchIndex = index / outputDim;
+		if (index < n) {
+			input[index] = expf(input[index]) / softmaxDenominator[batchIndex];
+		}
+	}
+
+
+	void genArray(int n, float *a) {
+		srand(11);
+
+		for (int i = 0; i < n; i++) {
+			a[i] = ((2 *((rand() * 1.0 )/ RAND_MAX)) - 1) * 0.0002;
+		}
+	}
+
+	FullyConnectedLayer::FullyConnectedLayer(int inputDim, int outputDim, int batchDim, bool lastLayer) {
+		this->inputDim = inputDim;
+		this->outputDim = outputDim;
+		this->batchDim = batchDim;
+		this->lastLayer = lastLayer;
+		cudaMalloc((void **)&weight, inputDim * outputDim * sizeof(float));
+		float *weightRand = new float[inputDim * outputDim];
+		genArray(inputDim * outputDim, weightRand);
+		cudaMemcpy(weight, weightRand, inputDim * outputDim * sizeof(float), cudaMemcpyHostToDevice);
+		cudaMalloc((void**)&inputs, inputDim * batchDim * sizeof(float));
+	}
+
+	int FullyConnectedLayer::getInputDim() {
+		return inputDim;
+	}
+
+	int FullyConnectedLayer::getOutputDim() {
+		return outputDim;
+	}
+
+	void FullyConnectedLayer::forward(float *inputArg, float *outputArg, bool test) {
+		cudaMemcpy(inputs, inputArg, batchDim * inputDim * sizeof(float), cudaMemcpyDeviceToDevice);
+		int gridRows = (batchDim*outputDim + blockSize - 1) / blockSize;
+
+		kernMultiplyMatrices<<<gridRows, blockSize >>>(inputArg, weight, outputArg, batchDim, inputDim, outputDim);
+		checkCUDAError("kernMultiplyMatrices");
+
+		dim3 fullBlocksPerGrid((outputDim*batchDim + blockSize - 1) / blockSize);
+		if (!lastLayer) {
+			kernActivateReLU << <fullBlocksPerGrid, blockSize >> > (outputArg, outputDim*batchDim);
+		}
+		else {
+			float *output = new float[outputDim * batchDim];
+			cudaMemcpy(output, outputArg, batchDim * outputDim * sizeof(float), cudaMemcpyDeviceToHost);
+			float *softmaxDenominator = new float[batchDim];
+			memset(softmaxDenominator, 0, batchDim * sizeof(float));
+			for (int j = 0; j < batchDim; j++) {
+				for (int i = 0; i < outputDim; i++) {
+					softmaxDenominator[j] += exp(output[j * outputDim + i]);
+				}
+			}
+
+			float *devSoftmaxDenominator;
+			cudaMalloc((void **)&devSoftmaxDenominator, batchDim * sizeof(float));
+			cudaMemcpy(devSoftmaxDenominator, softmaxDenominator, batchDim * sizeof(float), cudaMemcpyHostToDevice);
+			kernActivateSoftmax << <fullBlocksPerGrid, blockSize >> > (outputArg, batchDim * outputDim, outputDim, devSoftmaxDenominator);
+			checkCUDAError("kernActivateSoftmax");
+
+			delete(output);
+
+			cudaFree(devSoftmaxDenominator);
+			checkCUDAError("cudaFree");
+		}
+
+		if (test) {
+			printf("\n\n\tWeights : ");
+			float *tempWeight = new float[inputDim * outputDim];
+			cudaMemcpy(tempWeight, weight, inputDim * outputDim * sizeof(float), cudaMemcpyDeviceToHost);
+			for (int i = 0; i < inputDim * outputDim; i++) {
+				if (i % outputDim == 0) {
+					printf("\n\t\t");
+				}
+				printf("%f ", tempWeight[i]);
+			}
+			delete(tempWeight);
+		}
+	}
+
+	void FullyConnectedLayer::backward(float learningRate, float *incomingGradient, float *outgoingGradient) {
+
+		float *weightTranspose;
+		cudaMalloc((void**)&weightTranspose, inputDim * outputDim * sizeof(float));
+		checkCUDAError("cudaMalloc");
+
+		int gridRows = (inputDim*outputDim + blockSize - 1) / blockSize;
+		kernTransposeMatrices << <gridRows, blockSize >> > (weight, weightTranspose, inputDim, outputDim);
+		checkCUDAError("kernTransposeMatrices");
+
+		float *outgoingGradientLocal;
+		cudaMalloc((void**)&outgoingGradientLocal, inputDim*batchDim * sizeof(float));
+		checkCUDAError("cudaMalloc");
+
+	
+		gridRows = (inputDim*batchDim + blockSize - 1) / blockSize;
+		kernMultiplyMatrices << <gridRows, blockSize >> > (incomingGradient, weightTranspose, outgoingGradientLocal, batchDim, outputDim, inputDim);
+		checkCUDAError("kernMultiplyMatrices");
+
+		cudaFree(weightTranspose);
+		checkCUDAError("cudaFree");
+
+		float *inputDerivatived;
+		cudaMalloc((void**)&inputDerivatived, batchDim * inputDim * sizeof(float));
+		dim3 fullBlocksPerGrid((inputDim * batchDim + blockSize - 1) / blockSize);
+		kernActivateReLUDerivative << <fullBlocksPerGrid, blockSize >> > (inputs, inputDerivatived, inputDim * batchDim);
+		checkCUDAError("kernActivateReLUDerivative");
+
+
+		gridRows = (inputDim*batchDim + blockSize - 1) / blockSize;
+		kernMultMatricesHammard << <gridRows, blockSize >> > (outgoingGradientLocal, inputDerivatived, outgoingGradient, batchDim, inputDim);
+		checkCUDAError("kernMultMatricesHammard");
+		
+		cudaFree(inputDerivatived);
+		checkCUDAError("cudaFree");
+
+
+		float *inputTranspose;
+		cudaMalloc((void**)&inputTranspose, inputDim * batchDim * sizeof(float));
+
+		gridRows = (inputDim*batchDim + blockSize - 1) / blockSize;
+		kernTransposeMatrices << <gridRows, blockSize >> > (inputs, inputTranspose, batchDim, inputDim);
+		checkCUDAError("kernTransposeMatrices");
+
+		float *gradient;
+		cudaMalloc((void**)&gradient, inputDim * outputDim * sizeof(float));
+		gridRows = (inputDim*outputDim + blockSize - 1) / blockSize;
+		kernMultiplyMatrices << <gridRows, blockSize >> > (inputTranspose, incomingGradient, gradient, inputDim, batchDim, outputDim);
+		checkCUDAError("kernMultiplyMatrices");
+		
+		cudaFree(inputTranspose);
+		checkCUDAError("cudaFree");
+
+
+		kernMultMatricesWithScalar << <gridRows, blockSize >> > (gradient, gradient, inputDim, outputDim, learningRate);
+		checkCUDAError("kernMultMatricesWithScalar");
+
+
+		kernSubtractMatrices << <gridRows, blockSize >> > (weight, gradient, weight, inputDim, outputDim);
+		checkCUDAError("kernSubtractMatrices");
+
+		cudaFree(gradient);
+		checkCUDAError("cudaFree");
+
+	}
+
+	MultiLayerPerceptron::MultiLayerPerceptron(int inputDim, int numHiddenLayers, int *hiddenDim, int outputDim, int batchDim) {
+		this->batchDim = batchDim;
+
+		FullyConnectedLayer *tempLayer = new FullyConnectedLayer(inputDim, hiddenDim[0], batchDim, false);
+		layers.push_back(tempLayer);
+		for (int i = 1; i < numHiddenLayers - 1; i++) {
+			FullyConnectedLayer *tempLayer = new FullyConnectedLayer(hiddenDim[i - 1], hiddenDim[i], batchDim, false);
+			layers.push_back(tempLayer);
+		}
+		tempLayer = new FullyConnectedLayer(hiddenDim[numHiddenLayers - 1], outputDim, batchDim, true);
+		layers.push_back(tempLayer);
+
+	}
+
+	void MultiLayerPerceptron::forward(float *input, float *output, bool test) {
+		float *devOutput;
+		cudaMalloc((void**)&devOutput, batchDim * layers[0]->getInputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+		cudaMemcpy(devOutput, input, batchDim * layers[0]->getInputDim() * sizeof(float), cudaMemcpyHostToDevice);
+		checkCUDAError("cudaMemcpy");
+		float *hiddenOutput;
+		for (int i = 0; i < layers.size(); i++) {
+			cudaMalloc((void**)&hiddenOutput, batchDim * layers[i]->getOutputDim() * sizeof(float));
+			checkCUDAError("cudaMalloc");
+			layers[i]->forward(devOutput, hiddenOutput, test);
+			cudaFree(devOutput);
+			cudaMalloc((void**)&devOutput, batchDim * layers[i]->getOutputDim() * sizeof(float));
+			checkCUDAError("cudaMalloc");
+			cudaMemcpy(devOutput, hiddenOutput, batchDim * layers[i]->getOutputDim() * sizeof(float), cudaMemcpyDeviceToDevice);
+			checkCUDAError("cudaMemcpy");
+			cudaFree(hiddenOutput);
+		}
+		cudaMemcpy(output, devOutput, batchDim * layers[layers.size() - 1]->getOutputDim() * sizeof(float), cudaMemcpyDeviceToHost);
+		checkCUDAError("cudaMemcpy");
+
+		cudaFree(devOutput);
+	}
+
+	void MultiLayerPerceptron::backward(float *label, float *predicted, float learningRate) {
+		float *devLabel;
+		cudaMalloc((void**)&devLabel, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+		cudaMemcpy(devLabel, label, batchDim * layers[layers.size() - 1]->getOutputDim() * sizeof(float), cudaMemcpyHostToDevice);
+		checkCUDAError("cudaMemcpy");
+
+		float *devPredicted;
+		cudaMalloc((void**)&devPredicted, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+		cudaMemcpy(devPredicted, predicted, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float), cudaMemcpyHostToDevice);
+		checkCUDAError("cudaMemcpy");
+
+		float *incomingGradient;
+		cudaMalloc((void**)&incomingGradient, batchDim*layers[layers.size() - 1]->getOutputDim() * sizeof(float));
+		checkCUDAError("cudaMalloc");
+
+
+		int gridRows = ((batchDim * layers[layers.size() - 1]->getOutputDim()) + blockSize - 1) / blockSize;
+		kernSubtractMatrices << <gridRows, blockSize >> > (devPredicted, devLabel, incomingGradient, batchDim, layers[layers.size() - 1]->getOutputDim());
+		checkCUDAError("kernSubtractMatrices");
+
+		cudaFree(devLabel);
+		checkCUDAError("cudaFree");
+		cudaFree(devPredicted);
+		checkCUDAError("cudaFree");
+
+
+		checkCUDAError("cudaMemcpy");
+		float *outgoingGradient;
+		for (int i = layers.size() - 1; i >= 0; i--) {
+			cudaMalloc((void**)&outgoingGradient, batchDim*layers[i]->getInputDim() * sizeof(float));
+			checkCUDAError("cudaMalloc");
+			layers[i]->backward(learningRate, incomingGradient, outgoingGradient);
+			cudaFree(incomingGradient);
+			checkCUDAError("cudaFree");
+			cudaMalloc((void**)&incomingGradient, batchDim*layers[i]->getInputDim() * sizeof(float));
+			checkCUDAError("cudaMalloc");
+			cudaMemcpy(incomingGradient, outgoingGradient, batchDim*layers[i]->getInputDim() * sizeof(float), cudaMemcpyDeviceToDevice);
+			checkCUDAError("cudaMemcpy");
+			cudaFree(outgoingGradient);
+		}
+		cudaFree(incomingGradient);
+
+
+	}
+
+
 }
diff --git a/Project2-Character-Recognition/character_recognition/mlp.h b/Project2-Character-Recognition/character_recognition/mlp.h
index 2096228..ddc428e 100644
--- a/Project2-Character-Recognition/character_recognition/mlp.h
+++ b/Project2-Character-Recognition/character_recognition/mlp.h
@@ -1,9 +1,36 @@
 #pragma once
 
 #include "common.h"
+#include <vector>
+#include <math.h>
 
 namespace CharacterRecognition {
     Common::PerformanceTimer& timer();
+	class FullyConnectedLayer {
+		float *weight = NULL;
+		float *inputs = NULL;
+		int inputDim;
+		int batchDim;
+		int outputDim;
+		bool lastLayer;
 
-    // TODO: implement required elements for MLP sections 1 and 2 here
+	public:
+		FullyConnectedLayer(int inputDim, int outputDim, int batchDim, bool lastLayer);
+		void forward(float *input, float *output, bool test = false);
+		void backward(float learningRate, float *incomingGradient, float *outgoingGradient);
+		int getInputDim();
+		int getOutputDim();
+	};
+
+
+	class MultiLayerPerceptron {
+
+		std::vector<FullyConnectedLayer*> layers;
+		int batchDim;
+	public :
+		MultiLayerPerceptron(int inputDim, int numHiddenLayers, int *hiddenDim, int outputDim, int batchDim);
+		void forward(float *input, float *output, bool test = false);
+		void backward(float *output, float *predicted, float learningRate);
+		float loss(float *label, float *predicted);
+	};
 }
diff --git a/Project2-Character-Recognition/charts/CharacterRecognitionLoss.jpg b/Project2-Character-Recognition/charts/CharacterRecognitionLoss.jpg
new file mode 100644
index 0000000..dc9bd34
Binary files /dev/null and b/Project2-Character-Recognition/charts/CharacterRecognitionLoss.jpg differ
diff --git a/Project2-Character-Recognition/charts/MNISTLoss.jpg b/Project2-Character-Recognition/charts/MNISTLoss.jpg
new file mode 100644
index 0000000..384590a
Binary files /dev/null and b/Project2-Character-Recognition/charts/MNISTLoss.jpg differ
diff --git a/Project2-Character-Recognition/charts/XORLoss.jpg b/Project2-Character-Recognition/charts/XORLoss.jpg
new file mode 100644
index 0000000..d266ff2
Binary files /dev/null and b/Project2-Character-Recognition/charts/XORLoss.jpg differ
diff --git a/Project2-Character-Recognition/data-set/mnist/t10k-images-idx3-ubyte b/Project2-Character-Recognition/data-set/mnist/t10k-images-idx3-ubyte
new file mode 100644
index 0000000..1170b2c
Binary files /dev/null and b/Project2-Character-Recognition/data-set/mnist/t10k-images-idx3-ubyte differ
diff --git a/Project2-Character-Recognition/data-set/mnist/t10k-labels-idx1-ubyte b/Project2-Character-Recognition/data-set/mnist/t10k-labels-idx1-ubyte
new file mode 100644
index 0000000..d1c3a97
Binary files /dev/null and b/Project2-Character-Recognition/data-set/mnist/t10k-labels-idx1-ubyte differ
diff --git a/Project2-Character-Recognition/data-set/mnist/train-images-idx3-ubyte b/Project2-Character-Recognition/data-set/mnist/train-images-idx3-ubyte
new file mode 100644
index 0000000..bbce276
Binary files /dev/null and b/Project2-Character-Recognition/data-set/mnist/train-images-idx3-ubyte differ
diff --git a/Project2-Character-Recognition/data-set/mnist/train-labels-idx1-ubyte b/Project2-Character-Recognition/data-set/mnist/train-labels-idx1-ubyte
new file mode 100644
index 0000000..d6b4c5d
Binary files /dev/null and b/Project2-Character-Recognition/data-set/mnist/train-labels-idx1-ubyte differ
diff --git a/Project2-Character-Recognition/img/Autograd.jpg b/Project2-Character-Recognition/img/Autograd.jpg
new file mode 100644
index 0000000..383e9d3
Binary files /dev/null and b/Project2-Character-Recognition/img/Autograd.jpg differ
diff --git a/Project2-Character-Recognition/img/EquationsPart1.jpg b/Project2-Character-Recognition/img/EquationsPart1.jpg
new file mode 100644
index 0000000..a0bea57
Binary files /dev/null and b/Project2-Character-Recognition/img/EquationsPart1.jpg differ
diff --git a/Project2-Character-Recognition/img/EquationsPart2.jpg b/Project2-Character-Recognition/img/EquationsPart2.jpg
new file mode 100644
index 0000000..f938df3
Binary files /dev/null and b/Project2-Character-Recognition/img/EquationsPart2.jpg differ
diff --git a/Project2-Character-Recognition/img/MNISTmlp.png b/Project2-Character-Recognition/img/MNISTmlp.png
new file mode 100644
index 0000000..5671a7e
Binary files /dev/null and b/Project2-Character-Recognition/img/MNISTmlp.png differ
diff --git a/Project2-Character-Recognition/img/PrettyOutput.jpg b/Project2-Character-Recognition/img/PrettyOutput.jpg
new file mode 100644
index 0000000..5b667c1
Binary files /dev/null and b/Project2-Character-Recognition/img/PrettyOutput.jpg differ
diff --git a/Project2-Character-Recognition/img/bsod.gif b/Project2-Character-Recognition/img/bsod.gif
new file mode 100644
index 0000000..5e2b94a
Binary files /dev/null and b/Project2-Character-Recognition/img/bsod.gif differ
diff --git a/Project2-Character-Recognition/img/neuron.png b/Project2-Character-Recognition/img/neuron.png
new file mode 100644
index 0000000..46aab0f
Binary files /dev/null and b/Project2-Character-Recognition/img/neuron.png differ
diff --git a/Project2-Character-Recognition/img/neuron_model.jpeg b/Project2-Character-Recognition/img/neuron_model.jpeg
new file mode 100644
index 0000000..7943119
Binary files /dev/null and b/Project2-Character-Recognition/img/neuron_model.jpeg differ
diff --git a/Project2-Character-Recognition/img/partial_derivative_notations.png b/Project2-Character-Recognition/img/partial_derivative_notations.png
new file mode 100644
index 0000000..f208d0f
Binary files /dev/null and b/Project2-Character-Recognition/img/partial_derivative_notations.png differ
diff --git a/Project2-Character-Recognition/src/main.cpp b/Project2-Character-Recognition/src/main.cpp
index 11dd534..2bfce51 100644
--- a/Project2-Character-Recognition/src/main.cpp
+++ b/Project2-Character-Recognition/src/main.cpp
@@ -1,8 +1,8 @@
 /**
  * @file      main.cpp
- * @brief     Stream compaction test program
- * @authors   Kai Ninomiya
- * @date      2015
+ * @brief     MLP Driver
+ * @authors   Kushagra Goel
+ * @date      2019
  * @copyright University of Pennsylvania
  */
 
@@ -10,143 +10,324 @@
 #include <character_recognition/mlp.h>
 #include <character_recognition/common.h>
 #include "testing_helpers.hpp"
+#include <fstream>
+#include <string>
+#include <Windows.h>
+#include <vector>
+#include <sstream>
+#include <iostream>
+#include <math.h>
 
-const int SIZE = 1 << 8; // feel free to change the size of array
-const int NPOT = SIZE - 3; // Non-Power-Of-Two
-int *a = new int[SIZE];
-int *b = new int[SIZE];
-int *c = new int[SIZE];
+using namespace::std;
+
+// read MNIST data into double vector, OpenCV Mat, or Armadillo mat
+// free to use this code for any purpose
+// author : Eric Yuan 
+// my blog: http://eric-yuan.me/
+// part of this code is stolen from http://compvisionlab.wordpress.com/
+
+#include <math.h>
+#include <iostream>
+
+using namespace std;
+
+
+int ReverseInt(int i)
+{
+	unsigned char ch1, ch2, ch3, ch4;
+	ch1 = i & 255;
+	ch2 = (i >> 8) & 255;
+	ch3 = (i >> 16) & 255;
+	ch4 = (i >> 24) & 255;
+	return((int)ch1 << 24) + ((int)ch2 << 16) + ((int)ch3 << 8) + ch4;
+}
+
+void read_Mnist(string filename, vector<vector<double> > &vec)
+{
+	ifstream file(filename, ios::binary);
+	if (file.is_open())
+	{
+		int magic_number = 0;
+		int number_of_images = 0;
+		int n_rows = 0;
+		int n_cols = 0;
+		file.read((char*)&magic_number, sizeof(magic_number));
+		magic_number = ReverseInt(magic_number);
+		file.read((char*)&number_of_images, sizeof(number_of_images));
+		number_of_images = ReverseInt(number_of_images);
+		file.read((char*)&n_rows, sizeof(n_rows));
+		n_rows = ReverseInt(n_rows);
+		file.read((char*)&n_cols, sizeof(n_cols));
+		n_cols = ReverseInt(n_cols);
+		for (int i = 0; i < number_of_images; ++i)
+		{
+			vector<double> tp;
+			for (int r = 0; r < n_rows; ++r)
+			{
+				for (int c = 0; c < n_cols; ++c)
+				{
+					unsigned char temp = 0;
+					file.read((char*)&temp, sizeof(temp));
+					tp.push_back((double)temp);
+				}
+			}
+			vec.push_back(tp);
+		}
+	}
+	else {
+		cout << "Couldn't Open"<<endl;
+	}
+}
+
+void read_Mnist_Label(string filename, vector<double> &vec)
+{
+	ifstream file(filename, ios::binary);
+	if (file.is_open())
+	{
+		int magic_number = 0;
+		int number_of_images = 0;
+		int n_rows = 0;
+		int n_cols = 0;
+		file.read((char*)&magic_number, sizeof(magic_number));
+		magic_number = ReverseInt(magic_number);
+		file.read((char*)&number_of_images, sizeof(number_of_images));
+		number_of_images = ReverseInt(number_of_images);
+		for (int i = 0; i < number_of_images; ++i)
+		{
+			unsigned char temp = 0;
+			file.read((char*)&temp, sizeof(temp));
+			vec[i] = (double)temp;
+		}
+	}
+}
 
 int main(int argc, char* argv[]) {
-    // Scan tests
-
-    printf("\n");
-    printf("****************\n");
-    printf("** SCAN TESTS **\n");
-    printf("****************\n");
-
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    // initialize b using StreamCompaction::CPU::scan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
-    // At first all cases passed because b && c are all zeroes.
-    zeroArray(SIZE, b);
-    printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, b, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-	/* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-	onesArray(SIZE, c);
-	printDesc("1s array for finding bugs");
-	StreamCompaction::Naive::scan(SIZE, c, a);
-	printArray(SIZE, c, true); */
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    printf("\n");
-    printf("*****************************\n");
-    printf("** STREAM COMPACTION TESTS **\n");
-    printf("*****************************\n");
-
-    // Compaction tests
-
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    int count, expectedCount, expectedNPOT;
-
-    // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
-    zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedNPOT = count;
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
-    count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, power-of-two");
-    count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, non-power-of-two");
-    count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    system("pause"); // stop Win32 console from closing on exit
-	delete[] a;
-	delete[] b;
-	delete[] c;
+
+	bool test = true;
+	int seed = 36;
+
+
+	ofstream mnistData;
+	mnistData.open(R"(..\bookKeeping\mnistLosses.csv)");
+	string filename = R"(..\data-set\mnist\t10k-images-idx3-ubyte)";
+	int number_of_images = 10000;
+	int number_of_images_considered = 100;
+	int image_size = 28 * 28;
+	int number_of_classes = 10;
+	int numEpochs = 100;
+	int hiddenDimensionsMnist[] = { 64 , 32 };
+	CharacterRecognition::MultiLayerPerceptron *mnistMlp = new CharacterRecognition::MultiLayerPerceptron(image_size + 1, 2, hiddenDimensionsMnist, number_of_classes, number_of_images_considered);
+
+	float *inputMnist;
+	inputMnist = new float[number_of_images * (image_size + 1)];
+
+	float *labelMnist;
+	labelMnist = new float[number_of_images*number_of_classes];
+
+	float *predictedMnist;
+	predictedMnist = new float[number_of_images*number_of_classes];
+
+	//read MNIST image into double vector
+	vector<vector<double> > vec;
+	read_Mnist(filename, vec);
+
+	for (int i = 0; i < number_of_images; i++) {
+		for (int j = 0; j < image_size + 1; j++) {
+			if (j == image_size) {
+				inputMnist[i * image_size + j] = 1;
+				continue;
+			}
+			inputMnist[i * image_size + j] = (vec[i][j] - 0.5) * 2;
+		}
+	}
+
+	filename = R"(..\data-set\mnist\t10k-labels-idx1-ubyte)";
+	//read MNIST label into double vector
+	vector<double> vecLabel(number_of_images);
+	read_Mnist_Label(filename, vecLabel);
+
+
+	for (int i = 0; i < number_of_images; i++) {
+		for (int j = 0; j < number_of_classes; j++) {
+			labelMnist[i * number_of_classes + j] = (j == vecLabel[i]) ? 1.0 : 0.0;
+		}
+	}
+
+	printf("\n");
+	printf("****************\n");
+	printf("**** MNIST *****\n");
+	printf("****************\n");
+	cout << "\tMNIST Training : " << endl;
+	for (int i = 0; i < numEpochs; i++) {
+		mnistMlp->forward(inputMnist, predictedMnist);
+		float loss = mnistMlp->loss(labelMnist, predictedMnist);
+		if (!isnan(loss)) {
+			mnistData << loss << endl;
+		}
+		mnistMlp->backward(labelMnist, predictedMnist, 0.0001);
+	}
+	cout << "\t\tFinal Loss : " << mnistMlp->loss(labelMnist, predictedMnist) << endl;
+
+	cout << "\n\tMNIST Testing : " << endl;
+	mnistMlp->forward(inputMnist, predictedMnist, false);
+	srand(seed);
+	for (int k = 0; k < 5; k++) {
+		float maxTrueValue = 0;
+		int maxTrueIndex = 0;
+		float maxValue = 0;
+		int maxIndex = 0;
+		int i = rand() % number_of_images_considered;
+		for (int j = 0; j < number_of_classes; j++) {
+
+			if (predictedMnist[i * number_of_classes + j] > maxValue) {
+				maxValue = predictedMnist[i * number_of_classes + j];
+				maxIndex = j;
+			}
+			if (labelMnist[i * number_of_classes + j] > maxTrueValue) {
+				maxTrueValue = predictedMnist[i * number_of_classes + j];
+				maxTrueIndex = j;
+			}
+		}
+		cout << "\t\tFor True Label = " << maxTrueIndex;
+		cout << ", Predicted = " << maxIndex << endl;
+	}
+
+
+	numEpochs = 1000;
+	int hiddenDimensions[] = { 5 };
+	CharacterRecognition::MultiLayerPerceptron *mlp = new CharacterRecognition::MultiLayerPerceptron(3, 1, hiddenDimensions, 2, 4);
+	ofstream xorData;
+	xorData.open(R"(..\bookKeeping\xorLosses.csv)");
+
+	float inputs[] = { 0, 0, 1, 
+						0, 1, 1, 
+						1, 0, 1, 
+						1, 1, 1};
+
+	float labels[] = { 1, 0,
+						0, 1,
+						0, 1,
+						1, 0 };
+
+	float *predicted = new float[8];
+
+
+	printf("\n");
+	printf("****************\n");
+	printf("***** XOR ******\n");
+	printf("****************\n");
+	cout << "\tXOR Training : " << endl;
+	for (int i = 0; i < numEpochs; i++) {
+		mlp->forward(inputs, predicted);
+		xorData << mlp->loss(labels, predicted) << endl;
+		mlp->backward(labels, predicted, 0.1);
+	}
+	mlp->forward(inputs, predicted, false);
+	cout<<"\t\tFinal Loss : "<< mlp->loss(labels, predicted) << endl;
+	mlp->forward(inputs, predicted, test);
+
+	cout << "\n\n\tXOR Testing : " << endl;
+	for (int i = 0; i < 4; i++) {
+		float maxValue = 0;
+		int maxIndex = 0;
+		cout << "\t\tFor ";
+		for (int j = 0; j < 2; j++) {
+			cout << "x" << j + 1 << " = " << inputs[i*3 + j]<<" ";
+			
+			if (predicted[i * 2 + j] > maxValue) {
+				maxValue = predicted[i * 2 + j];
+				maxIndex = j;
+			}
+		}
+		cout << ", output = " << maxIndex << endl;
+	}
+	xorData.close();
+
+	
+
+	numEpochs = 100;
+
+	int numberOfInstancesAlpha = 52;
+	int numberOfFeaturesAlpha = 10201;
+	int numberOfClassesAlpha = 52;
+	float *input = new float[numberOfInstancesAlpha * numberOfFeaturesAlpha];
+	float *true_labels = new float[numberOfInstancesAlpha * numberOfClassesAlpha];
+	memset(true_labels, 0, numberOfInstancesAlpha * numberOfClassesAlpha * sizeof(float));
+	for (int i = 0; i < numberOfInstancesAlpha; i++) {
+		ifstream file("S:\\CIS 565\\Project_2\\Project2-Number-Algorithms\\Project2-Character-Recognition\\data-set\\" + ((i + 1 < 10) ? to_string(0) : "") + to_string(i + 1) + "info.txt");
+		if (!file.is_open()) {
+			exit(-1);
+		}
+		int count = 0;
+		string line;
+		while (getline(file, line))
+		{
+			count++;
+			if (count == 1) {
+				int index = i* numberOfClassesAlpha + (stof(line) - 1);
+				true_labels[index] = 1;
+			}
+			if (count == 3) {
+				stringstream ssin(line);
+				for (int k = 0; ssin.good() && k < numberOfFeaturesAlpha; k++) {
+					string temp;
+					ssin >> temp;
+					input[(i * numberOfFeaturesAlpha) + k] = stof(temp) / 255;
+				}
+			}
+		}
+		file.close();
+	}
+
+
+	int hiddenDimensionsAlpha[] = { 200 };
+	CharacterRecognition::MultiLayerPerceptron *mlpAlpha = new CharacterRecognition::MultiLayerPerceptron(numberOfFeaturesAlpha, 1, hiddenDimensionsAlpha, numberOfClassesAlpha, numberOfInstancesAlpha);
+	delete(predicted);
+	predicted = new float[numberOfInstancesAlpha * numberOfClassesAlpha];
+
+
+
+
+	printf("\n");
+	printf("*************************\n");
+	printf("* Character Recognition *\n");
+	printf("*************************\n");
+	cout << "\tCharacter Recognition Training : " << endl;
+	ofstream characterData;
+	characterData.open(R"(..\bookKeeping\characterLosses.csv)");
+	for (int i = 0; i < numEpochs; i++) {
+		mlpAlpha->forward(input, predicted);
+		characterData << mlpAlpha->loss(true_labels, predicted) << endl;
+		mlpAlpha->backward(true_labels, predicted, 0.01);
+	}
+	characterData.close();
+
+	mlpAlpha->forward(input, predicted, false);
+	cout << "\t\tFinal Loss : " << mlp->loss(true_labels, predicted) << endl;
+
+
+	cout << "\n\tCharacter Recognition Testing : " << endl;
+	srand(seed);
+	for (int k = 0; k < 5; k++) {
+		float maxTrueValue = 0;
+		int maxTrueIndex = 0;
+		float maxValue = 0;
+		int maxIndex = 0;
+		int i = rand() % numberOfInstancesAlpha;
+		for (int j = 0; j < numberOfClassesAlpha; j++) {
+
+			if (predicted[i * numberOfClassesAlpha + j] > maxValue) {
+				maxValue = predicted[i * numberOfClassesAlpha + j];
+				maxIndex = j;
+			}
+			if (true_labels[i * numberOfClassesAlpha + j] > maxTrueValue) {
+				maxTrueValue = predicted[i * numberOfClassesAlpha + j];
+				maxTrueIndex = j;
+			}
+		}
+		cout << "\t\tFor True Label = " << (char)((maxTrueIndex % 2 == 0) ? 65 + (maxTrueIndex/2) : 97 + ((maxTrueIndex - 1)/2));
+		cout << ", Predicted = " << (char)((maxIndex % 2 == 0) ? 65 + (maxIndex/2) : 97 + ((maxIndex - 1) / 2)) << endl;
+	}
+
+
 }
diff --git a/Project2-Stream-Compaction/README.md b/Project2-Stream-Compaction/README.md
index 0e38ddb..0f51a3e 100644
--- a/Project2-Stream-Compaction/README.md
+++ b/Project2-Stream-Compaction/README.md
@@ -3,12 +3,249 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Author : Kushagra
+  - [LinkedIn](https://www.linkedin.com/in/kushagragoel/)
+* Tested on:  Windows 10 Education, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 @ 4GB (Moore 100B Lab)
 
-### (TODO: Your README)
+____________________________________________________________________________________
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
 
+
+### Table of Contents
+1. [Introduction](#Introduction)  
+2. [Scan Algorithm](#algo)  
+2.1. [CPU Scan](#cpuscan)  
+2.2. [Naive GPU Scan](#naive)  
+2.3. [Work-Efficient GPU Scan](#work)  
+2.4. [Thrust Implementation](#thrust)  
+3. [Stream Compaction Algorithm](#streamcompaction)  
+3.1. [CPU Compaction](#cpucompaction)  
+3.2. [GPU Compaction](#gpucompaction)  
+4. [Radix Sort](#radix)
+5. [Performance Analysis](#performance) 
+6. [Outputs](#outputs)   
+
+<a name = "Introduction"/>  
+
+## Introduction
+
+We have implemented Stream Compaction in CUDA from scratch in this project. Stream compaction is widely used in graphics pipelines like pathtracers. 
+In particular we use stream compaction to simply remove `0`s from an array of integers which is similar to removing terminated paths from an array of rays for a pathtracing pipeline.
+
+<a name = "algo">  
+  
+## Scan Algorithm
+Scan algorithm is basically a reduction from an array to a new array such that the new array contains the prefix sum for each index in the given array. There are different ways to perform this reduction and we explore a few of them here.
+
+<a name = "cpuscan"/>  
+
+### CPU Scan
+We simply iterate over the array in a single pass while book-keeping the sum so far. We keep storing it into a new array as we go.
+
+<a name = "naive"/>  
+
+### Naïve Parallel Scan
+
+In a naive attempt to parallelize, we observe that we can optimize by calculating sums in pairs and then adding them in a new level. We can go even beyond by computing the sums for the layers in pairs as well. For demonstration, let's look at an example : 
+
+![](img/NaiveParallelScan.jpg)
+
+<a name = "work"/>
+
+### Work Efficient GPU Scan
+
+The aim is to go even further beyond, and we achieve that by clever indexing, 2 stage calculation and a balanced binary trees (didn't expect that did you?). For stage one, we perform the upsweep operation which propagates the partial sums upwards, as we can see in the image : 
+
+![](img/upsweep.jpg)  
+
+
+For stage 2, we can trickle down the partial sums back to the array elements to get the total prefix sum. Think of this like propagating partial information globaly such that each array element can combine the global knowledge with what it knows locally to form the complete picture:
+
+![](img/downsweep.jpg)  
+
+#### Extra Credits  
+
+Interestingly, it was observed that following the above algorithm exactly produced poorer results than the CPU. This is attributed to the fact that most threads at lower levels are not doing anything. We improvise on this by even cleverer indexing which allowed us to effectively compact the threads and also allow more blocks to be run when at lower levels.
+
+<a name = "thrust"/>
+
+### Thrust Implementation
+
+We can also use the thrust library's thrust::exclusive_scan to calculate the exclusive scan for the array.
+
+<a name = "(#streamcompaction) "/>
+
+
+## Stream Compaction Algorithm  
+
+Here we discuss the compaction algorithms we have used to see the difference between the might of the CPU and the GPU.
+
+<a name = "cpucompaction"/>
+
+### CPU Compaction
+
+We have tried 2 different versions of compaction, one which is a traditional cpu style code and another which follows similar logic (scan) as our GPU for benchmarking
+
+#### Without scanning
+
+We simply iterate over the array in a single pass while book-keeping the new index for the compacted array by ignoring the `0`s we see.
+![](img/streamcompaction.jpg)
+
+#### With scanning  
+
+Here we use a simple scan (single pass over the array) to calculate the indices the elements need to be scattered to. The algorithm is illustrated as follows:
+![](img/scatter.jpg)
+
+<a name = "gpucompaction"/>  
+
+### GPU Compaction
+
+Here we follow similar to CPU compaction with scanning, but we perform the scan using our work-efficient GPU scan.
+
+<a name = "radix"/>  
+
+## Radix Sort  
+
+As a part of extra credit, we also implement Radix Sort on the GPU and analyse it. GPU Radix Sort is essentially the vanilla radix sort, but we use work-efficient scan to calculate the new indices for the boolean array based on a given bit value.   
+
+### Effect of Number of Elements  
+
+![](charts/RadixVsNOE.jpg)
+
+### Effect of BlockSize  
+
+![](charts/RadixVsBlockSize.jpg)
+
+
+<a name = "performance"/>  
+
+## Performance Analysis   
+
+### Effect of Number of Elements  
+
+We perform this by choosing the optimal blockSize for each implementation.  
+
+#### When Number of Elements are a Power of 2  
+
+##### Scan Time  
+
+![](charts/ScanTimeVsNOE.jpg)  
+
+##### Compaction Time  
+
+![](charts/CompactionTimeVsNOE.jpg)
+
+#### When Number of Elements are not a Power of 2  
+
+##### Scan Time  
+
+![](charts/ScanTimeVsNOEnp2.jpg)  
+
+##### Compaction Time  
+
+![](charts/CompactionTimeVsNOEnp2.jpg)
+
+  
+### Effect of BlockSize  
+
+Since the performance of CPU implementations and Thrust do not depend on these changes of blockSizes, we report average time taken for them.  
+
+##### Scan Time  
+
+![](charts/ScanTimeVsBlockSize.jpg)
+By comparison, average scan time for when number of elements is a power of 2:
+* CPU : 4.71 ms
+* Thrust : 3.49 ms 
+Average scan time for when number of elements is not a power of 2 :
+* CPU : 1.76 ms (!!! suprising)
+* Thrust : 3.22 ms   
+
+##### Compaction Time  
+
+![](charts/CompactionTimeVsBlockSize.jpg)
+By comparison, average compaction time for when number of elements is a power of 2:
+* CPU :
+  * with Scan : 10.19077 ms
+  * without Scan : 3.38006
+Average scan time for when number of elements is not a power of 2 :
+* CPU without Scan: 2.96 ms
+
+We observe that Naive implementations are way slower than the CPU and Work Efficient implementations. 
+Surprisingly CPU implementation without scan is always the fastest and by a lot of margin. We suspect this might be due to latency induced in inefficient memory accesses in the GPU code which we plan to look into in the future. 
+As we expected Work Efficient scan is significantly faster than the CPU implementation with scan and this difference is even more evident when we have large number of elements.
+Thrust is usually comparable to work efficient scan but falls behind when we are scannning arrays that do not have power of 2 elements. We suspect this might be due to work efficient scan compacting the array and launching more blocks as per our improvisation.
+Another huge suprise for us that CPU implementation is unexpectedly fast for inputs whose size is not a multiple of 2. 
+
+Finally we believe memory I/O cause significant bottlenecks in our GPU implementation as we tried to time cuda memory commands and saw their timings were of the order of our executions. For Naive implementation inefficient computation also contributes a lot to terrible running times.
+
+
+
+
+<a name = "outputs"/>    
+
+## Output  
+
+```
+
+****************
+** SCAN TESTS **
+****************
+    [  35  23  10   8  40  33  46  28  22  20  39   2  16 ...  42   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 2.4435ms    (std::chrono Measured)
+    [   0  35  58  68  76 116 149 195 223 245 265 304 306 ... 25700115 25700157 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 2.5796ms    (std::chrono Measured)
+    [   0  35  58  68  76 116 149 195 223 245 265 304 306 ... 25700035 25700067 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 14.5613ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 13.8198ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 3.8544ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 4.12861ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 110.257ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 96.7069ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   3   0   2   2   3   0   0   2   2   3   2   2 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 4.8346ms    (std::chrono Measured)
+    [   3   3   2   2   3   2   2   3   2   2   2   2   2 ...   2   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 4.6603ms    (std::chrono Measured)
+    [   3   3   2   2   3   2   2   3   2   2   2   2   2 ...   2   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 18.4084ms    (std::chrono Measured)
+    [   3   3   2   2   3   2   2   3   2   2   2   2   2 ...   2   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 5.47408ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 5.89123ms    (CUDA Measured)
+    passed
+
+****************
+** RADIX TESTS **
+****************
+==== radix sort ====
+   elapsed time: 23.54ms    (CUDA Measured)
+    [  35  23  10   8  40  33  46  28  22  20  39   2  16 ...  42   0 ]
+    passed
+Press any key to continue . . .
+```
diff --git a/Project2-Stream-Compaction/charts/CompactionTimeVsBlockSize.jpg b/Project2-Stream-Compaction/charts/CompactionTimeVsBlockSize.jpg
new file mode 100644
index 0000000..b52a93f
Binary files /dev/null and b/Project2-Stream-Compaction/charts/CompactionTimeVsBlockSize.jpg differ
diff --git a/Project2-Stream-Compaction/charts/CompactionTimeVsNOE.jpg b/Project2-Stream-Compaction/charts/CompactionTimeVsNOE.jpg
new file mode 100644
index 0000000..7349d73
Binary files /dev/null and b/Project2-Stream-Compaction/charts/CompactionTimeVsNOE.jpg differ
diff --git a/Project2-Stream-Compaction/charts/CompactionTimeVsNOEnp2.jpg b/Project2-Stream-Compaction/charts/CompactionTimeVsNOEnp2.jpg
new file mode 100644
index 0000000..9bf9af8
Binary files /dev/null and b/Project2-Stream-Compaction/charts/CompactionTimeVsNOEnp2.jpg differ
diff --git a/Project2-Stream-Compaction/charts/RadixVsBlockSize.jpg b/Project2-Stream-Compaction/charts/RadixVsBlockSize.jpg
new file mode 100644
index 0000000..db16d1f
Binary files /dev/null and b/Project2-Stream-Compaction/charts/RadixVsBlockSize.jpg differ
diff --git a/Project2-Stream-Compaction/charts/RadixVsNOE.jpg b/Project2-Stream-Compaction/charts/RadixVsNOE.jpg
new file mode 100644
index 0000000..c6eefe8
Binary files /dev/null and b/Project2-Stream-Compaction/charts/RadixVsNOE.jpg differ
diff --git a/Project2-Stream-Compaction/charts/ScanTimeVsBlockSize.jpg b/Project2-Stream-Compaction/charts/ScanTimeVsBlockSize.jpg
new file mode 100644
index 0000000..22f40ab
Binary files /dev/null and b/Project2-Stream-Compaction/charts/ScanTimeVsBlockSize.jpg differ
diff --git a/Project2-Stream-Compaction/charts/ScanTimeVsNOE.jpg b/Project2-Stream-Compaction/charts/ScanTimeVsNOE.jpg
new file mode 100644
index 0000000..1697d9e
Binary files /dev/null and b/Project2-Stream-Compaction/charts/ScanTimeVsNOE.jpg differ
diff --git a/Project2-Stream-Compaction/charts/ScanTimeVsNOEnp2.jpg b/Project2-Stream-Compaction/charts/ScanTimeVsNOEnp2.jpg
new file mode 100644
index 0000000..3a8b789
Binary files /dev/null and b/Project2-Stream-Compaction/charts/ScanTimeVsNOEnp2.jpg differ
diff --git a/Project2-Stream-Compaction/img/NaiveParallelScan.jpg b/Project2-Stream-Compaction/img/NaiveParallelScan.jpg
new file mode 100644
index 0000000..0eaad61
Binary files /dev/null and b/Project2-Stream-Compaction/img/NaiveParallelScan.jpg differ
diff --git a/Project2-Stream-Compaction/img/Scatter.jpg b/Project2-Stream-Compaction/img/Scatter.jpg
new file mode 100644
index 0000000..a678f71
Binary files /dev/null and b/Project2-Stream-Compaction/img/Scatter.jpg differ
diff --git a/Project2-Stream-Compaction/img/downsweep.jpg b/Project2-Stream-Compaction/img/downsweep.jpg
new file mode 100644
index 0000000..cc7bb0b
Binary files /dev/null and b/Project2-Stream-Compaction/img/downsweep.jpg differ
diff --git a/Project2-Stream-Compaction/img/streamcompaction.jpg b/Project2-Stream-Compaction/img/streamcompaction.jpg
new file mode 100644
index 0000000..f576e35
Binary files /dev/null and b/Project2-Stream-Compaction/img/streamcompaction.jpg differ
diff --git a/Project2-Stream-Compaction/img/upsweep.jpg b/Project2-Stream-Compaction/img/upsweep.jpg
new file mode 100644
index 0000000..90cbfcc
Binary files /dev/null and b/Project2-Stream-Compaction/img/upsweep.jpg differ
diff --git a/Project2-Stream-Compaction/src/main.cpp b/Project2-Stream-Compaction/src/main.cpp
index d016553..ebded4b 100644
--- a/Project2-Stream-Compaction/src/main.cpp
+++ b/Project2-Stream-Compaction/src/main.cpp
@@ -11,9 +11,13 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix.h>
 #include "testing_helpers.hpp"
+#include <algorithm>
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+using namespace std;
+
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -147,6 +151,23 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+
+
+
+	printf("\n");
+	printf("****************\n");
+	printf("** RADIX TESTS **\n");
+	printf("****************\n");
+
+	genArray(SIZE - 1, a, 50);
+	zeroArray(SIZE, c);
+	printDesc("radix sort");
+	StreamCompaction::Radix::radix(SIZE, c, a);
+	printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+	printArray(SIZE, a, true);
+	sort(a, a + SIZE);
+	printCmpResult(SIZE, a, c);
+
     system("pause"); // stop Win32 console from closing on exit
 	delete[] a;
 	delete[] b;
diff --git a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt
index cdbef77..fc37515 100644
--- a/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt
+++ b/Project2-Stream-Compaction/stream_compaction/CMakeLists.txt
@@ -9,9 +9,11 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "radix.h"
+    "radix.cu"
     )
 
 cuda_add_library(stream_compaction
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_61
     )
diff --git a/Project2-Stream-Compaction/stream_compaction/common.cu b/Project2-Stream-Compaction/stream_compaction/common.cu
index 2ed6d63..7ff4f81 100644
--- a/Project2-Stream-Compaction/stream_compaction/common.cu
+++ b/Project2-Stream-Compaction/stream_compaction/common.cu
@@ -24,6 +24,11 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+			int index = threadIdx.x + (blockDim.x * blockIdx.x);
+			if (index >= n) {
+				return;
+			}
+			bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -33,6 +38,15 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+			int index = threadIdx.x + (blockDim.x * blockIdx.x);
+			if (index >= n) {
+				return;
+			}
+
+			if (bools[index] == 1) {
+				odata[indices[index]] = idata[index];
+			}
+
         }
 
     }
diff --git a/Project2-Stream-Compaction/stream_compaction/common.h b/Project2-Stream-Compaction/stream_compaction/common.h
index 996997e..f3594aa 100644
--- a/Project2-Stream-Compaction/stream_compaction/common.h
+++ b/Project2-Stream-Compaction/stream_compaction/common.h
@@ -2,6 +2,7 @@
 
 #include <cuda.h>
 #include <cuda_runtime.h>
+#include "device_launch_parameters.h"
 
 #include <cstdio>
 #include <cstring>
diff --git a/Project2-Stream-Compaction/stream_compaction/cpu.cu b/Project2-Stream-Compaction/stream_compaction/cpu.cu
index a2d3e6c..14d85c4 100644
--- a/Project2-Stream-Compaction/stream_compaction/cpu.cu
+++ b/Project2-Stream-Compaction/stream_compaction/cpu.cu
@@ -18,9 +18,29 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
+			bool exception = true;
+			try {
+				timer().startCpuTimer();
+				exception = false;
+			}
+			catch (const std::exception& e) {
+				exception = true;
+			}
+
             // TODO
-	        timer().endCpuTimer();
+			odata[0] = 0;
+			for (int i = 1; i < n; i++) {
+				odata[i] = odata[i - 1] + idata[i-1];
+			}
+
+			try {
+				if (exception == false) {
+					timer().endCpuTimer();
+				}
+			}
+			catch (const std::exception& e) {
+
+			}
         }
 
         /**
@@ -29,10 +49,33 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
+
+			bool exception = true;
+			try {
+				timer().startCpuTimer();
+				exception = false;
+			}
+			catch (const std::exception& e) {
+				exception = true;
+			}
+
             // TODO
-	        timer().endCpuTimer();
-            return -1;
+			int compactedIndex = 0;
+			for (int i = 0; i < n; i++) {
+				if (idata[i] != 0) {
+					odata[compactedIndex++] = idata[i];
+				}
+			}
+
+			try {
+				if (exception == false) {
+					timer().endCpuTimer();
+				}
+			}
+			catch (const std::exception& e) {
+
+			}
+            return compactedIndex;
         }
 
         /**
@@ -41,10 +84,41 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
+			bool exception = true;
+			try {
+				timer().startCpuTimer();
+				exception = false;
+			}
+			catch (const std::exception& e) {
+				exception = true;
+			}
+
+			int *binaryMap = new int[n];
+			int *scannedBinaryArray = new int[n];
+			int numOfElements = 0;
 	        // TODO
-	        timer().endCpuTimer();
-            return -1;
+			for (int i = 0; i < n; i++) {
+				binaryMap[i] = idata[i] == 0 ? 0 : 1;
+			}
+
+			scan(n, scannedBinaryArray, binaryMap);
+
+			for (int i = 0; i < n; i++) {
+				if (binaryMap[i] == 1) {
+					odata[scannedBinaryArray[i]] = idata[i];
+					numOfElements++;
+				}
+			}
+
+			try {
+				if (exception == false) {
+					timer().endCpuTimer();
+				}
+			}
+			catch (const std::exception& e) {
+
+			}
+            return numOfElements;
         }
     }
 }
diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.cu b/Project2-Stream-Compaction/stream_compaction/efficient.cu
index 2db346e..eb69db8 100644
--- a/Project2-Stream-Compaction/stream_compaction/efficient.cu
+++ b/Project2-Stream-Compaction/stream_compaction/efficient.cu
@@ -3,6 +3,13 @@
 #include "common.h"
 #include "efficient.h"
 
+#define blockSize 128
+
+int *devIdataEfficient;
+int *devIdataEfficientCompact;
+int *devIdataEfficientBinaryMap;
+int *devIdataEfficientNewIndices;
+
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,14 +19,105 @@ namespace StreamCompaction {
             return timer;
         }
 
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-        }
+		__global__ void upSweep(int n, int d, int *input) {
+			int index = (threadIdx.x + (blockDim.x * blockIdx.x))*(1 << (d + 1));
+			if (index + (1 << (d + 1)) - 1 >= n) {
+				return;
+			}
+			input[index + (1 << (d + 1)) - 1] += input[index + (1 << d) - 1];
+		}
+
+
+		__global__ void downSweep(int n, int d, int *input) {
+			int index = (threadIdx.x + (blockDim.x * blockIdx.x))*(1 << (d + 1));
+			if (index + (1 << (d + 1)) - 1 >= n) {
+				return;
+			}
+			int t = input[index + (1 << d) - 1];
+			input[index + (1 << d) - 1] = input[index + (1 << (d + 1)) - 1];
+			input[index + (1 << (d + 1)) - 1] += t;
+		}
+
+		/**
+		 * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+		 */
+		void scan(int n, int *odata, const int *idata) {
+			bool exception = true;
+			try {
+				timer().startGpuTimer();
+				exception = false;
+			}
+			catch (const std::exception& e) {
+				exception = true;
+			}
+
+
+
+
+			int newSize = 1 << ilog2ceil(n);
+			cudaMalloc((void**)&devIdataEfficient, newSize * sizeof(int));
+			checkCUDAError("cudaMalloc devIdataEfficient failed");
+			cudaMemset(devIdataEfficient, 0, newSize * sizeof(int));
+			cudaMemcpy(devIdataEfficient, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+
+
+
+			// TODO
+			for (int d = 0; d < ilog2(newSize); d++) {
+				dim3 fullBlocksPerGrid((((int)(newSize / (1 << (d + 1)))) + blockSize - 1) / blockSize);
+				upSweep << <fullBlocksPerGrid, blockSize >> > (newSize, d, devIdataEfficient);
+			}
+
+
+			cudaMemset(devIdataEfficient + (newSize - 1), 0, 1 * sizeof(int));
+
+			for (int d = ilog2(newSize) - 1; d >= 0; d--) {
+				dim3 fullBlocksPerGrid(((1 << (ilog2(newSize) - d)) + blockSize - 1) / blockSize);
+				downSweep << <fullBlocksPerGrid, blockSize >> > (newSize, d, devIdataEfficient);
+			}
+			cudaMemcpy(odata, devIdataEfficient, n * sizeof(int), cudaMemcpyDeviceToHost);
+			cudaFree(devIdataEfficient);
+
+
+			try {
+				if (exception == false) {
+					timer().endGpuTimer();
+				}
+			}
+			catch (const std::exception& e) {
+
+			}
+		}
+
+		/**
+		 * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+		 */
+		void scanForRadix(int n, int *odata, const int *idata, int radixBlockSize) {
+			int newSize = 1 << ilog2ceil(n);
+
+			cudaMemset(odata, 0, newSize * sizeof(int));
+			checkCUDAError("cudaMemset");
+			cudaMemcpy(odata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			checkCUDAError("cudaMemcpy");
+
+
+
+
+			// TODO
+			for (int d = 0; d < ilog2(newSize); d++) {
+				dim3 fullBlocksPerGrid((((int)(newSize / (1 << (d + 1)))) + radixBlockSize - 1) / radixBlockSize);
+				upSweep << <fullBlocksPerGrid, radixBlockSize >> > (newSize, d, odata);
+			}
+
+
+			cudaMemset(odata + (newSize - 1), 0, 1 * sizeof(int));
+
+			for (int d = ilog2(newSize) - 1; d >= 0; d--) {
+				dim3 fullBlocksPerGrid(((1 << (ilog2(newSize) - d)) + radixBlockSize - 1) / radixBlockSize);
+				downSweep << <fullBlocksPerGrid, radixBlockSize >> > (newSize, d, odata);
+			}
+		}
 
         /**
          * Performs stream compaction on idata, storing the result into odata.
@@ -31,10 +129,64 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
             // TODO
-            timer().endGpuTimer();
-            return -1;
+
+
+
+			cudaMalloc((void**)&devIdataEfficientCompact, n * sizeof(int));
+			checkCUDAError("cudaMalloc devIdataEfficientCompact failed");
+			cudaMemcpy(devIdataEfficientCompact, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			int *binaryMap = new int[n];
+			cudaMalloc((void**)&devIdataEfficientBinaryMap, n * sizeof(int));
+			checkCUDAError("cudaMalloc devIdataEfficientBinaryMap failed");
+			cudaMemset(devIdataEfficientBinaryMap, 0, n * sizeof(int));
+
+			int *newIndices = new int[n];
+			cudaMalloc((void**)&devIdataEfficientNewIndices, n * sizeof(int));
+			checkCUDAError("cudaMalloc devIdataEfficientNewIndices failed");
+
+
+
+			dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+			StreamCompaction::Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, devIdataEfficientBinaryMap, devIdataEfficientCompact);
+			cudaMemcpy(binaryMap, devIdataEfficientBinaryMap, n * sizeof(int), cudaMemcpyDeviceToHost);
+
+			bool exception = true;
+			try {
+				timer().startGpuTimer();
+				exception = false;
+			}
+			catch (const std::exception& e) {
+				exception = true;
+			}
+			scan(n, newIndices, binaryMap);
+			cudaMemcpy(devIdataEfficientNewIndices, newIndices, n * sizeof(int), cudaMemcpyHostToDevice);
+
+			int newSize = newIndices[n - 1] + binaryMap[n - 1];
+			cudaMalloc((void**)&devIdataEfficient, newSize * sizeof(int));
+			checkCUDAError("cudaMalloc devIdataEfficient failed");
+
+
+			StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, devIdataEfficient, devIdataEfficientCompact, devIdataEfficientBinaryMap, devIdataEfficientNewIndices);
+
+			try {
+				if (exception == false) {
+					timer().endGpuTimer();
+				}
+			}
+			catch (const std::exception& e) {
+
+			}
+			cudaMemcpy(odata, devIdataEfficient, newSize * sizeof(int), cudaMemcpyDeviceToHost);
+
+			cudaFree(devIdataEfficient);
+			cudaFree(devIdataEfficientBinaryMap);
+			cudaFree(devIdataEfficientCompact);
+			cudaFree(devIdataEfficientNewIndices);
+
+
+            return newSize;
         }
     }
 }
diff --git a/Project2-Stream-Compaction/stream_compaction/efficient.h b/Project2-Stream-Compaction/stream_compaction/efficient.h
index 803cb4f..dce2335 100644
--- a/Project2-Stream-Compaction/stream_compaction/efficient.h
+++ b/Project2-Stream-Compaction/stream_compaction/efficient.h
@@ -6,7 +6,9 @@ namespace StreamCompaction {
     namespace Efficient {
         StreamCompaction::Common::PerformanceTimer& timer();
 
-        void scan(int n, int *odata, const int *idata);
+		void scan(int n, int *odata, const int *idata);
+
+		void scanForRadix(int n, int *odata, const int *idata, int radixBlockSize);
 
         int compact(int n, int *odata, const int *idata);
     }
diff --git a/Project2-Stream-Compaction/stream_compaction/naive.cu b/Project2-Stream-Compaction/stream_compaction/naive.cu
index 4308876..1e2d432 100644
--- a/Project2-Stream-Compaction/stream_compaction/naive.cu
+++ b/Project2-Stream-Compaction/stream_compaction/naive.cu
@@ -3,6 +3,11 @@
 #include "common.h"
 #include "naive.h"
 
+#define blockSize 128
+
+int *devIdata;
+int *devOdata;
+
 namespace StreamCompaction {
     namespace Naive {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +17,40 @@ namespace StreamCompaction {
             return timer;
         }
         // TODO: __global__
+		__global__ void calculatePartialSum(int n, int d, int *odata, int *idata) {
+			int index = threadIdx.x + (blockDim.x * blockIdx.x);
+			if (index >= n) {
+				return;
+			}
+
+			odata[index] = ((index >= (1 << (d - 1))) ? (idata[index - (1 << (d - 1))]) : 0) + idata[index];
+		}
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startGpuTimer();
+
+			cudaMalloc((void**)&devIdata, n * sizeof(int));
+			checkCUDAError("cudaMalloc devIdata failed");
+			cudaMemcpy(devIdata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			cudaMalloc((void**)&devOdata, n * sizeof(int));
+			checkCUDAError("cudaMalloc devOdata failed");
+
+
+
             // TODO
+			for (int d = 1; d <= ilog2ceil(n); d++) {
+				dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+				calculatePartialSum << <fullBlocksPerGrid, blockSize>> > (n, d, devOdata, devIdata);
+				cudaMemcpy(devIdata, devOdata, n * sizeof(int), cudaMemcpyDeviceToDevice);
+			}
+			cudaMemcpy(odata + 1, devOdata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
+			odata[0] = 0;
+
+			cudaFree(devIdata);
+			cudaFree(devOdata);
             timer().endGpuTimer();
         }
     }
diff --git a/Project2-Stream-Compaction/stream_compaction/radix.cu b/Project2-Stream-Compaction/stream_compaction/radix.cu
new file mode 100644
index 0000000..ef46cf9
--- /dev/null
+++ b/Project2-Stream-Compaction/stream_compaction/radix.cu
@@ -0,0 +1,127 @@
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include "common.h"
+#include "radix.h"
+#include "efficient.h"
+
+#define blockSize 128
+
+
+namespace StreamCompaction {
+    namespace Radix {
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+            static PerformanceTimer timer;
+            return timer;
+        }
+        // TODO: __global__
+
+		__global__ void kernComputeE(int n, int bitPos, int *input, int *e) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+			if (index < n) {
+				e[index] = (((input[index] >> bitPos) & 1) == 0) ? 1 : 0;
+			}
+
+		}
+
+		__global__ void kernComputeTotalFalses(int n, int* totalFalses, int *e, int *f) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+			if (index < n) {
+				*totalFalses = e[n - 1] + f[n - 1];
+			}
+
+		}
+
+		__global__ void kernComputeD(int n, int *e, int *t, int *f, int *d) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+			if (index < n) {
+				d[index] = e[index] ? f[index] : t[index];
+			}
+
+		}
+
+		__global__ void kernComputeT(int n, int *totalFalses, int *t, int *f) {
+			int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+
+			if (index < n) {
+				t[index] = index - f[index] + (*totalFalses);
+			}
+
+		}
+
+
+
+		/**
+		 * Performs scatter on an array. That is, for each element in idata,
+		 */
+		__global__ void kernScatter(int n, int *odata,
+			const int *idata, const int *indices) {
+			// TODO
+			int index = threadIdx.x + (blockDim.x * blockIdx.x);
+			if (index >= n) {
+				return;
+			}
+			odata[indices[index]] = idata[index];
+
+		}
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void radix(int n, int *odata, const int *idata) {
+
+			int *devRadixData;
+			cudaMalloc((void **)&devRadixData, n * sizeof(int));
+			checkCUDAError("cudaMalloc");
+			cudaMemcpy(devRadixData, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+			checkCUDAError("cudaMemcpy");
+
+			int *e;
+			cudaMalloc((void **)&e, n * sizeof(int));
+			checkCUDAError("cudaMalloc");
+
+			int *f;
+			int newSize = 1 << ilog2ceil(n);
+			cudaMalloc((void **)&f, newSize * sizeof(int));
+			checkCUDAError("cudaMalloc");
+
+
+			int *t;
+			cudaMalloc((void **)&t, n * sizeof(int));
+			checkCUDAError("cudaMalloc");
+
+			int *d;
+			cudaMalloc((void **)&d, n * sizeof(int));
+			checkCUDAError("cudaMalloc");
+
+			int *output;
+			cudaMalloc((void **)&output, n * sizeof(int));
+			checkCUDAError("cudaMalloc");
+
+			int *totalFalses;
+			cudaMalloc((void **)&totalFalses, 1 * sizeof(int));
+			checkCUDAError("cudaMalloc");
+
+			int gridRows = (n + blockSize - 1) / blockSize;
+
+			timer().startGpuTimer();
+
+			for (int bitPos = 0; bitPos < 6; bitPos++) {
+
+				kernComputeE<<<gridRows, blockSize>>>(n, bitPos, devRadixData, e);
+				StreamCompaction::Efficient::scanForRadix(n, f, e, blockSize);
+				kernComputeTotalFalses <<< gridRows, blockSize >> > (n, totalFalses, e, f);
+				kernComputeT << < gridRows, blockSize >> > (n, totalFalses, t, f);
+				kernComputeD << < gridRows, blockSize >> > (n, e, t, f, d);
+				kernScatter << < gridRows, blockSize >> > (n, output, devRadixData, d);
+				cudaMemcpy(devRadixData, output, n * sizeof(int), cudaMemcpyDeviceToDevice);
+			}
+			timer().endGpuTimer();
+			cudaMemcpy(odata, devRadixData, n * sizeof(int), cudaMemcpyDeviceToHost);
+        }
+    }
+}
diff --git a/Project2-Stream-Compaction/stream_compaction/radix.h b/Project2-Stream-Compaction/stream_compaction/radix.h
new file mode 100644
index 0000000..0921ad6
--- /dev/null
+++ b/Project2-Stream-Compaction/stream_compaction/radix.h
@@ -0,0 +1,11 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+    namespace Radix {
+        StreamCompaction::Common::PerformanceTimer& timer();
+
+        void radix(int n, int *odata, const int *idata);
+    }
+}
diff --git a/Project2-Stream-Compaction/stream_compaction/thrust.cu b/Project2-Stream-Compaction/stream_compaction/thrust.cu
index 1def45e..c55fd9a 100644
--- a/Project2-Stream-Compaction/stream_compaction/thrust.cu
+++ b/Project2-Stream-Compaction/stream_compaction/thrust.cu
@@ -19,9 +19,15 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startGpuTimer();
+
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+			thrust::device_vector<int> dv_in(idata, idata + n);
+			thrust::device_vector<int> dv_out(odata, odata + n);
+            thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+			thrust::copy(dv_out.begin(), dv_out.end(), odata);
+
             timer().endGpuTimer();
         }
     }
diff --git a/README.md b/README.md
index 3a0b2fe..348391c 100644
--- a/README.md
+++ b/README.md
@@ -3,14 +3,13 @@ CUDA Number Algorithms
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Author : Kushagra
+  - [LinkedIn](https://www.linkedin.com/in/kushagragoel/)
+* Tested on:  Windows 10 Education, Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz 16GB, NVIDIA Quadro P1000 @ 4GB (Moore 100B Lab)
 
-### (TODO: Your README)
+### Subprojects
+* [Character Recognition](https://github.com/Kushagra-Goel/Project2-Number-Algorithms/blob/master/Project2-Character-Recognition/README.md)
+* [Stream Compaction](https://github.com/Kushagra-Goel/Project2-Number-Algorithms/blob/master/Project2-Stream-Compaction/README.md)
 
-Link to the readmes of the other two subprojects.
 
-Add anything else you think is relevant up to this point.
-(Remember, this is public, so don't put anything here that you don't want to share with the world.)