Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoogleNet training successful using PSGD method-- will Adam, AdaDelta work? #140

Open
nhe150 opened this issue Jul 1, 2016 · 4 comments

Comments

@nhe150
Copy link

nhe150 commented Jul 1, 2016

  1. I have used a modified version of SparkNet.
  2. I have successfully trained GoogleNet from scratch using 2 machines covering entire imagenet(1.281167 million images)
  3. I have achieved accuracy of 62.3% top-1 and 84.7% top-5 accuracy in 26 epocs.

Hopefully some statistician can prove PSGD is working at least with simple momentum, nesterov method... for AdaDelta, Adam (squared gradients) not sure about the implication of PSGD...

@robertnishihara
Copy link
Member

Nice work! Thanks for the running the benchmark.

@nhe150
Copy link
Author

nhe150 commented Jul 7, 2016

The key for distributed data training is starting from a model with top-1 accuracy high enough on GoogleNet.(say at least 5%, I call this step first opinion), Otherwise sparknet will stuck at random guessing instead of learning to improve higher accuracy.

I am starting to suspect the initial opinion is very critical and may cause bias latter on.

@nhe150
Copy link
Author

nhe150 commented Jul 8, 2016

Hopefully some statistician can start from above observation to deduce first order method like PSGD 's momentum, nesterov momentum can converge with a reasonable starting point when dealing with dramatically different data in parallel with only occasional communication.
I will try out method in utilized squared momentum method also for convergence.(AdaDelta, Adam, RMSProp etc...), There are strong evidence suggest squared momentum also converge using PSGD method.

@nhe150
Copy link
Author

nhe150 commented Jul 8, 2016

And it seems the PSGD converge faster than all other methods. I will call this clustering wisdom in training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants