You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I realized that we could make an impressive gain of computation time by vectorizing the computation to simultaneously solve several minimization problems. Instead of having on l1, l2 tv scalars values we could provide vectors of length k to simultaneously solve k minimization problems. If we provide l1 = [0.1, 0.9], l2=[0.9, 0.1], tv = [0.2, 0.8], we would solve two problems: one with l1, l2, tv = 0.1, 0.9, 0.2 and one with l1, l2, tv = 0.9, 0.1, 0.8. We would simultaneously estimate k beta vectors and model.predict(X) will then provide k prediction. This would be the only consequence on the global API.
First, this is useful for multinomial classification problems.
Second, such framework should considerably speed up the computation over a grid of parameters. The rationale is the following, most of the computation is spent in the np.dot(X, beta). My hypothesis was that the time-consuming procedure is to load X data int the L3 cache of the processor. I realized that by observing a considerable slowdown of the computation when using several cores of the same processor. Using the 8 cores of the same processor slow down the minimization by a factor of 5! It is a very poor scaling! So if the same X data is used for several problems (beta is p x k array instead of being a p x 1) we could simultaneously solve several problems with a minimum overhead. Test it with the following code:
import time
import matplotlib.pylab as plt
n_features = int(550**2)
n_samples = 200
X = np.random.randn(n_samples, n_features)
beta = np.random.randn(n_features, 1)
y = np.dot(X, beta)
sizes = np.arange(1, 101)
elapsed = np.zeros(sizes.shape[0])
for s in sizes:
beta = np.random.randn(X.shape[1], s)
t_ = time.clock()
# regression loss pattern
grad = np.dot(X.T, np.dot(X, beta) - y)
elapsed[s-1] = time.clock() - t_
# plt.plot(sizes, elapsed)
#plt.plot([1, sizes[-1]], [elapsed[0], elapsed[0]*sizes[-1]])
plt.plot(sizes, elapsed / elapsed[0])
plt.xlabel("beta.shape[1] (in X * beta)")
plt.ylabel("Ratio of CPU time: time(X beta) / time(X beta.shape[1]==1) ")
I observed an acceleration factor of 10 (anaconda with MKL single thread). Solving 50 problems beta.shape = [p, 50] is only 5 times slower than solving a single problem.
Most of the code should smoothly move to a vectorized version since beta is a [p x 1] vector and can be easly extend to a [p x k]. See an example of FISTA:
This is cool! I agree that likely the required changes will be minimal, while the gain will be maximal.
Three comments:
I guess the convergence criterion should be np.max(gap_mu + mu * gM) < self.eps, so that the algorithm stops when all have converged, instead of when the first beta vector is converged.
I think the parameters for l1, l2 and tv should be allowed to be either ints or lists of ints, so that the current syntax (with ints) still works.
Perhaps we could allow the cartesian product of the lists? So that with l1=[0.1, 0.9] and tv=[0.1, 1.0], instead of computing for l1,tv=0.1,0.1 and l1,tv=0.9,1.0, we compute for l1,tv=0.1,0.1, l1,tv=0.1,1.0, l1,tv=0.9,0.1 and l1,tv=0.9,1.0. This could be an option.
Experiment en 302,500 features (image 550x550) has demonstrated an acceleration factor of 4.5
Solving 40 simultaneously problems is only 8.5 times slower.
I realized that we could make an impressive gain of computation time by vectorizing the computation to simultaneously solve several minimization problems. Instead of having on l1, l2 tv scalars values we could provide vectors of length k to simultaneously solve k minimization problems. If we provide l1 = [0.1, 0.9], l2=[0.9, 0.1], tv = [0.2, 0.8], we would solve two problems: one with l1, l2, tv = 0.1, 0.9, 0.2 and one with l1, l2, tv = 0.9, 0.1, 0.8. We would simultaneously estimate k beta vectors and model.predict(X) will then provide k prediction. This would be the only consequence on the global API.
First, this is useful for multinomial classification problems.
Second, such framework should considerably speed up the computation over a grid of parameters. The rationale is the following, most of the computation is spent in the np.dot(X, beta). My hypothesis was that the time-consuming procedure is to load X data int the L3 cache of the processor. I realized that by observing a considerable slowdown of the computation when using several cores of the same processor. Using the 8 cores of the same processor slow down the minimization by a factor of 5! It is a very poor scaling! So if the same X data is used for several problems (beta is p x k array instead of being a p x 1) we could simultaneously solve several problems with a minimum overhead. Test it with the following code:
I observed an acceleration factor of 10 (anaconda with MKL single thread). Solving 50 problems beta.shape = [p, 50] is only 5 times slower than solving a single problem.
Most of the code should smoothly move to a vectorized version since beta is a [p x 1] vector and can be easly extend to a [p x k]. See an example of FISTA:
Most of change will ocure when testing for convergences:
gap_mu + mu * gM < self.eps
will becomenp.min(gap_mu + mu * gM) < self.eps
I suggest to start with FISTA and elasticnet, and evaluated the acceleration.
The text was updated successfully, but these errors were encountered: