Understanding Manual MLE logit estimation and optimization
I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary.
Consider the following code:
By hand:
negLL<- function(b,X,y){ # b = betas
p <- as.vector(1/(1+exp(-X %*% b))) # "standard logistic function"; 1/1+exp(-X)
-sum(y*log(p) + (1-y)*log(1-p)) # cost function; y-hat = (p)
}
gradient<- function(b,X,y){
p <- as.vector(1/(1+exp(-X %*% b)))
apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat
}
y <- sample(c(0, 1), size =500, replace=TRUE)
x1 <- rnorm(500)
x2 <- rnorm(500)
x <- data.frame(x1, x2)
x <- data.matrix(x)
x <- cbind(1, x)
# the start
tol = 10^-6
beta = c(0,0,0) # multivariate now
maxit = 1000
iter = 0
alpha = 0.0001
eps = Inf
start = Sys.time()
while(eps > tol & iter < maxit){
# save the previous value
beta0 = beta
# calculate h, the increment
h = alpha*gradient(beta, x, y)
# update beta
beta = beta + h
# update the log likelihood
logL = negLL(beta, x, y)
# calculate the euclidean distance
eps = sqrt(sum((beta - beta0)^2))
# update the iteration number
iter = iter + 1
if(iter == maxit) warning("Iteration limit reached without convergence")
# print out info to keep track
if(floor(iter/20) == ceiling(iter/20)) cat(sprintf("Iter: %d logL: %.2f beta0: %.3f beta1: %.3f beta2: %.3f eps:%f\n",iter, logL,beta[1],beta[2],beta[3],eps))
}
optimal
binreg<- function(X,y,method="BFGS"){
#X<- cbind(1,X)
negLL<- function(b,X,y){ # b = betas
p<-as.vector(1/(1+exp(-X %*% b))) # "standard logistic function"; 1/1+exp(-X)
- sum(y*log(p) + (1-y)*log(1-p)) # cost function; y-hat = (p)
}
gradient<- function(b,X,y){
p <- as.vector(1/(1+exp(-X %*% b)))
-apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat
}
results<- optim (rep(0,ncol(X)),negLL,gr=gradient,
hessian=T,method=method,X=X,y=y, control=list(trace=1, REPORT=1))
list(coefficients=results$par,var=solve(results$hessian),
deviance=2*results$value,
converged=results$convergence==0)
}
mlebin.fit<-binreg(x,y)
#results
round(mlebin.fit$coefficients,2)
The only difference: in the hand estimation, I removed the negative sign from the first derivative of the cost function.
apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat # by hand
# changes to
-apply(((y - p)*X),2,sum) # so optim works right
edit:
Is it because the algorithms are different, gradient descent vs BFGS? Climbing and climbing?
You can get the coefficients in either of the following ways
- maximize the log-likelihood, or
- Minimize negative log likelihood
Using the "manual" approach, you maximize the log-likelihood; even if you store the negative of the log-likelihood, the gradient with gradient descent is the gradient of the log-likelihood, and you add a step size, Therefore it is maximized.
With your optim()
method, the log-log likelihood is minimized; by optim()
default it is minimized, you give it an objective function that is the negative of the log-likelihood, and you give it a gradient that is the negative of the log-likelihood.
If you wish optim()
to input the gradient of the log-likelihood as in the manual method, you will also need to (1) add the log-likelihood as the objective function; (2) add fnscale = -1
to the control
list.
Here is the result when I run the original code:
## Data simulated as in your OP
## Here are results from binreg() as it is written in your OP:
mlebin.fit<-binreg(x,y)
initial value 346.573590
iter 2 value 346.470207
iter 3 value 346.469793
iter 4 value 346.469567
iter 5 value 346.468589
iter 6 value 346.468553
iter 6 value 346.468553
iter 6 value 346.468553
final value 346.468553
converged
round(mlebin.fit$coefficients,2)
[1] 0.01 0.04 0.00
This is how we will maximize the log-likelihood:
## binreg() that maximizes the log likelihood rather than
## minimizing the negative log likelihood
binreg<- function(X,y,method="BFGS"){
## Use log likelihood, not negative log likelihood
LL<- function(b,X,y){ # b = betas
p<-as.vector(1/(1+exp(-X %*% b))) # "standard logistic function"; 1/1+exp(-X)
sum(y*log(p) + (1-y)*log(1-p)) # cost function; y-hat = (p)
}
## Use gradient of log likelihood, not gradient of negative log likelihood
gradient<- function(b,X,y){
p <- as.vector(1/(1+exp(-X %*% b)))
apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat
}
results<- optim (rep(0,ncol(X)), LL,gr=gradient,
hessian=T,method=method,X=X,y=y,
control=list(trace=1, REPORT=1, fnscale = -1))
list(coefficients=results$par,var=solve(results$hessian),
deviance=2*results$value,
converged=results$convergence==0)
}
Here is the result of the method:
mlebin.fit<-binreg(x,y)
initial value 346.573590
iter 2 value 346.470207
iter 3 value 346.469793
iter 4 value 346.469567
iter 5 value 346.468589
iter 6 value 346.468553
iter 6 value 346.468553
iter 6 value 346.468553
final value 346.468553
converged
round(mlebin.fit$coefficients,2)
[1] 0.01 0.04 0.00
Finally, for completeness, the result of the manual method when I run it:
Iter: 20 logL: 346.53 beta0: 0.003 beta1: 0.009 beta2: -0.001 eps:0.000413
Iter: 40 logL: 346.50 beta0: 0.006 beta1: 0.015 beta2: -0.001 eps:0.000313
Iter: 60 logL: 346.49 beta0: 0.008 beta1: 0.020 beta2: -0.002 eps:0.000237
Iter: 80 logL: 346.48 beta0: 0.009 beta1: 0.024 beta2: -0.002 eps:0.000180
Iter: 100 logL: 346.48 beta0: 0.011 beta1: 0.027 beta2: -0.002 eps:0.000136
Iter: 120 logL: 346.47 beta0: 0.011 beta1: 0.029 beta2: -0.002 eps:0.000103
Iter: 140 logL: 346.47 beta0: 0.012 beta1: 0.031 beta2: -0.002 eps:0.000078
Iter: 160 logL: 346.47 beta0: 0.012 beta1: 0.032 beta2: -0.002 eps:0.000060
Iter: 180 logL: 346.47 beta0: 0.013 beta1: 0.033 beta2: -0.002 eps:0.000045
Iter: 200 logL: 346.47 beta0: 0.013 beta1: 0.034 beta2: -0.002 eps:0.000035
Iter: 220 logL: 346.47 beta0: 0.013 beta1: 0.035 beta2: -0.002 eps:0.000026
Iter: 240 logL: 346.47 beta0: 0.013 beta1: 0.035 beta2: -0.002 eps:0.000020
Iter: 260 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000015
Iter: 280 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000012
Iter: 300 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000009
Iter: 320 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000007
Iter: 340 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000005
Iter: 360 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000004
Iter: 380 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000003
Iter: 400 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000002
Iter: 420 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000002
Iter: 440 logL: 346.47 beta0: 0.014 beta1: 0.037 beta2: -0.002 eps:0.000001
Iter: 460 logL: 346.47 beta0: 0.014 beta1: 0.037 beta2: -0.002 eps:0.000001
So, as we'd expect, they're all basically the same.