Understanding Manual MLE logit estimation and optimization


John Stud

I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary.

Consider the following code:

By hand:

negLL<- function(b,X,y){  # b = betas
    p <- as.vector(1/(1+exp(-X %*% b)))  # "standard logistic function"; 1/1+exp(-X)
    -sum(y*log(p) + (1-y)*log(1-p))   # cost function; y-hat = (p)
    }

gradient<- function(b,X,y){
  p <- as.vector(1/(1+exp(-X %*% b)))
  apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat
  }

y <- sample(c(0, 1), size =500, replace=TRUE)
x1 <- rnorm(500)
x2 <- rnorm(500) 
x <- data.frame(x1, x2)
x <- data.matrix(x)
x <- cbind(1, x)

# the start
tol = 10^-6
beta = c(0,0,0) # multivariate now
maxit = 1000
iter = 0
alpha = 0.0001
eps = Inf

start = Sys.time()
while(eps > tol & iter < maxit){
  # save the previous value
  beta0 = beta

  # calculate h, the increment
  h =  alpha*gradient(beta, x, y)

  # update beta
  beta = beta + h

  # update the log likelihood 
  logL = negLL(beta, x, y)

  # calculate the euclidean distance
  eps  = sqrt(sum((beta - beta0)^2))

  # update the iteration number
  iter = iter + 1
  if(iter == maxit) warning("Iteration limit reached without convergence")

  # print out info to keep track
  if(floor(iter/20) == ceiling(iter/20)) cat(sprintf("Iter: %d logL: %.2f beta0: %.3f beta1: %.3f beta2: %.3f eps:%f\n",iter, logL,beta[1],beta[2],beta[3],eps))
}

optimal

binreg<- function(X,y,method="BFGS"){
  #X<- cbind(1,X)
  negLL<- function(b,X,y){  # b = betas
    p<-as.vector(1/(1+exp(-X %*% b)))  # "standard logistic function"; 1/1+exp(-X)
    - sum(y*log(p) + (1-y)*log(1-p))   # cost function; y-hat = (p)
  }

  gradient<- function(b,X,y){
    p <- as.vector(1/(1+exp(-X %*% b)))
     -apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat

  }

  results<- optim (rep(0,ncol(X)),negLL,gr=gradient,
                   hessian=T,method=method,X=X,y=y, control=list(trace=1, REPORT=1))
  list(coefficients=results$par,var=solve(results$hessian),
       deviance=2*results$value,
       converged=results$convergence==0)

  }
mlebin.fit<-binreg(x,y)
#results
round(mlebin.fit$coefficients,2)

The only difference: in the hand estimation, I removed the negative sign from the first derivative of the cost function.

apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat  # by hand
# changes to
-apply(((y - p)*X),2,sum)  # so optim works right

edit:

Is it because the algorithms are different, gradient descent vs BFGS? Climbing and climbing?

Demeier

You can get the coefficients in either of the following ways

  1. maximize the log-likelihood, or
  2. Minimize negative log likelihood

Using the "manual" approach, you maximize the log-likelihood; even if you store the negative of the log-likelihood, the gradient with gradient descent is the gradient of the log-likelihood, and you add a step size, Therefore it is maximized.

With your optim()method, the log-log likelihood is minimized; by optim()default it is minimized, you give it an objective function that is the negative of the log-likelihood, and you give it a gradient that is the negative of the log-likelihood.

If you wish optim()to input the gradient of the log-likelihood as in the manual method, you will also need to (1) add the log-likelihood as the objective function; (2) add fnscale = -1to the controllist.

Here is the result when I run the original code:

## Data simulated as in your OP
## Here are results from binreg() as it is written in your OP:
mlebin.fit<-binreg(x,y)
initial  value 346.573590 
iter   2 value 346.470207
iter   3 value 346.469793
iter   4 value 346.469567
iter   5 value 346.468589
iter   6 value 346.468553
iter   6 value 346.468553
iter   6 value 346.468553
final  value 346.468553 
converged

round(mlebin.fit$coefficients,2)
[1] 0.01 0.04 0.00

This is how we will maximize the log-likelihood:

## binreg() that maximizes the log likelihood rather than
## minimizing the negative log likelihood
binreg<- function(X,y,method="BFGS"){
    ## Use log likelihood, not negative log likelihood
    LL<- function(b,X,y){  # b = betas
        p<-as.vector(1/(1+exp(-X %*% b)))  # "standard logistic function"; 1/1+exp(-X)
        sum(y*log(p) + (1-y)*log(1-p))   # cost function; y-hat = (p)
    }

    ## Use gradient of log likelihood, not gradient of negative log likelihood
    gradient<- function(b,X,y){
        p <- as.vector(1/(1+exp(-X %*% b)))
        apply(((y - p)*X),2,sum) # derivative of cost function: (p) = y-hat

    }

    results<- optim (rep(0,ncol(X)), LL,gr=gradient,
                     hessian=T,method=method,X=X,y=y,
                     control=list(trace=1, REPORT=1, fnscale = -1))
    list(coefficients=results$par,var=solve(results$hessian),
         deviance=2*results$value,
         converged=results$convergence==0)

}

Here is the result of the method:

mlebin.fit<-binreg(x,y)

initial  value 346.573590 
iter   2 value 346.470207
iter   3 value 346.469793
iter   4 value 346.469567
iter   5 value 346.468589
iter   6 value 346.468553
iter   6 value 346.468553
iter   6 value 346.468553
final  value 346.468553 
converged

round(mlebin.fit$coefficients,2)
[1] 0.01 0.04 0.00

Finally, for completeness, the result of the manual method when I run it:

Iter: 20 logL: 346.53 beta0: 0.003 beta1: 0.009 beta2: -0.001 eps:0.000413
Iter: 40 logL: 346.50 beta0: 0.006 beta1: 0.015 beta2: -0.001 eps:0.000313
Iter: 60 logL: 346.49 beta0: 0.008 beta1: 0.020 beta2: -0.002 eps:0.000237
Iter: 80 logL: 346.48 beta0: 0.009 beta1: 0.024 beta2: -0.002 eps:0.000180
Iter: 100 logL: 346.48 beta0: 0.011 beta1: 0.027 beta2: -0.002 eps:0.000136
Iter: 120 logL: 346.47 beta0: 0.011 beta1: 0.029 beta2: -0.002 eps:0.000103
Iter: 140 logL: 346.47 beta0: 0.012 beta1: 0.031 beta2: -0.002 eps:0.000078
Iter: 160 logL: 346.47 beta0: 0.012 beta1: 0.032 beta2: -0.002 eps:0.000060
Iter: 180 logL: 346.47 beta0: 0.013 beta1: 0.033 beta2: -0.002 eps:0.000045
Iter: 200 logL: 346.47 beta0: 0.013 beta1: 0.034 beta2: -0.002 eps:0.000035
Iter: 220 logL: 346.47 beta0: 0.013 beta1: 0.035 beta2: -0.002 eps:0.000026
Iter: 240 logL: 346.47 beta0: 0.013 beta1: 0.035 beta2: -0.002 eps:0.000020
Iter: 260 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000015
Iter: 280 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000012
Iter: 300 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000009
Iter: 320 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000007
Iter: 340 logL: 346.47 beta0: 0.013 beta1: 0.036 beta2: -0.002 eps:0.000005
Iter: 360 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000004
Iter: 380 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000003
Iter: 400 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000002
Iter: 420 logL: 346.47 beta0: 0.014 beta1: 0.036 beta2: -0.002 eps:0.000002
Iter: 440 logL: 346.47 beta0: 0.014 beta1: 0.037 beta2: -0.002 eps:0.000001
Iter: 460 logL: 346.47 beta0: 0.014 beta1: 0.037 beta2: -0.002 eps:0.000001

So, as we'd expect, they're all basically the same.

Related


Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Understanding Manual MLE logit estimation and optimization

John Stud I'm looking for insight as to why a slight change in the gradient function is necessary to achieve the same result of a manual logit MLE calculation. They get the same result, but it's not clear to me why the changes specified below are necessary. Co

Simple numerical estimation problem for MLE of polynomials in R

John N I am trying to build a simple numerical MLE estimate for a multinomial distribution. The polynomial has a constraint - all cell probabilities need to add up to one. In general, the way to have this constraint is to reformulate one probability as (1 - su

Problems implementing MLE estimation in the bbmle package (for R)

will I am trying to validate the MLEs for $\alpha$ , $\beta$ and $\lambda$ obtained for the Logistic-Lomax distribution by Zubair et al. in their paper titled A Study of Logistic-Lomax Distribution when using Dataset 1 . The paper uses the following code to do

Simple numerical estimation problem for MLE of polynomials in R

John N I am trying to build a simple numerical MLE estimate for a multinomial distribution. The polynomial has a constraint - all cell probabilities need to add up to one. In general, the way to have this constraint is to reformulate one probability as (1 - su

Simple numerical estimation problem for MLE of polynomials in R

John N I am trying to build a simple numerical MLE estimate for a multinomial distribution. The polynomial has a constraint - all cell probabilities need to add up to one. In general, the way to have this constraint is to reformulate one probability as (1 - su

Maximum Likelihood Estimation - MLE - in R using the plm package

Lucca coding I have unbalanced panel data and I want to fit this type of regression: Pr(y=1|xB) = G(xB+a) where "y" is a binary variable, "x" is a vector of explanatory variables, and "B" is my coefficient. I want to implement a random effects model with maxi

Optimization of nonlinear Langmuir parameter estimation in R

world sports I am interested in using this nlsto help fit the Langmuir equation, Y =(Qmax*k*X)/(1+(k*X))similar to fitting nonlinear Langmuir isotherms in R in this article . The equation parameters of my interest correspond to Qmaxthe horizontal asymptotes (g

Retrieving optimization results from MLE via scipy.stats.fit()?

Coolio2654 I'm trying to pass estimating parameters for different distributions scipy.stats."some_dist".fit()and am having extreme difficulty retrieving any information about the optimization process I'm using. Specifically, I'm looking for the Hessian state,

Why does dmultinom MLE's optimization converge incorrectly in R?

Dylan I I am writing a Lincoln-Petersen MLE function in R. I have two versions of the distribution function, one using the gamma function and one using the Rs internal dmultinom (using lgamma). When I interpolate the sample values, both return the same result,

Retrieving optimization results from MLE via scipy.stats.fit()?

Coolio2654 I'm trying to pass estimating parameters for different distributions scipy.stats."some_dist".fit()and am having extreme difficulty retrieving any information about the optimization process I'm using. Specifically, I'm looking for the Hessian state,

Why does dmultinom MLE's optimization converge incorrectly in R?

Dylan I I am writing a Lincoln-Petersen MLE function in R. I have two versions of the distribution function, one using the gamma function and one using the Rs internal dmultinom (using lgamma). When I interpolate the sample values, both return the same result,

Manual optimization in the past (C language)

cancel the deal Back in the 70's when C was just getting started, I guess compiler-level optimizations weren't as advanced as modern compilers (clang, gcc, etc.) and computers themselves were limited in terms of hardware, so it was generally preferred to do it

Manual optimization in the past (C language)

cancel the deal Back in the 70's when C was just getting started, I guess compiler level optimizations weren't as advanced as modern compilers (clang, gcc, etc) and computers themselves were limited in terms of hardware, so it was generally preferred to do it

Understanding clang loop optimization

curious learner I have this code #include <cstdlib> #include <time.h> int sum () { srand (time(NULL)); unsigned long extra = rand() % 10; int sum = 0; // #pragma nounroll. <<<< This makes no difference for (int i = 0; i < 16 + extra; ++i) { s

Understanding clang loop optimization

curious learner I have this code #include <cstdlib> #include <time.h> int sum () { srand (time(NULL)); unsigned long extra = rand() % 10; int sum = 0; // #pragma nounroll. <<<< This makes no difference for (int i = 0; i < 16 + extra; ++i) { s

Understanding clang loop optimization

curious learner I have this code #include <cstdlib> #include <time.h> int sum () { srand (time(NULL)); unsigned long extra = rand() % 10; int sum = 0; // #pragma nounroll. <<<< This makes no difference for (int i = 0; i < 16 + extra; ++i) { s

Understanding clang loop optimization

curious learner I have this code #include <cstdlib> #include <time.h> int sum () { srand (time(NULL)); unsigned long extra = rand() % 10; int sum = 0; // #pragma nounroll. <<<< This makes no difference for (int i = 0; i < 16 + extra; ++i) { s

Understanding clang loop optimization

curious learner I have this code #include <cstdlib> #include <time.h> int sum () { srand (time(NULL)); unsigned long extra = rand() % 10; int sum = 0; // #pragma nounroll. <<<< This makes no difference for (int i = 0; i < 16 + extra; ++i) { s

Get the Coding Manual - Understanding Variables

dream 5 I'm just getting into the programming world and maybe I'm going a little too far by asking quickly, but I don't understand something from the book I'm reading. Why do I need to create a variable for the element password box that doesn't hold the value