logistic regression
From: "James Bailey" <James_Bailey@EmoryHealthCare.org>
Subject: logistic regression
Date: Tue, 18 Sep 2001 16:24:59 -0500
I believe the difficulty with logistic regression for sparse dichotomous
data can be well appreciated by considering the case of binary data (for
example, loss of responsiveness with an intravenous anesthetic) with one
data point per patient. The probability of a positive drug effect is
given by
P = C**gamma/(C**gamma + C50**gamma) (1)
This is equivalent to a model which postulates an underlying continuous
drug effect E given by
E = gamma*ln(C/C50) + epsilon (2)
where epsilon is a random variable with a logistic distribution. It is
further postulated that a positive binary drug effect is observed if
E > 0
The probability of positive binary drug effect is equal to the
probability that epsilon is greater than -gamma*ln(C/C50). and using the
definition of the logistic distribution one can easily derive equation
(1).
Now consider interpatient variability and assume that
ln(C50) =ln(<C50>) + eta
where <C50> is the "typical value" and eta is normally distributed.
Then
E = gamma*ln(C/<C50>) + gamma*eta + epsilon
In this case the probability of a positive binary drug effect is equal
to the probability that the random variable gamma*eta + epsilon is
greater than -gamma*ln(C/<C50>).
However, consider the situation where epsilon conforms to a normal
distribution instead of a logistic distribution. Then gamma*eta +
epsilon also has a normal distribution and it is impossible to determine
the relative contributions of eta and epsilon to the overall variance.
In this situation it is impossible to do a complete analysis of binary
data with one data point per patient. This, of course, corresponds to
probit analysis but it makes the difficulty apparent. The normal and
logistic distributions are not that different. Doing a population
analysis of sparse binary data depends on the ability to distinguish
between the two distributions and will be almost impossible.
Furthermore, it rests on the assumption of an underlying logistic
distribution for the intrapatient variability (in epsilon), and there is
little basis for this assumption.
I and my colleague Wei Lu have done some simulations and our results
indicate that from 5-10 data points per patient are necessary to
estimate <C50> or gamma with any degree of reliability.
Jim Bailey