The Logistic Regression and ROC Analysis of Diagnostic Tests Results for Gestational Diabetic Mellitus

Research Article

Open Access

Oyeka ICA¹ and Okeh UM^2*

¹Department of Applied Statistics, Nnamdi Azikiwe University, Awka, Nigeria

²Department of Industrial Mathematics and Applied Statistics, Ebonyi State University, Abakaliki, Nigeria

^*Corresponding author:

Okeh UM
Department of Industrial Mathematics and Applied Statistics
Ebonyi State University Abakaliki, Nigeria
E-mail: uzomaokey@ymail.com

Received February 07, 2013; Published March 30, 2013

Citation: Oyeka ICA, Okeh UM (2013) The Logistic Regression and ROC Analysis of Diagnostic Tests Results for Gestational Diabetic Mellitus. 2:654 doi:10.4172/scientificreports.654

Copyright: © 2013 Oyeka ICA, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

This paper proposes a matrix approach for estimating parameters of logistic regression, with a view of estimating the effects of risk factors of gestational diabetic mellitus (GDM). The proposed method, unlike other methods of estimating parameters of non-linear regression, is simpler, and convergence of parameters is quicker. The odds ratio obtained from the logistic regression were used to interpret the effects of these risk factors of GDM, where obesity and family history as risk factors, were positively associated with GDM on application of the proposed method, with data from five randomly selected hospitals in Ebonyi State, Nigeria. The proposed method was seen to compare favorably with other known methods.

Keywords

GDM; Odds ratio; Logistic regression; Dichotomous; Newton-raphson

Introduction

The constant evolution of medicine over the last two decades has meant that statistics has had to develop methods to solve the new problems that have appeared, and has come to play a central part in methods of diagnosis of diseases [1]. A diagnostic method consists of the application of a test with a group of patients, in order to obtain a provisional diagnosis regarding the presence or the absence of a particular disease [2]. In this work, logistic regression has been proposed for the purpose of estimating the effects of various predictors on some binary outcome of interest. Here, logistic regression regresses a dichotomous dependent variable on a set of independent variables, as a way of knowing the effects of these independent variables [3,4].

We, therefore here, propose to develop a matrix approach for solving a system of nonlinear equations, with P+1 unknown parameters. These methods will be applied in estimating the effects of risk factors on the occurrence of gestational diabetic mellitus (GDM) [ 5-7]. The proposed method will be illustrated using data on gestational diabetic mellitus (GDM), and have been shown to compare favorably with other existing methods in terms of efficiency.

The Proposed Method

The fundamental model for any multiple regression analysis assumes that the outcome variable is a linear combination of a set of predictors, and this is represented as:

(1)

Where β₀ is the expected value of Y, when the x's are set to 0, β_k is the regression coefficient for each corresponding predictor variable, x _ik, ε is the error of the prediction. The binary logistic model is based on a linear relationship between the natural logarithm (ln) of the odds of an event, and a numerical independent variable. The form of this relationship is as follows:

(2)

The logistic regression model indirectly models the response variable based on probabilities associated with the values of Y. Let π_i be the probability that Y=1 and π_i -1 be the probability that Y=0. These probabilities are represented as:

(3)

But, the general form of logistic model is given by

(4)

Where i=1,2,....N

And

are the odds of developing any disease for a subject with risk factor. By logit transformation of the inverse of log odds to favour Y=1, we obtain the linear component as

Similarly,

Using the inverse of logit transformation of the natural logarithm of the odds (log odds) to favor Y=1, we equates to the linear component to have:

(5)

Therefore,

(6)

Maximum Likelihood Estimation (Mle) for Logistic Regression

We here estimate the P+1 unknown parameters β in Equation 5, with MLE, by finding the set of parameters for which the probability of the observed data is greatest. Since each y_i represents a binomial count in the i^th population, the joint probability density function of Y is:

(7)

Where β is from π_i in Equation 3. For each population, there are

different ways to arrange y_i successes from among ni trials. Since the probability of a success for any one of the n_i trials is π_i, the probability of y_i successes is

Likewise, the probability of n_i-y_i failures is

The joint probability density function in Equation 7 expresses the values of Y as a function of known, fixed values for β. The likelihood function has the same form as the probability density function, except that the parameters of the function are reversed: the likelihood function expresses the values of β, in terms of known, fixed values for Y. Thus,

(8)

The maximum likelihood estimates are the values for β that maximize the likelihood function in Equation 8. Thus, finding the maximum likelihood estimates requires computing the first and second derivatives of the likelihood function. Since the factorial terms do not contain any of the π_i, they are essentially constants that can be ignored. Therefore, maximizing the equation without the factorial terms will come to the same result, as if they were included. By rearranging the terms, the equation to be maximized which is the conditional likelihood can be written as:

(9)

Recall from Equation 6 that

(10)

Which, after solving for π_i (the same thing as the result of Equation 3) becomes,

(11)

Substituting Equation 10 for the first term and Equation 11 for the second term, Equation 9 becomes:

(12)

Use

to simplify the first product and replace 1 with

to simplify the second product. We have:

(13)

This is the kernel of the likelihood function to maximize. We here simplify further by taking its log. Since the logarithm is a monotonic function, any maximum of the likelihood function will also be a maximum of the log likelihood function, and vice versa. Thus, taking the natural log of Equation 13 yields the log likelihood function:

(14)

To find the critical points of the log likelihood function, set the first derivative with respect to each β equal to zero. In differentiating Equation 14, note that

(15)

Since the other terms in the summation do not depend on βk,and can thus be treated as constants. In differentiating the second half of Equation 15, take note of the general rule that

, as well as the fact that

. But,

So that

. Thus, differentiating Equation 14 with respect to each β_k,

(16)

Therefore,

So that the gradient of the log likelihood in matrix form is given as:

(17)

Which is a column vector of length P+1, whose elements are

Let μ be a column vector of length N, with elements

The maximum likelihood estimates for β can be found by setting each of the P+1 equations in Equation 16 equal to zero, and solving for each β_k. Each such solution, if any exists, specifies a critical point (either a maximum or a minimum). The critical point will be a maximum if the matrix of second partial derivatives (Hessian matrix) is negative definite; that is, if every element on the diagonal of the matrix is less than zero [8]. It is formed by differentiating each of the P+1 equations in Equation 16, a second time with respect to each element of β, denoted byβ _k.The general form of the matrix of second partial derivatives is

(18)

The Hessian in a matrix form is given as

(19)

Where W is a square matrix of order N, with elements n_iπ_i(1-π_i) on the diagonal, and zeros everywhere else. To solve Equation 18, we will make use of two general rules for differentiation. First, a rule for differentiating exponential functions:

(20)

In our case, let

Second, the quotient rule for differentiating the quotient of two functions:

(21)

Applying these two rules together allows us to solve Equation 18.

(22)

Since

while,

, clearly defined. Thus, Equation 18 can now be written as:

(23)

Newton-Raphson Iteration Procedure

In finding the roots of Equation 16 using Newton-Raphson method, we generalize the method to a system of P+1 equations. This is done by expressing each step of the Newton-Raphson (NR) algorithm, through letting β^old or β⁽⁰⁾ represent the vector of initial approximations for each β_k, so that the result of this algorithm in matrix notation gives:

(24)

Substituting the values of l′(β ) and l′′(β ) above simplifies the equation to a matrix form, given as

(25)

Where

is a vector and W is the diagonal weight vector, with entries π_i(1-π_i).

The last equation is called the weighted least square regression, which finds the best least-squares solution to the equation. The equation is called recursive weighted least squares, because at each step, the weight vector W keeps changing (since the β 's are changing). Now, Equation 25 can be written:

(26)

Continue applying Equation 26 until there is essentially no change between the elements of β from one iteration to the next. At that point, the maximum likelihood estimates are said to have converged, and Equation 19 will hold the variance-covariance matrix of the estimates. Because the estimation algorithm for the parameter of the logistic regression model is iterative, parameter estimates based on small samples way fail to converge, or converge to local rather than global, stationary points. This informed the application of large sample in this study. This iterative procedure is handled by SAS software in this work.

Illustrative Example

In estimating the effects of risk factors on GDM, 1000 subjects (pregnant women at risk for GDM) were sampled from the five randomly selected hospitals from January 2010 to December 2011 in Ebonyi State through a retrospective study, out of which 490 (49%) were those less than 28 weeks of their gestational age, and 510 (51%) were those at least 28 weeks of their gestational age. In the total sampled subjects, 530 (53%) were gestational diabetic and 470 (47%) were nongestational diabetic. Since GDM is a dichotomous variable, it is coded as 0 or 1, and the independent factors considered in this work are Age, Category of pregnant women, Obesity, Income group, Life-style and exercise, F.H of diabetes, Hypertension, and Diet habit are also categorical and coded between 0 and 3. These are presented in table 1.

Table 1: Code sheet of concerned independent variables.

Results of Analysis

The results are shown in the following tables: Tables 2 and 3

Table 2: Chi-square analysis of covariates showing significance, after comparison with p and phi-value for the sample.

Table 3: Results of fitting the Multiple Logistic Regression Model, including O.R and 95% C.I, by using stepwise logistic procedure for the sample.

The table 3 shows that three risk factors: Obesity, F.H and Exercise, were significant because for all the above variables p-value was less than 0.05. Since the hospitals where these data were collected are mainly located in the urban areas, it means that by the results obtained, it implies that lifestyle of urban area, taking high calories food, less physical activity, invention of remote control equipments and less exercise are the causes of incidence of obesity in the sample data analysized. Moreover, genetical and environmental behaviors are also the reasons of obesity. The reference group for obesity was taken as non-obese persons. The O.R for obesity was 3.017, which shows that an obese person has 3.017 times more chance of getting a significant GDM, as compared to non-obese person keeping all other factors constant. As the O.R for obesity was greater than 1 and the 95% confidence interval for obesity did not include 1, therefore, obesity has a positive association with GDM, and was statistically significant. The reference group for F.H was taken as absent of F.H persons. The O.R for F.H was 2.489, which means that a pregnant woman in Ebonyi State with positive F.H has 2.489 times more chance of getting a significant GDM, as compared to a pregnant woman in which F.H of GDM was absent. Therefore, F.H was significantly different from reference group, and was positively associated with GDM. The reference group for, exercise was sedentary life style. The O.R for exercise was 0.519, which is less than 1 because by general rule, if O.R is less than 1 and chi-square is significant, then there is a protection of exposure against outcome; also 95% confidence interval for exercise did not include 1, therefore, O.R for exercise was significantly different from reference group, and shows that the person who take light exercise have 0.481 probability of protection against GDM. In the light of the above analysis for the 1000 sampled pregnant women, since it turns out that 3 risk factors, obesity, F.H and exercise were significant, that means empirical findings confirm concept and theory of risk factors. So clinicians and public health personal should take appropriate measures to control these risk factors, and prevention programs should be started against GDM. In the remaining 5 risk factors; age, category of women, income, hypertension and D.H, empirical findings do not confirm the concept and theories of risk factors. The theme of every study started with past literature and studies done by experts. According to the literature, these five variables were also the risk factors of diabetes in different regions of the world.

Multivariate Version with Interaction Terms

All the interactions terms were calculated separately and tested for significance at 5% level of significance (Table 4).

Table 4: Results of significant main effects and interaction terms of sample.

In the sample analysis, the main effect factors: category of women, age, obesity and F.H were significant risk factors. Besides the independent factors age was interacted with gender (P=0.005), exercise (P=0.000), and D.H (P=0.016) showed significant effect. Similarly, the factor obesity was interacted with INCM (P=0.008), and D.H (P=0.01) was the significant factor, while the factor “D.H” (P=0.01) was interacted with INCM, and had significant effect. The odd ratio for category of women 0.365 and odd ratio for age 0.286 indicated that those women less than 28 weeks of their gestational age and number of pregnant women less than 30 years of age were protected against this disease. Obese (O.R=6.582, P=0.000) and F.H of GDM (O.R=2.679, P=0.000) indicated that obese pregnant women have 6.582 times of chances of disease, as compared to non-obese pregnant women, while the pregnant women having GDM in their family have 2.679 times of developing disease, as compared to that pregnant women in which F.H of GDM was absent. Exercise was insignificant factor, but when it was interacted with age, it become significant (P=0.000). The interaction of age with category of women (P=0.005) and D.H, (P=0.016), separately were the significant factors. Obesity was also significant when it was interacted with INCM (P=0.008), and with D.H (P=0.012), since obesity has “O.R”=6.582, (P=0.000) in the main effects, but when it was interacted with D.H, the “O.R” decreases to 2.223, (P=0.01); that means by using balanced or proper diet, obesity can be reduced. Some of these interaction terms were very important, while the others were not statistically significant, or explaining no biological relationship for interpretation. For example: in the main effect model, age and category of women showed insignificant effect, but their interaction showed significant effect with odd ratio greater than 1. Similarly, INCM and obesity when interact with each other gave misleading interpretation with O.R=0.592.

Logit Model for Overall Sample with and without Interaction Terms

The model with out interaction terms:

The model with interaction terms for the sample is given below

Summary of Conclusions

We here summarize and conclude as follows:

1. In this hospital base study, ratio of GDM pregnant women is greater than the ratio of non-GDM pregnant women, and the pregnant women from 28 weeks of their gestational age are more liable to diabetes than those less than 28 weeks of their gestational age. The pregnant women entering the hospitals for GDM screening, greater than thirty years of age are three folds than the pregnant women of less than thirty years, concluded that GDM is more common in people above thirty years, and prevalence rate of GDM clearly increased with advancing age. Similarly, obese pregnant women are 1.4 folds than the non-obese pregnant women, and pregnant women with family history of GDM are approximately equal to with out having F.H of GDM in this sample. It is also concluded from the epidemiological study that educated pregnant women have awareness of GDM, and are more careful than the uneducated pregnant women.

2. In the sample analysis, the risk factors: obesity, F.H, were positively associated with GDM, and factor exercise was protection against this disease.

Exercise is protection against this disease, that means pregnant women who take exercise and led a simple life-style are at lesser risk of GDM and other diseases, as compared to those pregnant women who led sedentary lifestyle.

Recommendations

We here recommend on the ROC analysis that a threshold of 177 mg/dl becomes the cutoff value of 50 grams GCT, for screening of GDM in each trimester in GDM risk women, and it is suitable for low BMI or non-obese pregnancy. I also recommend that semi-parametric GLMM method, be used in evaluating the impact of covariates in diagnostic testing programmes, since by comparison it is far better than other methods, in terms producing smooth ROC curves, and compares favorably with other methods. In the second aspect of the analysis, I recommend that since emphasis is on prevalence of GDM, pregnant women with more than thirty years of age, greater number of pregnant women from 28 weeks of gestational age than those less than 28 weeks of gestational age, obesity, F.H and educational level suggests that GDM is not associated to only single risk factor, but it may be associated by more than one risk factor. It is clear from the findings of the study that in overall sample analysis. Obesity and F.H of diabetes are associated risk factors; so a GDM patient or non-GDM pregnant woman must be aware about the consequences of a regular high or low blood sugar level, and amount of cholesterol in blood, so precautionary measures must be taken to control the sugar level. Physical activity is inversely related with BMI, so it is recommended to urgently adopt measures to increase physical activity in these populations. Only a small numbers of pregnant women are aware of the increased genetic susceptibility of their first or second-degree relatives to develop GDM, suggested for weight reduction and regular physical exercise. As a rapidly expanding society problem, GDM requires collective efforts, which must include giving attention to prevention. Consistent with epidemiological concepts, prevention of GDM should be focused by reducing the threat of incidence of the disease, with the help of good nutritional status, physical fitness and regular check up for the individuals of the society; secondary early detection of the disease is necessary. Clinicians should advise the pregnant women, especially to more than 30 years of age, having F.H of GDM for monitoring adequate blood glucose level, or at least urine test for diagnosing GDM. Advice for measuring blood pressure is also very necessary. The doctors or clinicians should arrange staged management programmes. These programmes would be very beneficial and economical for the society. If the probability for getting GDM is high after clinical prediction model, then clinicians should advise the patients for controlling obesity and blood pressure, motivate for exercise, and to use balanced diet. They should arrange seminars at district level. Greater knowledge of risk factors about GDM may help to plan prevention programmes for GDM in future. Government of Ebonyi State and Health Ministries, with the collaboration of WHO, should arrange the maximum number of seminars and conferences on diabetes. To educate and aware the people against GDM, media should play its significant role. Non-Government Organizations (N.G.O’s) can also play their role with the help of well- trained health care team, educating both patients and general public with the consequences and complications of this chronic disease. In rural areas, special arrangements should be made for educating the people about balance diet and about this disease. Further studies are needed to specify the change associated with psychosocial problems in Ebonyi State, and to study the genetic components of individually as well as collectively effect of those risk factors, which are associated to GDM.

References

Agresti A (2007) An Introduction to categorical data analysis. (2nd Edn), Wiley, New York, USA.

Alonzo TA, Pepe MS (2002) Distribution-free ROC analysis using binary regression techniques. Biostatistics 3: 421-432.

Hosmer DW, Lemeshow S (2000) Applied logistic regression. (2nd Edn), Wiley-Interscience Publication, New York, USA.

Pepe M (2004) The statistical evaluation of medical tests for classification and prediction. Oxford University Press, NewYork, USA.

Chou P, Liaq MJ, Tsai ST (1994) Risk factors of diabetes. Diabetes Res Clin Pract 26: 229-235.

Jafar TH, Chaturvedi N, Pappas G (2006) Prevalence of overweight and obesity and their association with hypertension and DM in an Indo-Asian population. CMAG 175: 1071-1077.

Hagura R, Matsuda A, Kuzuya T, Yoshinaga H, Kosaka K (1994) Family history of diabetic patients in Japan. Diabetes Res Clin Pract S69-S73.

Fox J (2005) Maximum-likelihood estimation of the logistic regression model. UCLA/CCPR Notes.