Imputation of multiple categorical covariates with missing data

3 messages 3 people Latest: Sep 23, 2015
Dear NMusers, I would like to investigate the effect of several genotypes on clearance in a pop PK model. The issue is most genotypes have some amount of missing data. I have discarded the genotypes which have way too many missing samples ( >30%) and now want to handle the remaining genotypes appropriately, before I move on to an automated stepwise covariate search in PsN. A colleague informed me that the following mixture model can serve for imputation of a single categorical covariate (let's call it GENO): ----------------------------------- ; In the dataset, the genotype is saved in the variable GENO and coded -99 if unknown, otherwise it takes on the values 0,1,2 $INPUT ID OCC TIME AMT DV .... GENO .... $PK ; here you check if the genotype is available or not (GENO==--99). If it's available, you save the new variable GENOME=GENO... IF (GENO.NE.-99) THEN GENOME = GENO ; ... otherwise you use the mixture to impute GENOME ELSE IF (MIXNUM.EQ.1) GENOME = 0 IF (MIXNUM.EQ.2) GENOME = 1 IF (MIXNUM.EQ.3) GENOME = 2 ENDIF ; then you use the variable GENOME (not GENO, which was in the dataset) to define CL, or whichever other parameter you want. ; you need to use a new variable since NONMEM won't let you change the value of one of the fields in the dataset. IF(GENOME.EQ.0) THEN TVCL = THETA(1)*((WT/12.5)**0.75) TVBIO = 1 ENDIF ; Three sub-populations whose proportion is given by the THETAs $MIX NSPOP=3 P(1)=THETA(14) P(2)=THETA(15) P(3)=THETA(16) $THETA 0.4 FIX ; GENO = 0 fixed to observed proportion in known genotype $THETA 0.4 FIX ; GENO = 1 fixed to observed proportion in known genotype $THETA 0.2 FIX ; GENO = 2 fixed to observed proportion in known genotype ----------------------------------- *The question is how to repeat such an approach when there are several missing genotypes (GENO1, GENO2, ..., GENOX) which need to be explored? * The answer I received from my colleague is it would be rather difficult, as the mixture model would require the specification of every possible combination of different genotypes. One approach I am considering is performing the stepwise covariate search in PsN (where per default missing categorical data is set to equal the most common value). Then I retrace the steps of the search based on the scm log file and check the difference between the OFV drops + p-values of the chosen relationships with those observed had a mixture model approach been used. If the difference is small and far removed from any other relationships which could have been chosen, I accept it and build my covariate model. Any input on this matter would be very much appreciated. Have a good day. Best regards, Yassine Roskilde Hospital Denmark
I would start with assigning a separate level to missing values rather than running a mixture model(s). If missing is at random you would expect "missing" level to be somewhere in the middle of all other levels. One can try to assign different (larger) OMEGA for this level (as by definition it combines subjects with wider range of parameters due to differences in genotypes). Then you should be able to identify all strong effects. You may retain even genotypes with large fraction of missing values: you would care only about representation of each known genotype (so that each analyzed level contains sufficient number of subjects). "Missing" level should not be used as a reference. As a common-sense check, parameter value for "missing" could be compared with the weighted (by observed prevalence) sum of parameter values for all known genotypes. Leonid -------------------------------------- Leonid Gibiansky, Ph.D. President, QuantPharm LLC web: www.quantpharm.com e-mail: LGibiansky at quantpharm.com tel: (301) 767 5566
Quoted reply history
On 9/23/2015 4:44 AM, ykl7 . wrote: > Dear NMusers, > > I would like to investigate the effect of several genotypes on clearance > in a pop PK model. The issue is most genotypes have some amount of > missing data. > > I have discarded the genotypes which have way too many missing samples ( > >30%) and now want to handle the remaining genotypes appropriately, > before I move on to an automated stepwise covariate search in PsN. A > colleague informed me that the following mixture model can serve for > imputation of a single categorical covariate (let's call it GENO): > > ----------------------------------- > > ; In the dataset, the genotype is saved in the variable GENO and coded > -99 if unknown, otherwise it takes on the values 0,1,2 > $INPUT ID OCC TIME AMT DV .... GENO .... > > $PK > > ; here you check if the genotype is available or not (GENO==--99). If > it's available, you save the new variable GENOME=GENO... > IF (GENO.NE.-99) THEN > GENOME = GENO > ; ... otherwise you use the mixture to impute GENOME > ELSE > IF (MIXNUM.EQ.1) GENOME = 0 > IF (MIXNUM.EQ.2) GENOME = 1 > IF (MIXNUM.EQ.3) GENOME = 2 > ENDIF > > ; then you use the variable GENOME (not GENO, which was in the dataset) > to define CL, or whichever other parameter you want. > ; you need to use a new variable since NONMEM won't let you change the > value of one of the fields in the dataset. > > IF(GENOME.EQ.0) THEN > TVCL = THETA(1)*((WT/12.5)**0.75) > TVBIO = 1 > ENDIF > > ; Three sub-populations whose proportion is given by the THETAs > $MIX NSPOP=3 > P(1)=THETA(14) > P(2)=THETA(15) > P(3)=THETA(16) > > $THETA 0.4 FIX ; GENO = 0 fixed to observed proportion in known genotype > $THETA 0.4 FIX ; GENO = 1 fixed to observed proportion in known genotype > $THETA 0.2 FIX ; GENO = 2 fixed to observed proportion in known genotype > > ----------------------------------- > > *The question is how to repeat such an approach when there are several > missing genotypes (GENO1, GENO2, ..., GENOX) which need to be explored? * > > The answer I received from my colleague is it would be rather difficult, > as the mixture model would require the specification of every possible > combination of different genotypes. > > One approach I am considering is performing the stepwise covariate > search in PsN (where per default missing categorical data is set to > equal the most common value). Then I retrace the steps of the search > based on the scm log file and check the difference between the OFV drops > + p-values of the chosen relationships with those observed had a mixture > model approach been used. If the difference is small and far removed > from any other relationships which could have been chosen, I accept it > and build my covariate model. > > Any input on this matter would be very much appreciated. > > Have a good day. > > Best regards, > Yassine > Roskilde Hospital > Denmark
hi Yassine, There are alternatives to the mixture approach that would probably work better in your case and can be used in the context of an scm too, see e.g.: AAPS J. 2013 Oct;15(4):1232-41. doi: 10.1208/s12248-013-9526-y. Comparison of methods for handling missing covariate data. Johansson ÅM, Karlsson MO. ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3787222) AAPS J. 2012 Sep;14(3):601-11. doi: 10.1208/s12248-012-9373-2. Epub 2012 May 31. Performance of methods for handling missing categorical covariate data in population pharmacokinetic analyses. Keizer RJ, Zandvliet AS, Beijnen JH, Schellens JH, Huitema AD. ( http://www.ncbi.nlm.nih.gov/pubmed/22648902) best regards, Ron ------------------------------- Pirana Software & Consulting BV @ronpirana