View on GitHub

data310

Project 2 - Zambia DHS Data Analysis and Modeling

I requested and obtained DHS data for Zambia (2018, DHS-VII).

Model 1 - Penalized Logistic Regression

Using the R script provided, split and sample your DHS persons data and evaluate the AUC - ROC values you produce. Which “top_model” performed the best (had the largest AUC)?

Below are the results of the “top_models”:

    penalty .metric .estimator  mean     n std_err .config              
      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
0.0001   roc_auc hand_till  0.608     1      NA Preprocessor1_Model01
0.000127 roc_auc hand_till  0.608     1      NA Preprocessor1_Model02
0.000161 roc_auc hand_till  0.608     1      NA Preprocessor1_Model03
0.000204 roc_auc hand_till  0.608     1      NA Preprocessor1_Model04
0.000259 roc_auc hand_till  0.608     1      NA Preprocessor1_Model05
0.000329 roc_auc hand_till  0.608     1      NA Preprocessor1_Model06
0.000418 roc_auc hand_till  0.608     1      NA Preprocessor1_Model07
0.000530 roc_auc hand_till  0.608     1      NA Preprocessor1_Model08
0.000672 roc_auc hand_till  0.608     1      NA Preprocessor1_Model09
0.000853 roc_auc hand_till  0.609     1      NA Preprocessor1_Model10
0.00108  roc_auc hand_till  0.608     1      NA Preprocessor1_Model11
0.00137  roc_auc hand_till  0.608     1      NA Preprocessor1_Model12
0.00174  roc_auc hand_till  0.607     1      NA Preprocessor1_Model13
0.00221  roc_auc hand_till  0.606     1      NA Preprocessor1_Model14
0.00281  roc_auc hand_till  0.603     1      NA Preprocessor1_Model15

Based on this output, model 10 had the largest AUC (0.609) and thus performed the best.

Are you able to use the feature selection penalty to tune your hyperparameter and remove any potentially irrelevant predictors? Provide justification for your selected penalty value.

drawing

My selected penalty value is 0.000853. Looking at the “top_models” output, model 10 has the largest mean AUC, 0.609, with a penalty of 0.000853. Additionally, based on the penalty vs AUC plot above, AUC appears to begin to decrease after about model 10.

Finally, provide your ROC plots and interpret them. How effective is your penalized logistic regression model at predicting each of the five wealth outcomes?

drawing

I used a penalty of 0.000853, or slice 10, for the ROC plots. To determine how effective the model is, we can look at how close the plot is to the 45 degree line; the closer to the straight line the worse the model is at predicting the wealth outcome versus the others. The penalized logistic regression is most effective at predicting wealth outcomes 1 and 5, and less effective at differentiating 2, 3, and 4 from the others. The model performed best when predicting wealth group 5, then 1. It seems to perform okay, but not well when predicting groups 2 and 4, but overall appears to not perform very well when differentiating between 2, 3, and 4, or the middle wealth outcomes.

Model 2 - Random Forest Model

Using the R script provided, set up your random forest model and produce the AUC - ROC values for the randomly selected predictors, and the minimal node size, again with wealth as the target.

drawing

How did your random forest model fare when compared to the penalized logistic regression?

drawing

The random forest model is fairly comparable to the penalized logistic regression. It appears to perform a little better than the penalized logistic regression in some wealth outcomes, but overall the models performed similarly.

Provide your ROC plots and interpret them.

drawing

Similar to the penalized logistic regression, the random forest model is most effective at predicting wealth outcomes 1 and 5, and less effective at differentiating 2, 3, and 4 from the others. The model overall appears to not perform as well when differentiating between 2, 3, and 4 and the others, or the middle wealth outcomes. There does appear to perhaps be a small improvement in predicting the middle wealth outcomes versus the penalized logistic regression, but overall the models perform similarly.

Are you able to provide a plot that supports the relative importance of each feature’s contribution towards the predictive power of your random forest ensemble model?

drawing

It appears that age is the most important feature in terms of contribution to predictive power, and gender is the least important. I would have guessed that perhaps education level would be of more importance, so that was a somewhat surprising outcome. It would be interesting to look at other features of the data.

Model 3 - Logistic Regression Model

Using the python script provided, train a logistic regression model using the tensorflow estimator API and your DHS data, again with wealth as the target. Apply the linear classifier to the feature columns and determine the accuracy, AUC and other evaluative metrics towards each of the different wealth outcomes.

Evaluative Metric	Result (wealth 1)	Result (wealth 2)	Result (wealth 3)	Result (wealth 4)	Result (wealth 5)
accuracy	0.776949	0.781069	0.784110	0.827759	0.819225
accuracy_baseline	0.776949	0.781069	0.784110	0.827759	0.819225
auc	0.631950	0.535903	0.536906	0.564764	0.649596
auc_precision_recall	0.295410	0.232036	0.236950	0.210365	0.309374
average_loss	0.510198	0.522409	0.520388	0.456083	0.443171
label/mean	0.223051	0.218931	0.215890	0.172241	0.180775
loss	0.510198	0.522409	0.520388	0.456083	0.443171
precision	0.000000	0.000000	0.000000	0.000000	0.000000
prediction/mean	0.223939	0.225920	0.219684	0.169870	0.180851
recall	0.000000	0.000000	0.000000	0.000000	0.000000
global_step	100.000000	100.000000	100.000000	100.000000	100.000000

Then continue with your linear classifier adding the derived feature columns you have selected in order to extend capturing combinations of correlations (instead of learning on single model weights for each outcome). Again produce your ROC curves and interpret the results.

Wealth 1:

drawing

Wealth 2:

drawing

Wealth 3:

drawing

Wealth 4:

drawing

Wealth 5:

drawing

The ROC curves for the logistic regression model are similar to those of the penalized logistic regression and random forest models; the model performs better and has a larger AUC for wealth 1 and 5. The model is not very effective at differentiating wealths 2, 3, and 4. Similar to the previous models, this model does perhaps predict wealth 4 a bit better than wealths 2 and 3.

Model 4 - Gradient Boosting Model

Using the python script provided, train a gradient boosting model using decision trees with the tensorflow estimator. Provide evaluative metrics including a measure of accuracy and AUC.

Evaluative Metric	Result (wealth 1)	Result (wealth 2)	Result (wealth 3)	Result (wealth 4)	Result (wealth 5)
accuracy	0.777440	0.780284	0.784306	0.828543	0.826974
accuracy_baseline	0.776949	0.785875	0.784110	0.827759	0.819225
auc	0.654496	0.659469	0.549197	0.607277	0.689823
auc_precision_recall	0.326235	0.319574	0.251778	0.249540	0.378418
average_loss	0.500891	0.487646	0.516750	0.448407	0.430754
label/mean	0.223051	0.214125	0.215890	0.172241	0.180775
loss	0.500891	0.487646	0.516750	0.448407	0.430754
precision	0.575758	0.571429	0.545455	0.681818	0.739394
prediction/mean	0.219308	0.219959	0.215268	0.164324	0.177925
recall	0.008355	0.003665	0.005452	0.008542	0.066196
global_step	100.000000	100.000000	100.000000	100.000000	100.000000

Produce the predicted probabilities plot as well as the ROC curve for each wealth outcome and interpret these results.

Wealth 1:

drawing

Wealth 2:

drawing

Wealth 3:

drawing

Wealth 4:

drawing

Wealth 5:

drawing

The ROC curves for the gradient boosting model are again pretty similar to those of previous models; the model performs better and has a larger AUC for wealth 1 and 5. The model is not as effective at differentiating wealths 2, 3, and 4 from the others. Compared to the logistic regression model, we can look at the evaluative metrics and see that the gradient boosting model overall appears to have slightly improved accuracy and also outputs a slightly larger AUC for all wealths - particularly for wealths 2 and 4.

Looking at the predicted probabilities plots, these outcomes make sense because there is a larger number of people in wealth outcomes 1-3 than 4 and 5, so the predictive probabilities would be higher for those wealth outcomes.

drawing

Analyze all four models

According to the evaluation metrics, which model produced the best results? Were there any discrepancies among the five wealth outcomes from your DHS survey dataset?

According to the evaluation metrics, the gradient boosting model produced the best results. It had a higher accuracy across the wealth outcomes, and also had the larger AUC for the wealth outcomes. The four models performed similarly - they performed better when predicting the “extremes” of the wealth category, wealths 1 and 5. Model performance was not as good for the middle wealth outcomes.