Ana Gomez, a data analyst at Cha-Ching Bank, has compiled data on 500 past customers to whom Cha-Ching Bank marketed its Home Equity Line of Credit (HELOC) product. The data includes the age, sex, income, and whether or not the customer responded to the HELOC offer. Ana would like to team up with you to accomplish two data mining tasks:
(a) Develop a k-NN model for predicting whether or not a bank customer will respond to a HELOC offer.
(b) Identify for each of the 20 new customers if they are likely to respond to a HELOC offer.
Follow the k-NN optimization (with normalization) process as shown in the example process 07-01-RidingMowers k-NN Optimized Normalized.rmp with some changes as described below:
Make a copy of the RidingMowers process mentioned above. Rename the process by right-clicking it. Double-click and load this process on the RapidMiner canvas to start making changes to it.
Import HELOC.csv and HELOC-score.csv data into RapidMiner repository.
Load the files in the process appropriately (connect them instead of the existing data files).
Remove the Nominal to Binominal operator from the original process.
Instead, use the Numerical to Binominal operator to convert HELOC outcome variable to a binomial attribute.
Use the Set Role operator to set HELOC as the label role.
In the Edit Parameter Settings panel of the Optimize Parameters (Grid) operator, change the range of k to vary from a minimum of 1 to a maximum of 50 in 25 steps (linear scale).
Inside the Optimize Parameters (Grid) operator, change the split ratio of the Validation (Split Validation) operator to 0.75 split ratio with stratified sampling.
In the k-NN operator, change the measure types to MixedMeasures and mixed measure to MixedEuclideanDistance (since we have 2 numeric and 1 categorical attribute (Sex)).
In the Performance (Binomial Classification) operator, set the positive class to true and the main criterion for optimization to f-measure.
Run the process. Report the following results and provide your interpretation (important):
What is the optimal k value obtained?
What is the optimal (f-measure) value for the validation partition?
What is the AUC of your model?
What is the precision, recall, and accuracy of the model?
Provide screenshots of the following:
a. Confusion matrix obtained from the Performance operator
b. Result from Optimize Parameters (Grid) showing the optimal k-value selected
c. Result with a table showing all the k-values and performance metrics. Sort by f-measure in descending order.
d. Show the 20 new customer data, clearly showing the confidence (true), confidence (false), and the prediction (HELOC) columns.