What is the dependent variable? What is the independent variable?
As you know the two main variables in an experiment are the independent and dependent variable. An independent variable is the variable that is changed or controlled in a scientific experiment to test the effects on the dependent variable. A dependent variable is the variable being tested and measured in a scientific experiment. In regression analysis, the dependent variable is denoted "Y" and the independent variables are denoted by "X".
Type of dependent variables and independent variables
In the previous regression model, both the dependent variable and the independent variable are numeric (type of variable). What can you do when the dependent variable is categorical (nominal or ordinal) variable? What should you do if you cannot conduct a scientific experiment?
In today’s world of "Big Data", you can look into large datasets and perform "mining" on the data instead of an experiment. A very analogous situation is coal mining, where different tools are required to mine the coal buried deep beneath the ground. Of the tools in Data mining, "Decision Tree" is one of them.
Decision Tree Data Mining Technique
A decision tree is a supervised machine learning technique wherein we train the machine using the existing data with the known target (i. E., dependent) variable. As the name suggests, this technique has a tree type of structure. In Decision Tree, the algorithm splits the dataset into subsets based on the most important or significant attribute. The most significant attribute is designated in the root node, and that is where the splitting takes the place of the entire dataset present in the root node. This splitting done is known as decision nodes. In case no more split is possible, that node is termed as a leaf node.
To stop the algorithm from reaching an overwhelming stage, a stop criterion is employed. One of the stop criteria is the minimum number of observations in the node before the split happens. While applying the decision tree in splitting the dataset, one must be careful that many nodes might have noisy data. To cater to an outlier or noisy data problems, we employ techniques known as Data Pruning. Data pruning is nothing but an algorithm to classify out data from the subset, making it difficult for learning from a given model.