ANALLYZ

Outlier detection
Summary Statistics
Query
Histogram/Pie Chart
Line/Column/Bar Chart
Scatter/Bubble Plot
Enter the Number of Records to build Training Model
Start and End Nodes of the Path
Press VIEW to make selections or view results
Download the results
Save Data without Outliers
Save Predicted Values
Save K-means Cluster Table
Save Linear Programming Results
Data View

The imported data is displayed as a Table. The data can be sorted in ascending or descending order based on any Column. Also any value, especially Categorical value, can be searched for by typing the value in the Search Box. It is important to notice the total number of records displayed. Not only it ensures that entire data is imported, but also helps in giving size of Training Data in building Predictive Analytics models.


Outlier Detection and Removal

Outliers are anomalous or extreme values, which can arise due to error in reporting or may be a chance occurence. Outliers should be removed, as they diminish the accuracy of Parametric and Model-driven Techniques like Regression, Discriminant Analysis etc. They also adversly impact results in Cluster Analysis. A variable can be inspected by means of a Histogram to detect Outliers. In case of an error message for Histogram, CHANGE option can be used to regenerate the Charts. Then Outliers can be visualized through a Pie chart. TRUE in pie chart shows the percentage of Outliers. If removal of Outliers is opted, Outliers are deleted from the data set. This new data set can be saved and downloaded by the link at the lower left corner. This new file should be loaded again for analyses with Clean data. A variables should always be selected in order to inspect or remove outliers. After removal of outliers, clean data has to be reloaded or Tool has to be refreshed.

Numerical Techniques:

Following numerical measures are used for Continuous data.

Mean is the Arithmetic Average of all data points. Presence of very high or low values can distort the mean. Median is the middle value, when we arrange the data in decreasing or increasing order. Please note that Median would not be impacted by extreme values. For even number of Data Point, we will have a tie and median is average of the middle two values. Mode is the most frequent data value. But it is not used as often as Mean and Median Range is simply the difference between Maximum and Minimum values. Variance is average of the square of all deviations form the mean value. Std. Deviation is square root of the Variance, and the most widely used Statistical Measure along with Mean. Skewness is the measure of the deviation from the perfectly symmetric Histogram. Kurtosis is measure of the peakedness of a Histogram. A higher Kurtosis means a narrow Histogram, with most of the values close to mean values. Hence the variance, range, Std. Deviation all would be low for high Kurtosis and vice versa

Summary Statistic


              

Query on the Data set can be made either through the drop down menu, or SQL statements for advanced data retrieval. SELECT Fields are the columns to be retrieved (multiple selection allowed) and WHERE Fields are used for condition to be imposed. VALUE is manually entered. Any advanced query can be made through standard SQL commands, by typing in the SQL Statement Text Box (use Data as Table or View name)

As an example, select Country, Size of Economy, Region where GDP > 1000 and inspect the results

Graphical Techniques:

Primarily Line, Table, Column, Bar, Pie, Scatter Chart, Histogram etc. are used.

Categorical Data: The only measure which could be used on them is counts for each Categories. Accordingly, we can use Table, Bar(Column) or Pie Chart for them.

Continuous Data: The most widely used techniques are Histogram (also called Frequency Distribution) and Line Charts (mostly used with data ordered with Time).

Depending on whether the selected variable is Continuous or Categorical, appropriate charts are generated. Use the CHANGE option at top left to try different visualization techniques.

Combination: We often encounter a mix of Categorical and Continuous data, which in Analytics parlance are also called Dimension and Measure respectively.

When we combine two Dimensions we get what is known as a Contingency Table. We can combine two continuous variables by a Scatter plot. We can combine Categorical and Continuous data by aggregating Continuous values for Categories by means of Bar or Column Charts.

Line/Column/Bar Chart

Line, Bar etc. can be used to see distribution of a Measure for a particular Dimension. The measure can be either a sum or mean for a Category. CHANGE option at the top left can be used to change to other appropriate charts or to customize the chart.

Scatter/Bubble Plot

Scatter Plot is used to find relationship (correlation) between two numeric variables. It can be an important insight for detailed Analysis like Dimension Reduction or Predictive Analytics. Bubble Plots are advanced Scatter Plots which use an additional Size variable.

Dimension Reduction

Principal Component Analysis

PCA is a useful technique for reducing the no. of predictors(variables) in a model. It is especially useful when group of predictors are correlated among themselves, which results in Multicollinearity. Multicollinearity can lead to misleading results. PCA is intended for use with numeric or continuous variables. It provides few Artificial Variables, which are weighted linear combinations of the original variables. Often three or less PC may retain the predictive power of all the variables. The no. of PC may be determined by looking at the cumulative variance, and PC accounting for close to 100% variance should be retained. Additionally, length of Bars or the steep part of the Scree Plot indicates the no. of PC to be used. By default, Correlation is used and hence normalization of data is not required.


                
Analysis of Variance

ANOVA should be used to assess whether a Factor (a Qualitative or Categorical variable) has any significat impact on the Numeric (Continuous) Outcome. The variables not impacting the Outcome, may not be considered for Predictive Models. As an example, it can be used to reduce number of variables in Multiple Regression Models, involving Categorical Predictors


              

Predictive Models

The Model is trained based on the size of Training Data, input as number of records. The remaining records are taken as Validation Data. The accuracy measures and Confusion Matrix are shown for Validation Data. As an example, if 100 records are input as Training Size from total of 150 records, 100 are used to train the model, and 50 are used to validate the model.

K Nearest Neighbor

KNN is the simplest, yet a very effective Predictive Analytics technique. It is a very flexible technique, which works satisfactorily well in case of outliers, missing values etc. in the data. However it requires all the predictors to be numeric values. K is the no. of nearest neighbors used for majority voting. The plot below shows the accuracy with different K values. User should use an appropriate K from this plot for new Data prediction

Regression Model

Regression Models are the most widely used Predictive Analytics technique. It is used both as an Exploratory Model or Predictive Model. When the Response is Numeric, Multiple Regression is used, and when Response is Qualitative, Logistic Regression is used. Linear and Logistic Regression Models are automatically selected, based on the nature of Response. There are multiple ways to assess a Regression Model, the details are discussed below.


                

Coefficients Table is the most important result to look at, which has Estimated Coefficients, Error in coefficient estimation and Pr(>|t|). Pr(>|t|) less than 0.05 shows a statistically significant relationship between that predictor and the response. Maginitude and sign of the coefficients show the imapact of the predictor on the response. But it has to be considered along with the Pr(>|t|) values. The error in Coefficients should be a small % of the Coefficient value itself.

Residuals are the difference between actual and pedicted values (by the model). Mean should be ideally close to zero. Max and Min values should be low as well as symmetric (same absolute values). This represents a bell curve kind of distribution for Residuals, with a mean of zero. A plot of Residuals is generated, which should not only show a low value, but also a totally random pattern.

Adjusted R-squared indicates the % of variation in Response explained by the Model. A higher values, usually >70% is a sign of good model, but the context is also important. A lower value indicates need to include more or other set of predictors.

F Statistics is a global test of significance and shows the merit in building a model over Predicting a constant value (Average) for the Response. Once again, p value less than 0.05 indicates that there are statistically significant relationships between Predictors and the Response.

Cross-validation is a technique by which Training and Validation samples are built automatically from the same Data. It is one of techniques, under a broader approach called Ensemble Models. With this appraoch, the usual process of using separate Training and Validation samples is not needed. A big difference in Predicted and CVpredicted value indicates an over-fitted model. In case of Logistic Regression, Cross-validation is used to generate Confusion Matrix, and the internal estimate and CV estimate of accuracy should be closer.

Null & Residual Deviance are applicable for Logistic Regression. The null deviance shows how well the Response is predicted by just an intercept. The residual deviance shows how well the Response is predicted by the model when the predictors are included. Ideally one should see a sharp decrease in Residual Deviance from a Null Deviance. If Residual Deviance is higher than Null Deviance, then it indicates significant lack of fit.

Confusion Matrix is used in place of Residuals, when the Response is Categorical. It shows no. of predicted vs. actual classes and higher Diagonal values (FALSE-FALSE and TRUE-TRUE) indicates high accuracy.Accuracy, Precision, Recall etc. can be calculated by this Matrix.


                  

                  
Decision Tree

Decision Tree is a preferred technique over Regression models, in case majority of predictors are Categorical. Also when a non-linear relationship is expected between the response and predictors, Decision Tree should be used. A big advantage of Decision Tree over Regression models is that, the variable selection is autonomous and user need not intervene for picking strong predictors. Also Decision Trees are more amenable to missing data, outliers etc. To summarize, Decision Trees are more flexible and autonomous techniques, however they lack the stability and robustness of Regressio models, and usually need big data sets. Often Decision Trees is used as a Complimentary technique to filter out important predictors. In this tool, a popular Decision Tree technique called CART is used, which automatically detects type of Response. Hence both Categorical and Continuous Response can be analysed, and no explicit user selection is needed. Also the usual practice of Pruning the tree is not needed in this case.


                
Confusion Matrix

                Actual vs. Predicted
                

Naive-Bayes is a preferred technique, when there are several Categorical Predictors. It is very effective as a Classifier, where the correct Classification is an objective, and not the exact Probability of membership to a class. It works best with large data sets, and needs Outcome or Y variable as Categorical. X variables can be Numeric, as they are internally binned.

Confusion Matrix

              

Support Vector Machine works by finding a Separation boundary between different classes. In this regard, it is very similar to Linear Discriminant Analysis. It is a very popular technique, along with Logistic Regression and Discriminant Analysis for Binary Classification. However it is much more sophisticated and scores above the other techniques, in its ability to find highly non-linear separation boundaries, in a multi-dimensional space. It makes use of Kernel functions to map the non-linear boundaries into a Linear boundary. It can be used for Continuous as well as Categorical Responses. The tabular output displays the prediction along with the actual Outcome for all the records. Apart from that Confusion Matrix can also be used to assess the accuracy for Categorical Responses.

Confusion Matrix

                Actual vs. Predicted
                

Neural Network is a very accurate and sophisticated technique, however it can be very hard to train. The data needs some preprocessing and cleaning up, before it can be used for Training a NNet. One of the most important preprocessing is Normalization . Without Normalization, Results could be totally wrong. Although there are several techniques for it Min-Max is quite commonly used. It is a good practice to normalize the data, outside in Excel or Databases, for better control. Please use a Normalized Data, given as Sample, for Neural Network models.

Cluster Analysis is used to construct groups of similar records based on multiple variables. These variables are numeric measures available for all the records. There are two broad techniques of Cluster Analysis - Hierarchical and Non-hierarchical. K-means is the most widely used algorithm in Non-hierarchical class.

Hierarchical Cluster Analysis treats each records as one cluster to begin with and then progressively merge similar records in one cluster. The results are often represented in the form of an inverted tree, called a Dendrogram, which shows the process of progressive Cluster Analysis. Hierarchical Cluster Analysis is not suited for large data sets and can be slow. Also it is sensitive to outliers and is less stable. In order to save computational time, Sample should be taken from the dataset. In case of moderate size data, total no. of records can be used in the Sample size input box.

K Means Algorithm groups the records in predetermined no. of clusters. It starts with an initial grouping and then it is refined successively in multiple steps. K Means is more popular due to faster computatios and easy interpretation of results. But deciding an appropriate no. of cluster may need iterations. K Means is preferred over Hierarchical for large data sets.

Prescriptive Analytics

The user needs to provide the Linear Programming Formulation in a csv file. The csv file should contain the Objective Function, in terms of all the Decision Variables at the top row and Constraints in subsequent rows.Maximize or Minimize should be selected, depending on whether one wants to maximize or minimize the Objective. Getting familiarized with the Input Data format is recommended. One can download the sample data for Prescriptive Analytics through the link provided


              

Transportation Problem is a specific type of Linear Programming Problem. LP Problems are solved by a Technique called SIMPLEX METHOD . Transportation Problems are solved by a simple variant of Simplex Method, called Transportation Method. It is used in cases, where certain products can be supplied from multiple SOURCES to multiple DESTINATIONS . With each combination of Source and Destination, a certain Cost is associated. The objective is to minimize the cost associated in Transportation. The transported object could be material or man.

Assignment problem is also a special case of Linear Programming. In fact it is one of the special cases of Transportation problems. It is applicable in cases where multiple person/machines are available for multiple jobs, with each person/job pair having a different cost or time. The objective is to minimize the total cost or time of completing all the jobs by all the person. An important characteristic of the assignment problem is that the number of person/machine(similar to Sources) is equal to number of job(similar to destinations). In case they are unequal, then a dummy job or dummy person is added with zero cost/time to make it a balanced problem. The most widely used technique for Assignment problem is called Hungarian Method, as it was developed by a Hungarian mathematician D. Konig

The SHORTEST PATH algorithm provides the shortest distance between two Nodes and the shortest route. A distance matrix needs to be provided, which has actual distances between any two nodes. The diagonal cells need to be zero. If bi-directional movement between two nodes is permitted, then the matrix would be Symmetric. Hence distance 1,2 and 2,1 would be same. By default path 1-2 is selected, but other inputs can be given by changing nodes in the From and To input boxes. The Shortest path matrix can be downloaded by the Downloaded Link in the left User panel.


              
New Data Prediction

The analysis and assessment done in Predictive Model building step can be used to Predict new data. Please note that Response and Predictor selection is based on the Training dataset. The model fitted on the Training Data can be used for new data set, which has to be read separately. The new data set should have all the column present in the Training data.