How can we treat NULL values?

Let’s call NULL to a position of a variable that should have a value but is empty. For example, this variable has a NULL in row 5.

Null treatment

Hpwr variable

 

Nulls can be of two types: missing values ​​and empty values. It is very important to understand the difference between the two.

A missing value is a value that makes sense there. The above example shows a variable called Hpwr (power) and belonging to a table of data about vehicles. Other variables in the same table are MiGal (miles per gallon), Cyl (number of cylinders),WtLbs (weight in pounds), Acc0-60 (acceleration from 0 to 60 mph, etc.). In this table, the vehicle shown in row 4 lacks the power value. There should be a value, but it is not. This is a missing value kind of NULL.

Instead, an empty value is a value that does not make sense there. For example, consider a variable that indicates the account balance of a client. If the client does not have an account, what should have that variable value for that customer? It can’t be zero, because that would mean that the customer decided to have an account and its balance is $ 0, and this is not true. We will see a little later how to deal with this case.

Now that we are clear about the two types of NULLS, we’ll see what to do in each case.

The simplest case is that of missing values. For some reason the value is not present, although it should be. What to do? The simple answer is to ignore the entire row.There are times when missing values ​​are scattered randomly across the data table, but hopefully not too many. That is, if we took the decision to ignore the rows that have at least one missing value, we are still left with a significant percentage of total rows (for example, only 10% of the rows would be ignored so we would have 90% to work).This is the best solution, but not always possible.

Sometimes, while individual variables have only a small percentage of missing values, the fact of considering all variables together makes the ignored number of rows too large to run out of data to analyze. In this case the only answer is to fill the empty values. The problem is choosing the right value. With any value that we use we will be introducing some bias. In the article on data preparation I talked about structure that retains a row. If we replace the missing value by a value that is not appropriate, we will be modifying this structure. Complete with the average value of the variable is not a good idea, for the reason stated above.

Dorian Pyle, in his book Data Preparation for Data Mining makes an exhaustive treatment about NULL replacement. Powerhouse is based on one of these methods for the automatic replacement of missing values that has to do with interfering as little as possible the structure between variables.

Before turning to the treatment of missing values ​​is worth keeping in mind a case that appears very often. Often, the variables that have only two values, such as YES or NO, are completed with a single value (eg YES) and the other value is empty (meaning really that should be NO). That is, if we go by the number of NULL could rule out this variable reported to contain just too NULLS, but further analysis indicated that it only needed replace missing values ​​by NO. Powerhouse can do this automatically through the NRV option during the data load.

Let’s see how to handle NULLS when they are clearly empty values, I mean, they are not completed because it makes sense not to put a value.

In Powerhouse exists one type of variable called Mixed. It may contain numeric and categorical at once. In the example of the account balance, if customers have a checking account, the value is the balance, otherwise the value is an label such as “No account”.

Account variable

Mixed variables

Powerhouse treats this variable as numbers for account customers and as categories for the rest of the customers.

In short, if the number of missing values ​​is not too large, the ideal is to ignore the rows that contain missing values, if not, then replace them so as to interfere as little as possible with the internal structure of the row data.

If NULL values represent empty values, use a string to replace any empty cell so that Powerhouse can automatically treat this variable as Mixed.

What is noisy data?

Information Theory has a precise definition of noise, but in this article I will address this issue in a more informal way. 

In any communication, noise is something to be avoided, as it messes the message being transmitted.

When referring to a prediction model, we assume that the independent variables transmit information about the dependent variable. In other words, there are variables that transmit information that will help us to predict the variable of interest.

However, models do not always achieve a high degree of accuracy in their predictions for various reasons. For example, it could be that the relationships between independent and dependent variables are not linear and the tool used to build the model considers only linear relationships.

Another reason I often hear is about the noise in the data. It is said that the data contains noise, and this interferes with the creation of a good model. But, what is noise?

We may define noise as a random signal that overlaps the original signal and confuse the receiver. Let’s see an example seeing first a signal without noise and then with noise overlaid.

Suppose that we have the following relationship between two variables

y = x2 – x + 2

A data table containing this relationship will have two variables X and Y… Read more »

Information in the data II

So the next step was to build models with all available variables. But with Powerhouse you cannot build a model without having selected the best variables, so you must first make it clear that you want to include all variables in the model. So, what it is hoped is a list of variables, sorted by the provided information. And here’s where the surprise came not all the variables are needed, because the best 11 variables have 100% of the needed information to predict the class!

Selection Infometric data

Selection Infometric data

In the information gain column can be seen that the last 3 variables contribute less than 1% of information. Moreover, you can also check that the variable selection algorithm is very robust with respect to noise, since variables which provide only noise (x21 to x41, as clarified at the site where the data was collected) were sorted at the end. Remember that we forced the selection algorithm to choose all the variables. Powerhouse stops selecting variables because it is not necessary. These 11 variables contain all the necessary information.
Read more »

Information in the data I

A few weeks ago I found a thesis in which it showed how a neural network (NN) solved a problem of classification. The dataset used has 41 variables plus the dependent variable. It tries to classify the type of wave (three classes) based on the attributes given by the variables. While all variables contain noise, the last 20 does not contain information about the wave type (they are just noise).

The thesis shows how a NN model has achieved an approximately 83% of correct classification.

Powerhouse wanted to see if he could achieve similar results, and this article will tell you step by step what was achieved, and the surprises I had.

I started as I always do, making a default model. To be more precise, I used only the variables selected and then the model OPFIT (a kind of logistic regression applied to data prepared by Powerhouse).

The next Powerhouse screen displays the information that was available to solve the problem

Variables Selection

Infometrics of the Selected Variables

To keep things simple, just use 2 columns of this display in all the analysis in this article, which shows the selected variables and shows the information gain.
Read more »

The importance of preparing the data II (categorical variables)

Prepare the data before building a model with any tool is essential for best results.

Categorical variables

Another problem commonly found in data mining projects is to assign numerical values to categorical variables. The normal practice is to assign numerical values randomly, without following any particular method.

The problem with this method is that assign arbitrary values to each category could destroy the internal structure of the data. To clarify this we suppose that a row of a table of data. In general, the values that make the variables maintain a certain relationship, for example, height and weight of a person. This relationship between variables is that we keep when we assign numerical values to the categories. If we do, we might be changing internal relationships so that the modeling tool have to deal with more complex relationships. Read more »

The importance of preparing the data

Prepare the data before building a model with any tool is essential for best results.

Here’s why.

Numeric variables

Suppose you are working with a variable whose distribution has a long tail. An example would be the salary of a person, such as seen in the chart below



This variable has a median of $ 1,200 and an average of about $ 1,700, but contains much higher salaries, reaching a maximum of $ 130,000.

When this variable is used without any pretreatment, except to express it in a new scale, such as 0 to 1 or -1 to +1, as it would require a neural network, we would be wasting much of the information it may contain. Why? The reason is simple, the algorithms that work with the different values ​​of the variable will try to find intervals that can discriminate variable values ​​of the dependent variable. But these intervals are always relative to the total range of the variable. Read more »

Data mining in a different way with Powerhouse

Powerhouse is a data mining tool based on a completely different approach than normally used to analyze data.

When trying to make a prediction model is generally assumed that the data contains sufficient information to make a good model.

Powerhouse doesn’t assume information exists, measures it.

Claude Shannon elaborated on his Information Theory how can information be measured in bits of information using a concept called entropy.

Powerhouse is built upon the Information Theory and manages to improve the practice of data mining by the following advantages, granted by this approach:

  • The data are automatically prepared to expose the best information they contain.
  • Variable selection is a straightforward process, which ensures quick and optimum group of variables with noise-free information.
  • Create simple predictive models to understand and yields similar to neural networks.
  • Discover segmentation models that expose natural clusters in data.

Powerhouse’s website contains detailed information about its features and benefits.