Let’s call NULL to a position of a variable that should have a value but is empty. For example, this variable has a NULL in row 5.
Nulls can be of two types: missing values and empty values. It is very important to understand the difference between the two.
A missing value is a value that makes sense there. The above example shows a variable called Hpwr (power) and belonging to a table of data about vehicles. Other variables in the same table are MiGal (miles per gallon), Cyl (number of cylinders),WtLbs (weight in pounds), Acc0-60 (acceleration from 0 to 60 mph, etc.). In this table, the vehicle shown in row 4 lacks the power value. There should be a value, but it is not. This is a missing value kind of NULL.
Instead, an empty value is a value that does not make sense there. For example, consider a variable that indicates the account balance of a client. If the client does not have an account, what should have that variable value for that customer? It can’t be zero, because that would mean that the customer decided to have an account and its balance is $ 0, and this is not true. We will see a little later how to deal with this case.
Now that we are clear about the two types of NULLS, we’ll see what to do in each case.
The simplest case is that of missing values. For some reason the value is not present, although it should be. What to do? The simple answer is to ignore the entire row.There are times when missing values are scattered randomly across the data table, but hopefully not too many. That is, if we took the decision to ignore the rows that have at least one missing value, we are still left with a significant percentage of total rows (for example, only 10% of the rows would be ignored so we would have 90% to work).This is the best solution, but not always possible.
Sometimes, while individual variables have only a small percentage of missing values, the fact of considering all variables together makes the ignored number of rows too large to run out of data to analyze. In this case the only answer is to fill the empty values. The problem is choosing the right value. With any value that we use we will be introducing some bias. In the article on data preparation I talked about structure that retains a row. If we replace the missing value by a value that is not appropriate, we will be modifying this structure. Complete with the average value of the variable is not a good idea, for the reason stated above.
Dorian Pyle, in his book Data Preparation for Data Mining makes an exhaustive treatment about NULL replacement. Powerhouse is based on one of these methods for the automatic replacement of missing values that has to do with interfering as little as possible the structure between variables.
Before turning to the treatment of missing values is worth keeping in mind a case that appears very often. Often, the variables that have only two values, such as YES or NO, are completed with a single value (eg YES) and the other value is empty (meaning really that should be NO). That is, if we go by the number of NULL could rule out this variable reported to contain just too NULLS, but further analysis indicated that it only needed replace missing values by NO. Powerhouse can do this automatically through the NRV option during the data load.
Let’s see how to handle NULLS when they are clearly empty values, I mean, they are not completed because it makes sense not to put a value.
In Powerhouse exists one type of variable called Mixed. It may contain numeric and categorical at once. In the example of the account balance, if customers have a checking account, the value is the balance, otherwise the value is an label such as “No account”.
Powerhouse treats this variable as numbers for account customers and as categories for the rest of the customers.
In short, if the number of missing values is not too large, the ideal is to ignore the rows that contain missing values, if not, then replace them so as to interfere as little as possible with the internal structure of the row data.
If NULL values represent empty values, use a string to replace any empty cell so that Powerhouse can automatically treat this variable as Mixed.




