Recode Continuous Variable Into Categorical Sas

Question

The way I understood your question was that you were wondering if there was a way for you to take your categorical variable where each category represents a range of continuous values, and assign a value for each observation that falls within the range represented by the category.

Example:

          Observation     Categorical (have)    "Recoded" continuous value (want) 1               "1: 15 to 19"             15 2               "1: 15 to 19"             18     3               "2: 20 to 24"             20 4               "2: 20 to 24"             22 5               "3: 25 to 29"             26 6               "3: 25 to 29"             26 7               "3: 25 to 29"             25 ...

If this is the case, you probably should not do it. Categorical variables, by nature, mask the observed value of the constructs you are trying to measure. If the categories represent a range of values, you know that the observed value is somewhere within that range, but you don't know if it's closer to the lower end, the higher end, or the middle of the range. Of course it is possible to do these things mathematically, for example, assign a "random" value or impute a value, but what would the mean and SD for these values mean in the "real world" sense? How useful is it to look at the mean and SD of values that were artificially constructed if we do not know the "true" underlying distribution of the values?

You may be tempted to randomly assign a value within each category based on the range of that category, but without knowing what the "true" distribution is within each range (category), it's hard to justify assigning a random value based on that category's limits. There are many continuous variables that do not follow a normal distribution (which is why we categorize them before entering them into models). Income, for example, is generally positively skewed:

IFLS4 HH expenditure among women who had a birth in the 5 years preceding interview

Supposing the survey instrument collected income based using pre-determined categories of income in increments of 250,000 Rupiahs, the frequency distribution of the same data shown above may look something like:

                      Categorical HH |   expenditure per adult |              equivalent |      Freq.     Percent        Cum. ------------------------+-----------------------------------            0 to 250,000 |        350        6.98        6.98   250,000.01 to 500,000 |      1,692       33.75       40.73   500,000.01 to 750,000 |      1,245       24.83       65.56 750,000.01 to 1,000,000 |        723       14.42       79.98              >1,000,000 |      1,004       20.02      100.00 ------------------------+-----------------------------------                   Total |      5,014      100.00

The frequencies do not tell us anything about how income is distributed within each category. Furthermore, the categorisation does not tell us what the upper limit is for the last category (in fact, it's just under 10,000,000). We can overlay vertical lines at 250,000, ..., 1,000,000 to see what the actual distributions look like with each of the 5 income ranges:

Histogram with overlays

We can even produce histograms for each category:

First

Histogram first category

Second enter image description here

Third enter image description here

Fourth enter image description here

Fifth enter image description here

For the first and last categories, there are clear trends in distribution as income increases. You can see how it would be a bad idea to "convert" the first category into a continuous variable by randomly assigning value between 0 and 250,000 because it clearly shows that people were more likely to be on the higher end of the range than the lower end. Similarly for the last category, people were more likely to have monthly income closer to 1,000,000 instead of the higher end of the range, so we can't just assign a random number for those people either. The distributions look more "random" for the middle 3 categories (the bars are closer in height to one another, and no clear increasing/decreasing/bimodal trend).

Of course this is an extreme example meant to show the dangers of "converting" categorical variables into continuous without knowing the underlying true distribution of the values. Your ranges may well be smaller than what I have shown in this example, but in the end, the question remains on what the means and SDs of the constructed (imputed?) values truly mean.

If what you mean is something completely different (the categories don't represent numbers but subjective attributes, e.g. 1=Blue, 2=Yellow, 3=Green), then my responses above would not apply. If this is the case, you can of course take the mean, SD, etc., but as other posters have mentioned, the question remains on how the mean/SD/etc. would be interpreted and whether it is useful.

Curtis Witen1995

Recode Continuous Variable Into Categorical Sas

Post a Comment for "Recode Continuous Variable Into Categorical Sas"