
I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove rows with a factor that has a low total number of instances.
I've tried the following: create a table for the student name factor
studenttable <- table(data$Anon.Student.Id)
returns a table
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
then I can get a table that tells me if there are more than 1000 data points for a given factor level
biginstances <- studenttable>1000
then I tried making a subset of the data on this query
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
But I get weird subsets that still have the original number of factor levels as the full set. I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.
Answer1:
There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)
# Create some fake data
dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15))
# tabulate the id variable
tab <- table(dat$id)
# Get the names of the ids that we care about.
# In this case the ids that occur >= 3 times
idx <- names(tab)[tab >=3]
# Only look at the data that we care about
dat[dat$id %in% idx,]
Answer2:
@Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.
biginstances <- studenttable>1000
This will create a logical vector whose length is equal the number of unique student id's. studenttable
contained a count for each unique value of data$Anon.Student.Id
. When you try to use that logical vector in subset
:
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
it's length is almost surely much less than the number of rows in data
. And since the subsetting criteria in subset
is meant to identify rows of data
, R's recycling rules take over and you get 'weird' looking subsets.
I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:
> fac <- factor(rep(letters[1:3],each = 3))
> fac
[1] a a a b b b c c c
Levels: a b c
> fac[-(1:3)]
[1] b b b c c c
Levels: a b c
> droplevels(fac[-(1:3)])
[1] b b b c c c
Levels: b c
So you'll want to use droplevels
if you want to ensure that those levels are really 'gone'. Also, see options(stringsAsFactors = FALSE)
.
Answer3:
Another approach will involve a join between your dataset and the table of interest. I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)
require(plyr)
set.seed(123)
Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE),
var2 = 1:100)
R> table(Data$var1)
A B C D E
19 20 21 22 18
## rows with category less than 20
mytable <- count(Data, vars = "var1")
## mytable <- as.data.frame(table(Data$var1))
R> str(mytable)
'data.frame': 5 obs. of 2 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
$ freq: int 19 20 21 22 18
Data <- join(Data, mytable)
## Data <- merge(Data, mytable)
R> str(Data)
'data.frame': 100 obs. of 3 variables:
$ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ...
$ var2: int 1 2 3 4 5 6 7 8 9 10 ...
$ freq: int 21 20 21 18 21 18 18 22 21 19 ...
mysubset <- droplevels(subset(Data, freq > 20))
R> table(mysubset$var1)
C D
21 22
Hope this help..
Answer4:
this is how I managed to do this. I sorted the table of factors and associated counts.
studenttable <- sort(studenttable, decreasing=TRUE)
now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.
sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574
now we know the first 230 factor levels are the ones we care about. So, we can do
idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]
we can verify it worked by doing
bigstudenttable <- table(bigdata$Anon.Student.Id)
to get a print out and see all the factor levels with less than 1000 instances are now 0.