Data Set Objects

Description

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R’s statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = default.stringsAsFactors(),
                 document = NULL)
as.data.set(x, row.names=NULL, ...)
## S4 method for signature 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
## S4 method for signature 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'data.set'
within(data, expr, ...)

dsView(x)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a “data.set” object.

data

a data set, that is, an object of class “data.set”.

expr

an expression, or several expressions enclosed in curly braces.

Value

data.set and the within method for data sets returns a “data.set” object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see item and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for “importer” objects.

Examples

Data <- data.set(
         vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
         region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
         income = exp(rnorm(300,sd=.7))*2000
         )
Data <- within(Data,{
 description(vote) <- "Vote intention"
 description(region) <- "Region of residence"
 description(income) <- "Household income"
 wording(vote) <- "If a general election would take place next tuesday,
                   the candidate of which party would you vote for?"
 wording(income) <- "All things taken into account, how much do all
                   household members earn in sum?"
 foreach(x=c(vote,region),{
   measurement(x) <- "nominal"
   })
 measurement(income) <- "ratio"
 labels(vote) <- c(
                   Conservatives         =  1,
                   Labour                =  2,
                   "Liberal Democrats"   =  3,
                   "Don't know"          =  8,
                   "Answer refused"      =  9,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 labels(region) <- c(
                   England               =  1,
                   Scotland              =  2,
                   Wales                 =  3,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 foreach(x=c(vote,region,income),{
   annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
   })
 missing.values(vote) <- c(8,9,97,99)
 missing.values(region) <- c(97,99)

 # These to variables do not appear in the
 # the resulting data set, since they have the wrong length.
 junk1 <- 1:5
 junk2 <- matrix(5,4,4)

})
Warning in within(Data, { :
  Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
Data set with 300 observations and 3 variables

                   vote               region   income
 1 *Not asked in survey             Scotland 4434.841
 2               Labour             Scotland 3149.609
 3        Conservatives *Not asked in survey 3753.801
 4        Conservatives *Not asked in survey 7621.179
 5      *Not applicable              England 1623.035
 6      *Answer refused              England 1254.452
 7          *Don't know             Scotland 1902.068
 8    Liberal Democrats              England 1941.671
 9               Labour             Scotland 1823.695
10 *Not asked in survey             Scotland 6376.524
11        Conservatives                Wales 3802.682
12        Conservatives             Scotland 1811.134
13    Liberal Democrats              England 7219.892
14    Liberal Democrats              England 3835.629
15      *Answer refused *Not asked in survey  163.159
16               Labour                Wales 1587.021
17      *Not applicable              England 1822.535
18        Conservatives *Not asked in survey 4251.340
19               Labour              England 1581.108
20      *Not applicable             Scotland 6242.283
21    Liberal Democrats             Scotland 1180.103
22    Liberal Democrats *Not asked in survey 1184.158
23          *Don't know             Scotland 3598.123
24               Labour                Wales 2644.358
25      *Not applicable              England 1165.311
(25 of 300 observations shown)
## Not run:
##
##
## # If we insist on seeing all, we can use 'print' instead
## print(Data)
## End(Not run)

str(Data)
Data set with 300 obs. of 3 variables:
$ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num 99 2 1 1 97 9 8 3 2
  99 ...
$ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num 2 2 99 99 1 1 2 1 2
  2 ...
 $ income: Rto. item  num  4435 3150 3754 7621 1623 ...
summary(Data)
                  vote                     region        income
Conservatives       :58   England             :135   Min.   :  163.2
Labour              :38   Scotland            : 96   1st Qu.: 1299.1
Liberal Democrats   :41   Wales               : 31   Median : 2134.0
*Don't know         :50   *Not asked in survey: 38   Mean   : 2682.0
*Answer refused     :38                              3rd Qu.: 3318.3
*Not applicable     :40                              Max.   :13639.7
*Not asked in survey:35
## Not run:
##
## # If we want to 'View' a data set we can use 'dsView'
## dsView(Data)
## # Works also, but changes the data set into a data frame first:
## View(Data)
## End(Not run)

Data[[1]]
Item 'Vote intention' (measurement: nominal, type: double, length = 300)

 [1:300] *Not asked in survey Labour Conservatives Conservatives ...
Data[1,]
Data set with 1 observations and 3 variables

                  vote   region   income
1 *Not asked in survey Scotland 4434.841
head(as.data.frame(Data))
           vote   region   income
1          <NA> Scotland 4434.841
2        Labour Scotland 3149.609
3 Conservatives     <NA> 3753.801
4 Conservatives     <NA> 7621.179
5          <NA>  England 1623.035
6          <NA>  England 1254.452
EnglandData <- subset(Data,region == "England")
EnglandData
Data set with 135 observations and 3 variables

                   vote  region    income
 1      *Not applicable England 1623.0346
 2      *Answer refused England 1254.4520
 3    Liberal Democrats England 1941.6712
 4    Liberal Democrats England 7219.8917
 5    Liberal Democrats England 3835.6294
 6      *Not applicable England 1822.5345
 7               Labour England 1581.1076
 8      *Not applicable England 1165.3108
 9    Liberal Democrats England  413.8022
10        Conservatives England 5046.0297
11    Liberal Democrats England 2121.2771
12          *Don't know England 3935.1355
13      *Not applicable England 3284.1126
14      *Answer refused England 2301.0699
15        Conservatives England 4971.8018
16          *Don't know England 3168.6848
17        Conservatives England 3770.4533
18      *Answer refused England 1020.7113
19      *Not applicable England 8655.1240
20 *Not asked in survey England 3624.6753
21    Liberal Democrats England 1922.5645
22 *Not asked in survey England  533.5368
23        Conservatives England 3752.4209
24    Liberal Democrats England 2355.8925
25      *Not applicable England 2260.3259
(25 of 135 observations shown)
xtabs(~vote+region,data=Data)
                   region
vote                England Scotland Wales
  Conservatives          21       18     7
  Labour                 17       12     7
  Liberal Democrats      24        9     2
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
                      region
vote                   England Scotland Wales
  Conservatives             21       18     7
  Labour                    17       12     7
  Liberal Democrats         24        9     2
  *Don't know               19       21     6
  *Answer refused           21        8     4
  *Not applicable           19       10     5
  *Not asked in survey      14       18     0