data.set memisc 0.99.25.4

Data Set Objects

Description

"data.set" objects are collections of "item" objects, with similar semantics as data frames. They are distinguished from data frames so that coercion by as.data.fame leads to a data frame that contains only vectors and factors. Nevertheless most methods for data frames are inherited by data sets, except for the method for the within generic function. For the within method for data sets, see the details section.

Thus data preparation using data sets retains all informations about item annotations, labels, missing values etc. While (mostly automatic) conversion of data sets into data frames makes the data amenable for the use of R’s statistical functions.

dsView is a function that displays data sets in a similar manner as View displays data frames. (View works with data sets as well, but changes them first into data frames.)

Usage

data.set(...,row.names = NULL, check.rows = FALSE, check.names = TRUE,
    stringsAsFactors = FALSE, document = NULL)
as.data.set(x, row.names=NULL, ...)
## S4 method for signature 'list'
as.data.set(x,row.names=NULL,...)
is.data.set(x)
## S4 method for signature 'data.set'
as.data.frame(x, row.names = NULL, optional = FALSE, ...)
## S4 method for signature 'data.set'
within(data, expr, ...)

dsView(x)

## S4 method for signature 'data.set'
head(x,n=20,...)
## S4 method for signature 'data.set'
tail(x,n=20,...)

Arguments

...

For the data.set function several vectors or items, for within further, ignored arguments.

row.names, check.rows, check.names, stringsAsFactors, optional

arguments as in data.frame or as.data.frame, respectively.

document

NULL or an optional character vector that contains documenation of the data.

x

for is.data.set(x), any object; for as.data.frame(x,...) and dsView(x) a “data.set” object.

data

a data set, that is, an object of class “data.set”.

expr

an expression, or several expressions enclosed in curly braces.

n

integer; the number of rows to be shown by head or tail

Value

data.set and the within method for data sets returns a “data.set” object, is.data.set returns a logical value, and as.data.frame returns a data frame.

Details

The as.data.frame method for data sets is just a copy of the method for list. Consequently, all items in the data set are coerced in accordance to their measurement setting, see as.vector,item-method and measurement.

The within method for data sets has the same effect as the within method for data frames, apart from two differences: all results of the computations are coerced into items if they have the appropriate length, otherwise, they are automatically dropped.

Currently only one method for the generic function as.data.set is defined: a method for “importer” objects.

Examples

Data <- data.set(
         vote = sample(c(1,2,3,8,9,97,99),size=300,replace=TRUE),
         region = sample(c(rep(1,3),rep(2,2),3,99),size=300,replace=TRUE),
         income = exp(rnorm(300,sd=.7))*2000
         )
Data <- within(Data,{
 description(vote) <- "Vote intention"
 description(region) <- "Region of residence"
 description(income) <- "Household income"
 wording(vote) <- "If a general election would take place next tuesday,
                   the candidate of which party would you vote for?"
 wording(income) <- "All things taken into account, how much do all
                   household members earn in sum?"
 foreach(x=c(vote,region),{
   measurement(x) <- "nominal"
   })
 measurement(income) <- "ratio"
 labels(vote) <- c(
                   Conservatives         =  1,
                   Labour                =  2,
                   "Liberal Democrats"   =  3,
                   "Don't know"          =  8,
                   "Answer refused"      =  9,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 labels(region) <- c(
                   England               =  1,
                   Scotland              =  2,
                   Wales                 =  3,
                   "Not applicable"      = 97,
                   "Not asked in survey" = 99)
 foreach(x=c(vote,region,income),{
   annotation(x)["Remark"] <- "This is not a real survey item, of course ..."
   })
 missing.values(vote) <- c(8,9,97,99)
 missing.values(region) <- c(97,99)

 # These to variables do not appear in the
 # the resulting data set, since they have the wrong length.
 junk1 <- 1:5
 junk2 <- matrix(5,4,4)

})
Warning in within(Data, { :
  Variables 'junk1','junk2' have wrong length, removing them.
# Since data sets may be huge, only a
# part of them are 'show'n
Data
Data set with 300 observations and 3 variables

                   vote               region    income
 1 *Not asked in survey              England 1333.4413
 2               Labour              England 4078.4325
 3      *Not applicable              England 1700.6438
 4        Conservatives              England 5491.5934
 5    Liberal Democrats              England 1793.2672
 6      *Answer refused *Not asked in survey 2812.0151
 7 *Not asked in survey              England 2035.1048
 8        Conservatives              England 1359.4642
 9 *Not asked in survey             Scotland 1257.2907
10      *Not applicable                Wales 2154.1364
11 *Not asked in survey              England 4988.3709
12    Liberal Democrats *Not asked in survey 3284.0083
13               Labour                Wales  761.7126
14        Conservatives              England 4238.8204
15          *Don't know             Scotland 1974.9022
16               Labour             Scotland 2335.7966
17      *Answer refused             Scotland 3996.2648
18          *Don't know                Wales 5894.6863
19    Liberal Democrats                Wales 1804.0887
20 *Not asked in survey             Scotland 6329.8199
21          *Don't know                Wales 2728.0786
22               Labour              England 3855.4560
23 *Not asked in survey *Not asked in survey 1451.7798
24 *Not asked in survey                Wales  919.0701
25          *Don't know              England 1458.9798
(25 of 300 observations shown)
# If we insist on seeing all, we can use 'print' instead
print(Data)

str(Data)
Data set with 300 obs. of 3 variables:
$ vote : Nmnl. item w/ 7 labels for 1,2,3,... + ms.v.  num 99 2 97 1 3 9 99 1
  99 97 ...
$ region: Nmnl. item w/ 5 labels for 1,2,3,... + ms.v.  num 1 1 1 1 1 99 1 1 2
  3 ...
 $ income: Rto. item  num  1333 4078 1701 5492 1793 ...
summary(Data)
                  vote                     region        income
Conservatives       :37   England             :144   Min.   :  398.1
Labour              :52   Scotland            : 73   1st Qu.: 1311.7
Liberal Democrats   :45   Wales               : 47   Median : 2076.9
*Don't know         :37   *Not asked in survey: 36   Mean   : 2517.6
*Answer refused     :45                              3rd Qu.: 3112.4
*Not applicable     :37                              Max.   :10757.1
*Not asked in survey:47
# If we want to 'View' a data set we can use 'dsView'
dsView(Data)
# Works also, but changes the data set into a data frame first:
View(Data)

Data[[1]]
Item 'Vote intention' (measurement: nominal, type: double, length = 300)

[1:300] *Not asked in survey Labour *Not applicable Conservatives Liberal
  Democrats ...
Data[1,]
Data set with 1 observations and 3 variables

                  vote  region   income
1 *Not asked in survey England 1333.441
head(as.data.frame(Data))
               vote  region   income
1              <NA> England 1333.441
2            Labour England 4078.433
3              <NA> England 1700.644
4     Conservatives England 5491.593
5 Liberal Democrats England 1793.267
6              <NA>    <NA> 2812.015
EnglandData <- subset(Data,region == "England")
EnglandData
Data set with 144 observations and 3 variables

                   vote  region    income
 1 *Not asked in survey England 1333.4413
 2               Labour England 4078.4325
 3      *Not applicable England 1700.6438
 4        Conservatives England 5491.5934
 5    Liberal Democrats England 1793.2672
 6 *Not asked in survey England 2035.1048
 7        Conservatives England 1359.4642
 8 *Not asked in survey England 4988.3709
 9        Conservatives England 4238.8204
10               Labour England 3855.4560
11          *Don't know England 1458.9798
12        Conservatives England 1697.5493
13 *Not asked in survey England 2263.7997
14      *Answer refused England 1436.7869
15 *Not asked in survey England 2513.9709
16 *Not asked in survey England 1427.6691
17               Labour England 3408.9998
18 *Not asked in survey England 1434.3889
19        Conservatives England 2174.4192
20               Labour England  555.5234
21    Liberal Democrats England 2246.1439
22 *Not asked in survey England 3802.0551
23 *Not asked in survey England 3359.7500
24               Labour England 2958.7386
25      *Not applicable England 5873.0534
(25 of 144 observations shown)
xtabs(~vote+region,data=Data)
                   region
vote                England Scotland Wales
  Conservatives          19        8     4
  Labour                 30       11     9
  Liberal Democrats      17       13     9
xtabs(~vote+region,data=within(Data, vote <- include.missings(vote)))
                      region
vote                   England Scotland Wales
  Conservatives             19        8     4
  Labour                    30       11     9
  Liberal Democrats         17       13     9
  *Don't know               22        5     6
  *Answer refused           15       16     7
  *Not applicable           12       11     7
  *Not asked in survey      29        9     5