Data Managament with R: A Guide for Social Scientists

Elff, Martin. 2019. “Data Managament with R: A Guide for Social Scientists”. Book contract with Sage, London.

Outline of the book

Abstract:: This chapter gives a brief review of R - where it comes from, how it can be obtained, and what its major strengths and limits are

Abstract:: This chapter describes the most basic data types which all other data structures build on. It starts with simple numeric vectors which may e.g. contain series of measurement. It further discusses character vectors, i.e. sequences of character strings, logical vectors, i.e. sequences of TRUE/FALSE data, and finally lists. The chapter covers also how simple computations on such data can be conducted and simple summaries can be obtained form elementary data types. Finally the chapter discusses how data can be stored on disk in an R-specific format.

Abstract:: This chapter describes how a typical data set used in multivariate analysis is composed - i.e. as a rectangular arrangement of variables and observations. The chapter further describes ways to manipulate data within data frames and how data frames can be restricted, combined, and reshaped.

Abstract:: This chapter discusses the various formats that data sets are available from data archives (such as GESIS, UKData archive, ICPSR) as well as from other data providers. Such data formats include CSV, TAB-delimited, fixed-column, SPSS “portable” and “system” files, Stata files. The various features of these formats will be discussed, i.e. what kind of metadata (such as value labels and user-refined missing values) they allow; the challenges that importing data in these formats may pose, e.g. variables where only a subset of the values are labeled, data sets with a large number of variables, and finally variables that have non-mnemonic names (such as v1, v2, etc.)

Abstract:: This chapter introduces the extension package memisc, which is specifically designed to address several of the challenges that were discussed in the previous chapter. It shows how a the system memory can be saved by importing subsets of variables and observations; how variables can be renamed so that results are more easy to interpret; how certain metadata can be used, that are not provided for by a basic R installation, such as value labels and user-defined missing values. The chapter also provides examples for more complex recodings of variables, e.g. for the construction of Goldthorpe class categories for households from ISCO-coded occupations of survey respondents and the creation of codebooks.

Abstract:: Data from social science surveys often come from complex samples. In order to achieve efficient or at least accurate inference one may need to take into account the sampling design in the computation of sample summaries, e.g. by the application of sampling weights. The chapter shows how the survey package can help to take into account the sampling design in data management and data analysis.

Abstract:: Temporal data, consisting of dates and times, pose their own challenges. Time is measured in non-metric units, in hours, minutes and seconds. Dates can be recorded according to various calendaric systems, and are complicated by leap days and leap seconds. R provides facilities to convert times and dates into different calendaric systems, to format temporal data and to import temporal data recorded in different formats. This is one topic of this chapter. The other topic are time series and similar data structures (such as panels). Basic time series consist of measurements conducted in regular temporal intervals, but beside these basic variants, emph{R} also supports irregular time series. The chapter therefore also the discusses the construction and manipulation of regular and irregular time series.

Abstract:: While single geographical locations can be identified with a single pair of coordinates, the geographical extent of trade routes or countries are series of such coordinate values and cannot easily accommodated with the observation-by-variable format of data frames. Instead such geographical data constitutes its own data type, which is discussed in this chapter. The chapter also discusses how such spatial data can be connected with other data about geographical entities, e.g. the population density or GDP per capita of countries. Further it discusses the import of geographical data from shape files and Google Maps KML files and the definition of and conversion between cartographic projections.

Abstract:: Textual data have rapidly gained attention in the communication, social, and political sciences. Without their discussion, a companion to data management would be incomplete. This chapter starts with a discussion of basic operations on character strings, such as concatenation, search and replace. It then moves on to the management of corpora of text. It also discusses routine issues in the management of textual data, such as stemming, stop-word deletion, and the creation of term-frequency matrices.

Abstract:: The support of more complex data types, such as survey responses and geographical locations, rests a more abstract and complex data type: classes and objects. While an understanding of the different object-class systems (S3, S4, and R6) is not a requirement for the application involving e.g. geographical data. But readers who want to gain a deeper understanding of the construction of complex data types in R may gain from this appendix.