Entering Data into R

## 21  Entering Data into R

It is very convenient to use built-in data sets, but at some point one wants to enter data into the session from outside of R. However, there are so many different ways to find data such as on the web, in a spreadsheet, in a database, in a text file, in the paper.... As such, there are nearly an equal number of ways to enter in data. For the authoritative account on how to do this, consult the R Data Import/Export'' guide from http://cran.r-project.org

What follows below is a much-shortened summary to illustrate quickly several different methods. Which method is best depends upon the context. Here, we will show you a variety of them and explain when they make sense to use.

### 21.1  Using c

The c operator combines values. One of its simplest usages is to combine a sequence of values into a vector of values. For example

> x = c(1,2,3,4)

stores the values 1,2,3,4 into x. This is the easiest way to enter in data quickly, but suffers if the data set is long.

### 21.2  using scan

The function scan at its simplest can do the same as c. It saves you having to type the commas though:

> x=scan()
1 2 3
4

Notice, we start typing the numbers in, If we hit the return key once we continue on a new row, if we hit it twice in a row, scan stops. This can be fairly convenient when entering in a few data points (10-40 say), but you might want to use a file if you have more.

The scan function has other options, one particularly useful one is the choice of separator.

### 21.3  Using scan with a file

If we have our numbers stored in a text file, then scan can be used to read them in. You just need to tell scan to open the file and read them in. Here are two examples

Suppose the file ReadWithScan.txt has contents

1 2 3
4

Then the command

> x = scan(file = "ReadWithScan.txt")


Now suppose you had some formatting between the numbers you want to get rid of for example this is now your file ReadWithScan.txt

1,2,3,
4

then


works.

The data.entry command will let you edit existing variables and data frames with a spreadsheet-like interface. The only gotcha is that variable you want to edit must already be defined. A simple usage is

> data.entry(x)                 # x already defined
> data.entry(x=c(NA))           # if x is not defined already

When the window is closed, the values are saved.

The R command edit will also open a simple window to edit data. This will let you edit functions easily. It can be used for data, but if you try, you'll see why it isn't recommended.

An important caveat, you must remember to store the results of the edit or they vanish when you are done. For example

> x = edit(x)                   ### NOT edit(x) alone!

The command fix will do the same thing but will automatically store the results.

### 21.5  Reading in tables of data

If you want to enter multivariate sets of data, you can do any of the above for each variable. However, it may be more convenient to read in tables of data at once.

Suppose you data is in tabular form such as this file ReadWithReadTable.txt.


Age Weight Height Gender
18 150 65 F
21 160 68 M
45 180 65 M
54 205 69 M

Notice the first row supplies column names,the second and following rows the data. The command read.table will read this in and store the results in a data frame. A data frame is a special matrix where all the variables are stored as columns and each has the same length. (Notice we need to specify that the headers are there in this case.)


> x[['Gender']]             # a factor, it prints the levels
[1] F M M M
Levels:  F M
> x[['Age']]                # a numeric vector
[1] 18 21 45 54
> x                    # default print out for a data.frame
Age Weight Height Gender
1  18    150     65      F
2  21    160     68      M
3  45    180     65      M
4  54    205     69      M

Read table treats the variables as numeric or as factors. A factor is special class to R and has a special print method. The "levels" of the factor are displayed after the values are printed. As well, the internal representation can be a bit surprising.

### 21.6  Fixed-width fields

Sometimes data comes without breaks. Especially if you interface with old databases. This data may be of fixed width format (fwf). An example data set for student information at the College of Staten Island is of this form (say student.txt)


123456789MTH 2149872 A  0220002
314159319MTH 2149872 B+ 0220002
271828232MTH 2149872 A- 0220002

The first 9 characters are a student id, then 7 characters for the class, 4 for the section, 4 for the grade, 2 for the semester and 4 for the year. To read such a file in, we can use the read.fwf command. You need to tell it how big the fields are, and optionally provide names. Here is how the example above could be read in if the file were titled student.txt:

> x
id   class section grade sem year
1 123456789 MTH 214    9872   A     2 2000
2 314159319 MTH 214    9872   B+    2 2000
3 271828232 MTH 214    9872   A-    2 2000


Alternatively, you may have data from a spreadsheet. The simplest way to enter this into R is through a file format that both applications can talk. Typically, this is CSV format (comma separated values). First, save the data from the spreadsheet as a CSV file say data.csv. Then the R command read.csv will read it in as follows


If you use Windows, there is a developing package RExcel which allows you to do much much more with R and that spreadsheet. If you use linux, there is a package for interfacing with the spreadsheet gnumeric.

### 21.8  XML, urls

XML or extensible markup language is a file storage format of the future. R has support for this but you may need to add the XML package to your R installation. Many external applications can write in XML format. On UNIX the gnumeric spreadsheet does so. The Microsoft .NET initiative does too.

R has a function url which will allow you to read in properly formatted web pages as though you were reading them with read.table. The syntax is identical, except that when one specifies the filename, it is replaced with a call to url. For example, the command might look like