고급 R 활용 - R 데이터 처리

728x90

S 의 탄생 :

Becker and Chambers (AT&T Bell Lab) 가 1980년대에 새로 개발 한 통계프로그램 언어를 S 라 명함 – S-PLUS 시스템으로 발전.

R의 탄생 :

Ross Ihaka and Robert Gentleman(Univ. of Auckland, New Zealand)가 교육 목적으로 S 의 축소버전 (reduced version) “R & R” 을 만듬

R의 발표 :

1995년 Martin Maechler가 Ross Ihaka and Robert Gentleman를 설득하여 Linux system 과 같이 Open Source Software 규약인 GPL(General Public Licence) 규약하에 R의 source code를 발표

R Core Team 의 결성 :

1997년 8월 R 시스템의 발전을 위한 국제적인 R core team의 결성됨. 이후 확장 발전하여 현재(2015년 7월) 21명의 멤버로 구성됨. 2000년 2월 29일 R version 1.0.0 발표됨. 2015년 7월 현재 R version

※ 참고 : www.r-project.org

Peter Dalgaard (2005), Introductory Statistics with R, Springer.

R_컴퓨팅 - R 과 RStudio 설치 [바로가기]

R 환경설정 [바로가기]

작업 영역 설정:

getwd()

setwd("d://R")

setwd(choose.dir())

rstudio > session > set working....

global = Tools>Global Options

작업 히스토리 저장:

savehistory(file="mywork.Rhistory")

history()

history(max.show=5)

.txt파일 불러오기:

example file :

insurance.txt

text.data1 = read.table("insurance.txt", header=T)

text.data2 = read.table("insurance.txt", header=T, na.string="-9")

.csv파일 불러오기:

sample file : insurance.csv

csv.data = read.csv("insurance.csv")

tab으로 분리된 .txt파일 불러오기:

tab.data = read.table("insurance.txt", header=T, sep="\t")

.csv 로 저장하기:

write.table(tab.data, "test_write.csv", row.names=F, quote=F, sep=",")

tab으로 분리 저장하기 :

write.table(tab.data, "test.write.txt", row.names=F, quote=F, sep="\t")

열 고정간격 설정 불러오기:

fwf.data = read.fwf(file="test_fwf.txt", widths=c(2,2,3,3,3,6,6),

col.names=c("id", "sex", "job", "religion", "edu", "amount", "salary"))

구분자를 안쓰는 경우 자칫 중간에 데이터가 꼬일 수 있으므로 read.fwf() 함수는 조심해서 써야 한다.

기계에서 일정한 간격으로 흐트러짐 없이 쏟아져나오는 센서 데이터와 같이 일정한 간격, 고정된 구조의 데이터라고 확신이 있을 때만 사용하는 것이 바람직하다.

sample file : insurance_fwf.txt

fwf.data[fwf.data$job==-9, "job"] = NA

head(fwf.data, n=3)

fwf2.data = read.fwf(file="test_fwf.txt", widths=c(2,-2,-3,3,3,6,6),

col.names=c("id", "religion", "edu", "amount", "salary"))

Xlsx 불러오기:

Sample xlsx :insurance_xlsx.xlsx

install.packages("xlsx")

library(xlsx)

xlsx.data = read.xlsx("test_xlsx.xlsx", 1)

xlsx2.data = read.xlsx("test_xlsx.xlsx", 1, colIndex = c(1,2,6:7)

RDBMS

endterm.sql

midterm.sql

install mariaDB

Create database : MySQL DB 생성/조회/삭제 [바로가기]

Create user : MySQL user 생성,권한,조회/삭제 [바로가기]

Alter table :MySQL TABLE 구조 및 속성 변경 [바로가기]

Insert into : MySQL table에 data 입력하기 [바로가기]

install mariaDB Connector driver

제어판 > 시스템 > 관리도구 > ODBC > ODBC DSN 입력 >MariaDB driver

indtall libreoffice 6.2.4

libreoffice실행 > base database 선택 > 기존db 연결하기 > mysql > mysql연결설정 >ODBC 체크>ODBC 연결설정 >찾아보기 > ODBC DSN 선택 >사용자 인증 설정 > 사용자 이름 > 암호 필요 체크 >테스트연결>암흐입력 > 저장 및 계속 > 마침 > db명 입력 > 연결된 db와 db 구조가 보일것이다

table읗 선택후 더블클릭하여 열어서 데이터를 삽입할 수 있다

Sample database : student1.odb

install("RODBC")

library(RODBC)

channel = odbdConnect("ODBC DSN")

sqlFetch(channel, "table_name") = sqlQuery(channel, "select * from table_name")

그러나 SQLExexcDirect ERROR을 출력하며 오류가 발생한다.

sqlQuery를 사옹하자

sqlQuery(channel, "select * from table_name where id > 10") #id가 10 보다 큰 데이타 출력

열의 수가 같으면서 변수명이 같으면 rbind

행의 수가 같으면서 아이디나 이름이 같으면 cbind

SPSS DATA 읽기

install.packages("foreign")

library(foreign)

ex1 = read.spss("test_spss.sav", to.data.frame=T, use.value.label=T)

Sample sav : ex1-1.sav

mouse.data = ex1[rep(1:nrow(ex1), ex1$count),] #가중치 식

attach(mouse.data) #mouse.data 내부의 변수들도 사용 가능하게함

mouse.table = table( shock, response) #shock:row response : col

summary(mouse.table) <- 요약정보 출력

STATA 읽기:

read.dta()

SPSS 읽기:

read.spss()

SAS XPORT file 읽기:

read.xport()

Systat data 읽기:

read.systat()

RData 저장하고 불러오기:

save(ex1, file="ex1.RData")

rm(ex1)

load("ex1.RData")

탐색창을 이용하기: load (file = file.choose())

변수 값 바꾸기 및 결측치 처리:

wd.txtwd.xlsx

wd.txt

>wd <- read.table("wd.txt", header=T, sep="\t")

>nwd <- wd

>nwd[nwd\$M2 < 0.11, "M2"] = 99

>nwd[nwd == 99] = NA

>rowSums(is.na(nwd)) # 행별로 NA의 숫자 세기

>colSums(in.na(nwd)) # 열별로 NA의 숫자 세기

>mywd <- na.omit(nwd) # NA가 포함된 nwd의 모든 행과 열 삭제

>fix(nwd) # 열린 팝업 창에서 데이터 및 변수명읗 변경 할 수 있음

>names(nwd)[6] = 'ny' # 6번째 변수명읗 ny로 바꾸기

>colnames(nwd) = c('x1', 'x2', 'x3', 'x4', 'x5', 'newy') #변수명 일괄변경하기

>install.packages('reshape') # reshape를 이용하여 변수명 바꾸기

>library(reshape)

>names = c('kim', 'lee', 'pack')

>ages = c(50, 44, 35)

>frame.data = data.frame(names, ages)

>frame.data = rename(frame.data, c(names = 'name')

>frame.data = rename(frame.data, c(ages = 'age')

변수 값 라벨

변수 ex) job = [1:근로자, 2:사무직, 3:전문가]

edu = [1:무학, 2=국졸, 3:중졸, 4:고졸, 5:대졸]

>insurance = read.table("insurance.txt", header=T)

>insurance\$job = factor(insurance\$job, levels = c(1:3), labels = c('근로자', '사무직', '전문가'))

>insurance\$edu = ordered(insurance\$edu, levels = c(1:5), labels = c('무학', '국졸', '중졸', '고졸', 대졸')

막대그림에 값 라벨 적용 시키기

>job.freq = table(insurance\$job)

>barplot(job.freq)

>title("막대그림 : job")

>insurance\$job = factor(insurance\$job, levels = c(1:3), labels = c('근로자', '사무직', '전문직')

>job.freq2 = table(insurance\$job)

barplot(job.freq2)

>title("막대그림2 : job")

변수 값 변환

>install.packages('xlsx')

>library(xlsx')

>drug = read.xlsx("drug.xlsx", 1)

>drug\$agr = durg\$age

>drug\$agr[drug\$agr >=20 & drug\$agr <=40] = 1

>drug\$agr[drug\$agr >40 &drug\$agr <= 60] = 2

>drug\$agr[drug\$agr >60] = 3

>drug\$age_labels = ordered(drug\$agr, levels = c(1:3), labels = c('청년', '중년', '장년'))

# car 패키지의 recode()를 이용하는 예

>install.packages("car")

>library(car)

>drug$agr2 = drug\$age

>drug$agr2 = recode(drug\$age, "lo:20=1; 40:60=2; 60:hi=3")

케이스 선택

성별이 여자(m)인 경우 추출

>insurance = read.table('insurance.txt', header = T)

>select1 = insurance[insurance\$sex == 'm',]

성별이 남자(f)이고, 직업이 사무직(2)인 경우

>select2 = insurance[insurance\$sex == 'f' & insurance\$job ==2,]

>select2 = insurance[which(insurance\$sex == 'f' & insurance\$job ==2),]

dplyr package 활용

데이터 추출(filtering)

>install.packages('dplyr')

>library(dplyr)

>dim(insurance)

>tbl_df(insurance) # 데이터 프레임 형태로 효율적인 출력결과를 보여줌

>install.pacakage('nycflights13')

>library(nycflights13)

>tbl_df(flights)

성별이 여자(m)이고, 교육정도가 중졸(3)인 데이텨 주줄 and = [ , | &]

>select3 = filter(insurance, sex=='m', edu == 3)

변수 선택

select()

>select4 = select(insurance, sex, job, amount, salary)

>select5 = select(insurance, job:salary)

filter() + select()

>select6 = filter(select(insurance, sex, job, amount, salary), job ==1)

변수 추가

mutate()

>insu_add1 = mutate(insurance, amopersal1 = amount/salary)

>amopersal2 = insurance\$amount / insurance\$salary

>insu_add2 = cbind(insurance, amopersal2)

데이터 정렬

arrange()

>insu_sort = arrange(insurance, sex, job)

>insu_sort = arrange(insurance, desc(sex), desc(job))

요약결과 출력

group_by()

저작자표시 (새창열림)

Boolean

고급 R 활용 - R 데이터 처리

dplyr package 활용

댓글

티스토리툴바