Oracle® R Enterprise User's Guide Release 11.2 for Linux, Solaris, AIX, and Windows Part Number E26499-05 |
|
|
PDF · Mobi · ePub |
This chapter explains how to use Oracle R Enterprise to analyze data stored in tables or views in an Oracle Database. Before you analyze data in tables, you must connect to a database, as described in Tables in the Oracle Database.
This chapter discusses these topics:
We assume familiarity with R in the remainder of this section.
These examples were all created using R Console, the default graphical user interface for Open Source R.
For information about ore.connect
, ore.attach
, ore.sync
, and ore.ls
, see Start the Oracle R Enterprise Client on Microsoft Windows, Start the Oracle R Enterprise Client on Linux. Solaris, or AIX, and Connect to an Oracle Database.
Oracle R Enterprise also includes the Oracle R Enterprise Statistical Functions, described in Chapter 4.
Before you can use Oracle R Enterprise to analyze data stored in database tables, you must Start the Oracle R Enterprise Client on Microsoft Windows or Start the Oracle R Enterprise Client on Linux. Solaris, or AIX.
Objects created by Oracle R Enterprise are identified with the ore
prefix. Pick any object returned by ore.ls()
and type either class(OBJECTNAME)
or class(OBJECTNAME$COLUMN_NAME)
.
For example,
R> class(NARROW) [1] "ore.frame" attr(,"package") [1] "OREbase"
The prefix ore
is applied to the class names. This indicates that the object is an Oracle R Enterprise created object that holds metadata (instead of contents) of the corresponding object in Oracle Database.
Two important metadata objects are
ore.frame is the Oracle R Enterprise metadata object that maps to a database table. The ore.frame object is the counterpart to an R data.frame.
ore.matrix is the Oracle R Enterprise metadata object that maps to a database table storing a matrix. The ore.matrix object is the counterpart to an R matrix.
ore.frame
or ore.matrix
can be returned by the class()
function. For an example of creating ore.frame data, see Load Data into the Database.
Use this command to view the Oracle R Enterprise documentation library:
R> OREShowDoc()
When you install Oracle R Enterprise, two tables NARROW and ONTIME_S are loaded into the rquser
schema:
R> ore.ls() [1] "NARROW" "ONTIME_S"
Oracle R Enterprise includes these functions:
Follow these steps to load data from files on your system to the Oracle Database:
Load contents of the file to an R data frame using read.table()
or read.csv()
functions documented in the R manuals.
Then use ore.create()
to load a data frame to a table:
ore.create(data_frame, table="TABLE_NAME")
loads data_frame into the database table TABLE_NAME.
This example creates an R data frame df
consisting of pairs of numbers and letters and then loads the data frame into the Oracle table DF_TABLE. The example shows that the data frame and the table have the same dimensions and the same first few elements, but different values for class. The class for DF_TABLE is ore.frame.
R> df <- data.frame(A=1:26, B=letters[1:26]) R> dim(df) [1] 26 2 R> class(df) [1] "data.frame" R> head(df) A B 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f R> ore.create(df, table="DF_TABLE") R> ore.ls() [1] "DF_TABLE" "NARROW" "ONTIME_S" R> class(DF_TABLE) [1] "ore.frame" attr(,"package") [1] "OREbase" R> dim(DF_TABLE) [1] 26 2 R> head(DF_TABLE) A B 0 1 a 1 2 b 2 3 c 3 4 d 4 5 e 5 6 f R>
ore.push(data-frame)
stores an R object in the database as a temporary object, and returns a handle to that object. It converts data frame, matrix, and vector to a table, and list, model, and others to a serialized object.
This example pushes the numerical vector created by the R command c(1,2,3,4,5)
to v, an Oracle R Enterprise object:
v <- ore.push(c(1,2,3,4,5)) R> class(v) [1] "ore.numeric" attr(,"package") [1] "OREbase" R> head(v) [1] 1 2 3 4 5
To drop a table in the database use
ore.drop(table="NAMEOFTABLE")
For example, this command drops DF_TABLE:
ore.drop(table="DF_TABLE")
To pull the contents of an Oracle Database table or view into an in-memory R data frame use ore.pull(OBJECT_NAME)
for the name of an object returned by ore.ls()
.
Note:
You can pull a table or view to an R frame only if the.if the size of the data can fit into R's memory.For example, use ore.pull()
to create the data frame df_narrow
from the table NARROW and then verify that df_narrow
is a data frame:
R> df_narrow <- ore.pull(NARROW) R> class(df_narrow) [1] "data.frame"
The Oracle R Enterprise transparency framework allows R users to continue to use R syntax to work directly with database-resident objects without having to pull data from Oracle into R's memory on the user's desktop.
R language constructs and syntax are supported for objects mapped to Oracle Database objects. The following R data types have been overloaded so that they are mapped to database objects and hence enabled for in-database execution:
Character, Integer, Numeric and Logical vectors
Factors
Data Frame
Matrix is overloaded in two situations:
Linear algebra cross-products
Creating input matrices for advanced analytics
class(object)
reports the data type of such mapped objects. For example,
R> class(NARROW$AGE) [1] "ore.numeric" attr(,"package") [1] "OREbase"
The following operators and functions are supported. See R documentation for syntax and semantics of these operators and functions. Syntax and semantics for these items remain unchanged when used on a corresponding database-mapped data type (also known as an Oracle R Enterprise data type):
Mathematical transformations: abs, sign, sqrt, ceiling, floor, trunc, cummax, cummin, cumprod, cumsum, log, log10, log2, log1p, acos, acosh, asin, asinh, atan, atanh, exp, expm1, cos, cosh, sin, sinh, tan, tanh, gamma, lgamma, digamma, trigamma, round, signif, pmin, pmax, zapsmall
Basic statistics: mean, summary, min, max, sum, any, all, median, range, IQR, fivenum, mad, quantile, sd, var, table, rowSums, colSums, rowMeans, colMeans
Arithmetic operators:+, -, *, /, ^, %%, %/%
Comparison operators: ==, >, <, !=, <=, >=
Logical operators: &, |, xor
Set operations: unique, %in%
Assignment: <-, =, ->
String operations: tolower, toupper, casefold, toString, chatr, sub, gsub, substr, substring, paste, nchar
Combine Data Frame: cbind, rbind, merge
Combine vectors: append
Vector creation: ifelse
Subset: [, [[, $, head, tail, window, subset, Filter, na.omit, na.exclude, complete.cases
Data reshaping: split, unlist
Data processing: eval, with, within, transform
Apply variants: tapply, aggregate, by
Regression: ore.lm()
- a variant of lm()
Special value checks: is.na, is.finite, is.infinite, is.nan
Metadata functions: attributes, nrow, NROW, ncol, NCOL, nlevels, names, row, col, dimnames, dim, length, row.names, col.names, levels, reorder
Graphics: hist, boxplot, plot, smoothScatter
Garbage collection: gc (removal of implicitly created temporary tables after errors and explicitly created temporary tables)
Conversion functions: as.ore.{character, factor, integer, logical, numeric, vector}
Test functions: is.ore.{character, factor, integer, logical, numeric, vector}
Save: ore.push (table is automatically refreshed in R memory)
The following additional categories of functions exist to accomplish conversions to/from and checks on Oracle R Enterprise data types:
Hypothesis testing: wilcox.test, ks.test, var.test, binom.test, chisq.test, t.test, bartlett.test
Bessel Functions: Bessel(I,J,K,Y)
Gamma Functions: gamma, lgamma, digamma, trigamma (part of mathematical functions group)
Various Distributions: Density, cumulative distribution, and quantile functions for standard distributions
Matrix Operations:%*% (matrix multiplication), crossprod (matrix cross-product), tcrossprod (matrix cross-product A times transpose of B)
The Oracle R Enterprise sample programs, described in Oracle R Enterprise Examples include several examples using each category of the functions listed above with Oracle R Enterprise data types.
Our design principle has been to support data pre-processing functionality extensively so all data preparation and analysis can take place directly in the database.If you need to use a statistical technique that is not available in Oracle R Enterprise, having used Oracle R Enterprise to preprocess and filter the data, a much smaller amount of data can be pulled into R.
If a specific function that you need is not in the list above, you must explicitly pull data from the database into the R engine memory using ore.pull()
to create an in-memory R object first.
The following examples illustrate using R with Oracle R Enterprise data types:
Simple column and row selection in R:
# Push built-in R data set iris to database ore.create(iris, table="IRIS") head(iris) iris_projected = IRIS[, c("PETAL_LENGTH", "SPECIES")] R> head (iris_projected) PETAL_LENGTH SPECIES 0 1.4 setosa 1 1.4 setosa 2 1.3 setosa 3 1.5 setosa 4 1.4 setosa 5 1.7 setosa
df1 <- data.frame(x1=1:5, y1=letters[1:5]) df2 <- data.frame(x2=5:1, y2=letters[11:15]) merge (df1, df2, by.x="x1", by.y="x2") x1 y1 y2 1 1 a o 2 2 b n 3 3 c m 4 4 d l 5 5 e k # Create database objects to correspond to in-memory R objects df1 and df2 ore.df1 <- ore.create(df1, table="DF1") ore.df2 <- ore.create(df2, table="DF2") # Compare results R> merge (DF1, DF2, by.x="X1", by.y="X2") X1 Y1 Y2 0 1 a o 1 2 b n 2 3 c m 3 4 d l 4 5 e k
# Push built-in data set iris to database ore.create(iris, table="IRIS") aggdata <- aggregate(IRIS, by = list(IRIS$SPECIES), FUN = summary) class(aggdata) head(aggdata)
Data formatting and creating derived columns in R
Note that adding derived columns does not change the database table. See Derived Columns in Oracle R Enterprise.
diverted_fmt <- function (x) { ifelse(x==0, 'Not Diverted', ifelse(x==1, 'Diverted','')) } cancellationCode_fmt <- function(x) { ifelse(x=='A', 'A CODE', ifelse(x=='B', 'B CODE', ifelse(x=='C', 'C CODE', ifelse(x=='D', 'D CODE', 'NOT CANCELLED')))) } delayCategory_fmt <- function(x) { ifelse(x>200,'LARGE', ifelse(x>=30,'MEDIUM','SMALL')) } zscore <- function(x) { (x-mean(x,na.rm=TRUE))/sd(x,na.rm=TRUE) # ONTIME_S is a database table ONTIME_S$DIVERTED <- diverted_fmt(DIVERTED) ONTIME_S$CANCELLATIONCODE <- cancellationCode_fmt(CANCELLATIONCODE) ONTIME_S$ARRDELAY <- delayCategory_fmt(ARRDELAY) ONTIME_S$DEPDELAY <- delayCategory_fmt(DEPDELAY) ONTIME_S$DISTANCE_ZSCORE <- zscore(DISTANCE)
The embedded R engine in the Oracle Database allows R users to off load desktop calculations that may require either more resources such as those available to Oracle Database or database-driven data parallelism. The embedded R engine also executes R scripts embedded in SQL or PL/SQL programs (lights-out processing).
These examples illustrate using Oracle R Enterprise embedded R engine with standard R packages downloaded from CRAN:
This example illustrates building a regression model using a CRAN package. Prepare the data used for training in the database (filtering out observations that are not of interest, selecting attributes, imputing missing values, etc.) to create the table ONTIME_S_PREPROCESSED_SUBSET. Pull the prepared training set (which is usually small enough to fit in desktop R memory) into R client to execute the model build. The resulting model is then used to score vast numbers of rows, in parallel, in the Oracle Database.
Note that scoring is a trivially parallelizable operation because one row can be scored independent of and in parallel with another row. The model built on the desktop is shipped to the database to perform scoring on vast number of rows in the database.
The computations are divided into these steps:
dat <- ore.pull(ONTIME_S_PREPROCESSED_SUBSET) mod <- glm(ARRDELAY ~ DISTANCE + DEPDELAY, dat) mod summary(mod)
Score in-parallel in the database using embedded R:
prd <- predict(mod, newdata=ONTIME_S_FINAL_DATA_TO_BE_SCORED) class(prd) # Add predictions as a new column res <- cbind(newdat, PRED = prd) head(res)
This example illustrates off-loading R computation to execute in the embedded R engine. The R user simply includes his code within a closure (that is, function() {}
) and invokes ore.doEval()
. ore.doEval()
schedules execution of the R code with the database-embedded R engine and returns the results back to the desktop for continued analysis:
mod <- ore.doEval( function() { library(biglm) dat <- ore.pull(ONTIME_S) mod <- biglm(ARRDELAY ~ DISTANCE + DEPDELAY, dat) mod }); print(mod) mod=ore.pull(mod) print(mod)
This example illustrates database-driven data parallelism at work in building a series of regression models using a CRAN package. One model is built per unique value of a factor. The database orchestrates parallel and concurrent building of the models, one per factor and bringing the list of all models built to the user desktop for further analysis:
modList <- ore.groupApply( # Organize input to the R script – This is always an Oracle R Enterprise # data frame X=ONTIME_S, # Specify the grouping column. Here we request one model per unique value of # ONTIME_S$DEST INDEX=ONTIME_S$DEST, # Model building code goes inside the closure. Input and grouping # conditions can be referenced as parameters to the function function(x, param) { library(biglm) biglm(ARRDELAY ~ DISTANCE + DEPDELAY, x) }); modList_local <- ore.pull(modList) # Print the model for just one destination - BOSTON summary(modList_local$BOS)
These functions are available to enable richer statistical analysis. See the Oracle R Enterprise Sample Library described in Oracle R Enterprise Examples for usage examples of each function. The functions all operate on an Oracle R Enterprise data frame:
ore.summary
: Enables powerful multiple aggregations of columns
ore.rank
: Enables flexible ranking across multiple columns
ore.sort
: Enables flexible sorting along one or more columns
ore.corr
: Enables correlation analysis of numeric columns
ore.crosstab
: Enables cross-column analysis
ore.freq
: Enables cross tabulation analysis of numeric columns
Oracle R Enterprise users who are allowed to execute R code via SQL queries must be granted the RQROLE role.
To enable execution of an R script in the database (lights-out processing), Oracle R Enterprise provides variants of ore.doEval()
, ore.groupApply()
, and ore.indexApply()
in SQL. (ore.doEval()
, ore.groupApply()
, and ore.indexApply()
are described in Oracle R Enterprise Database-Embedded R Engine.)
The SQL functions are
rqTableEval()
rqEval()
rqRowEval()
You can also code an rqGroupEval() Function function.
The rq*:Eval()
functions have the same syntax:
rq*Eval( cursor(select * from table-1, cursor(select * from table-2, 'select <column list> from table-3 t', <grouping col-name from table-1 or num_rows>, 'function(x,param) { registered-R-code ] ')
where
The first cursor is the input cursor: Input is passed as a whole table, group, or one row as a time to the R closure described in the fourth parameter.
The second cursor is the parameters cursor: One value can be passed (that is, collection of the models to be implemented.
The query specifies the output table definition; if this parameter is NULL, output is a BLOB; output can also be XML.
grouping col-name
is optional; it provides the name of the grouping column
num_rows
is optional; it provides then number of rows to provide to the functions at one time.
registeredR-code
is a registered version of the R function to execute. See Registering R Scripts for details.
The following examples illustrate using these functions:
This example uses all rows from the table fish
as input to the R function that takes no other parameters and produces output
that contains all input data plus the ROWSUM of values.
Note that both input(x)
and parameters (param
) to the R function is optional.
select * from table(rqTableEval( cursor(select * from fish), NULL, 'select t.*, 1 rowsum from fish t', 'function(x, param) { dat <- data.frame(x, stringsAsFactors=F) cbind(dat, ROWSUM = apply(dat,1,sum)) }'));
This example illustrates passing n=1 (4th parameter) row at a time from the table fish
to the R function. No parameters are required by the function. The function generates ROWSUM which is added as an extra column to fish
in the output.
select * from table(rqRowEval( cursor(select * from fish), NULL, 'select t.*, 1 rowsum from fish t', 1, 'function(x, param) { dat <- data.frame(x, stringsAsFactors=F) cbind(dat, ROWSUM = apply(dat,1,sum)+10) }'));
There is no rqGroupEval
() function as such. You must define a private version of rqGroupEval
() based on the data and grouping column. This is the limitation of the table function infrastructure.
Here is an example based on the ONTIME_S sample data. The data cursor uses all data, but you could also define cursors that use some columns using PL/SQL records. Then you must define as many table functions as the number of grouping columns that you are interested in using for a particular data cursor:
CREATE PACKAGE ontimePkg AS TYPE cur IS REF CURSOR RETURN ontime_s%ROWTYPE; END ontimePkg; / CREATE FUNCTION ontimeGroupEval( inp_cur ontimePkg.cur, par_cur SYS_REFCURSOR, out_qry VARCHAR2, grp_col VARCHAR2, exp_txt CLOB) RETURN SYS.AnyDataSet PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH (month)) CLUSTER inp_cur BY (month) USING rqGroupEvalImpl; /
At this time, only one grouping column is supported. If you have multiple columns you combine the columns into one column and use the new column as a grouping column. PARALLEL_ENABLE clause is optional but CLUSTER BY is not.
For security purposes, you must first register the R script under some system unique name and use new name instead of the actual script in the call to rq*Eval
table functions.
There are two administrative functions that create and drop scripts and a view that lists scripts:
sys.rqScriptCreate
sys.rqScriptDrop
sys.rq_scripts
view allows you to list and use scripts that were created
The scripts and the view require grants as described in Roles Required to Create and Use Scripts.
Here is an example of registering the scripts and using the registered scripts:
begin sys.rqScriptCreate('tmrqfun2', 'function() { ID <- 1:10 res <- data.frame(ID = ID, RES = ID / 100) res }'); end; / select * from table(rqEval( NULL, 'select 1 id, 1 res from dual', 'tmrqfun2')); begin sys.rqScriptDrop('tmrqfun2'); end;
To execute sys.rqScriptCreate
and sys.rqScriptDrop
, you must be granted the administrative role RQADMIN.
Select privilege for the sys.rq_scripts
view is granted to RQROLE role.
The RQADMIN and RQROLE role are created when you Install Server on Linux
Oracle R Enterprise is shipped with a collection of examples that illustrate how to use Oracle R Enterprise. These examples are a collection of self-contained R scripts.
Most of the sample programs use the data frame iris
, which is included in the R distribution. iris
is loaded into a table as described in Load Data Frame to a Table.
The rest of this section describes two examples in detail and includes a list of all examples:
Follow these steps to load an R data frame to a database table:
Starts R, load the ORE packages via library(ORE), and then connect to the database. The latter steps are automatic if Rprofile is in place.
Most of these examples use the R data frame iris
. iris is shipped with R. Use the R command class
to verify that iris
is an R data frame:
R> class(iris) [1] "data.frame"
iris
consist of measurements of parts of iris flowers. Use the R command head
to see a small sample of the data in iris
.
R> head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
Now load the data frame iris
into the database that you are connected to.
In these examples, the database table version of iris
is named IRIS_TABLE. Drop IRIS_TABLE to make sure that no table of this name exists in the connected schema:
ore.drop(table = "IRIS_TABLE")
If IRIS_TABLE doesn't exist, you do not get a message.
Now create a database table with the data contained in iris
:
ore.create(iris, table = "IRIS_TABLE")
Use ore.ls()
to verify that the table was created:
R> ore.ls() [1] "IRIS_TABLE" "NARROW" "ONTIME_S"
IRIS_TABLE is a database-resident table with just metadata on the R side:
R> class(IRIS_TABLE) [1] "ore.frame" attr(,"package") [1] "OREbase"
Use head
to see the column names and the first few values in IRIS_TABLE:
R> head(IRIS_TABLE) SEPAL_LENGTH SEPAL_WIDTH PETAL_LENGTH PETAL_WIDTH SPECIES 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa 5 5.4 3.9 1.7 0.4 setosa
Use mode to see the data type of the column SPECIES.
mode(IRIS_TABLE$SPECIES) [1] "raw"
Some algorithms only work if all of the data is numerical. Follow these steps to create IRIS_TABLE_N that does not contain SPECIES, the nonnumeric column:
IRIS_TABLE_N=IRIS_TABLE[,c("SEPAL_LENGTH", "SEPAL_WIDTH", "PETAL_LENGTH", "PETAL_WIDTH")]
You can use R functions to analyze the data in the table. Here are some simple examples taken from the example basic.R
:
Use unique
to get a list of the unique entries in a column. This example find the unique SPECIES:
R> unique(IRIS_TABLE$SPECIES) [1] setosa versicolor virginica Levels: setosa versicolor virginica
Find the minimum, maximum, and mean of PETAL_LENGTH:
R> min(IRIS_TABLE$PETAL_LENGTH) [1] 1 R> max(IRIS_TABLE$PETAL_LENGTH) [1] 6.9 R> mean(IRIS_TABLE$PETAL_LENGTH) [1] 3.758
If you need information about an R function, use the command help(function-name)
.
The sample null.R
is the only sample that does not use iris
as data. null.R
compares the handling of NULLs in SQL with the handling of NAs in R.
In R, NA is a logical constant of length 1 which contains a missing value indicator. In the database, null refers to the absence of a value in a column of a row. Nulls indicate missing, unknown, or inapplicable data.
Follow these steps to execute the sample:
This example uses the data frame airquality
. Verify that the data set is a data frame and look at the few rows of the data frame:
R> class(airquality) [1] "data.frame" R> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
Load airquality
into the database as "AIRQUALITY":
ore.drop(table = "AIRQUALITY") ore.create(airquality, table = "AIRQUALITY")
Use ore.ls()
to verify that the table was created. If you wish, use class(AIRQUALITY)
to verify that AIRQUALITY
is a database-resident table with just metadata on the R side.
Examine how R handles NAs. Return all observations where ozone < 30:
R> nrow(airquality[airquality$Ozone < 30,])[1] 92
Compare this with the results when NAs are explicitly excluded:
R> nrow(airquality[airquality$Ozone < 30 & !is.na(airquality$Ozone),]) [1] 55
The default behavior for SQL tables is to exclude NULLS in output:
nrow(AIRQUALITY[AIRQUALITY$OZONE < 30,]) [1] 55
To handle NULLs the same way that R handles NA, request the behavior explicitly:
options(ore.na.extract = TRUE) nrow(AIRQUALITY[AIRQUALITY$OZONE < 30,]) [1] 92
These scripts have been added as demos to the ORE package.
To access a complete listing of them type
R> demo(package = "ORE")
To run one of these scripts, specify the name of the demo in a demo
function call. For example, to run aggregate.R
, type
R> demo("aggregate", package = "ORE")
These examples are shipped with Oracle R Enterprise:
table_apply.R Execute R code on all rows of a table passed in at once aggregate.R Demonstrates aggregations. See also summary.R analysis.R Demonstrates basic analysis and data processing operations basic.R Demonstrates basic connectivity to database binning.R Demonstrates binning in R columnfns.R Demonstrates use of column functions corr.R Correlation matrix (Pearson's, Spearman/Kendalls) crosstab.R Frequency cross-tabulations. Also see freq.R derived.R Handling derived columns distributions.R Distribution, Density, and Quantile Functions doEval.R Demonstrates support for database-enabled parallel simulations freqanalysis.R Frequency cross-tabulations. Also see crosstab.R graphics.R Demonstrates visual analysis (boxplot, histogram) group_apply.R Execute R code for different sets of rows, one set per group hypothesis.R Hypothesis Testing Functions(binomial, chi square, T test, etc.) matrix.R Matrix operations nulls.R Demonstrates handling of nulls in SQL vs. NAs in R push_pull.R Demonstrates collaborative processing between database and client rank.R Ranking of observations (ranking, handling ties, etc.) reg.R Multivariate Regression row_apply.R Execute R code on each row sql_like.R Demonstrates how R commands map to SQL operations stepwise.R Stepwise Multivariate Regression summary.R Demonstrates summary functionality