Oracle® R Enterprise User's Guide Release 11.2 for Linux, Solaris, AIX, and Windows Part Number E26499-05 |
|
|
PDF · Mobi · ePub |
This chapter describes Oracle R Enterprise functions that perform most common or base statistical procedures. These functions are designed to help users who are converting from commercially available products to Oracle R Enterprise.
Oracle R Enterprise provides these collections of functions:
The use of the functions is illustrated with examples. Most of the examples use the same data, described in Data for Examples.
Most of the examples use the table NARROW, which is installed in your database when you install with Oracle R Enterprise.
NARROW is an ore.frame with 9 columns:
R> class(NARROW) [1] "ore.frame" attr(,"package") [1] "OREbase" R> names(NARROW) [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" [5] "COUNTRY" "EDUCATION" "OCCUPATION" "YRS_RESIDENCE" [9] "CLASS"
ore.corr
performs correlation analysis across numeric columns in an ore.frame.
ore.corr
supports partial correlations with a control column.
ore.corr
enables aggregations prior to correlations.
ore.corr
allows post-processing of results and integration into an R code flow.
The output of ore.corr
can be made to conform to output of the R cor()
function; this allows the output of ore.corr to be post-processed by any R function or graphics.
See ore.corr Parameters for syntax and output and ore.corr Examples for examples.
ore.corr
has these parameters:
data: The data for which to compute correlation coefficients as an ore.frame.
var: The numeric column(s) of data for which to build correlation matrix
group.by: Indicates the correlation matrices to calculate; ore.corr
calculates as many correlation matrices as unique values in group.by columns; default value is NULL
weight: A column of the data whose numeric values provide a multiplicative factor for var columns; default value is NULL
partial: columns of data to use as control variables for partial correlation; default value is NULL
stats: The method of calculating correlations; one of pearson (default), spearman, kendall
ore.corr
returns an ore.frame as output in all cases except when group.by is used. If group.by is used, returns an Oracle R Enterprise list object.
To convert the output of ore.corr
into R cor()
-compatible output format, use:
OREeda:::ore.corr.as.matrix()
These examples show how to use ore.corr
:
These examples use the NARROW data set; for more information, see Data for Examples.
Before you can use ore.corr
, you must project out all non-numerical values:
R> names(NARROW) [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" "COUNTRY" "EDUCATION" "OCCUPATION" [8] "YRS_RESIDENCE" "CLASS" "AGEBINS" R> NARROW=NARROW[,c(3,8,9)]
Now calculate correlation in several ways:
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS') #Calculate using Spearman R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman') # Calculate using Kendall R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall') # Convert output so that it is compatible with output of R cor R> cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x) R> class(cor_compatible_matrix) [1] "matrix"
Use the version of NARROW with non-numeric values that was created in Basic Correlation Calculations.
Calculate partial correlation using Spearman's methods:
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman', partial='GENDER')
Use the version of NARROW with non-numeric values that was created in Basic Correlation Calculations.
Create several correlation matrices and then convert the output so that it is compatible with R output:
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='pearson', partial='GENDER', group.by='COUNTRY') R> class(x) [1] "list" R> cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x[[1]])
If you calculate several matrices, you can use R packages to visualize them.
Cross tabulation is a statistical technique that finds an interdependent relationship between two tables of values.
ore.crosstab
enables cross column analysis of an ore.frame. This function is a sophisticated variant of the R table()
function.
ore.crosstab
must be performed before frequency analysis is done using ore.freq.
You can extend the cross tab calculation with various sums as described in ore.extend.
ore.crosstab
is written in R. The function is mapped to SQL that gets executed at the database server.
See ore.crosstab Parameters for syntax and output and ore.crosstab Examples for examples.
You can use ore.extend to augment crass tabulation.
ore.crosstab
has these parameters:
expr: the cross tabulation definition
[COLUMN_SPEC] ~ COLUMN_SPEC [*<WEIGHTING COLUMN>] [/<GROUPING COLUMN>] [^<STRATIFICATION COLUMN>] [|ORDER_SPECIFICATION] COLUMN_SPEC is <column-name>[+COLUMN_SET][+COLUMN_RANGE] COLUMN_SET is <column_name>[+COLUMN_SET] COLUMN_RANGE is <FROM COLUMN>-<TO COLUMN
where
COLUMN_SPEC is <column>[+COLUMN_SET][+COLUMN_RANGE] COLUMN_SET is <column>[+COLUMN_SET] COLUMN_RANGE is (<from column>-<to column>) ORDER_SPECIFICATION is one of [-]NAME, [-]DATA, [-]FREQ, or INTERNAL
The stratification column is used to cluster, or group, data. When used, the values contribute to the ORE$STRATA column of the resulting cross-tabulated table.
data: the ore.frame containing the data to cross tabulate
grouping column: as many cross tabulations as unique values in grouping columns; default value is NULL
order: defines optional sorting of output data. Specify [-]NAME to sort by tabulation columns, [-]FREQ to sort by frequency counts in table. Unspecified order is the most efficient. The optional '-' reverses the order direction.
weights: column of the data that indicates the frequency of the corresponding row; default value is NULL
partial: columns of data to use as control variables for partial correlation; default value is NULL
ore.crosstab
returns an ore.frame as output in all cases except when multiple tables are created. If multiple tables are created, ore.crosstab returns an Oracle R Enterprise list object.
These examples illustrate use of ore.crosstab
:
These examples use the NARROW data set; for more information, see Data for Examples.
The most basic use case is to create a single column frequency table. The following command filters NARROW) grouping by GENDER:
R> ct = ore.crosstab(AGE, data=NARROW) R> ct
This command analyses AGE by GENDER aid AGE by CLASS:
R> ct = ore.crosstab(AGE~GENDER+CLASS, data=NARROW) R> head(ct)
To weight rows, include count based on another column; this example weights values in AGE and GENDER using values in YRS_RESIDENCE:
R> ct = ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW) R> head(ct)
There are several possibilities:
Default or NAME- Order by the columns being analyzed
FREQ - Order by frequency counts
-NAME or -FREQ does reverse ordering
INTERNAL - Bypass ordering
Here are two examples:
R> ct = ore.crosstab(AGE~GENDER|FREQ, data=NARROW) R> head(ct) AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP R> ct = ore.crosstab(AGE~GENDER|-FREQ, data=NARROW) R> head(ct)
This is similar to what SQL GROUPING SETs accomplish:
ct = ore.crosstab(AGE+COUNTRY~GENDER, NARROW)
You can specify a range of columns instead of having to type all the column names, as illustrated in this example:
R> names(NARROW) [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" [5] "COUNTRY" "EDUCATION" "OCCUPATION" "YRS_RESIDENCE" [9] "CLASS"
Since AGE, MARITAL_STATUS and COUNTRY are successive columns, you can simply use
ct = ore.crosstab(AGE-COUNTRY~GENDER, NARROW)
An equivalent version is
ct = ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)
This command produces one cross table (AGE, GENDER) for each unique value of another column COUNTRY:
R> ct=ore.crosstab(~AGE/COUNTRY, data=NARROW) R> head(ct)
You can extend this to more than one column. For example, this command produces one (AGE,EDUCATION) table for each unique combination of (COUNTRY, GENDER):
R> ct = ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW)
All of the above cross tabs can be augmented with stratification. For example,
R> ct = ore.crosstab(AGE~GENDER^CLASS, data=NARROW) R> head(ct)
The command in this example is the same as
ct = ore.crosstab(~GENDER, NARROW, strata="CLASS")
First bin AGE, the calculate cross tabulation for GENDER and the bins:
R> NARROW$AGEBINS=ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2, ifelse(NARROW$AGE<40,3,4))) R> ore.crosstab(GENDER~AGEBINS, NARROW)
The cross tabulation produced using ore.crosstab can be further augmented with these three basic statistics:
Row and Column Sums
crosstab = ore.extend.sum(crosstab)
Cumulative sums for each cell of the table
crosstab = ore.extend.cumsum(crosstab)
Total for the entire table
crosstab = ore.extend.total(crosstab)
The following example illustrates ore.extend
:
R> ct = ore.crosstab(~GENDER, NARROW) R> ct = ore.extend.sum(ct) R> ct GENDER ORE$FREQ ORE$STRATA ORE$GROUP ORE$SUM$GENDER 0 F 421 1 1 421 1 M 880 1 1 880
ore.crosstab
must be performed before frequency analysis is done using ore.freq.
ore.freq
analyses the output of ore.crossta
b and automatically determines the techniques that are relevant to an ore.crossta
b result. The techniques depend on the kind of cross tables:
1-way cross tables
Goodness-of-fit tests for equal proportions or specified null proportions, confidence limits and tests for equivalence.
2-way cross tables
Various statistics that describe relationships between columns in the cross tabulation
Chi-square tests, Cochran-Mantel-Haenzsel statistics, measures of association, strength of association, risk differences, odds ratio and relative risk for 2x2 tables, tests for trend
N-way cross tables
N 2-way cross tables
Statistics across and within strata
ore.freq
uses Oracle Database SQL functions when available.
See ore.freq Parameters for syntax and output and ore.freq Examples for examples.
ore.freq
supports these parameters:
crosstab: ore.frame output from ore.crosstab()
stats: List of statistics required; these statistics are supported:
Chi Square: AJCHI, LRCHI, MHCHI, PCHISQ
Kappa: KAPPA, WTKAP
Lambda: LAMCR, LAMRC, LAMDAS
Correlation: KENTB,PCORR, SCORR
Stuart's Tau, Somers: D|C, STUTC, SMDCR,SMDRC
Fisher's, Cochran's Q, FISHER, COCHQ
Odds Ratio: OR, MHOR, LGOR
Relative Risk: RR,MHRR,ALRR
Others: MCNEM, PHI, CRAMV, CONTGY, TSYM, TREND, GAMMA
The default value is NULL.
Params: Control parameters specific to the statistical function specified in stats:
SCORE: TABLE|RANK|RIDIT|MODRIDIT
ALPHA: number
WEIGHTS: number
The default value is NULL.
skip.missing: Either TRUE or FALSE; skip cells with missing values in the cross table; default value is FALSE
skip.failed: Either TRUE or FALSE; if a statistical test required fails on the cross table because it is found to be in-applicable to the table then return immediately; ; default value is FALSE
ore.freq
returns an ore.frame in all cases.
These examples use the NARROW data set; for more information, see Data for Examples.
Before you use ore.freq
, you must calculate cross tabs.
For example:
R> ct = ore.crosstab(~GENDER, NARROW) R> ore.freq(ct) METHOD FREQ DF PVALUE DESCR GROUP 0 PCHI 161.9377 1 0 Chi-Square 1
ore.rank
analyzes distribution of values in numeric columns of an ore.frame.
ore.rank
supports useful functionality, including:
Ranking within groups
Partitioning rows into groups based on rank tiles
Calculation of cumulative percentages and percentiles
Treatment of ties
Calculation of normal scores from ranks
ore.rank
syntax is simpler that the corresponding SQL queries.
See ore.rank Parameters for syntax and ore.rank Examples for examples.
ore.rank
supports these parameters:
data: The ore.frame containing the data to rank
var: numeric columns in data to rank
desc: If desc=TRUE
, rank in descending order; otherwise, rank in ascending order. (The default is to rank in ascending order.)
groups: Partition rows into #groups based on ranks. For percentiles, groups=100
, For deciles, groups=10
, For quartiles, groups=4.
The default value is NULL.
group.by: Rank each group identified by group.by
columns separately
The default value is NULL.
ties: Specify how to treat ties. Assign the largest of, or smallest of, or mean of corresponding ranks to tied values
The default value is NULL.
fraction: The rank of a column value divided by the number of non-missing column values; the default value is FALSE.
Use with nplus1
to estimate the cumulative distribution function
nplus1: fraction plus 1, that is, 1 plus the rank of a column value divided by the number of non-missing column values; the default value is FALSE.
Use with fraction
to estimate the cumulative distribution function
percent: fraction converted to a percent value, that is fraction * 100.
ore.rank
returns anore.frame in all instances.
You can use these R scoring methods with ore.rank
:
To compute exponential scores from ranks, use savage
.
To compute normal scores, use one of blom,
tukey
, or vw
(van der Waerden).
These examples illustrate using ore.rank
:
These examples use the NARROW data set; for more information, see Data for Examples.
This example ranks the two columns AGE and CLASS and reports the results as derived columns; values are ranked in the default order (ascending):
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')
This example ranks the two columns AGE and CLASS. If there is a tie, the smallest value is signed to all tied values:
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')
This example ranks the two columns AGE and CLASS and ranks the values according to COUNTRY:
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')
This example ranks the two columns AGE and CLASS and partitions the columns into deciles (10 partitions):
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)
To partition the columns into a different number of partitions, change the value of groups
. For example, groups=4
partitions into quartiles.
This example ranks the two columns AGE and CLASS and estimates the cumulative distribution function for both columns:
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)
This example ranks the two columns AGE and CLASS and scores the ranks in tow different ways. The first command all partitions the columns into percentiles (100 groups). savage
calculates exponential scores and blom
calculates normal scores:
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',score='savage', groups=100, group.by='COUNTRY') R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',score='blom')
ore.sort
enables flexible sorting of a data frame along one or more columns specified in a by clause.
ore.sort
can be used with other data pre-processing functions. The results of sorting can provide input to R visualization.
ore.sort
sorting takes places in the Oracle database. ore.sort
supports the database nls.sort
option.
See ore.sort Parameters for syntax and ore.sort Examplesfor examples.
ore.sort
supports these parameters:
data: ore.frame containing the data to be sorted; required
by: the column(s) in data by which to sort the data; required
stable: Relative order is maintained within sorted group (TRUE or FALSE); default value is FALSE
reverse: Optional reversal of collation order for character variables (TRUE or FALSE); default value is FALSE
unique.keys: Optional deletion of observations with duplicate values in the columns being sorted, TRUE or FALSE; default value is FALSE
unique.data: Optional deletion of observations duplicate values in all columns, TRUE or FALSE; default value is FALSE
data and by are required parameters; all other parameters are optional
ore.sort
returns an ore.frame.
The following examples illustrate using ore.sort
:
Most of these examples use the NARROW data set; for more information, see Data for Examples. There are also Examples Using ONTIME_S.
Sort the columns AGE and GENDER in descending order:
R> x=ore.sort(data=NARROW,by='AGE,GENDER', reverse=TRUE)
Sort AGE in descending order and GENDER in ascending order:
R> x=ore.sort(data=NARROW,by='-AGE,GENDER')
Sort by AGE and keep one row per unique value of AGE:
R> x=ore.sort(data=NARROW,by='AGE', unique.key=TRUE)
Sort by AGE and remove duplicate rows:
R> x=ore.sort(data=NARROW,by='AGE', unique.data=TRUE)
Sort by AGE. Also remove duplicate rows, and return one row per unique value of AGE:
R> x=ore.sort(data=NARROW,by='AGE', unique.data=TRUE,unique.key = TRUE)
Maintain the relative order in the sorted output:
R> x=ore.sort(data=NARROW,by='AGE', stable=TRUE)
These examples use the ONTIME_S airline data that is installed when you install Oracle R Enterprise:
Sort ONTIME_S by airline name in descending order and departure delay in ascending order:
R> sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')
Sort ONTIME_S by airline name and departure delay and select one of each combination (that is, return a unique key):
R> sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY', unique.key=TRUE)
ore.summary
calculates descriptive statistics and supports extensive analysis of columns in an ore.frame, along with flexible row aggregations.
ore.summary
supports these statistics:
Mean, min., max, mode, number of missing values, sum, weighted sum
Corrected and uncorrected sum of squares, range of values, stddev, stderr, variance
t-test for testing the hypothesis that the population mean is 0
Kurtosis, skew, Coefficient of Variation
Quantiles: p1, p5, p10, p25, p50, p75, p90, p95, p99, qrange
1-sided and 2-sided Confidence Limits for the mean: clm, rclm, lclm
extreme value tagging
ore.summary
provides a relatively simple syntax compared with SQL queries for the same results.
See ore.summary Parameters for syntax and ore.summary Examples for examples.
ore.summary
supports these parameters:
data: the data to aggregate as an ore.frame
class: column(s) of data to aggregate (that is, SQL GROUP BY); default value is NULL
var: column(s) of data on which to apply statistics functions (SQL SELECT list)
stats: list of statistics functions to be applied on var columns
mean, min, max, cnt, n, nmiss, css, uss, cv, sum, sumwgt, range, stddev, stderr, var, t, kurt, skew,p1, p5, p10, p25, p50, p75, p90, p95, p99, qrange, lclm, rclm, clm, mode that can be requested on var columns.
The default value are n, mean, min, max.
weight: A column of data whose numeric values provide a multiplicative factor for var columns
maxid, minid: for each group optionally list maximum or minimum value from other columns in data; default value is NULL
ways: restrict output to only certain grouping levels of the class variables; default value is NULL
group.by: column(s) of data to stratify summary results across; default value is NULL
order: defines optional sorting of output data. Specify [-]NAME to sort by tabulation columns, [-]FREQ to sort by frequency counts in table. Unspecified order is the most efficient. The optional '-' reverses the order direction
_FREQ: frequency, number of observations in a group
_TYPE: identifies the grouping, binary code based
_LEVEL: identifies number of variables used in grouping
ore.summary
returns an ore.frame as output in all cases except when a group.by clause is used. If a group.by clause is used, ore.summary
returns a list of ore.frames, one frame per stratum.
These examples illustrate the use of ore.summary
:
These examples use the NARROW data set; for more information, see Data for Examples.
This example calculates mean, min, max for columns AGE and CLASS and rolls up (aggregates) GENDER:
R> ore.summary(NARROW, class='GENDER', var ='AGE,CLASS'. order='freq')
This example calculates skew for skew of AGE as column A and the t-test for CLASS as column B:
R> ore.summary(NARROW, class='GENDER', var='AGE,CLASS', stats='skew(AGE)=A, probt(CLASS)=B')
This example calculates weighted sum for AGE aggregated by GENDER with YRS_RESIDENCE as weights; in other words, it calculates sum(var*weight)
:
R> ore.summary(NARROW, class='GENDER', var='AGE', stat='sum=X', weight='YRS_RESIDENCE')
Group CLASS by GENDER and MARITAL_STATUS:
r> ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways=1)
ore.univariate
provides distribution analysis of numeric variables in an ore.frame.
ore.univariate
provides these statistics:
All statistics reported by ore.summary
Signed rank test, Student's t-test
Extreme values reporting
See ore.univariate Parameters for syntax and ore.univariate Examples for examples.
ore.univariate
supports these parameters:
data: The data to aggregate as an ore.frame
var: Numerical column(s) of data to analyze
weight: A column of the data whose numeric values provide a multiplicative factor for var columns; the default value is NULL
stats: Optional specification of a subset of statistics to calculate and display:
moments: n, sumwgt, mean, sum, stddev, var, skew, kurt., uss.css.cv, stderr
measures: mean, stddev, median, var, mode, range, iqr
quantiles: p100, p99, p95, p90, p75, p50, p25, p10, p5, p1, p0
location: studentt, studentp, signt, signp, srankt, srankp
normality
loccount: loc<,loc>,loc!
extremes
The default value is NULL.
ore.univariate
returns anore.frame as output in all cases.
These examples illustrate the use of ore.univariate
:
These examples use the NARROW data set; for more information, see Data for Examples.
This example calculates the default univariate statistics for AGE, YRS_RESIDENCE, and CLASS:
R> ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")
This example calculates location statistics for YRS_RESIDENCE:
R> ore.univariate(NARROW, var="YRS_RESIDENCE",stats="location")
This example calculates complete quantile statistics for AGE and YRS_RESIDENCE:
R> ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")