4 Oracle R Enterprise Statistical Functions

This chapter describes Oracle R Enterprise functions that perform most common or base statistical procedures. These functions are designed to help users who are converting from commercially available products to Oracle R Enterprise.

Oracle R Enterprise provides these collections of functions:

ore.corr
ore.crosstab
ore.extend
ore.freq
ore.rank
ore.sort
ore.summary
ore.univariate

The use of the functions is illustrated with examples. Most of the examples use the same data, described in Data for Examples.

Data for Examples

Most of the examples use the table NARROW, which is installed in your database when you install with Oracle R Enterprise.

NARROW is an ore.frame with 9 columns:

R> class(NARROW)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> names(NARROW)
[1] "ID"             "GENDER"         "AGE"            "MARITAL_STATUS"
[5] "COUNTRY"        "EDUCATION"      "OCCUPATION"     "YRS_RESIDENCE" 
[9] "CLASS"

ore.corr

ore.corr performs correlation analysis across numeric columns in an ore.frame.

ore.corr supports partial correlations with a control column.

ore.corr enables aggregations prior to correlations.

ore.corr allows post-processing of results and integration into an R code flow.

The output of ore.corr can be made to conform to output of the R cor() function; this allows the output of ore.corr to be post-processed by any R function or graphics.

See ore.corr Parameters for syntax and output and ore.corr Examples for examples.

ore.corr Parameters

ore.corr has these parameters:

data: The data for which to compute correlation coefficients as an ore.frame.
var: The numeric column(s) of data for which to build correlation matrix
group.by: Indicates the correlation matrices to calculate; ore.corr calculates as many correlation matrices as unique values in group.by columns; default value is NULL
weight: A column of the data whose numeric values provide a multiplicative factor for var columns; default value is NULL
partial: columns of data to use as control variables for partial correlation; default value is NULL
stats: The method of calculating correlations; one of pearson (default), spearman, kendall

ore.corr returns an ore.frame as output in all cases except when group.by is used. If group.by is used, returns an Oracle R Enterprise list object.

To convert the output of ore.corr into R cor()-compatible output format, use:

OREeda:::ore.corr.as.matrix()

ore.corr Examples

These examples show how to use ore.corr:

Basic Correlation Calculations
Partial Correlation
Create Several Correlation Matrices
Visualization of Correlations

These examples use the NARROW data set; for more information, see Data for Examples.

Basic Correlation Calculations

Before you can use ore.corr, you must project out all non-numerical values:

R> names(NARROW)
 [1] "ID"             "GENDER"     "AGE"    "MARITAL_STATUS" 
"COUNTRY"        "EDUCATION"      "OCCUPATION"
 [8] "YRS_RESIDENCE"  "CLASS"          "AGEBINS"
R> NARROW=NARROW[,c(3,8,9)]

Now calculate correlation in several ways:

R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS')
#Calculate using Spearman
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman')
# Calculate using Kendall
R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall')
# Convert output so that it is compatible with output of R cor
R> cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x)
R> class(cor_compatible_matrix)
[1] "matrix"

Partial Correlation

Use the version of NARROW with non-numeric values that was created in Basic Correlation Calculations.

Calculate partial correlation using Spearman's methods:

R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman', partial='GENDER')

Create Several Correlation Matrices

Use the version of NARROW with non-numeric values that was created in Basic Correlation Calculations.

Create several correlation matrices and then convert the output so that it is compatible with R output:

R> x=ore.corr(NARROW,var='AGE,YRS_RESIDENCE,CLASS', stats='pearson', partial='GENDER', group.by='COUNTRY')
R> class(x)
[1] "list"
R> cor_compatible_matrix = OREeda:::.ore.corr.as.matrix(x[[1]])

Visualization of Correlations

If you calculate several matrices, you can use R packages to visualize them.

ore.crosstab

Cross tabulation is a statistical technique that finds an interdependent relationship between two tables of values.

ore.crosstab enables cross column analysis of an ore.frame. This function is a sophisticated variant of the R table() function.

ore.crosstab must be performed before frequency analysis is done using ore.freq.

You can extend the cross tab calculation with various sums as described in ore.extend.

ore.crosstab is written in R. The function is mapped to SQL that gets executed at the database server.

See ore.crosstab Parameters for syntax and output and ore.crosstab Examples for examples.

You can use ore.extend to augment crass tabulation.

ore.crosstab Parameters

ore.crosstab has these parameters:

expr: the cross tabulation definition

[COLUMN_SPEC] ~ COLUMN_SPEC [*<WEIGHTING COLUMN>] [/<GROUPING COLUMN>]
 [^<STRATIFICATION COLUMN>] [|ORDER_SPECIFICATION]
        COLUMN_SPEC is <column-name>[+COLUMN_SET][+COLUMN_RANGE]
        COLUMN_SET is <column_name>[+COLUMN_SET]
        COLUMN_RANGE is <FROM COLUMN>-<TO COLUMN

where

COLUMN_SPEC is <column>[+COLUMN_SET][+COLUMN_RANGE]
COLUMN_SET is <column>[+COLUMN_SET]
COLUMN_RANGE is (<from column>-<to column>)
ORDER_SPECIFICATION is one of [-]NAME, [-]DATA, [-]FREQ, or INTERNAL

The stratification column is used to cluster, or group, data. When used, the values contribute to the ORE$STRATA column of the resulting cross-tabulated table.

data: the ore.frame containing the data to cross tabulate
grouping column: as many cross tabulations as unique values in grouping columns; default value is NULL
order: defines optional sorting of output data. Specify [-]NAME to sort by tabulation columns, [-]FREQ to sort by frequency counts in table. Unspecified order is the most efficient. The optional '-' reverses the order direction.
weights: column of the data that indicates the frequency of the corresponding row; default value is NULL
partial: columns of data to use as control variables for partial correlation; default value is NULL

ore.crosstab returns an ore.frame as output in all cases except when multiple tables are created. If multiple tables are created, ore.crosstab returns an Oracle R Enterprise list object.

ore.crosstab Examples

These examples illustrate use of ore.crosstab:

Single-Column Frequency Table
Analyze Two Columns
Weighting Rows
Order Rows in the Cross Tabulated Table
Analyze Three or More Columns
Specify a Range of Columns
Produce One Cross Table for Each Value of Another Column
Augment Cross Tabulation with Stratification
Custom Binning Followed by Cross Tabulation
ore.extend

These examples use the NARROW data set; for more information, see Data for Examples.

Single-Column Frequency Table

The most basic use case is to create a single column frequency table. The following command filters NARROW) grouping by GENDER:

R> ct = ore.crosstab(AGE, data=NARROW)
R> ct

Analyze Two Columns

This command analyses AGE by GENDER aid AGE by CLASS:

R> ct = ore.crosstab(AGE~GENDER+CLASS, data=NARROW)
R> head(ct)

Weighting Rows

To weight rows, include count based on another column; this example weights values in AGE and GENDER using values in YRS_RESIDENCE:

R> ct = ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW)
R> head(ct)

Order Rows in the Cross Tabulated Table

There are several possibilities:

Default or NAME- Order by the columns being analyzed
FREQ - Order by frequency counts
-NAME or -FREQ does reverse ordering
INTERNAL - Bypass ordering

Here are two examples:

R> ct = ore.crosstab(AGE~GENDER|FREQ, data=NARROW)
R> head(ct)
  AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP

R> ct = ore.crosstab(AGE~GENDER|-FREQ, data=NARROW)
R> head(ct)

Analyze Three or More Columns

This is similar to what SQL GROUPING SETs accomplish:

ct = ore.crosstab(AGE+COUNTRY~GENDER, NARROW)

Specify a Range of Columns

You can specify a range of columns instead of having to type all the column names, as illustrated in this example:

R> names(NARROW)
[1] "ID"             "GENDER"         "AGE"            "MARITAL_STATUS"
[5] "COUNTRY"        "EDUCATION"      "OCCUPATION"     "YRS_RESIDENCE" 
[9] "CLASS"

Since AGE, MARITAL_STATUS and COUNTRY are successive columns, you can simply use

ct = ore.crosstab(AGE-COUNTRY~GENDER, NARROW)

An equivalent version is

ct = ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)

Produce One Cross Table for Each Value of Another Column

This command produces one cross table (AGE, GENDER) for each unique value of another column COUNTRY:

R> ct=ore.crosstab(~AGE/COUNTRY, data=NARROW)
R> head(ct)

You can extend this to more than one column. For example, this command produces one (AGE,EDUCATION) table for each unique combination of (COUNTRY, GENDER):

R> ct = ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW)

Augment Cross Tabulation with Stratification

All of the above cross tabs can be augmented with stratification. For example,

R> ct = ore.crosstab(AGE~GENDER^CLASS, data=NARROW) 
R> head(ct)

The command in this example is the same as

ct = ore.crosstab(~GENDER, NARROW, strata="CLASS")

Custom Binning Followed by Cross Tabulation

First bin AGE, the calculate cross tabulation for GENDER and the bins:

R> NARROW$AGEBINS=ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2, ifelse(NARROW$AGE<40,3,4)))
R> ore.crosstab(GENDER~AGEBINS, NARROW)

ore.extend

The cross tabulation produced using ore.crosstab can be further augmented with these three basic statistics:

Row and Column Sums
```
crosstab = ore.extend.sum(crosstab)
```
Cumulative sums for each cell of the table
```
crosstab = ore.extend.cumsum(crosstab)
```
Total for the entire table
```
crosstab = ore.extend.total(crosstab)
```

The following example illustrates ore.extend:

R> ct = ore.crosstab(~GENDER, NARROW)
R> ct = ore.extend.sum(ct)
R> ct
  GENDER ORE$FREQ ORE$STRATA ORE$GROUP ORE$SUM$GENDER
0      F      421          1         1            421
1      M      880          1         1            880

ore.freq

ore.crosstab must be performed before frequency analysis is done using ore.freq.

ore.freq analyses the output of ore.crosstab and automatically determines the techniques that are relevant to an ore.crosstab result. The techniques depend on the kind of cross tables:

1-way cross tables

Goodness-of-fit tests for equal proportions or specified null proportions, confidence limits and tests for equivalence.
2-way cross tables
- Various statistics that describe relationships between columns in the cross tabulation
- Chi-square tests, Cochran-Mantel-Haenzsel statistics, measures of association, strength of association, risk differences, odds ratio and relative risk for 2x2 tables, tests for trend
N-way cross tables
- N 2-way cross tables
- Statistics across and within strata

ore.freq uses Oracle Database SQL functions when available.

See ore.freq Parameters for syntax and output and ore.freq Examples for examples.

ore.freq Parameters

ore.freq supports these parameters:

crosstab: ore.frame output from ore.crosstab()
stats: List of statistics required; these statistics are supported:
- Chi Square: AJCHI, LRCHI, MHCHI, PCHISQ
- Kappa: KAPPA, WTKAP
- Lambda: LAMCR, LAMRC, LAMDAS
- Correlation: KENTB,PCORR, SCORR
- Stuart's Tau, Somers: D|C, STUTC, SMDCR,SMDRC
- Fisher's, Cochran's Q, FISHER, COCHQ
- Odds Ratio: OR, MHOR, LGOR
- Relative Risk: RR,MHRR,ALRR
- Others: MCNEM, PHI, CRAMV, CONTGY, TSYM, TREND, GAMMA
The default value is NULL.
Params: Control parameters specific to the statistical function specified in stats:
- SCORE: TABLE|RANK|RIDIT|MODRIDIT
- ALPHA: number
- WEIGHTS: number
The default value is NULL.
skip.missing: Either TRUE or FALSE; skip cells with missing values in the cross table; default value is FALSE
skip.failed: Either TRUE or FALSE; if a statistical test required fails on the cross table because it is found to be in-applicable to the table then return immediately; ; default value is FALSE

ore.freq returns an ore.frame in all cases.

ore.freq Examples

These examples use the NARROW data set; for more information, see Data for Examples.

Before you use ore.freq, you must calculate cross tabs.

For example:

R> ct = ore.crosstab(~GENDER, NARROW)
R> ore.freq(ct)
  METHOD     FREQ DF PVALUE      DESCR GROUP
0   PCHI 161.9377  1      0 Chi-Square     1

ore.rank

ore.rank analyzes distribution of values in numeric columns of an ore.frame.

ore.rank supports useful functionality, including:

Ranking within groups
Partitioning rows into groups based on rank tiles
Calculation of cumulative percentages and percentiles
Treatment of ties
Calculation of normal scores from ranks

ore.rank syntax is simpler that the corresponding SQL queries.

See ore.rank Parameters for syntax and ore.rank Examples for examples.

ore.rank Parameters

ore.rank supports these parameters:

data: The ore.frame containing the data to rank
var: numeric columns in data to rank
desc: If desc=TRUE, rank in descending order; otherwise, rank in ascending order. (The default is to rank in ascending order.)
groups: Partition rows into #groups based on ranks. For percentiles, groups=100, For deciles, groups=10, For quartiles, groups=4.

The default value is NULL.
group.by: Rank each group identified by group.by columns separately

The default value is NULL.
ties: Specify how to treat ties. Assign the largest of, or smallest of, or mean of corresponding ranks to tied values

The default value is NULL.
fraction: The rank of a column value divided by the number of non-missing column values; the default value is FALSE.

Use with nplus1 to estimate the cumulative distribution function
nplus1: fraction plus 1, that is, 1 plus the rank of a column value divided by the number of non-missing column values; the default value is FALSE.

Use with fraction to estimate the cumulative distribution function
percent: fraction converted to a percent value, that is fraction * 100.

ore.rank returns anore.frame in all instances.

You can use these R scoring methods with ore.rank:

To compute exponential scores from ranks, use savage.
To compute normal scores, use one of blom, tukey, or vw(van der Waerden).

ore.rank Examples

These examples illustrate using ore.rank:

Rank Two Columns
Handle Ties
Rank Within Groups
Partition into Deciles
Estimate Cumulative Distribution Function

These examples use the NARROW data set; for more information, see Data for Examples.

Rank Two Columns

This example ranks the two columns AGE and CLASS and reports the results as derived columns; values are ranked in the default order (ascending):

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')

Handle Ties

This example ranks the two columns AGE and CLASS. If there is a tie, the smallest value is signed to all tied values:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')

Rank Within Groups

This example ranks the two columns AGE and CLASS and ranks the values according to COUNTRY:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')

Partition into Deciles

This example ranks the two columns AGE and CLASS and partitions the columns into deciles (10 partitions):

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)

To partition the columns into a different number of partitions, change the value of groups. For example, groups=4 partitions into quartiles.

Estimate Cumulative Distribution Function

This example ranks the two columns AGE and CLASS and estimates the cumulative distribution function for both columns:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)

Score Ranks

This example ranks the two columns AGE and CLASS and scores the ranks in tow different ways. The first command all partitions the columns into percentiles (100 groups). savage calculates exponential scores and blom calculates normal scores:

R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge,            CLASS=RankOfClass',score='savage', groups=100, group.by='COUNTRY')
R> x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',score='blom')

ore.sort

ore.sort enables flexible sorting of a data frame along one or more columns specified in a by clause.

ore.sort can be used with other data pre-processing functions. The results of sorting can provide input to R visualization.

ore.sort sorting takes places in the Oracle database. ore.sort supports the database nls.sort option.

See ore.sort Parameters for syntax and ore.sort Examplesfor examples.

ore.sort Parameters

ore.sort supports these parameters:

data: ore.frame containing the data to be sorted; required
by: the column(s) in data by which to sort the data; required
stable: Relative order is maintained within sorted group (TRUE or FALSE); default value is FALSE
reverse: Optional reversal of collation order for character variables (TRUE or FALSE); default value is FALSE
unique.keys: Optional deletion of observations with duplicate values in the columns being sorted, TRUE or FALSE; default value is FALSE
unique.data: Optional deletion of observations duplicate values in all columns, TRUE or FALSE; default value is FALSE

data and by are required parameters; all other parameters are optional

ore.sort returns an ore.frame.

ore.sort Examples

The following examples illustrate using ore.sort:

Sort Columns in Descending Order
Sort Different Columns in Different Orders
Sort and Return One Row per Unique Value
Remove Duplicate Columns
Remove Duplicate Columns and Return One Row per Unique Value
Preserve Relative Order in Output

Most of these examples use the NARROW data set; for more information, see Data for Examples. There are also Examples Using ONTIME_S.

Sort Columns in Descending Order

Sort the columns AGE and GENDER in descending order:

R> x=ore.sort(data=NARROW,by='AGE,GENDER', reverse=TRUE)

Sort Different Columns in Different Orders

Sort AGE in descending order and GENDER in ascending order:

R> x=ore.sort(data=NARROW,by='-AGE,GENDER')

Sort and Return One Row per Unique Value

Sort by AGE and keep one row per unique value of AGE:

R> x=ore.sort(data=NARROW,by='AGE', unique.key=TRUE)

Remove Duplicate Columns

Sort by AGE and remove duplicate rows:

R> x=ore.sort(data=NARROW,by='AGE', unique.data=TRUE)

Remove Duplicate Columns and Return One Row per Unique Value

Sort by AGE. Also remove duplicate rows, and return one row per unique value of AGE:

R> x=ore.sort(data=NARROW,by='AGE', unique.data=TRUE,unique.key = TRUE)

Preserve Relative Order in Output

Maintain the relative order in the sorted output:

R> x=ore.sort(data=NARROW,by='AGE', stable=TRUE)

Examples Using ONTIME_S

These examples use the ONTIME_S airline data that is installed when you install Oracle R Enterprise:

Sort ONTIME_S by airline name in descending order and departure delay in ascending order:
```
R> sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')
```
Sort ONTIME_S by airline name and departure delay and select one of each combination (that is, return a unique key):
```
R> sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY', unique.key=TRUE)
```

ore.summary

ore.summary calculates descriptive statistics and supports extensive analysis of columns in an ore.frame, along with flexible row aggregations.

ore.summary supports these statistics:

Mean, min., max, mode, number of missing values, sum, weighted sum
Corrected and uncorrected sum of squares, range of values, stddev, stderr, variance
t-test for testing the hypothesis that the population mean is 0
Kurtosis, skew, Coefficient of Variation
Quantiles: p1, p5, p10, p25, p50, p75, p90, p95, p99, qrange
1-sided and 2-sided Confidence Limits for the mean: clm, rclm, lclm
extreme value tagging

ore.summary provides a relatively simple syntax compared with SQL queries for the same results.

See ore.summary Parameters for syntax and ore.summary Examples for examples.

ore.summary Parameters

ore.summary supports these parameters:

data: the data to aggregate as an ore.frame
class: column(s) of data to aggregate (that is, SQL GROUP BY); default value is NULL
var: column(s) of data on which to apply statistics functions (SQL SELECT list)
stats: list of statistics functions to be applied on var columns

mean, min, max, cnt, n, nmiss, css, uss, cv, sum, sumwgt, range, stddev, stderr, var, t, kurt, skew,p1, p5, p10, p25, p50, p75, p90, p95, p99, qrange, lclm, rclm, clm, mode that can be requested on var columns.

The default value are n, mean, min, max.
weight: A column of data whose numeric values provide a multiplicative factor for var columns
maxid, minid: for each group optionally list maximum or minimum value from other columns in data; default value is NULL
ways: restrict output to only certain grouping levels of the class variables; default value is NULL
group.by: column(s) of data to stratify summary results across; default value is NULL
order: defines optional sorting of output data. Specify [-]NAME to sort by tabulation columns, [-]FREQ to sort by frequency counts in table. Unspecified order is the most efficient. The optional '-' reverses the order direction
_FREQ: frequency, number of observations in a group
_TYPE: identifies the grouping, binary code based
_LEVEL: identifies number of variables used in grouping

ore.summary returns an ore.frame as output in all cases except when a group.by clause is used. If a group.by clause is used, ore.summary returns a list of ore.frames, one frame per stratum.

ore.summary Examples

These examples illustrate the use of ore.summary:

Calculate Default Statistics
Skew and t Test
Weighted Sum
Two Separate Group By Columns

These examples use the NARROW data set; for more information, see Data for Examples.

Calculate Default Statistics

This example calculates mean, min, max for columns AGE and CLASS and rolls up (aggregates) GENDER:

R> ore.summary(NARROW, class='GENDER', var ='AGE,CLASS'. order='freq')

Skew and t Test

This example calculates skew for skew of AGE as column A and the t-test for CLASS as column B:

R> ore.summary(NARROW, class='GENDER', var='AGE,CLASS',  stats='skew(AGE)=A, probt(CLASS)=B')

Weighted Sum

This example calculates weighted sum for AGE aggregated by GENDER with YRS_RESIDENCE as weights; in other words, it calculates sum(var*weight):

R> ore.summary(NARROW, class='GENDER', var='AGE', stat='sum=X', weight='YRS_RESIDENCE')

Two Separate Group By Columns

Group CLASS by GENDER and MARITAL_STATUS:

r> ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways=1)

All Possible Group By

This example groups CLASS in all possible ways by GENDER and MARTIAL_STATUS:

R> ore.summary(NARROW, class='GENDER, MARITAL_STATUS', var='CLASS', ways='nway')

ore.univariate

ore.univariate provides distribution analysis of numeric variables in an ore.frame.

ore.univariate provides these statistics:

All statistics reported by ore.summary
Signed rank test, Student's t-test
Extreme values reporting

See ore.univariate Parameters for syntax and ore.univariate Examples for examples.

ore.univariate Parameters

ore.univariate supports these parameters:

data: The data to aggregate as an ore.frame
var: Numerical column(s) of data to analyze
weight: A column of the data whose numeric values provide a multiplicative factor for var columns; the default value is NULL
stats: Optional specification of a subset of statistics to calculate and display:
- moments: n, sumwgt, mean, sum, stddev, var, skew, kurt., uss.css.cv, stderr
- measures: mean, stddev, median, var, mode, range, iqr
- quantiles: p100, p99, p95, p90, p75, p50, p25, p10, p5, p1, p0
- location: studentt, studentp, signt, signp, srankt, srankp
- normality
- loccount: loc<,loc>,loc!
- extremes
The default value is NULL.

ore.univariate returns anore.frame as output in all cases.

ore.univariate Examples

These examples illustrate the use of ore.univariate:

Default Univariate Statistics
Location Statistics
Complete Quantile Statistics

These examples use the NARROW data set; for more information, see Data for Examples.

Default Univariate Statistics

This example calculates the default univariate statistics for AGE, YRS_RESIDENCE, and CLASS:

R> ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")

Location Statistics

This example calculates location statistics for YRS_RESIDENCE:

R> ore.univariate(NARROW, var="YRS_RESIDENCE",stats="location")

Complete Quantile Statistics

This example calculates complete quantile statistics for AGE and YRS_RESIDENCE:

R> ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")