exploreR and explorePy
As part of a small team in the MDS program, I worked on a collaborative software development project to create R and Python packages to explore the contents of a dataframe.
The exploreR package for R can be found here and
The explorePy package for Python can be found here.
The functions in explorePy are similar to those detailed below in the R package. Installation and documentation for the Python package can be viewed in the link noted above.
exploreR
Installation
- Input the following into the console:
devtools::install_github("UBC-MDS/exploreR", build_opts = c("--no-resave-data", "--no-manual"))
Functions and Example Usage
Load the package.
library(exploreR)
Function 1 | Variable summary
The function variable_summary
will take a data frame as input and provide the total quantity of each type of variable present in the data frame. The output of the function will be a dataframe of size 5 x 2 and will have one row for each variable type with its corresponding quantity. The function will look to identify 5 different types of variables: numerical, character, boolean, date, and an other category.
example usage of variable_summary
:
toy_data <- data.frame("letters" = c("a", "b", NA, "d"),
"numbers" = c(1, 4, 6, NA),
"logical" = c(NA, FALSE, NA, TRUE),
"dates" = as.Date(c("2003-01-02", "2002-02-02", "2004-03-03", "2005-04-04")),
"integers" = c(2L, 3L, 4L, 5L),
stringsAsFactors = FALSE)
variable_summary(toy_data)
example output of variable_summary
:
variable_type | count |
---|---|
numeric | 1 |
character | 1 |
logical | 1 |
date | 1 |
other | 1 |
Function 2 | Missing values per variable
For each column/variable in the dataframe, this function will count the number of missing values present and report back on that number per column. The function missing_values
will accept a dataframe as input and output a corresponding dataframe with the above information detailing the counts of missing values per column/variable. If the input is of size n x d, the output size will be d x 3.
example usage of missing_values
:
toy_data <- data.frame("letters" = c("a", "b", NA, "d"),
"numbers" = c(1, 4, 6, NA),
"logical" = c(NA, FALSE, NA, TRUE),
"dates" = as.Date(c("2003-01-02", "2002-02-02", "2004-03-03", "2005-04-04")),
"integers" = c(2L, 3L, 4L, 5L),
stringsAsFactors = FALSE)
missing_values(toy_data)
example output of missing_values
:
variable | missing_values | percent_missing |
---|---|---|
letters | 1 | 0.25 |
numbers | 1 | 0.25 |
logical | 2 | 0.50 |
dates | 0 | 0.00 |
integers | 0 | 0.00 |
Function3 | Dataset Size/Info
The function size
will take in a dataframe and print the shape and size of the dataframe. For the size, the function will print how much memory the dataframe consumes in bytes. The output of the function will be a dataframe of size 1 x 3.
example usage of size
:
toy_data <- data.frame("letters" = c("a", "b", NA, "d"),
"numbers" = c(1, 4, 6, NA),
"logical" = c(NA, FALSE, NA, TRUE),
"dates" = as.Date(c("2003-01-02", "2002-02-02", "2004-03-03", "2005-04-04")),
"integers" = c(2L, 3L, 4L, 5L),
stringsAsFactors = FALSE)
size(toy_data)
example output of size
:
rows | columns | size_in_memory |
---|---|---|
4 | 5 | 1760 |
Check out the package vignette for more information by entering the following in the console:
vignette("explorer")
for viewing inside the RStudio viewer
or
browseVignettes(package="exploreR")
for viewing in a browser
Comparable Functions Available in the R Ecosystem
The following are existing functions in R that are similar to those developed within our project.
dim(): used to obtain the shape of a dataframe.
ncol() and nrow(): used to get the number of rows and columns in a dataframe.
str(): provides summary information about the dataframe, including some of the same information as above (i.e. dim, ncol and nrow). str() provides descriptive information about variable and data types in the dataframe.
is.na(): provides the number of missing values in the columns of the data frame.
Collaborators on this project include myself, James Pushor, Milos Milic, and Arzan Irani.
Photo by Andrew Neel on Unsplash