Skip to content

elastacloud/automatic-data-explorer

Repository files navigation

Automatic Data Explorer Build Status codecov

An R package to explore and quality check data. Contains a variety of useful functions which enable automatic checking of data quality, factors and numeric data as well as correlations.

  • targetCorrletions()
  • ggdensity()
  • gghistogram()
  • SummaryStatsCat()
  • SummaryStatsNum()
  • autoMarkdown()

Using targetCorrelations

To get started use a data frame and detail the column that you want to get target correlations for:

install.packages("purrr")
library(purrr)

data <- data.frame(A = rnorm(50,0,1),
                   B = runif(50,10,20),
                   C = seq(1,50,1),
                   D = rep(LETTERS[1:5], 10))

targetCorrelations(data, "B")

This should give a similar report to:

         C          A 
0.40549008 0.01356416 

Using autoMarkdown

The autoMarkdown() function can be used to automatically generate R Markdown files directly from one or more R scripts. The idea is to take the focus away from thinking about your Markdown styling when doing the most important part of data science, the actual expoloration and analysis.

The function requires that the R script has some formatting; the code that you wish to be incorporated into a code chunk must be separated with a divider, e.g.

#' # Summary
#' This is the summary of the mtcars dataset

#.#
summary(mtcars)
#.#

#' ## Histogram of mpg
#' This is a histogram of the mpg variable

#.#
autoHistogramPlot(mtcars, mpg, colour = "black", fill = "blue")
#.#

There are two things to note in this example

  • #.# are the dividers and mean that the code within should be treated as a code chunk
  • #' autoMarkdown recognises these as Roxygen comments and treats them accordingly

Say that we have saved the above in an R script called mtcars.R, we can now write this as R Markdown to an existing mtcars.Rmd file with

autoMarkdown("mtcars.R", "mtcars.Rmd")

Most projects will have multiple separate scripts; perhaps detailing different stages of the data science life-cycle. This makes our work flow much easier to follow and keeps code neat and tidy. However, when it comes to reporting it is most likely that we want just one report. If we have multiple scripts these can all be written to the same .Rmd file with

autoMarkdown(c("DataExploration.R", "DataCleaning.R", "Modelling.R"), "ProjectReport.Rmd", overwrite = TRUE)

Note the overwrite = TRUE argument. This specification will mean that any existing markdown in the .Rmd file will automatically be written over. This is useful in most circumstances but could potentially be dangerous if you specify the wrong .Rmd file, so use with caution.

The default setting is to create code chunks that are "quiet", that is they will only display the results of the code, not the code itself or any messages generated by it. Further development may include an option to specify a code chunk that also displays the code itself.

Releases

No releases published

Packages

No packages published

Languages