Stuff I have found super useful for work and life
So without further ado:
- tidyverse: I believe this should be bundled with R now. Actually if you are not using it for data wrangling you are probably doing something wrong (or you are super old school). It comes with a bundle of rich tools such as dplyr (transform, filter, aggregate), tidyr (long / wide table conversion), purrr (functional programming package), tibble (supercharged dataframes), etc.
- janitor: a great tool for data cleaning, works nicely with tidyverse.
- seplyr: a handy companion package for dplyr, adding some shortcuts such as group summary functions. You may always implement yourselves but here’s lazy person’s choice.
- ggplot2: The plotting package you should be using, period. Python has a plotnine binding as well. Sometimes you may question yourself why you need 10 lines of code to generate a line plot whereas you may just open the file in excel and make one, other times you will feel grateful that you can compose anything without banging your head to the screen. The learning curve is definitely steep if you are new, but well worth the investment.
- DataExplorer: An easy to use package to generate ggplot-verse plots of data diagnostics, such as missing data / feature, etc. I use some shortcut functions for quick data validations.
- ggrepel: This is a drop-in replacement for geom_text, if you have a massive number of labels and don’t want all the texts to overlap. It may take a long time to run for large datasets but the results will be much more visually pleasant.
- ggfortify: Diagnostics / visualization plots for various statistical models.
- sjPlot: Comes with a slew of neat diagnostic plots for statistical modeling. I especially like it’s built on ggplot so you can easily to tweak the look and feel to be consistent with other ggplot-verse plots. jtools / interplot contains some nice functions to plot interaction effects. ggdistribute contains functions to plot posterior distributions.
- coefplot: Automatically plot model coefficients with confidence intervals.
- gghighlight: Very useful if you have multiple time series and would like to highlight a few while keeping the rest in the background.
- lvplot: An improved plot which works similar to boxplot but better at outliers and large datasets.
- stargazer: Generates html / latex tables from data frames which can be inserted into manuscripts. I use this to generate tables for Google docs whenever I want the table style to be more classic. Otherwise I go with sjPlot. There’s also gt.
- ggthemes, ggsci, ggpubr, ggalt, hrbrthemes: publication ready themes and color templates for various journals. You may even get xkcd styled plots. extrafont allows you to use custom fonts in plots.
- waffle: Infographics is not the always best option for data presentation but may be helpful for less data / tech savvy audience. Here’s a blog post with some examples.
- corrplot: One-stop-shop for plotting various correlations in various representations.
- alluvial, ggalluvial, riverplot: ggplot packages to generate Sankey flowcharts. They are sometimes useful for visualizing complex flows, assuming your flow won’t be intertwined like noodles.
- pheatmap, ggdendro: Package to plot heatmaps.
- lattice, plotrix, gplots, plotly: Various plotting functions not in the ggplot universe.
- ggmap, mapview, leaflet, sp: Packages to generate geo-spatial visualizations functions.
- patchwork, gridExtra: These packages work well for stitching multiple ggplots together. I personally prefer patchwork since the syntax is more natural and gg-ish.
Natural Language Processing
- formattable: Package to provide a nice suite of functions to facilitate formatting such as numbers, percentages, etc.
- scales: A new R package to to aid/prettify ggplot scales, which can be quite annoying and tedious to tweak. I rely this lib to generate log scale ticks (example link).
- stringr, stringi: A nice suite of string manipulation functions. String processing in R is generally not pleasant, and these are good to rescue.
- RKEA: An R interface for KEA (Key Phrase Extraction), this is a good partner lib for other more comprehensive NLP packages such as topicmodels.
- udpipe: Trainable tokenization pipelines.
- stringdist, fuzzywuzzyR: Fuzzy/ approximate string matching and string distance functions.
- fuzzyjoin: An interesting concept — this is an extension to the dply join methods to add fuzzy joins functions for tables. Can be handy but use with caution since the behavior may not be predictable.
- topicmodels: A performant topic modeling package.
- formatR: Can be used to batch format multiple R source codes if you are not using an IDE with built in auto-format function.
- lintr: Code linting for R sources.
- SentimentAnalysis: This is a similar package to python’s Vader, if you’d like to have a quick sentiment analysis with multiple methods.
- modelr: A streamlined modeling interface to normalize several commonly used modeling frameworks.
- broom: Very handy package to convert modeling objects into more consumable data frames for batch reporting, persistence or visualization.
- fmsb, Mcomp: This is the companion package to a book Practices of Medical and Health Data Analysis using R. The book is about 10 years old but some data are useful for trying out various statistical / ML methods. Mcomp contains some time series data to play with.
- Hmisc, rcompanion, e1071: Excellent utility packages for various statistical analysis, such as sample size / power calculation, computing pseudo R², various statistical tests, etc.
- vegan: Statistical methods for ecologists.
- fit.models: This package is similar to caret in such a way that it standardizes the interface for various parametric / nonparametric model fitting and comparisons.
- oem: Apackage developed to specifically tackle big tall data (small p, very large N, an example is the kaggle Talking Data) regression problems.
- BeSS: Abest subset selection algorithm, can be used along with the standard AIC / BIC based model selection procedures.
- quantreg: Package w/ quantile regression functions.
- lavaan: Latent variable analysis and structural equation modeling [link].
- lme4: This package is great for fitting linear and generalized mixed-effects models.
- uplift: Uplift modeling to optimize for the ROI of a particular treatment, but identifying the best subset of targets [link].
- arm: Companion to the book Data Analysis Using Regression and Multilevel/Hierarchical Models. Contains handy methods for hierarchical modeling.
- smooth: Curve smoothing and time series forecasting.
- extRemes: Package for extreme / long tail value statistics.
- MixtureInf: Maximum Likelihood Estimate (MLE) methods.
- MatchIt: A great package for propensity scoring matching. It has some quirks and data extraction from results may not be very intuitive.
- CausalImpact: Google’s package for quick estimation, usually this is my first attempt to analyze any causal inference analysis.
- causaleffect: Casual inferences. I haven’t checked this package in detail and it looks very promising.
- CompareCausalNetworks: This is the caret for causal networks, a unified interface to tap into many causal network frameworks.
- rdd: This package contains everything you need for regression discontinuity analysis.
- mediation: Apackage developed for causal mediation analysis. Uber has an excellent blog on such analysis [link]. This frame work can estimate the average treatment effect w & w/o the presence of a proposed mediator.
- meta, metafor: Starting packages for basic meta analysis, such as fitting fixed / random effect models for binary / continuous outcomes.
- forestmodel, metaviz: The built-in forest plotting functions for meta/metafor were not upgraded to ggplot universe yet. These packages do the facelifting and the resultant plots can be updated with ggplot functions such as changing captions, stitching with other plots, etc. I wrote a even better forest plot functions which I will share in another blog post.
- vcov: Faster methods to extract variance — covariance matrices. This can be used in many contexts but can be handy for multvariate meta analysis.
- bayesmeta, CPBayes, bmeta: Certainly there will be Bayesian counterparts for meta analysis. I haven’t checked these in detail but probably will do more as I dig deeper into Bayesian inference.
- netmeta: A package for network meta analysis. Network meta analysis is a methodology to combine multiple studies and infer treatment effects between groups which were never compared directly.
- diptest: dip test of multi-modality / mixture distribution.
- normtest: Contains many handy tests of normality, skewness, kurtosis, etc, using Monte Carlo simulations, a good addition to base::shapiro.test: Shapiro-Wilk test of Normality.
- seqtest, SPRT, ldbounds, gsDesign: These are the packages for sequential testing — computing bounds w/ various methods. I plan to write a comprehensive blog about this subject area.
- lmtest: Hypothesis testing for linear regression models.
- mhtboot: Multiple hypothesis testing correction.
- wBoot: Bootstrap methods as alternatives to hypothesis testing.
- coin: Permutation test framework.
- resample: Resampling methods which can be used along with many hypothesis testing frameworks.
- binom: Provide additional binomial confidence intervals since the standard estimation may not work well for rare events, heterogeneous p, etc.
- DescTools: Lazy person’s tools for descriptive statistics.
- tolerance: To think about the hypothesis testing in a different way — given alpha and percentage of recovery, what’s the lower bound and upper bound of data we may recover from a particular probability distribution? This package does the trick for all the common distributions.
- BSDA: Companion for Book Basic Statistics and Data Analysis, contains many interesting datasets and auxiliary hypothesis testing functions (such as computing z-test from summary statistics v.s. raw data, although it’s pretty easy to roll your own).
- WRS2: A collection of robust statistical modeling / testing methods;
- bayesAB: THE bayesian hypothesis testing package you want to use. This is an awesome package developed by Frank Portman.
- rstanarm: Bayesian Applied Regression Modeling — can be used for Bayesian hypothesis testing in a modeling framework.
- hotelling: A package for Hotelling’s T² test.
- pwr: Basic functions for power analysis. Compute any of sample size, power, effect size, alpha from the 3 other parameters.
- samplesize: Similar to pwr but can compute sample size for nonparametric Wilcoxon Test as well.
- powerAnalysis: An older package also does the trick.
- simr: Using simulation based approach for power analysis
- effsize, compute.es: Very handy packages to compute various effect size measures (Cohen d, etc.).
- survey: Name is very self-explanatory.
- SDaA: Sampling, Design and Analysis
- FactoMineR: Multivariate factor analysis, with Multiple Correspondence Analysis (MCA)
- ade4: Analysis of ecological data / environmental sciences w/ survey methods.
- Ca, homals: Various correspondence analysis methods.
Time Series Analysis
- jmotif: Provides a comprehensive suite of symbolic transformation functions such as SAX (Symbolic Aggregation Approximation) to convert continuous time-series data into discrete string sequences, which can then be fed into a variety of feature engineering functions
- seewave: Package for sound analysis, also offers SAX transformations
- prophet: Facebook’s time series analysis package, including forecasting and change point detection. Also comes in Python.
- imputeTS: Imputation methods designed specifically for time series data.
- anytime, timeDate, lubridate, hms: We all know time format conversion is a universal pain. These packages are there for the rescue.
- fma: Time series datasets you may play with.
- timereg: Regression models for survival / time series data
- forecast: Forecasting functions for ts / linear models.
- TSA: General Time Series analysis accompanying the book Time Series Analysis with Applications in R
- astsa: Applied statistical time series analysis.
- spetral.methods: Spectral decomposition of Time Series data
- pracma: Practical Numeric Math Functions.
- changepoint, cpm: Methods for change point / anomaly detection.
- bcp, ecp: Bayesian / Nonparametric methods for change point detection.
- TSClust, dtwclust: Specific methods developed for time series data clustering.
- survival: Survival analysis toolkit. Contains everything you need to begin such as Cox Hazard models.
- robust, robustbase: Robust methods to overcome univariate / multidimensional outliers and generate more stable models / estimates.
- twitter/AnomalyDetection: This package hasn’t been updated for years but seems still working. I would welcome recommendations on anomaly detection methods.
- forcats: Tools for categorical variable transformations.
- Boruta: Method for feature selection based on permutation of importance measures.
- MXM: Feature selection methods w/ Bayesian networks.
- fscaret: Automated feature selection from caret.
- EFS: Feature selection using ensemble methods.
- one_hot, onehot: Handy shortcuts to onehot encoding of categorical variables.
- proxy: Distance functions can be scattered around many R packages with various function signatures, which makes it hard to hot-swap. This package normalizes distance definitions and make it convenient to define any custom distance function.
- parallelDist: Compute distance matrix on very large datasets (>5000 rows) can very time consuming on a local machine; this package allows to compute distances in parallel which can dramatically reduce computation time, tested up to 20X speed boost.
- philentropy: Similarity distances between probability functions.
- wCorr, weights: Weighted statistics such as correlations.
- distances: Various distance metrics which can be used for ML / stat modeling.
- gower: Gower’s distance between records, often used in survey analysis with mixed numeric / categorical responses.
- Rtsne, tsne: T-sne implementation in R.
- gmodels: fast.prcomp: A fast version of PCA.
- umap: With tutorial here, dimension reduction methodology.
- smacof: A comprehensive package for Multi-dimensional scaling, as a nice addition to MASS::isoMDS
- largeViz: Large data visualization with dimension reduction.
- RDRToolbox: Dimension reduction w/ isomap and LLE with a unified framework.
Unsupervised Learning / Clustering
- mclust: Model Based clustering approach using Gaussian Mixtures. It automatically decides the optimal number of clusters based on maximum likelihood. Here’s a starter’s tutorial.
- fastcluster: A drop in method to replace the built in hierarchical clustering with massive performance boost. Clustering on thousands of data points take less than a couple of seconds.
- flashClust: Another implementation of fast hierarchical clustering.
- NMF: Package of non-negative matrix factorization. This is a very useful technique to find compressed versions of smaller matrices which can be multiplied to approximate the source matrix, while holding all values positive (additive). Frequently used in clustering / image processing. [Nature Paper]
- cluster, fpc, clue: A suite of methods for cluster analysis and validation.
- pvclust: Using bootstrap to estimate the uncertainty in hierarchical clustering and search optimal cuts.
- fastICA: Fast method for independent component analysis (ICA). A good Quora post explains the difference between PCA and ICA.
- EMCluster: Model based clustering w/ finite mixture Gaussian Distribution. In short words, it assumes the data was generated from a multivariate Gaussian distribution and tries to estimate the optimal number of clusters and cluster membership w/ EM algorithm.
- clues, clusterSim: Automatic clustering methods to identify number of clusters, with diagnostics plots.
- RSKC: Robust K-means clustering algo for sparse data.
- dendextend: Advanced dendrogram drawing methods.
- NbClust: A really nice package to identify the optimal number of clusters — can provide ~ 30 metrics simultaneously.
- clValid: Compute a variety of cluster quality metrics, such as Dunn index.
- clustertend: Hopkin’s cluster tendency — you may apply clustering algorithm to any dataset but it doesn’t mean the result is meaningful. Specifically, the data need to contain some sort of clustering structure and Hopkin’s index is a good measure with permutation tests.
- dbscan: Density based clustering methods [wiki], may be able to solve when traditional distance based methods fail to work.
- cluMix: Clustering subjects with mixed data types, distances can be computed using gower’s distance. Alternatively, you may use gower to compute the distance and then feed to preferred clustering algorithms.
- apcluster: Affinity propagation clustering — similar to label propagation, the closeness are passed via similarity networks.
Supervised Learning (General Machine Learning) / Deep Learning
- caret: The R equivalent to scikit learn: feature processing, train / test split, cross validation, model performance metrics … you name it.
- mlbench: ML benchmark datasets and functions.
- xgboost: The well-known Kaggle winning algorithm. As a matter of fact this is almost the universal choice for high performance production level models. Fast, easy to use, easy to deploy.
- modelr: An initiative to make modeling syntax more interoperable with the tidyverse.
- recipes: Auxiliary packages for design matrices.
- mlr: Similar to caret, this is a universal framework for model training.
- h2o, mltools: A distributed machine learning framework, with a community version and a commercial version that includes a AutoML implementation.
- rstudio/keras: Keras implementation in R, go Deep Learning!
- smotefamily: Synthetic oversampling methods for class imbalance problems.
- MatchIt: Mentioned in causal inference, I feel this package also worthy of another nomination here to generate samples with balanced covariates.
- upclass: Recently archived by CRAN, this is another package to synthesize minority samples.
- igraph: The most comprehensive graph library for R — summary statistics, distances, community structure, clustering, visualization layout algos — you name it! Must have.
- qgraph: Contains various methods for graph data viz.
- BB: Solving large systems of linear and nonlinear equations. Very fast and handy.
- VIM: Visualization and Imputation of missing values. Swiss-knife package.
- mice, Amelia: Methods for multivariate imputation. The idea is to borrow as much neighbor data as possible to improve imputation accuracy.
- missForest: One of the model based approach — we can use missing value as the response variable and fit a model with the rest of variables, and hence make imputations.
- mi, mitools: Some older packages for missing value imputation.
- randomizeR: Randomization for clinical trials.
- MonteCarlo: Name explains it, a package for MonteCarlo simulation.
- paircompviz: A bioconductor package for visualizing multiple testing comparisons.
- msa: Multiple sequence alignment procedures for DNA/RNA/Protein sequence alignment. The transition matrix maybe redefined for custom sequence typed data.
- Biostrings: Efficient library for Biological Strings, it may be extended to custom character sets.
- gender: Guesstimate gender from English names, producing probabilities
- babynames: U.S. babynames over years from census data. I used this package to understand time-varying name popularity when I was trying to name my daughter
- gcookbook: This package contains data for the book R Graphics Cookbook; I found it useful for testing visualization tools and it came with a few handy utility functions out-of-the-box.
- wbstats: This packages offers programmatic access to World Bank Data, such as GDP, income, crime rate, education, demographics, at various geo-granularity
- wrapr: this package can be used to debug pipe (%>%) functions
- validate: this package comes with a rich set of functions to validate function arguments, can be used in the backend of web services such as plumber
Dashboard / Interactive Viz
- R/Shiny: I am not a big fan of shiny but it’s a handy alternative to Tableu for creating quick interactive data visualization dashboards.
- htmlwidgets: A great companion to shiny, providing many interactive tools to visualize tabular / time-series / geo-spatial data
- dygraphs: One of the best package to visualize time-series data interactively; you may plot multiple series simultaneously and decorate a variety of annotations
- DataTables(DT): A simple wrapper to convert a R dataframe into an interactive data table with sorting and filtering capabilities
- leaflet: Best package to visualize geo-spatial data although I found the integration into Jupyter notebook can be quite clumsy
- foreach: Arguably the much more robust version of for loops, supports a couple of parallel processing frameworks with (%dopar%) syntax. I found it less performant than mclapply from parallel but I like its error handling and flexibility.
- mclapply: My Go-to function now for single box parallelization if the output can be condensed in list/arrays.
- parallel, snow: Various parallel backends can be used in mclapply / foreach.
- readr: If you are still using the built in read.csv … don’t. This package is so much superior and easy to use. Can’t live without.
- readxl: Even i don’t have memory of last time that I worked on an Excel file (everything is Google Sheets now), knowing there’s someway to read directly from Excel is great, especially when that file contains multiple sheets.
- jsonlite: No need to explain, you need someway to parse JSON
- xml2: Though XML is becoming a thing of the past, knowing it’s still supported offers me a peace of mind.
- rDrop: Conveniently read file directly from Dropbox.
- pryr: Methods to peek under the hood of R objections and functions.
- devtools: Developer tools if you are into R development.
- plumber, httr: Packages for setting up http services and sending http requests.
- glue: A very handy tool to format string with multiple variables (equivalent to python’s string.format), I found it super handy for generating SQLs or debugging messages
- memo: An awesome implementation of lru cache, works great with http service such as plumber
- reticulate: Allows R to directly access python libraries and objects, if you are dual-wielding!
- roxygen2: Generate R docs from inline annotations.
- testthat: Unit testing package for R
- knitr, bookdown: Create HTML reports from R markdowns.
- packrat: A R dependency management system.
- IRdisplay: Used to display images / text in Jupyter with R kernel
To Be Updated …