Motivation¶
Why build yet another data science environment? This page explains the problems this project solves and the philosophy behind its design.
The Problem with Local Installations¶
If you've done data science work for any length of time, you've probably experienced these pain points:
Dependency Hell¶
Your operating system ships one version of R. Homebrew installs another. RStudio wants a third. Python's situation is even worse — system Python, Homebrew Python, pyenv, conda, virtualenv... the combinations are endless.
When something breaks, you're left debugging your environment instead of doing actual work.
The "It Works on My Machine" Problem¶
You write a brilliant analysis. You share it with a colleague. They can't run it because:
- They have a different OS
- They're missing a system library
- Their package versions don't match
- Something is configured differently
Reproducing your environment on another machine often takes longer than writing the code did.
Polluting Your Main System¶
Every data science project brings its own dependencies. Over time, your laptop accumulates:
- Multiple R versions
- Conflicting Python environments
- System libraries you installed once for one package
- Configuration files scattered across your home directory
Eventually, something breaks in a way you can't fix without nuking everything and starting over.
Remote Access Limitations¶
RStudio Desktop and JupyterLab are designed for local use. If you want to:
- Work from a different computer
- Access your environment from a tablet
- Let a colleague look at your code
- Run long jobs on a more powerful machine
...you need to set up RStudio Server or JupyterHub, which is its own project.
The Container Solution¶
Docker containers solve these problems elegantly:
| Problem | Container Solution |
|---|---|
| Dependency conflicts | Each container is isolated |
| Reproducibility | Same image runs identically everywhere |
| System pollution | Nothing installed on host |
| Remote access | Built-in web interface |
But running RStudio Server or JupyterLab in Docker has traditionally been painful:
- Separate containers — Most setups run RStudio and Jupyter in different containers, complicating shared data and context switching
- Lost packages — By default, packages are lost when containers restart
- Configuration complexity — You need to understand Docker volumes, networking, and permissions
- Architecture issues — Many images don't support ARM64 (Apple Silicon)
What This Project Does Differently¶
One Container, Two IDEs¶
Instead of orchestrating multiple containers, DataSci Homelab runs both RStudio Server and JupyterLab in a single container. They share:
- The same filesystem
- The same user account
- The same installed packages (R and Python)
- The same
/datadirectory
Switch between them by changing browser tabs.
True Package Persistence¶
Packages are stored in Docker volumes that persist across:
- Container restarts
- Container rebuilds
- Image updates
Install a package once; it's there until you explicitly remove the volume.
Pre-Configured Everything¶
The image ships with:
- All common data science packages pre-installed
- Sensible default configurations
- Health checks for reliability
- Authentication options for security
Pull and run. No configuration required.
Native Multi-Architecture¶
Built natively for both AMD64 and ARM64. If you're on Apple Silicon, you get a native image — no Rosetta emulation, no performance penalty.
Remote-Ready by Default¶
RStudio Server and JupyterLab are web applications. Access them from:
- Your laptop
- Your phone
- A different computer
- Anywhere in the world (with Cloudflare Tunnel)
Design Philosophy¶
Batteries Included, But Swappable¶
The base image includes everything you need for common data science work. But nothing prevents you from:
- Installing additional packages at runtime
- Mounting your own configuration files
- Building on top of this image for specialized needs
Persistence Over Reproducibility (For Packages)¶
Typical Docker wisdom says: "put everything in the image." But data scientists install packages constantly. Rebuilding an image every time you need a new package is impractical.
DataSci Homelab separates:
- The base environment (in the image) — R, Python, RStudio, Jupyter
- Your packages (in volumes) — Install once, persist forever
Local-First, Remote-Optional¶
The primary use case is running on your own machine. But the same setup works for:
- A home server
- A cloud VM
- A Kubernetes pod
The Cloudflare Tunnel integration is documented but not required.
Opinionated Defaults, Easy Overrides¶
The default configuration reflects best practices for data science work:
- UTF-8 encoding everywhere
- POSIX line endings
- CRAN mirror configured
- Common packages pre-installed
But every default can be overridden via environment variables or mounted config files — no image rebuild required.
Who Built This and Why¶
This project was built by a data scientist who got tired of:
- Reinstalling packages every time macOS updated
- Explaining environment setup to new team members
- Debugging R/Python version conflicts
- Not being able to access work from different machines
The goal was simple: define the environment once, use it everywhere, never think about it again.
If that resonates with you, welcome aboard.
Ready to try it?