What Is a Reproducible Project?

Open workflows for ecology and wildlife research

Topics:

Folder structure
Raw vs processed data
Scripts and automation
Documentation
Portable workflows

Learning Goals

By the end of this session you should be able to:

Recognise the components of a reproducible project
Understand why project structure matters
Separate raw and processed data safely
Organise scripts logically
Use relative paths
Create a simple reproducible workflow in RStudio

Does this look familiar?

Desktop/
  analysis/
    final_analysis.R
    final_analysis_v2.R
    final_analysis_v2_FINAL.R
    final_analysis_v2_FINAL_REAL.R
    figure_new_final.png
    temp_data.csv
    old_data.csv

Common problems

Which script is correct?
Which data were used?
Which figures belong to which analysis?
Can another person rerun this?

Reproducibility Is Not Just About Publishing Code

Reproducibility means:

Someone else can:

Open the project
Understand the workflow
Run the analysis
Recreate the outputs

Ideally with minimal confusion

This includes:

Data organisation
Documentation
Scripts
Dependencies
File structure

Why Reproducibility Matters in Ecology

Ecology workflows are often complex

Examples:

Camera trap datasets
Spatial layers
Occupancy models
Meta-analysis workflows
Long-term monitoring
Multi-country collaborations

Complexity increases risk

Small workflow issues can make projects difficult to reproduce.

A typical reproducible workflow

Raw data
   ↓
Cleaning scripts
   ↓
Processed data
   ↓
Analysis scripts
   ↓
Figures and tables
   ↓
Reports/manuscripts

Core components of a reproducible project

A project usually contains:

Raw data
Processed data
Scripts
Outputs
Documentation
Metadata
A project file

The structure itself is part of the methodology.

Example project structure

my_project/
├── data_raw/
├── data_processed/
├── scripts/
├── outputs/
├── docs/
├── README.md
└── my_project.Rproj

Benefits

Easier navigation
Easier collaboration (even with your future self!)
Easier automation
Easier handover

Raw vs Processed Data

Never manually edit raw data

Raw data

Original source files
Untouched
Read-only if possible

Processed data

Cleaned data
Derived variables
Analysis-ready datasets

Example ecology workflow

camera_trap_data.csv
        ↓
clean_camera_data.R
        ↓
camera_data_clean.csv
        ↓
occupancy_model.R
        ↓
occupancy_results.csv
        ↓
figures/

The scripts explain exactly what happened.

Scripts Are Better Than Manual Steps

Manual workflows are fragile

Problems with manual editing:

Difficult to track changes
Easy to introduce errors
Impossible to reproduce precisely
Hard to scale

Scripts are documentation

Good scripts explain:

What was done
Why it was done
In what order

Documentation Matters

Good projects explain themselves

Minimum documentation:

A README.md should explain:

What the project is
Where the data came from
How to run the analysis
Folder structure
Required software/packages

Future-you will appreciate this.

Relative Paths

Avoid absolute paths!

Bad

read.csv("C:/Users/matt/Desktop/project/data.csv")

Better

read.csv("data_raw/data.csv")

Even better

here::here("data_raw", "data.csv")

Why Relative Paths Matter

Absolute paths break collaboration

Absolute paths:

Only work on your computer
Break when folders move
Cause problems across operating systems

Relative paths:

Improve portability
Improve collaboration
Improve reproducibility

By biggest bugbear

rm(list=ls())

This is just rude

Gold-standard reproduciblity

Red Fox IPM

Common Mistakes

Files on Desktop
Manual spreadsheet editing
Missing scripts
Missing metadata
Unclear filenames
Mixing raw and processed data
No version control
Hard-coded paths

Hands-on exercise

Goal

Build your first reproducible research project.

You will:

Create an RStudio project
Build a folder structure
Add a small dataset
Create a cleaning script
Generate a figure
Write a README

Exercise workflow

Create project
      ↓
Add folders
      ↓
Import raw data
      ↓
Create cleaning script
      ↓
Save processed data
      ↓
Create plot
      ↓
Document project

Step 1 — Create a project

In RStudio:

File → New Project
New Directory
New Project
Name it:

wildlife_reproducibility_project

Step 2 — Create folders

Create these folders:

/data_raw
/data_processed
/scripts
/outputs
/docs

Or create them using R:

folders <- c(
  "data_raw",
  "data_processed",
  "scripts",
  "outputs",
  "docs"
)

sapply(folders, dir.create)

Step 3 — Add raw data

Download the workshop dataset from GitHub.

Save it into:

/data_raw

Step 4 — Create a cleaning script

Create:

scripts/01_clean_data.R

Example:

library(readr)
library(dplyr)
wildlife <- read_csv(
  here::here("data_raw", "wildlife_data.csv")
)

wildlife_clean <- wildlife %>%
  filter(!is.na(count))
write_csv(
  wildlife_clean,
  here::here("data_processed", "wildlife_clean.csv")
)

Step 5 — Create a figure

Create:

scripts/02_make_plot.R

Example:

library(ggplot2)
library(readr)

wildlife_clean <- read_csv(
  here::here("data_processed", "wildlife_clean.csv")
)

p <- ggplot(wildlife_clean,
            aes(x = habitat,
                y = count)) +
  geom_boxplot()

ggsave(
  filename = here::here("outputs", "habitat_plot.png"),
  plot = p,
  width = 6,
  height = 4
)

Step 6 — Create a README

Create:

README.md

Include:

Project title
Short description
Folder structure
How to run the analysis
Required packages

Keep it simple

What did we build?

You now have:

A reproducible folder structure
Raw and processed data separation
Documented scripts
Reproducible outputs
Basic documentation

What Is a Reproducible Project?

Open workflows for ecology and wildlife research

Learning Goals

Does this look familiar?

Common problems

Reproducibility Is Not Just About Publishing Code

Ideally with minimal confusion

Why Reproducibility Matters in Ecology

Complexity increases risk

A typical reproducible workflow

Core components of a reproducible project

Example project structure

Benefits

Raw vs Processed Data

Example ecology workflow

Scripts Are Better Than Manual Steps

Scripts are documentation

Documentation Matters

Relative Paths

Why Relative Paths Matter

By biggest bugbear

Gold-standard reproduciblity

Common Mistakes

Hands-on exercise

Exercise workflow

Step 1 — Create a project

Step 2 — Create folders

Step 3 — Add raw data

Step 4 — Create a cleaning script

Step 5 — Create a figure

Step 6 — Create a README

What did we build?

Useful Links