Using AI within Pipelines via ellmer

Last updated on 2026-02-09 | Edit this page

Overview

Questions

What is ellmer and how does it integrate with AI models?
How can I set up ellmer with GitHub Copilot in RStudio?
How can I use AI prompts to process data within my analysis pipelines?
What are the best practices for integrating AI into data processing workflows?

Objectives

Understand the ellmer package and its capabilities
Install and configure ellmer for use with AI models
Integrate AI prompts into data processing pipelines
Apply AI-powered transformations to datasets
Develop reproducible AI-enhanced workflows

Introduction

The ellmer package provides a powerful interface for integrating large language models (LLMs) into R workflows. This enables you to leverage AI capabilities directly within your data processing pipelines, combining traditional statistical computing with modern AI assistance.

What is ellmer?

ellmer is an R package that provides a unified interface to various LLM providers, including:

GitHub Copilot (via GitHub Models)
OpenAI models
Anthropic Claude
Google Gemini
Local models via Ollama

It allows you to:

Send prompts to AI models from R code
Process text data with AI assistance
Generate structured outputs
Stream responses for interactive applications

Callout

Why Use ellmer?

Unified API: Work with different AI providers using consistent syntax
Pipeline Integration: Seamlessly incorporate AI into tidyverse workflows
Reproducible: Track prompts and model versions for reproducible research
Flexible: Switch between models without rewriting code

Installing ellmer

First, install the ellmer package from CRAN or GitHub:

R

# Install from CRAN
install.packages("ellmer")

# Or install development version from GitHub
# install.packages("remotes")
remotes::install_github("tidyverse/ellmer")

Load the package:

R

library(ellmer)
library(tidyverse)  # For data manipulation

Setting Up ellmer with GitHub Copilot

To use ellmer with GitHub Copilot (via GitHub Models), you need to set up authentication.

Step 1-pat: Getting Access to GitHub Models

Given that we have already registered our GitHub account in RStudio, to use the GitHub Copilot features, we can proceed to use the GitHub Models via ellmer without further actions.

Step 1-explicit: Get a GitHub Token and Store It Securely

The following steps are only needed, if you have not already set up GitHub authentication in RStudio.

Go to github.com/settings/tokens
Click “Generate new token” → “Generate new token (classic)”
Give it a descriptive name (e.g., “ellmer-access”)
Select the required scopes (typically repo and user)
Click “Generate token” and copy it immediately

Store your token securely in your R environment:

R

# Option 1: Set for the current session
Sys.setenv(GITHUB_TOKEN = "your-token-here")

# Option 2: Store permanently in .Renviron
usethis::edit_r_environ()
# Add this line: GITHUB_TOKEN=your-token-here
# Save and restart R

Callout

Security Best Practice

Never commit tokens or API keys to your code! Always use environment variables and add .Renviron to your .gitignore file.

Step 2: Initialize a Chat Object

R

# Create a chat session with GitHub Copilot
chat <- chat_github()  # the used default model will be printed

# Test the connection
chat$chat("Hi, please give me a joke!")

Challenge

Challenge 1: Setup and Test `ellmer`

Install the ellmer package
Create a chat object and send a test message
Verify you receive a response and enjoy your joke..
What do you see, when you print your chat object?

Show me the solution

Printing the chat object shows

the model being used
the number of tokens sent and received
the total cost incurred (if applicable)
the history of messages exchanged

Using AI for Data Processing

Basic Text Processing

Process individual text strings with AI:

R

# Classify sentiment
texts <- c(
  "I love this product!",
  "This is terrible.",
  "It's okay, not great."
)

# Use AI to classify sentiment
chat <- chat_github()
# iterative call of chat interface using sapply() or purrr::map()
results <- purrr::map_chr(texts, 
  function(text) {
    chat$chat(paste("Classify the sentiment (positive/negative/neutral):", text))
  })

Processing Data in Pipelines

Integrate AI into tidyverse pipelines:

R

# Example: Customer feedback analysis
feedback_data <- tibble(
  id = 1:5,
  comment = c(
    "Great service, very helpful!",
    "Long wait times, not happy.",
    "Average experience.",
    "Excellent quality and fast delivery!",
    "Product arrived damaged."
  )
)

# Add AI-powered sentiment analysis
feedback_processed <- feedback_data %>%
  rowwise() %>% # ensures each row/information is processed individually
  mutate(
    sentiment = chat$chat(
      paste("Classify as positive/negative/neutral:", comment)
    ),
    key_themes = chat$chat(
      paste("Extract main themes (max 3 words):", comment)
    )
  )

Aggregated calls to reduce IO and number of prompts

So far, one chat() call (prompt) was made per row.

An alternative is to aggregate multiple inputs into a single prompt, reducing the number of API calls:

R

feedback_data |> 
  mutate(
    mood = 
      comment |> 
      str_c(collapse="#") |> 
      chat$chat( "Assign to each product feedback answer (provided as #-separated list) a respective category from (happy,unhappy) in a #-separated aggregated text output",
                      echo = "none") |> 
      str_split_1("#")
  )

That way, only one API call is made for the entire dataset.

But when working on larger datasets, be aware of token limits per prompt (both input and output tokens).

Callout

Rate Limits and Costs

Be aware of:

API rate limits for your chosen provider (limited number of requests per minute/hour)
Limited token quotas and prompt sizes
Potential costs for API calls
Processing time for large datasets
Consider batching requests when possible (less tokens spent)

Advanced: Structured Output Generation

Request structured data from AI models:

R

# Extract structured information
extract_info <- function(text) {
  prompt <- paste0(
    "Extract the following from this text and return as JSON:\n",
    "- sentiment (positive/negative/neutral)\n",
    "- urgency (high/medium/low)\n",
    "- category (product/service/delivery/other)\n\n",
    "Text: ", text
  )
  
  chat$chat(prompt)
}

# Apply to dataset
feedback_structured <- feedback_data %>%
  rowwise() %>%
  mutate(analysis = extract_info(comment))

Challenge

Challenge 2: Build a Data Processing Pipeline

Create a tidyverse pipeline that:

creates a tibble dataset (columns review_id and text) with the following product reviews

The software is intuitive but lacks some features. Rating: 4/5
Terrible experience, crashed multiple times. Very disappointed.
Perfect for my needs! Easy to use and fast. Highly recommend.

Uses AI to classify the main topic of each review in up to two words
Extracts a numerical satisfaction score (1-5) from the text

Example Solution

R

library(ellmer)
library(tidyverse)

# Sample data
reviews <- tibble(
  review_id = 1:3,
  text = c(
    "The software is intuitive but lacks some features. Rating: 4/5",
    "Terrible experience, crashed multiple times. Very disappointed.",
    "Perfect for my needs! Easy to use and fast. Highly recommend."
  )
)

# Initialize chat
chat <- chat_github()

# Process with AI
reviews %>%
  rowwise() %>%
  mutate(
    topic = chat$chat(paste("Main topic (1-2 words):", text)),
    score = chat$chat(paste("Satisfaction score 1-5:", text)),
  ) %>%
  ungroup()

Local LLMs

So far, we have focused on cloud-based AI models like GitHub Copilot. Thus, our setup required internet access and API tokens. Since the latter may incur costs and have rate limits, local models can be a viable alternative for some use cases. This is especially true for high-volume applications or sensitive data. Furthermore, this reduces dependency on external services and the costs associated with them.

To use local models with ellmer, you can leverage Ollama, which allows you to run LLMs on your local machine.

The general steps are:

Install Ollama and set up a local model that fits both your computational resources and needs
Install the ellmer package
Initialize a chat object for the local model

A detailed guide for setting up Ollama with ellmer can be found in the

POSIT Blog post on “Setting up local LLMs for R and Python” (Jan 2026) and the
ellmer chat_ollama documentation.

Best Practices for AI in Pipelines

1. Design Clear Prompts

R

# Good: Specific and constrained
prompt <- "Classify sentiment as: positive, negative, or neutral. 
          Return only one word."

# Less effective: Vague
prompt <- "What do you think about this?"

2. Handle Errors Gracefully

R

safe_chat <- function(chat, prompt) {
  tryCatch(
    chat$chat(prompt),
    error = function(e) {
      warning("API call failed: ", e$message)
      return(NA)
    }
  )
}

# Use in pipeline
data %>%
  mutate(ai_result = safe_chat(chat, prompt))

This is especially important when working with large datasets or unreliable network connections.

3. Cache Results

Write your AI-generated data to file to avoid redundant calls:

R

# Save processed results to avoid reprocessing
processed_data <- data %>%
  mutate(ai_field = process_with_ai(text))

# Save to disk
write_rds(processed_data, "cache/processed_data.rds")

# Load cached version later
processed_data <- read_rds("cache/processed_data.rds")

This reduces redundant API calls and speeds up data processing.

You might even want to write your script in a way, that it first checks, whether a cached version exists, and only processes the data if not. That way, you can easily re-run your whole analysis without incurring additional costs or delays.

4. Use Batch Processing for Large Datasets

R

# Process in chunks to manage rate limits
process_in_batches <- function(data, batch_size = 10) {
  data %>%
    mutate(batch = ceiling(row_number() / batch_size)) %>%
    group_by(batch) %>%
    mutate(ai_result = process_with_ai(text)) %>%
    ungroup() %>%
    select(-batch)
}

This approach is especially useful for large datasets to avoid hitting API rate limits.

Furthermore, it can be nicely combined with the “aggregated calls” approach shown earlier.

Reproducibility Considerations

Document your AI pipeline for reproducibility:

R

# Initialize chat
chat <- chat_github()

# Record model and version
metadata <- list(
  model = chat$get_model(),
  provider = "github",
  date = Sys.Date(),
  ellmer_version = packageVersion("ellmer"),
  prompt_template = "Classify sentiment: {text}"
)

# Save with results
list(
  data = processed_data,
  metadata = metadata
) %>%
  write_rds("results_with_metadata.rds")

Discussion

Group Discussion

Consider the following questions for discussion within the class:

What types of data processing tasks in your work could benefit from AI integration?
How would you balance reproducibility with using AI models that may change over time?
What ethical considerations arise when using AI to process data?
How do you validate the quality of AI-generated classifications or summaries?

Real-World Use Cases

Text Classification

R

# Categorize research abstracts
abstracts %>%
  mutate(field = chat$chat(
    paste("Research field (one word):", abstract)
  ))

Data Cleaning

R

# Standardize inconsistent entries
messy_data %>%
  mutate(cleaned = chat$chat(
    paste("Standardize company name:", company_name_raw)
  ))

Content Generation

R

# Generate descriptions
products %>%
  mutate(description = chat$chat(
    paste("Write 20-word product description for:", product_name)
  ))

Translation and Localization

R

# Translate content
content %>%
  mutate(translated = chat$chat(
    paste("Translate to Spanish:", english_text)
  ))

Challenge

Challenge 3: Implement a Complete Workflow

Choose a scenario and implement a complete AI-enhanced pipeline:

Option A: Analyze a dataset of tweets/social media posts
Option B: Process customer support tickets
Option C: Categorize research papers by topic

Include error handling, caching, and metadata tracking.

Example for Option B: Support Tickets

R

library(ellmer)
library(tidyverse)

# Initialize
chat <- chat_github()

# Process tickets
process_ticket <- function(ticket_text) {
  list(
    category = safe_chat(chat, 
      paste("Category (billing/technical/account):", ticket_text)),
    priority = safe_chat(chat,
      paste("Priority (high/medium/low):", ticket_text)),
    suggested_response = safe_chat(chat,
      paste("Suggest 2-sentence response:", ticket_text))
  )
}

# Apply to dataset
tickets_processed <- tickets %>%
  rowwise() %>%
  mutate(analysis = list(process_ticket(description))) %>%
  unnest_wider(analysis)

# Save with metadata
output <- list(
  data = tickets_processed,
  metadata = list(
    processed = Sys.time(),
    model = chat$get_model(),
    n_tickets = nrow(tickets)
  )
)

write_rds(output, "processed_tickets.rds")

Troubleshooting Common Issues

Authentication Errors

Verify your token is set correctly:
- Sys.getenv("GITHUB_PAT_GITHUB_COM") for RStudio setup
- Sys.getenv("GITHUB_TOKEN") for explicit token setup from above
Ensure token has required permissions
Check token hasn’t expired

Rate Limiting

Implement delays between requests: Sys.sleep(1)
Use batch processing
Consider caching results
Monitor API usage
Consider installation of local models via Ollama for heavy workloads

Inconsistent Results

Make prompts more specific
Add constraints to expected outputs
Use temperature parameter (if supported)
Validate and clean AI outputs

Future Developments

Stay updated with ellmer developments:

New model integrations
Enhanced streaming capabilities
Better error handling
Performance improvements

Check the ellmer documentation regularly for updates.

Key Points

ellmer provides a unified interface for integrating LLMs into R workflows
Set up authentication using environment variables for security
Integrate AI prompts seamlessly into tidyverse pipelines
Design clear, constrained prompts for consistent results
Implement error handling and caching for robust pipelines
Document models and prompts for reproducibility
Consider rate limits, costs, and ethical implications when using AI in data processing
Local model options are available for high-volume use cases

Using AI within Pipelines via ellmer

Overview

Questions

Objectives

Introduction

What is ellmer?

Why Use ellmer?

Installing ellmer

R

R

Setting Up ellmer with GitHub Copilot

Step 1-pat: Getting Access to GitHub Models

Step 1-explicit: Get a GitHub Token and Store It Securely

R

Security Best Practice

Step 2: Initialize a Chat Object

R

Challenge 1: Setup and Test ellmer

Show me the solution

Using AI for Data Processing

Basic Text Processing

R

Processing Data in Pipelines

R

Aggregated calls to reduce IO and number of prompts

R

Rate Limits and Costs

Advanced: Structured Output Generation

R

Challenge 2: Build a Data Processing Pipeline

Example Solution

R

Local LLMs

Best Practices for AI in Pipelines

1. Design Clear Prompts

R

2. Handle Errors Gracefully

R

3. Cache Results

R

4. Use Batch Processing for Large Datasets

R

Reproducibility Considerations

R

Group Discussion

Real-World Use Cases

Text Classification

R

Data Cleaning

R

Content Generation

R

Translation and Localization

R

Challenge 3: Implement a Complete Workflow

Example for Option B: Support Tickets

R

Troubleshooting Common Issues

Authentication Errors

Rate Limiting

Inconsistent Results

Future Developments

Challenge 1: Setup and Test `ellmer`