Documentation

Best Practices

Optimize costs, improve performance, secure API keys, implement reliability patterns, and deploy SkillBoss safely to production environments for AI agents.

Cost Optimization

1. Choose the Right Model

Different models have vastly different costs. Match the model to your use case:

By Use Case
Use CaseRecommended ModelWhy
Complex reasoningclaude-4-5-sonnetBest quality-to-cost ratio
Simple tasksgemini-2.5-flash or gpt-4o-mini10-20x cheaper
Code generationdeepseek/deepseek-v3Excellent for code, low cost
Long documentsclaude-4-5-sonnet200K context window
Ultra-fast responsesgemini-2.5-flashLowest latency
By Cost
// Most Expensive → Cheapest

// Premium ($$$$)
model: "gpt-5"                    // ~$15/1M tokens

// High Quality ($$$)
model: "claude-4-5-sonnet"  // ~$3/1M input, $15/1M output

// Balanced ($$)
model: "gpt-4o"                   // ~$2.50/1M input, $10/1M output
model: "gemini-2.0-pro"           // ~$1.25/1M input, $5/1M output

// Economy ($)
model: "gemini-2.5-flash"         // ~$0.10/1M input, $0.40/1M output
model: "gpt-4o-mini"              // ~$0.15/1M input, $0.60/1M output
model: "deepseek/deepseek-v3"     // ~$0.27/1M tokens (cache-enabled)

2. Optimize Token Usage

❌ Wasteful:

const prompt = `
I need you to analyze this very long text and provide a summary.
Please make sure the summary is comprehensive and covers all the key points.
Here's the text:
${veryLongText}

Please provide:
1. A summary
2. Key takeaways
3. Action items

Thank you!
`

✅ Efficient:

const prompt = `Summarize this text with key takeaways and action items:\n\n${veryLongText}`

Saved: ~50 tokens per request

Don't waste credits on unused output:

// ❌ Wasteful (defaults to 4096 tokens)
await client.chat.completions.create({
  model: 'claude-4-5-sonnet',
  messages: [{role: 'user', content: 'Say hi'}]
})

// ✅ Efficient (only generate what you need)
await client.chat.completions.create({
  model: 'claude-4-5-sonnet',
  messages: [{role: 'user', content: 'Say hi'}],
  max_tokens: 20  // Enough for a greeting
})

DeepSeek supports prompt caching for repeated prefixes:

// First call: Pays full price
const response1 = await client.chat.completions.create({
  model: 'deepseek/deepseek-v3',
  messages: [
    {role: 'system', content: longSystemPrompt},  // Cached
    {role: 'user', content: 'Question 1'}
  ]
})

// Second call: System prompt is cached (98% cheaper!)
const response2 = await client.chat.completions.create({
  model: 'deepseek/deepseek-v3',
  messages: [
    {role: 'system', content: longSystemPrompt},  // From cache
    {role: 'user', content: 'Question 2'}
  ]
})

Savings: ~98% on cached tokens

3. Batch Similar Requests

// ❌ Inefficient: Multiple API calls
for (const product of products) {
  const description = await generateDescription(product)
}

// ✅ Efficient: Batch in one call
const prompt = `Generate descriptions for these products:\n${products.map(p => `- ${p.name}`).join('\n')}`
const response = await client.chat.completions.create({...})

Performance

1. Use Streaming for Better UX

// ❌ Slow: Wait for entire response
const response = await client.chat.completions.create({
  model: 'claude-4-5-sonnet',
  messages: [{role: 'user', content: 'Write a long story'}]
})
// User waits 10-30 seconds with no feedback

// ✅ Fast: Stream tokens as generated
const stream = await client.chat.completions.create({
  model: 'claude-4-5-sonnet',
  messages: [{role: 'user', content: 'Write a long story'}],
  stream: true
})

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '')
  // User sees words appear immediately
}

2. Parallel Requests

// ❌ Sequential: Takes 3x as long
const task1 = await client.chat.completions.create({...})
const task2 = await client.chat.completions.create({...})
const task3 = await client.chat.completions.create({...})

// ✅ Parallel: 3x faster
const [task1, task2, task3] = await Promise.all([
  client.chat.completions.create({...}),
  client.chat.completions.create({...}),
  client.chat.completions.create({...})
])

3. Choose Fast Models

ModelTypical LatencyBest For
gemini-2.5-flash~500msReal-time chat, autocomplete
gpt-4o-mini~800msQuick responses
claude-3-5-haiku~1sBalanced speed/quality
claude-4-5-sonnet~2-4sQuality over speed

Security

1. Never Expose API Keys

⚠️

Never include your API key in:

  • Public Git repositories
  • Client-side JavaScript
  • Mobile app binaries
  • URL parameters
  • Logs or error messages
// ❌ DANGEROUS: Client-side usage
'use client'  // This runs in the browser!
export function ChatComponent() {
  const client = new OpenAI({
    baseURL: 'https://api.skillboss.co/v1',
    apiKey: process.env.NEXT_PUBLIC_SKILLBOSS_KEY  // ❌ EXPOSED TO USERS!
  })
}

// ✅ SAFE: Server-side only
// app/api/chat/route.ts
export async function POST(req: Request) {
  const client = new OpenAI({
    baseURL: 'https://api.skillboss.co/v1',
    apiKey: process.env.SKILLBOSS_KEY  // ✅ Server-only, secure
  })

  const response = await client.chat.completions.create({...})
  return Response.json(response)
}

2. Use Environment Variables

# .env (add to .gitignore!)
SKILLBOSS_KEY=sk-abc123...

# .gitignore
.env
.env.local

3. Implement Rate Limiting

Protect yourself from abuse:

import rateLimit from 'express-rate-limit'

const limiter = rateLimit({
  windowMs: 15 * 60 * 1000,  // 15 minutes
  max: 100,  // Limit each IP to 100 requests per window
  message: 'Too many requests from this IP'
})

app.use('/api/', limiter)

Reliability

1. Implement Retry Logic

import { OpenAI } from 'openai'

async function callWithRetry(
  func: () => Promise<any>,
  maxRetries = 3
): Promise<any> {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await func()
    } catch (error: any) {
      // Don't retry on client errors
      if (error.status >= 400 && error.status < 500) {
        throw error
      }

      // Last attempt - throw error
      if (i === maxRetries - 1) {
        throw error
      }

      // Exponential backoff
      const delay = Math.min(1000 * (2 ** i), 10000)
      await new Promise(resolve => setTimeout(resolve, delay))
    }
  }
}

// Usage
const response = await callWithRetry(() =>
  client.chat.completions.create({...})
)

2. Implement Fallback Models

const modelTiers = [
  'claude-4-5-sonnet',  // Try premium first
  'gpt-4o',                     // Fallback to GPT
  'gemini-2.5-flash'            // Fallback to Gemini
]

async function callWithFallback(messages: any[]) {
  for (const model of modelTiers) {
    try {
      return await client.chat.completions.create({
        model,
        messages
      })
    } catch (error: any) {
      if (error.status === 503) {
        // Provider down, try next
        continue
      }
      throw error
    }
  }
  throw new Error('All providers unavailable')
}

3. Monitor Balance

const response = await client.chat.completions.create({...})

// Check for low balance warning
if (response._balance_warning) {
  console.warn(`Low balance: ${response._remaining_credits} credits`)

  // Send notification
  await sendEmail({
    to: 'admin@company.com',
    subject: 'SkillBoss Balance Low',
    body: `Only ${response._remaining_credits} credits remaining`
  })

  // Optionally trigger auto-recharge
}

Production Checklist

Before deploying to production:

  • API keys stored in environment variables
  • Keys not committed to Git
  • Server-side API calls only (not client-side)
  • Rate limiting implemented
  • Input validation on user prompts
  • Retry logic with exponential backoff
  • Fallback models configured
  • Error handling for all error types
  • Timeout handling
  • Health check endpoint
  • Error tracking (Sentry, etc.)
  • Usage monitoring
  • Balance alerts configured
  • Latency monitoring
  • Cost tracking per feature
  • Right model chosen for each use case
  • max_tokens set appropriately
  • Prompt caching utilized where applicable
  • Auto-recharge configured
  • Budget alerts set
  • Streaming enabled for long responses
  • Parallel requests where possible
  • Appropriate model for latency requirements
  • Caching implemented for repeated queries

Common Patterns

Chat Interface

// Store conversation history
const messages = [
  {role: 'system', content: 'You are a helpful assistant.'}
]

async function chat(userMessage: string) {
  // Add user message
  messages.push({role: 'user', content: userMessage})

  // Get response
  const response = await client.chat.completions.create({
    model: 'claude-4-5-sonnet',
    messages,
    max_tokens: 500
  })

  // Add assistant response to history
  const assistantMessage = response.choices[0].message.content
  messages.push({role: 'assistant', content: assistantMessage})

  return assistantMessage
}

Function Calling

const response = await client.chat.completions.create({
  model: 'gpt-5',
  messages: [{role: 'user', content: 'What\'s the weather in SF?'}],
  tools: [{
    type: 'function',
    function: {
      name: 'get_weather',
      description: 'Get current weather',
      parameters: {
        type: 'object',
        properties: {
          location: {type: 'string'},
          unit: {type: 'string', enum: ['celsius', 'fahrenheit']}
        },
        required: ['location']
      }
    }
  }]
})

// Handle function call
if (response.choices[0].message.tool_calls) {
  const toolCall = response.choices[0].message.tool_calls[0]
  const weather = await getWeather(JSON.parse(toolCall.function.arguments))
  // Send function result back...
}

Need Help?

📚

API Reference

Complete API documentation

📄

Troubleshooting

Common issues and solutions

Best Practices