Documentation
Best Practices
Optimize costs, improve performance, secure API keys, implement reliability patterns, and deploy SkillBoss safely to production environments for AI agents.
Cost Optimization
1. Choose the Right Model
Different models have vastly different costs. Match the model to your use case:
By Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| Complex reasoning | claude-4-5-sonnet | Best quality-to-cost ratio |
| Simple tasks | gemini-2.5-flash or gpt-4o-mini | 10-20x cheaper |
| Code generation | deepseek/deepseek-v3 | Excellent for code, low cost |
| Long documents | claude-4-5-sonnet | 200K context window |
| Ultra-fast responses | gemini-2.5-flash | Lowest latency |
By Cost
// Most Expensive → Cheapest
// Premium ($$$$)
model: "gpt-5" // ~$15/1M tokens
// High Quality ($$$)
model: "claude-4-5-sonnet" // ~$3/1M input, $15/1M output
// Balanced ($$)
model: "gpt-4o" // ~$2.50/1M input, $10/1M output
model: "gemini-2.0-pro" // ~$1.25/1M input, $5/1M output
// Economy ($)
model: "gemini-2.5-flash" // ~$0.10/1M input, $0.40/1M output
model: "gpt-4o-mini" // ~$0.15/1M input, $0.60/1M output
model: "deepseek/deepseek-v3" // ~$0.27/1M tokens (cache-enabled)
2. Optimize Token Usage
❌ Wasteful:
const prompt = `
I need you to analyze this very long text and provide a summary.
Please make sure the summary is comprehensive and covers all the key points.
Here's the text:
${veryLongText}
Please provide:
1. A summary
2. Key takeaways
3. Action items
Thank you!
`
✅ Efficient:
const prompt = `Summarize this text with key takeaways and action items:\n\n${veryLongText}`
Saved: ~50 tokens per request
Don't waste credits on unused output:
// ❌ Wasteful (defaults to 4096 tokens)
await client.chat.completions.create({
model: 'claude-4-5-sonnet',
messages: [{role: 'user', content: 'Say hi'}]
})
// ✅ Efficient (only generate what you need)
await client.chat.completions.create({
model: 'claude-4-5-sonnet',
messages: [{role: 'user', content: 'Say hi'}],
max_tokens: 20 // Enough for a greeting
})
DeepSeek supports prompt caching for repeated prefixes:
// First call: Pays full price
const response1 = await client.chat.completions.create({
model: 'deepseek/deepseek-v3',
messages: [
{role: 'system', content: longSystemPrompt}, // Cached
{role: 'user', content: 'Question 1'}
]
})
// Second call: System prompt is cached (98% cheaper!)
const response2 = await client.chat.completions.create({
model: 'deepseek/deepseek-v3',
messages: [
{role: 'system', content: longSystemPrompt}, // From cache
{role: 'user', content: 'Question 2'}
]
})
Savings: ~98% on cached tokens
3. Batch Similar Requests
// ❌ Inefficient: Multiple API calls
for (const product of products) {
const description = await generateDescription(product)
}
// ✅ Efficient: Batch in one call
const prompt = `Generate descriptions for these products:\n${products.map(p => `- ${p.name}`).join('\n')}`
const response = await client.chat.completions.create({...})
Performance
1. Use Streaming for Better UX
// ❌ Slow: Wait for entire response
const response = await client.chat.completions.create({
model: 'claude-4-5-sonnet',
messages: [{role: 'user', content: 'Write a long story'}]
})
// User waits 10-30 seconds with no feedback
// ✅ Fast: Stream tokens as generated
const stream = await client.chat.completions.create({
model: 'claude-4-5-sonnet',
messages: [{role: 'user', content: 'Write a long story'}],
stream: true
})
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0]?.delta?.content || '')
// User sees words appear immediately
}
2. Parallel Requests
// ❌ Sequential: Takes 3x as long
const task1 = await client.chat.completions.create({...})
const task2 = await client.chat.completions.create({...})
const task3 = await client.chat.completions.create({...})
// ✅ Parallel: 3x faster
const [task1, task2, task3] = await Promise.all([
client.chat.completions.create({...}),
client.chat.completions.create({...}),
client.chat.completions.create({...})
])
3. Choose Fast Models
| Model | Typical Latency | Best For |
|---|---|---|
gemini-2.5-flash | ~500ms | Real-time chat, autocomplete |
gpt-4o-mini | ~800ms | Quick responses |
claude-3-5-haiku | ~1s | Balanced speed/quality |
claude-4-5-sonnet | ~2-4s | Quality over speed |
Security
1. Never Expose API Keys
⚠️
Never include your API key in:
- Public Git repositories
- Client-side JavaScript
- Mobile app binaries
- URL parameters
- Logs or error messages
// ❌ DANGEROUS: Client-side usage
'use client' // This runs in the browser!
export function ChatComponent() {
const client = new OpenAI({
baseURL: 'https://api.skillboss.co/v1',
apiKey: process.env.NEXT_PUBLIC_SKILLBOSS_KEY // ❌ EXPOSED TO USERS!
})
}
// ✅ SAFE: Server-side only
// app/api/chat/route.ts
export async function POST(req: Request) {
const client = new OpenAI({
baseURL: 'https://api.skillboss.co/v1',
apiKey: process.env.SKILLBOSS_KEY // ✅ Server-only, secure
})
const response = await client.chat.completions.create({...})
return Response.json(response)
}
2. Use Environment Variables
# .env (add to .gitignore!)
SKILLBOSS_KEY=sk-abc123...
# .gitignore
.env
.env.local
3. Implement Rate Limiting
Protect yourself from abuse:
import rateLimit from 'express-rate-limit'
const limiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // Limit each IP to 100 requests per window
message: 'Too many requests from this IP'
})
app.use('/api/', limiter)
Reliability
1. Implement Retry Logic
import { OpenAI } from 'openai'
async function callWithRetry(
func: () => Promise<any>,
maxRetries = 3
): Promise<any> {
for (let i = 0; i < maxRetries; i++) {
try {
return await func()
} catch (error: any) {
// Don't retry on client errors
if (error.status >= 400 && error.status < 500) {
throw error
}
// Last attempt - throw error
if (i === maxRetries - 1) {
throw error
}
// Exponential backoff
const delay = Math.min(1000 * (2 ** i), 10000)
await new Promise(resolve => setTimeout(resolve, delay))
}
}
}
// Usage
const response = await callWithRetry(() =>
client.chat.completions.create({...})
)
2. Implement Fallback Models
const modelTiers = [
'claude-4-5-sonnet', // Try premium first
'gpt-4o', // Fallback to GPT
'gemini-2.5-flash' // Fallback to Gemini
]
async function callWithFallback(messages: any[]) {
for (const model of modelTiers) {
try {
return await client.chat.completions.create({
model,
messages
})
} catch (error: any) {
if (error.status === 503) {
// Provider down, try next
continue
}
throw error
}
}
throw new Error('All providers unavailable')
}
3. Monitor Balance
const response = await client.chat.completions.create({...})
// Check for low balance warning
if (response._balance_warning) {
console.warn(`Low balance: ${response._remaining_credits} credits`)
// Send notification
await sendEmail({
to: 'admin@company.com',
subject: 'SkillBoss Balance Low',
body: `Only ${response._remaining_credits} credits remaining`
})
// Optionally trigger auto-recharge
}
Production Checklist
Before deploying to production:
- API keys stored in environment variables
- Keys not committed to Git
- Server-side API calls only (not client-side)
- Rate limiting implemented
- Input validation on user prompts
- Retry logic with exponential backoff
- Fallback models configured
- Error handling for all error types
- Timeout handling
- Health check endpoint
- Error tracking (Sentry, etc.)
- Usage monitoring
- Balance alerts configured
- Latency monitoring
- Cost tracking per feature
- Right model chosen for each use case
- max_tokens set appropriately
- Prompt caching utilized where applicable
- Auto-recharge configured
- Budget alerts set
- Streaming enabled for long responses
- Parallel requests where possible
- Appropriate model for latency requirements
- Caching implemented for repeated queries
Common Patterns
Chat Interface
// Store conversation history
const messages = [
{role: 'system', content: 'You are a helpful assistant.'}
]
async function chat(userMessage: string) {
// Add user message
messages.push({role: 'user', content: userMessage})
// Get response
const response = await client.chat.completions.create({
model: 'claude-4-5-sonnet',
messages,
max_tokens: 500
})
// Add assistant response to history
const assistantMessage = response.choices[0].message.content
messages.push({role: 'assistant', content: assistantMessage})
return assistantMessage
}
Function Calling
const response = await client.chat.completions.create({
model: 'gpt-5',
messages: [{role: 'user', content: 'What\'s the weather in SF?'}],
tools: [{
type: 'function',
function: {
name: 'get_weather',
description: 'Get current weather',
parameters: {
type: 'object',
properties: {
location: {type: 'string'},
unit: {type: 'string', enum: ['celsius', 'fahrenheit']}
},
required: ['location']
}
}
}]
})
// Handle function call
if (response.choices[0].message.tool_calls) {
const toolCall = response.choices[0].message.tool_calls[0]
const weather = await getWeather(JSON.parse(toolCall.function.arguments))
// Send function result back...
}
Need Help?
📚
API Reference
Complete API documentation
📄
Troubleshooting
Common issues and solutions