Image and Vision APIs: Multi-Modal Applications

Shane

2/13/2026

24 min read

Build multi-modal applications with Claude and GPT vision APIs including image analysis, OCR, document extraction, and structured output in Node.js.

claude nodejs gpt vision multi-modal image analysis

Image and Vision APIs: Multi-Modal Applications

Overview

Vision capabilities in large language models have fundamentally changed what we can build with APIs. Instead of chaining together specialized OCR libraries, object detection models, and natural language processing pipelines, you can now send an image to Claude or GPT-4 and get back structured, intelligent analysis in a single API call. This article walks through the practical details of building multi-modal applications in Node.js, from encoding and sending images to building production-ready analysis pipelines with real cost and performance considerations.

Prerequisites

Node.js 18 or later installed
An Anthropic API key (for Claude vision)
An OpenAI API key (for GPT-4 vision)
Basic familiarity with Express.js and REST APIs
Understanding of base64 encoding and image formats

Install the dependencies you will need:

npm install @anthropic-ai/sdk openai express multer sharp

The Multi-Modal Landscape

The two dominant vision APIs right now are Claude (Anthropic) and GPT-4 Vision (OpenAI). Both accept images alongside text prompts, but they differ in meaningful ways that affect how you architect your applications.

Claude Vision (Claude 3 Opus, Sonnet, and Haiku) accepts images via base64-encoded data or direct URLs. Claude tends to be exceptionally strong at document understanding, chart interpretation, and following detailed extraction instructions. It supports up to 20 images per request and handles images up to approximately 5MB after encoding.

GPT-4 Vision (GPT-4o and GPT-4 Turbo) also accepts base64 and URL-based images. OpenAI introduced a detail parameter that gives you explicit control over how much processing power the model spends analyzing your image. GPT-4 Vision handles complex spatial reasoning well and has broad general-purpose vision capabilities.

Both models understand photographs, screenshots, diagrams, charts, handwritten text, documents, and more. The practical differences tend to show up in edge cases: Claude handles dense tabular data and multi-page documents more reliably, while GPT-4o tends to be faster for simple classification tasks.

Sending Images to Claude

Claude accepts images in the content array of a message. Each image is a content block with type image and either base64 data or a URL source.

Base64 Encoding

var Anthropic = require("@anthropic-ai/sdk");
var fs = require("fs");
var path = require("path");

var client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

function analyzeImageWithClaude(imagePath, prompt) {
  var imageBuffer = fs.readFileSync(imagePath);
  var base64Data = imageBuffer.toString("base64");
  var extension = path.extname(imagePath).toLowerCase().replace(".", "");

  var mediaTypeMap = {
    jpg: "image/jpeg",
    jpeg: "image/jpeg",
    png: "image/png",
    gif: "image/gif",
    webp: "image/webp"
  };

  var mediaType = mediaTypeMap[extension] || "image/jpeg";

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: mediaType,
              data: base64Data
            }
          },
          {
            type: "text",
            text: prompt
          }
        ]
      }
    ]
  });
}

analyzeImageWithClaude("./receipt.jpg", "Extract the total amount, date, and merchant name from this receipt. Return JSON.")
  .then(function (response) {
    console.log(response.content[0].text);
  })
  .catch(function (err) {
    console.error("Analysis failed:", err.message);
  });

URL-Based Images

Claude also accepts publicly accessible image URLs directly, which avoids the overhead of base64 encoding:

function analyzeImageUrl(imageUrl, prompt) {
  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "url",
              url: imageUrl
            }
          },
          {
            type: "text",
            text: prompt
          }
        ]
      }
    ]
  });
}

URL-based images are simpler and reduce your request payload size, but the image must be publicly accessible. For private or uploaded images, base64 is the way to go.

Sending Images to OpenAI

OpenAI's vision API uses a slightly different structure. Images go inside the content array as image_url type blocks.

Detail Levels

OpenAI provides a detail parameter with three options:

low: The model receives a fixed 512x512 thumbnail. Costs 85 tokens per image regardless of size. Use this for simple classification, yes/no questions, or when speed matters more than accuracy.
high: The model receives the full-resolution image, tiled into 512x512 chunks. Costs 85 tokens base plus 170 tokens per tile. Use this for OCR, reading small text, or detailed analysis.
auto: The model decides based on the image size. This is the default.

var OpenAI = require("openai");

var openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

function analyzeWithGPT(base64Data, mediaType, prompt, detail) {
  detail = detail || "auto";

  return openai.chat.completions.create({
    model: "gpt-4o",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image_url",
            image_url: {
              url: "data:" + mediaType + ";base64," + base64Data,
              detail: detail
            }
          },
          {
            type: "text",
            text: prompt
          }
        ]
      }
    ]
  });
}

// Low detail for quick classification
var imageData = fs.readFileSync("./photo.jpg").toString("base64");
analyzeWithGPT(imageData, "image/jpeg", "Is this a photo of food? Answer yes or no.", "low")
  .then(function (response) {
    console.log(response.choices[0].message.content);
  });

The detail parameter is one of OpenAI's genuinely useful features. In production, the difference between low and high can be a 10x cost difference per image, and for many use cases low is perfectly sufficient.

Supported Image Formats and Size Limits

Feature	Claude	GPT-4 Vision
JPEG	Yes	Yes
PNG	Yes	Yes
GIF	Yes (first frame)	Yes (first frame)
WebP	Yes	Yes
Max file size	~5 MB (base64)	20 MB
Max dimensions	8000x8000 px	2048x2048 (short side)
Max images per request	20	10+

In practice, you almost never want to send a raw high-resolution image. A 4000x3000 JPEG from a phone camera is 3-5 MB and will eat through your token budget. Preprocessing is essential.

Image Preprocessing and Optimization

Sharp is the best image processing library in the Node.js ecosystem. It is fast, memory-efficient, and handles all the formats you need.

var sharp = require("sharp");

function preprocessImage(inputPath, options) {
  options = options || {};
  var maxWidth = options.maxWidth || 1568;
  var maxHeight = options.maxHeight || 1568;
  var quality = options.quality || 80;
  var format = options.format || "jpeg";

  return sharp(inputPath)
    .resize(maxWidth, maxHeight, {
      fit: "inside",
      withoutEnlargement: true
    })
    .toFormat(format, { quality: quality })
    .toBuffer()
    .then(function (buffer) {
      return {
        buffer: buffer,
        base64: buffer.toString("base64"),
        mediaType: "image/" + format,
        sizeKB: Math.round(buffer.length / 1024)
      };
    });
}

// Resize a large photo to a reasonable size for API analysis
preprocessImage("./large-photo.jpg", { maxWidth: 1024, quality: 75 })
  .then(function (result) {
    console.log("Optimized size:", result.sizeKB, "KB");
    // Now send result.base64 to Claude or GPT
  });

Claude recommends images no larger than 1568 pixels on any side for optimal performance. For most analysis tasks, 1024 pixels on the long side is more than enough and keeps costs down. I routinely resize to 1024 and compress to quality 75 JPEG for general analysis, and only go higher for OCR of small text.

Building an Image Analysis Pipeline

A real-world pipeline handles upload, validation, preprocessing, analysis, and storage. Here is a complete pipeline function:

var sharp = require("sharp");
var Anthropic = require("@anthropic-ai/sdk");
var crypto = require("crypto");

var client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

var ALLOWED_TYPES = ["image/jpeg", "image/png", "image/webp", "image/gif"];
var MAX_INPUT_SIZE = 10 * 1024 * 1024; // 10 MB input limit

function processAndAnalyze(fileBuffer, mimeType, analysisPrompt) {
  // Step 1: Validate
  if (ALLOWED_TYPES.indexOf(mimeType) === -1) {
    return Promise.reject(new Error("Unsupported image type: " + mimeType));
  }
  if (fileBuffer.length > MAX_INPUT_SIZE) {
    return Promise.reject(new Error("Image exceeds 10 MB input limit"));
  }

  var imageHash = crypto.createHash("sha256").update(fileBuffer).digest("hex").substring(0, 16);

  // Step 2: Preprocess
  return sharp(fileBuffer)
    .resize(1568, 1568, { fit: "inside", withoutEnlargement: true })
    .jpeg({ quality: 80 })
    .toBuffer()
    .then(function (optimizedBuffer) {
      var base64Data = optimizedBuffer.toString("base64");

      // Step 3: Analyze
      return client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 4096,
        messages: [
          {
            role: "user",
            content: [
              {
                type: "image",
                source: {
                  type: "base64",
                  media_type: "image/jpeg",
                  data: base64Data
                }
              },
              {
                type: "text",
                text: analysisPrompt
              }
            ]
          }
        ]
      });
    })
    .then(function (response) {
      return {
        id: imageHash,
        analysis: response.content[0].text,
        model: response.model,
        inputTokens: response.usage.input_tokens,
        outputTokens: response.usage.output_tokens
      };
    });
}

OCR and Document Extraction

Vision models have largely replaced traditional OCR for many use cases. They understand context, handle messy layouts, and can extract structured data without the brittle rule-based parsing that Tesseract or similar tools require.

function extractDocumentData(imageBuffer) {
  var base64Data = imageBuffer.toString("base64");

  var extractionPrompt = [
    "Analyze this document image and extract all text content.",
    "Preserve the document structure including:",
    "- Headers and section titles",
    "- Tables (format as markdown tables)",
    "- Lists and bullet points",
    "- Any form fields with their labels and values",
    "",
    "Return the extracted content in structured markdown format.",
    "If any text is unclear or partially visible, indicate this with [unclear]."
  ].join("\n");

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: "image/jpeg",
              data: base64Data
            }
          },
          {
            type: "text",
            text: extractionPrompt
          }
        ]
      }
    ]
  });
}

// For structured data extraction like invoices
function extractInvoiceData(imageBuffer) {
  var base64Data = imageBuffer.toString("base64");

  var prompt = [
    "Extract the following fields from this invoice image.",
    "Return ONLY valid JSON with these fields:",
    "{",
    '  "vendor_name": "string",',
    '  "invoice_number": "string",',
    '  "invoice_date": "YYYY-MM-DD",',
    '  "due_date": "YYYY-MM-DD or null",',
    '  "line_items": [{"description": "string", "quantity": number, "unit_price": number, "total": number}],',
    '  "subtotal": number,',
    '  "tax": number,',
    '  "total": number,',
    '  "currency": "string (3-letter code)"',
    "}",
    "If a field is not visible, use null."
  ].join("\n");

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: { type: "base64", media_type: "image/jpeg", data: base64Data }
          },
          { type: "text", text: prompt }
        ]
      }
    ]
  }).then(function (response) {
    var text = response.content[0].text;
    // Strip markdown code fences if the model wraps JSON in them
    text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
    return JSON.parse(text);
  });
}

This approach consistently outperforms Tesseract for documents with complex layouts, tables, or mixed content. The model understands what an invoice looks like, what fields to expect, and how to normalize the data. Traditional OCR gives you raw text; vision models give you understanding.

Comparing Images and Detecting Differences

Multi-image requests let you ask the model to compare two or more images. This is useful for quality assurance, change detection, and A/B testing visual designs.

function compareImages(imageBufferA, imageBufferB, comparisonPrompt) {
  var base64A = imageBufferA.toString("base64");
  var base64B = imageBufferB.toString("base64");

  var defaultPrompt = [
    "Compare these two images carefully.",
    "Image 1 is the 'before' version and Image 2 is the 'after' version.",
    "List all visible differences between them.",
    "For each difference, describe:",
    "- What changed",
    "- Where in the image the change is located",
    "- Whether the change appears intentional or could be a defect"
  ].join("\n");

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2048,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "text",
            text: "Image 1 (before):"
          },
          {
            type: "image",
            source: { type: "base64", media_type: "image/jpeg", data: base64A }
          },
          {
            type: "text",
            text: "Image 2 (after):"
          },
          {
            type: "image",
            source: { type: "base64", media_type: "image/jpeg", data: base64B }
          },
          {
            type: "text",
            text: comparisonPrompt || defaultPrompt
          }
        ]
      }
    ]
  });
}

I have found this pattern invaluable for automated visual regression testing. You can compare screenshots of your application before and after deployments, and the model will describe exactly what changed in plain English. It is not pixel-perfect, but it catches the kinds of visual regressions that matter to users.

Video Frame Analysis

Vision APIs do not accept video directly, but you can extract key frames and analyze them individually. The ffmpeg command-line tool paired with Node.js works well for this.

var execFile = require("child_process").execFile;
var fs = require("fs");
var path = require("path");
var os = require("os");

function extractKeyFrames(videoPath, intervalSeconds) {
  intervalSeconds = intervalSeconds || 5;
  var outputDir = fs.mkdtempSync(path.join(os.tmpdir(), "frames-"));
  var outputPattern = path.join(outputDir, "frame-%04d.jpg");

  return new Promise(function (resolve, reject) {
    execFile("ffmpeg", [
      "-i", videoPath,
      "-vf", "fps=1/" + intervalSeconds,
      "-q:v", "2",
      "-frames:v", "20",
      outputPattern
    ], function (err) {
      if (err) return reject(err);

      var files = fs.readdirSync(outputDir)
        .filter(function (f) { return f.endsWith(".jpg"); })
        .sort()
        .map(function (f) { return path.join(outputDir, f); });

      resolve({ frames: files, outputDir: outputDir });
    });
  });
}

function analyzeVideo(videoPath, prompt) {
  return extractKeyFrames(videoPath, 5)
    .then(function (result) {
      var contentBlocks = [];

      result.frames.forEach(function (framePath, index) {
        var frameBuffer = fs.readFileSync(framePath);
        contentBlocks.push({
          type: "text",
          text: "Frame " + (index + 1) + " (at " + (index * 5) + " seconds):"
        });
        contentBlocks.push({
          type: "image",
          source: {
            type: "base64",
            media_type: "image/jpeg",
            data: frameBuffer.toString("base64")
          }
        });
      });

      contentBlocks.push({
        type: "text",
        text: prompt || "Describe what happens in this video based on these frames."
      });

      return client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 4096,
        messages: [{ role: "user", content: contentBlocks }]
      });
    });
}

Keep in mind that 20 frames at 1568x1568 each will be expensive. For most video analysis tasks, resize frames to 768 pixels on the long side and be selective about which frames you send.

Cost Implications of Image Tokens

Image tokens are the hidden cost of vision APIs. Understanding how they work is critical for budgeting.

Claude calculates image tokens based on the image dimensions. The formula roughly works out to: (width * height) / 750 tokens. A 1568x1568 image consumes about 3,277 input tokens. A smaller 800x600 image is around 640 tokens. At Claude Sonnet pricing, a single high-resolution image costs roughly the same as 3,000 words of text input.

GPT-4 Vision with high detail tiles the image into 512x512 chunks. Each tile costs 170 tokens, plus a base of 85 tokens. A 2048x2048 image becomes 16 tiles, costing 2,805 tokens. With low detail, every image is a flat 85 tokens.

Here is a utility to estimate costs before sending:

function estimateClaudeImageTokens(width, height) {
  // Claude resizes images to fit within 1568x1568 maintaining aspect ratio
  var scale = Math.min(1568 / width, 1568 / height, 1);
  var scaledWidth = Math.round(width * scale);
  var scaledHeight = Math.round(height * scale);
  return Math.ceil((scaledWidth * scaledHeight) / 750);
}

function estimateGPTImageTokens(width, height, detail) {
  if (detail === "low") return 85;

  // GPT scales the short side to 768, then tiles into 512x512
  var scale = Math.min(2048 / Math.max(width, height), 768 / Math.min(width, height));
  var scaledWidth = Math.round(width * scale);
  var scaledHeight = Math.round(height * scale);

  var tilesX = Math.ceil(scaledWidth / 512);
  var tilesY = Math.ceil(scaledHeight / 512);

  return 85 + (170 * tilesX * tilesY);
}

// Example: a 4000x3000 photo
console.log("Claude tokens:", estimateClaudeImageTokens(4000, 3000)); // ~3277
console.log("GPT high tokens:", estimateGPTImageTokens(4000, 3000, "high")); // ~1445
console.log("GPT low tokens:", estimateGPTImageTokens(4000, 3000, "low")); // 85

My recommendation: always resize images before sending. The difference between a 4000x3000 raw photo and a 1024x768 resized version is negligible for most analysis tasks, but the cost difference is 4-5x.

Multi-Image Conversations

Both APIs support sending multiple images in a single request. This enables powerful comparison, batch analysis, and contextual understanding.

function batchAnalyze(imagePaths, prompt) {
  var contentBlocks = [];

  imagePaths.forEach(function (imagePath, index) {
    var buffer = fs.readFileSync(imagePath);
    contentBlocks.push({
      type: "text",
      text: "Image " + (index + 1) + ":"
    });
    contentBlocks.push({
      type: "image",
      source: {
        type: "base64",
        media_type: "image/jpeg",
        data: buffer.toString("base64")
      }
    });
  });

  contentBlocks.push({ type: "text", text: prompt });

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [{ role: "user", content: contentBlocks }]
  });
}

// Analyze a batch of product photos
var productImages = [
  "./product-front.jpg",
  "./product-back.jpg",
  "./product-label.jpg"
];

batchAnalyze(productImages, "These are three views of the same product. Extract the product name, all ingredients from the label, and describe any visible defects or quality issues.")
  .then(function (response) {
    console.log(response.content[0].text);
  });

Combining Vision with Tool Use

Claude supports combining vision with tool use, which lets the model see an image and then call structured functions based on what it observes. This is powerful for building automated workflows.

function analyzeAndRoute(imageBuffer) {
  var base64Data = imageBuffer.toString("base64");

  var tools = [
    {
      name: "create_support_ticket",
      description: "Create a support ticket for a damaged product",
      input_schema: {
        type: "object",
        properties: {
          product_name: { type: "string" },
          damage_description: { type: "string" },
          severity: { type: "string", enum: ["minor", "moderate", "severe"] }
        },
        required: ["product_name", "damage_description", "severity"]
      }
    },
    {
      name: "flag_for_review",
      description: "Flag an image for manual review when automated analysis is uncertain",
      input_schema: {
        type: "object",
        properties: {
          reason: { type: "string" },
          confidence: { type: "number" }
        },
        required: ["reason"]
      }
    }
  ];

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    tools: tools,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: { type: "base64", media_type: "image/jpeg", data: base64Data }
          },
          {
            type: "text",
            text: "Analyze this product image. If you see damage, create a support ticket. If the image is unclear or you are unsure, flag it for manual review."
          }
        ]
      }
    ]
  });
}

Building Accessible Image Descriptions

One of the most impactful applications of vision APIs is generating alt text and accessible descriptions for images. This is not just a nice-to-have; it is a legal requirement in many jurisdictions and the right thing to do.

function generateAltText(imageBuffer, context) {
  var base64Data = imageBuffer.toString("base64");

  var prompt = [
    "Generate an accessible alt text description for this image.",
    "The description should be:",
    "- Concise (under 125 characters for alt text)",
    "- Descriptive of the key visual content",
    "- Functional (if the image conveys information, describe the information)",
    "- Free of phrases like 'image of' or 'picture of'",
    "",
    "Also provide a longer description (2-3 sentences) for use as an extended description.",
    "",
    context ? "Context: This image appears on a page about " + context : "",
    "",
    "Return JSON: { \"alt_text\": \"...\", \"long_description\": \"...\" }"
  ].join("\n");

  return client.messages.create({
    model: "claude-haiku-3-5-20241022",
    max_tokens: 512,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: { type: "base64", media_type: "image/jpeg", data: base64Data }
          },
          { type: "text", text: prompt }
        ]
      }
    ]
  }).then(function (response) {
    var text = response.content[0].text;
    text = text.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
    return JSON.parse(text);
  });
}

I use Claude Haiku for alt text generation because it is fast and cheap, and the task does not require the reasoning power of Sonnet or Opus. At scale, generating alt text for thousands of images, model selection matters a lot for cost.

Complete Working Example: Image Analysis Service

Here is a complete Express.js service that accepts image uploads, processes them through Claude vision, extracts structured data, and stores results in memory (swap in your database of choice for production).

var express = require("express");
var multer = require("multer");
var sharp = require("sharp");
var Anthropic = require("@anthropic-ai/sdk");
var crypto = require("crypto");
var path = require("path");

var app = express();
var upload = multer({
  storage: multer.memoryStorage(),
  limits: { fileSize: 10 * 1024 * 1024 },
  fileFilter: function (req, file, cb) {
    var allowed = ["image/jpeg", "image/png", "image/webp", "image/gif"];
    if (allowed.indexOf(file.mimetype) === -1) {
      return cb(new Error("Unsupported file type: " + file.mimetype));
    }
    cb(null, true);
  }
});

var client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

// In-memory store (use a database in production)
var analysisResults = {};

// Preprocess image for API consumption
function optimizeImage(buffer) {
  return sharp(buffer)
    .resize(1568, 1568, { fit: "inside", withoutEnlargement: true })
    .jpeg({ quality: 80 })
    .toBuffer()
    .then(function (optimized) {
      return {
        buffer: optimized,
        base64: optimized.toString("base64"),
        sizeKB: Math.round(optimized.length / 1024)
      };
    });
}

// Call Claude vision API with retry logic
function callClaudeVision(base64Data, prompt, retries) {
  retries = retries || 3;

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 4096,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image",
            source: {
              type: "base64",
              media_type: "image/jpeg",
              data: base64Data
            }
          },
          { type: "text", text: prompt }
        ]
      }
    ]
  }).catch(function (err) {
    if (retries > 0 && (err.status === 429 || err.status === 529)) {
      var delay = (4 - retries) * 2000;
      console.log("Rate limited, retrying in " + delay + "ms...");
      return new Promise(function (resolve) {
        setTimeout(resolve, delay);
      }).then(function () {
        return callClaudeVision(base64Data, prompt, retries - 1);
      });
    }
    throw err;
  });
}

// POST /analyze - Upload and analyze an image
app.post("/analyze", upload.single("image"), function (req, res) {
  if (!req.file) {
    return res.status(400).json({ error: "No image file provided" });
  }

  var analysisType = req.body.type || "general";
  var prompts = {
    general: "Describe this image in detail. Include objects, people, text, colors, and overall composition. Return JSON with keys: description, objects (array), text_content (any visible text), dominant_colors (array).",
    receipt: "Extract all data from this receipt. Return JSON with keys: merchant, date, items (array of {name, price}), subtotal, tax, total, payment_method.",
    document: "Extract all text and structure from this document. Return JSON with keys: title, type (invoice/letter/form/other), content (structured text), tables (array of objects if any tables exist).",
    product: "Analyze this product image. Return JSON with keys: product_name, brand, category, visible_text, condition (new/used/damaged), description."
  };

  var prompt = prompts[analysisType] || prompts.general;
  var analysisId = crypto.randomBytes(8).toString("hex");

  optimizeImage(req.file.buffer)
    .then(function (optimized) {
      console.log("Image optimized: " + optimized.sizeKB + " KB");
      return callClaudeVision(optimized.base64, prompt);
    })
    .then(function (response) {
      var rawText = response.content[0].text;
      var parsedData;

      try {
        var cleaned = rawText.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
        parsedData = JSON.parse(cleaned);
      } catch (e) {
        parsedData = { raw_text: rawText };
      }

      var result = {
        id: analysisId,
        type: analysisType,
        data: parsedData,
        usage: {
          input_tokens: response.usage.input_tokens,
          output_tokens: response.usage.output_tokens
        },
        model: response.model,
        created_at: new Date().toISOString()
      };

      analysisResults[analysisId] = result;
      res.json(result);
    })
    .catch(function (err) {
      console.error("Analysis failed:", err.message);
      res.status(500).json({
        error: "Analysis failed",
        message: err.message
      });
    });
});

// GET /analysis/:id - Retrieve a previous analysis
app.get("/analysis/:id", function (req, res) {
  var result = analysisResults[req.params.id];
  if (!result) {
    return res.status(404).json({ error: "Analysis not found" });
  }
  res.json(result);
});

// POST /analyze/batch - Analyze multiple images
app.post("/analyze/batch", upload.array("images", 10), function (req, res) {
  if (!req.files || req.files.length === 0) {
    return res.status(400).json({ error: "No images provided" });
  }

  var promises = req.files.map(function (file) {
    return optimizeImage(file.buffer)
      .then(function (optimized) {
        return callClaudeVision(
          optimized.base64,
          "Describe this image briefly. Return JSON with: description, category, objects (array)."
        );
      })
      .then(function (response) {
        var text = response.content[0].text.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
        try { return JSON.parse(text); }
        catch (e) { return { raw_text: text }; }
      });
  });

  Promise.all(promises)
    .then(function (results) {
      res.json({
        count: results.length,
        results: results
      });
    })
    .catch(function (err) {
      res.status(500).json({ error: err.message });
    });
});

// POST /analyze/compare - Compare two images
app.post("/analyze/compare", upload.array("images", 2), function (req, res) {
  if (!req.files || req.files.length !== 2) {
    return res.status(400).json({ error: "Exactly 2 images required" });
  }

  Promise.all([
    optimizeImage(req.files[0].buffer),
    optimizeImage(req.files[1].buffer)
  ]).then(function (optimized) {
    return client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 2048,
      messages: [
        {
          role: "user",
          content: [
            { type: "text", text: "Image 1:" },
            {
              type: "image",
              source: { type: "base64", media_type: "image/jpeg", data: optimized[0].base64 }
            },
            { type: "text", text: "Image 2:" },
            {
              type: "image",
              source: { type: "base64", media_type: "image/jpeg", data: optimized[1].base64 }
            },
            {
              type: "text",
              text: "Compare these two images. Return JSON with: similarities (array), differences (array), overall_assessment (string)."
            }
          ]
        }
      ]
    });
  }).then(function (response) {
    var text = response.content[0].text.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
    try { res.json(JSON.parse(text)); }
    catch (e) { res.json({ raw_analysis: text }); }
  }).catch(function (err) {
    res.status(500).json({ error: err.message });
  });
});

// Error handling middleware for multer
app.use(function (err, req, res, next) {
  if (err instanceof multer.MulterError) {
    if (err.code === "LIMIT_FILE_SIZE") {
      return res.status(400).json({ error: "File too large. Maximum size is 10 MB." });
    }
    return res.status(400).json({ error: err.message });
  }
  if (err.message && err.message.indexOf("Unsupported file type") > -1) {
    return res.status(400).json({ error: err.message });
  }
  next(err);
});

var PORT = process.env.PORT || 3000;
app.listen(PORT, function () {
  console.log("Image analysis service running on port " + PORT);
});

Test it with curl:

# General analysis
curl -X POST http://localhost:3000/analyze \
  -F "[email protected]" \
  -F "type=general"

# Receipt extraction
curl -X POST http://localhost:3000/analyze \
  -F "[email protected]" \
  -F "type=receipt"

# Compare two images
curl -X POST http://localhost:3000/analyze/compare \
  -F "[email protected]" \
  -F "[email protected]"

Common Issues and Troubleshooting

1. "Could not process image" or Invalid Base64 Errors

Error: Could not process image: Invalid base64 data

This usually means your base64 string contains a data URI prefix. Claude expects raw base64, not data:image/jpeg;base64,.... Strip the prefix:

function cleanBase64(data) {
  return data.replace(/^data:image\/\w+;base64,/, "");
}

2. Rate Limiting on Batch Operations

Error 429: Rate limit exceeded. Please retry after 30 seconds.

When processing multiple images, you will hit rate limits fast. Implement a queue with concurrency control instead of firing all requests at once:

function processWithConcurrency(items, concurrency, processFn) {
  var results = [];
  var index = 0;

  function next() {
    if (index >= items.length) return Promise.resolve();
    var currentIndex = index++;
    return processFn(items[currentIndex])
      .then(function (result) {
        results[currentIndex] = result;
        return next();
      });
  }

  var workers = [];
  for (var i = 0; i < Math.min(concurrency, items.length); i++) {
    workers.push(next());
  }

  return Promise.all(workers).then(function () {
    return results;
  });
}

// Process 50 images, 3 at a time
processWithConcurrency(imageBuffers, 3, function (buffer) {
  return optimizeImage(buffer).then(function (opt) {
    return callClaudeVision(opt.base64, "Describe this image.");
  });
});

3. Sharp Installation Failures on Linux

ERR! sharp EACCES: permission denied, mkdir '/usr/lib/node_modules/sharp/vendor'

Sharp includes native bindings and can fail during installation. On production Linux servers, make sure you have the build essentials installed:

sudo apt-get install build-essential libvips-dev
npm install sharp --unsafe-perm

On Docker, use a node image that includes build tools, or install libvips in your Dockerfile:

RUN apt-get update && apt-get install -y libvips-dev && rm -rf /var/lib/apt/lists/*

4. Unexpected JSON Parsing Failures from Model Responses

SyntaxError: Unexpected token 'H' at position 0

Even with explicit "Return JSON only" instructions, models sometimes prepend natural language like "Here is the JSON:" before the actual JSON. Always strip markdown fences and attempt to extract JSON:

function extractJSON(text) {
  // Try direct parse first
  try { return JSON.parse(text); } catch (e) {}

  // Strip markdown code fences
  var cleaned = text.replace(/```json\n?/g, "").replace(/```\n?/g, "").trim();
  try { return JSON.parse(cleaned); } catch (e) {}

  // Try to find JSON in the response
  var jsonMatch = text.match(/\{[\s\S]*\}/);
  if (jsonMatch) {
    try { return JSON.parse(jsonMatch[0]); } catch (e) {}
  }

  return null;
}

5. Images Appearing Rotated or Mirrored

Some JPEG files contain EXIF orientation data that the browser respects but raw processing ignores. Sharp handles this automatically with its rotate() method, but you need to call it:

sharp(buffer)
  .rotate() // Auto-rotate based on EXIF orientation
  .resize(1568, 1568, { fit: "inside" })
  .jpeg({ quality: 80 })
  .toBuffer();

Without .rotate(), a photo taken in portrait mode on a phone may be sent to the API sideways, and the model will analyze a sideways image.

Best Practices

Always resize before sending. There is no reason to send a 4000x3000 image when 1024x768 gives equivalent results for most tasks. You will save 5-10x on token costs.
Use the right model for the task. Haiku is fast and cheap for simple classification and alt text. Sonnet is the sweet spot for most structured extraction. Opus is for when you need maximum accuracy on complex documents or subtle visual analysis.
Implement retry logic with exponential backoff. Vision requests are heavier than text-only requests and more likely to hit rate limits. Always handle 429 and 529 status codes gracefully.
Cache analysis results aggressively. If the same image might be analyzed twice, hash the image content and check your cache first. Vision API calls are expensive; cache misses are not.
Validate and sanitize model output. Even with structured output instructions, models can return malformed JSON, extra commentary, or unexpected field types. Always parse defensively and have a fallback.
Strip EXIF data before storing images. EXIF metadata can contain GPS coordinates, camera serial numbers, and other sensitive information. Sharp's toBuffer() strips most EXIF by default when you re-encode, but verify this for your use case.
Use detail levels strategically with OpenAI. If you are just classifying images into categories, low detail at 85 tokens per image is 30x cheaper than high detail. Only use high when you need to read small text or analyze fine details.
Label your images in multi-image requests. When sending multiple images, add text blocks like "Image 1 (front label):" before each image. This helps the model reference specific images accurately in its response.
Handle animated GIFs explicitly. Both APIs only process the first frame of animated GIFs. If you need to analyze an animation, extract frames with Sharp or ffmpeg and send them as separate images.
Monitor your token usage. Image tokens add up fast. A service processing 1,000 images per day at 3,000 tokens each is consuming 3 million input tokens daily. Track usage per endpoint and set up alerts.

References

Anthropic Vision Documentation - Official Claude vision API reference
OpenAI Vision Guide - GPT-4 Vision API documentation
Sharp Documentation - High-performance Node.js image processing
Multer Documentation - Express.js multipart file upload middleware
Web Content Accessibility Guidelines (WCAG) - Images - Accessibility standards for image descriptions
Anthropic API Pricing - Current token pricing for Claude models
OpenAI API Pricing - Current token pricing for GPT models

Image and Vision APIs: Multi-Modal Applications

Image and Vision APIs: Multi-Modal Applications

Overview

Prerequisites

The Multi-Modal Landscape

Sending Images to Claude

Base64 Encoding

URL-Based Images

Sending Images to OpenAI

Detail Levels

Supported Image Formats and Size Limits

Image Preprocessing and Optimization

Building an Image Analysis Pipeline

OCR and Document Extraction

Comparing Images and Detecting Differences

Video Frame Analysis

Cost Implications of Image Tokens

Multi-Image Conversations

Combining Vision with Tool Use

Building Accessible Image Descriptions

Complete Working Example: Image Analysis Service

Common Issues and Troubleshooting

1. "Could not process image" or Invalid Base64 Errors

2. Rate Limiting on Batch Operations

3. Sharp Installation Failures on Linux

4. Unexpected JSON Parsing Failures from Model Responses

5. Images Appearing Rotated or Mirrored

Best Practices

References

Quick Links

Need Expert Help?