JavaScript/TypeScript SDK

The official JavaScript SDK for LLMCrawl provides a simple and powerful way to scrape websites, crawl multiple pages, and extract structured data using AI from your JavaScript or TypeScript applications.

Installation

Install the SDK using npm or your preferred package manager:

bash

npm install @llmcrawl/llmcrawl-js

bash

yarn add @llmcrawl/llmcrawl-js

bash

pnpm add @llmcrawl/llmcrawl-js

Quick Start

typescript

import { LLMCrawl } from "@llmcrawl/llmcrawl-js";

const client = new LLMCrawl({
  apiKey: "your-api-key-here",
});

// Scrape a single page
const result = await client.scrape("https://example.com");
console.log(result.data?.markdown);

Authentication

Initialize the client with your API key:

typescript

const client = new LLMCrawl({
  apiKey: "your-api-key",
  baseUrl: "https://api.llmcrawl.dev", // Optional custom base URL
});

You can obtain an API key from your LLMCrawl Dashboard.

Core Features

🌐 Single Page Scraping

Extract content from individual web pages with multiple format options.

🕷️ Website Crawling

Crawl entire websites with customizable depth and filtering options.

🗺️ Site Mapping

Get all URLs from a website without scraping content.

🤖 AI-Powered Extraction

Extract structured data using custom JSON schemas and AI.

📷 Screenshot Capture

Take screenshots of web pages during scraping.

⚙️ Flexible Configuration

Extensive customization options for headers, timeouts, and more.

API Reference

Scraping Single Pages

The scrape() method extracts content from a single webpage:

typescript

const result = await client.scrape("https://example.com", {
  formats: ["markdown", "html", "links"],
  headers: {
    "User-Agent": "Mozilla/5.0 (compatible; LLMCrawl)",
  },
  waitFor: 3000, // Wait 3 seconds for page to load
  timeout: 30000, // 30 second timeout
  extract: {
    schema: {
      type: "object",
      properties: {
        title: { type: "string" },
        price: { type: "number" },
        description: { type: "string" },
      },
      required: ["title", "price"],
    },
  },
});

if (result.success) {
  console.log("Markdown:", result.data.markdown);
  console.log("Extracted data:", result.data.extract);
  console.log("Links:", result.data.links);
}

Scrape Options

Option	Type	Description
`formats`	`string[]`	Output formats: `'markdown'`, `'html'`, `'rawHtml'`, `'links'`, `'screenshot'`, `'screenshot@fullPage'`
`headers`	`object`	Custom HTTP headers
`includeTags`	`string[]`	HTML tags to include in output
`excludeTags`	`string[]`	HTML tags to exclude from output
`timeout`	`number`	Request timeout in milliseconds (1000-90000)
`waitFor`	`number`	Delay before capturing content (0-60000ms)
`extract`	`ExtractOptions`	AI extraction configuration
`webhookUrls`	`string[]`	URLs to send results to
`metadata`	`object`	Additional metadata to include

Website Crawling

Start a crawl job to scrape multiple pages:

typescript

const crawlResult = await client.crawl("https://example.com", {
  limit: 100,
  maxDepth: 3,
  includePaths: ["/blog/*", "/docs/*"],
  excludePaths: ["/admin/*", "/login"],
  allowBackwardLinks: false,
  allowExternalLinks: false,
  scrapeOptions: {
    formats: ["markdown"],
    extract: {
      schema: {
        type: "object",
        properties: {
          title: { type: "string" },
          content: { type: "string" },
          author: { type: "string" },
        },
      },
    },
  },
});

if (crawlResult.success) {
  console.log("Crawl started with ID:", crawlResult.id);

  // Monitor crawl progress
  const status = await client.getCrawlStatus(crawlResult.id);
  console.log(`Progress: ${status.completed}/${status.total}`);
}

Crawl Options

Option	Type	Description
`limit`	`number`	Maximum number of pages to crawl
`maxDepth`	`number`	Maximum crawl depth
`includePaths`	`string[]`	Path patterns to include (supports wildcards)
`excludePaths`	`string[]`	Path patterns to exclude
`allowBackwardLinks`	`boolean`	Allow crawling backward links
`allowExternalLinks`	`boolean`	Allow crawling external domains
`ignoreSitemap`	`boolean`	Ignore robots.txt and sitemap.xml
`scrapeOptions`	`ScrapeOptions`	Scraping options for each page
`webhookUrls`	`string[]`	URLs for webhook notifications
`webhookMetadata`	`object`	Additional webhook metadata

Monitoring Crawl Jobs

Check the status of a running crawl:

typescript

const status = await client.getCrawlStatus("crawl-job-id");

if (status.success) {
  console.log("Status:", status.status); // 'scraping', 'completed', 'failed'
  console.log("Progress:", `${status.completed}/${status.total}`);

  if (status.status === "completed") {
    console.log("Scraped pages:", status.data.length);
    status.data.forEach((page, index) => {
      console.log(`Page ${index + 1}:`, page.metadata?.title);
    });
  }
}

Cancel a running crawl:

typescript

const result = await client.cancelCrawl("crawl-job-id");

if (result.success) {
  console.log("Crawl cancelled:", result.message);
}

Site Mapping

Get all URLs from a website without scraping content:

typescript

const mapResult = await client.map("https://example.com", {
  limit: 1000,
  includeSubdomains: true,
  search: "documentation",
  includePaths: ["/docs/*", "/api/*"],
  excludePaths: ["/internal/*"],
});

if (mapResult.success) {
  console.log(`Found ${mapResult.links.length} URLs`);
  mapResult.links.forEach((link) => {
    console.log(link);
  });
}

Map Options

Option	Type	Description
`limit`	`number`	Maximum number of links to return (1-5000)
`includeSubdomains`	`boolean`	Include subdomain URLs
`search`	`string`	Filter links by search query
`ignoreSitemap`	`boolean`	Ignore robots.txt and sitemap.xml
`includePaths`	`string[]`	Path patterns to include
`excludePaths`	`string[]`	Path patterns to exclude

AI-Powered Data Extraction

LLMCrawl's most powerful feature is AI-powered structured data extraction using custom JSON schemas:

E-commerce Example

typescript

const result = await client.scrape("https://store.example.com/product/123", {
  formats: ["markdown"],
  extract: {
    mode: "llm",
    schema: {
      type: "object",
      properties: {
        productName: { type: "string" },
        price: { type: "number" },
        originalPrice: { type: "number" },
        discount: { type: "number" },
        inStock: { type: "boolean" },
        rating: { type: "number" },
        reviewCount: { type: "number" },
        description: { type: "string" },
        specifications: {
          type: "object",
          properties: {
            color: { type: "string" },
            size: { type: "string" },
            brand: { type: "string" },
          },
        },
        images: {
          type: "array",
          items: { type: "string" },
        },
      },
      required: ["productName", "price", "inStock"],
    },
    systemPrompt: "Extract product information from this e-commerce page.",
    prompt: "Focus on getting accurate pricing and availability information.",
  },
});

if (result.success && result.data.extract) {
  const product = JSON.parse(result.data.extract);
  console.log("Product:", product.productName);
  console.log("Price:", product.price);
  console.log("In Stock:", product.inStock);
}

News Article Example

typescript

const article = await client.scrape("https://news.example.com/article/123", {
  formats: ["markdown"],
  extract: {
    schema: {
      type: "object",
      properties: {
        headline: { type: "string" },
        subheadline: { type: "string" },
        author: { type: "string" },
        publishDate: { type: "string", format: "date-time" },
        content: { type: "string" },
        tags: {
          type: "array",
          items: { type: "string" },
        },
        category: { type: "string" },
        readTime: { type: "number" },
        relatedArticles: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: { type: "string" },
              url: { type: "string" },
            },
          },
        },
      },
      required: ["headline", "content", "publishDate"],
    },
  },
});

Complex Schema Example

typescript

const schema = {
  type: "object",
  properties: {
    company: {
      type: "object",
      properties: {
        name: { type: "string" },
        description: { type: "string" },
        founded: { type: "string" },
        employees: { type: "number" },
        location: {
          type: "object",
          properties: {
            city: { type: "string" },
            country: { type: "string" },
            address: { type: "string" },
          },
        },
      },
    },
    leadership: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          position: { type: "string" },
          bio: { type: "string" },
        },
      },
    },
    products: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          description: { type: "string" },
          price: { type: "number" },
        },
      },
    },
  },
};

Advanced Examples

Crawling Documentation Sites

typescript

// Start crawling documentation
const crawl = await client.crawl("https://docs.example.com", {
  limit: 500,
  maxDepth: 4,
  includePaths: ["/docs/*", "/api/*", "/guides/*"],
  excludePaths: ["/docs/internal/*", "/admin/*"],
  scrapeOptions: {
    formats: ["markdown"],
    excludeTags: ["nav", "footer", "aside"],
    extract: {
      schema: {
        type: "object",
        properties: {
          title: { type: "string" },
          section: { type: "string" },
          content: { type: "string" },
          codeExamples: {
            type: "array",
            items: {
              type: "object",
              properties: {
                language: { type: "string" },
                code: { type: "string" },
              },
            },
          },
        },
      },
    },
  },
});

// Poll for completion
if (crawl.success) {
  let status = await client.getCrawlStatus(crawl.id);

  while (status.success && status.status === "scraping") {
    console.log(`Progress: ${status.completed}/${status.total} pages`);
    await new Promise((resolve) => setTimeout(resolve, 5000));
    status = await client.getCrawlStatus(crawl.id);
  }

  if (status.success && status.status === "completed") {
    console.log("Documentation crawl completed!");
    console.log(`Total pages: ${status.data.length}`);

    // Process the results
    status.data.forEach((page) => {
      if (page.extract) {
        const pageData = JSON.parse(page.extract);
        console.log(`Section: ${pageData.section}`);
        console.log(`Title: ${pageData.title}`);
      }
    });
  }
}

Batch Processing with Custom Headers

typescript

const urls = [
  "https://example.com/page1",
  "https://example.com/page2",
  "https://example.com/page3",
];

const customHeaders = {
  "User-Agent": "Mozilla/5.0 (compatible; LLMCrawl)",
  "Accept-Language": "en-US,en;q=0.9",
};

const results = await Promise.all(
  urls.map(async (url) => {
    try {
      const result = await client.scrape(url, {
        formats: ["markdown", "links"],
        headers: customHeaders,
        timeout: 30000,
        waitFor: 2000,
      });
      return { url, success: true, data: result.data };
    } catch (error) {
      return { url, success: false, error: error.message };
    }
  })
);

results.forEach((result) => {
  if (result.success) {
    console.log(`✅ ${result.url}: ${result.data?.markdown?.length} chars`);
  } else {
    console.log(`❌ ${result.url}: ${result.error}`);
  }
});

Screenshot Capture

typescript

const result = await client.scrape("https://example.com", {
  formats: ["screenshot@fullPage", "markdown"],
  waitFor: 3000, // Wait for page to fully load
});

if (result.success && result.data.screenshot) {
  // Screenshot is returned as base64 encoded string
  const screenshotBuffer = Buffer.from(result.data.screenshot, "base64");

  // Save to file (Node.js)
  await fs.writeFile("screenshot.png", screenshotBuffer);

  // Or create download link (Browser)
  const blob = new Blob([screenshotBuffer], { type: "image/png" });
  const url = URL.createObjectURL(blob);
}

Error Handling

All SDK methods return a response object with a success field for consistent error handling:

typescript

const result = await client.scrape("https://example.com");

if (result.success) {
  // Handle successful response
  console.log("Content:", result.data.markdown);
  console.log("Metadata:", result.data.metadata);
} else {
  // Handle error
  console.error("Error:", result.error);
  console.error("Details:", result.details);

  // Common error types:
  // - Authentication errors (invalid API key)
  // - Rate limiting
  // - Network timeouts
  // - Invalid URLs
  // - Server errors
}

Error Types

Error Type	Description
`Authentication`	Invalid or missing API key
`RateLimit`	Too many requests, try again later
`Timeout`	Request exceeded timeout limit
`InvalidURL`	Malformed or unreachable URL
`ServerError`	Internal server error
`ValidationError`	Invalid parameters or schema

Type Definitions

The SDK includes comprehensive TypeScript types for better development experience:

typescript

import type {
  // Response types
  ScrapeResponse,
  CrawlResponse,
  CrawlStatusResponse,
  MapResponse,

  // Data types
  Document,
  ExtractOptions,
  ScrapeOptions,
  CrawlerOptions,
  MapOptions,

  // Configuration types
  LLMCrawlConfig,
} from "@llmcrawl/llmcrawl-js";

// Example usage with types
const scrapeOptions: ScrapeOptions = {
  formats: ["markdown", "html"],
  timeout: 30000,
  extract: {
    schema: {
      type: "object",
      properties: {
        title: { type: "string" },
      },
    },
  },
};

const result: ScrapeResponse = await client.scrape(url, scrapeOptions);

Environment-Specific Usage

Node.js

typescript

import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
import fs from "fs/promises";

const client = new LLMCrawl({
  apiKey: process.env.LLMCRAWL_API_KEY,
});

const result = await client.scrape("https://example.com");
if (result.success) {
  await fs.writeFile("output.md", result.data.markdown);
}

Browser/React

typescript

import { LLMCrawl } from "@llmcrawl/llmcrawl-js";

function MyComponent() {
  const [content, setContent] = useState("");

  const handleScrape = async () => {
    const client = new LLMCrawl({
      apiKey: process.env.REACT_APP_LLMCRAWL_API_KEY,
    });

    const result = await client.scrape("https://example.com");
    if (result.success) {
      setContent(result.data.markdown);
    }
  };

  return (
    <div>
      <button onClick={handleScrape}>Scrape Website</button>
      <pre>{content}</pre>
    </div>
  );
}

Next.js API Route

typescript

// pages/api/scrape.ts
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
import type { NextApiRequest, NextApiResponse } from "next";

export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse
) {
  if (req.method !== "POST") {
    return res.status(405).json({ error: "Method not allowed" });
  }

  const { url } = req.body;

  const client = new LLMCrawl({
    apiKey: process.env.LLMCRAWL_API_KEY!,
  });

  try {
    const result = await client.scrape(url, {
      formats: ["markdown"],
    });

    if (result.success) {
      res.status(200).json({ content: result.data.markdown });
    } else {
      res.status(400).json({ error: result.error });
    }
  } catch (error) {
    res.status(500).json({ error: "Internal server error" });
  }
}

Best Practices

1. Rate Limiting

typescript

// Implement client-side rate limiting
const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

async function scrapeWithRateLimit(urls: string[]) {
  const results = [];

  for (const url of urls) {
    const result = await client.scrape(url);
    results.push(result);

    // Wait 1 second between requests
    await delay(1000);
  }

  return results;
}

2. Error Recovery

typescript

async function scrapeWithRetry(url: string, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await client.scrape(url);
      if (result.success) return result;

      if (attempt < maxRetries) {
        await delay(1000 * attempt); // Exponential backoff
      }
    } catch (error) {
      if (attempt === maxRetries) throw error;
      await delay(1000 * attempt);
    }
  }
}

3. Schema Validation

typescript

import Ajv from "ajv";

const ajv = new Ajv();

const productSchema = {
  type: "object",
  properties: {
    name: { type: "string" },
    price: { type: "number" },
  },
  required: ["name", "price"],
};

const validate = ajv.compile(productSchema);

const result = await client.scrape(url, {
  extract: { schema: productSchema },
});

if (result.success && result.data.extract) {
  const data = JSON.parse(result.data.extract);
  if (validate(data)) {
    console.log("Valid product data:", data);
  } else {
    console.error("Invalid data:", validate.errors);
  }
}

Migration Guide

From v0.x to v1.0.0

Before (v0.x):

typescript

import LLMCrawl from "@llmcrawl/llmcrawl-js";

const client = new LLMCrawl("your-api-key");

After (v1.0.0):

typescript

import { LLMCrawl } from "@llmcrawl/llmcrawl-js";

const client = new LLMCrawl({
  apiKey: "your-api-key",
});

Support and Resources

📧 Email Support: [email protected]
📚 Documentation: https://docs.llmcrawl.dev
🎮 Playground: https://llmcrawl.dev/tools
🐛 Issues: GitHub Issues
💬 Community: Discord Server

License

MIT License - see LICENSE file for details.

/{id}

/cancel

JavaScript/TypeScript SDK ​

Installation ​

Quick Start ​

Authentication ​

Core Features ​

🌐 Single Page Scraping ​

🕷️ Website Crawling ​

🗺️ Site Mapping ​

🤖 AI-Powered Extraction ​

📷 Screenshot Capture ​

⚙️ Flexible Configuration ​

API Reference ​

Scraping Single Pages ​

Scrape Options ​

Website Crawling ​

Crawl Options ​

Monitoring Crawl Jobs ​

Site Mapping ​

Map Options ​

AI-Powered Data Extraction ​

E-commerce Example ​

News Article Example ​

Complex Schema Example ​

Advanced Examples ​

Crawling Documentation Sites ​

Batch Processing with Custom Headers ​

Screenshot Capture ​

Error Handling ​

Error Types ​

Type Definitions ​

Environment-Specific Usage ​

Node.js ​

Browser/React ​

Next.js API Route ​

Best Practices ​

1. Rate Limiting ​

2. Error Recovery ​

3. Schema Validation ​

Migration Guide ​

From v0.x to v1.0.0 ​

Support and Resources ​

License ​

JavaScript/TypeScript SDK

Installation

Quick Start

Authentication

Core Features

🌐 Single Page Scraping

🕷️ Website Crawling

🗺️ Site Mapping

🤖 AI-Powered Extraction

📷 Screenshot Capture

⚙️ Flexible Configuration

API Reference

Scraping Single Pages

Scrape Options

Website Crawling

Crawl Options

Monitoring Crawl Jobs

Site Mapping

Map Options

AI-Powered Data Extraction

E-commerce Example

News Article Example

Complex Schema Example

Advanced Examples

Crawling Documentation Sites

Batch Processing with Custom Headers

Screenshot Capture

Error Handling

Error Types

Type Definitions

Environment-Specific Usage

Node.js

Browser/React

Next.js API Route

Best Practices

1. Rate Limiting

2. Error Recovery

3. Schema Validation

Migration Guide

From v0.x to v1.0.0

Support and Resources

License