Skip to content

JavaScript/TypeScript SDK

npm version

The official JavaScript SDK for LLMCrawl provides a simple and powerful way to scrape websites, crawl multiple pages, and extract structured data using AI from your JavaScript or TypeScript applications.

Installation

Install the SDK using npm or your preferred package manager:

bash
npm install @llmcrawl/llmcrawl-js
bash
yarn add @llmcrawl/llmcrawl-js
bash
pnpm add @llmcrawl/llmcrawl-js

Quick Start

typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";

const client = new LLMCrawl({
  apiKey: "your-api-key-here",
});

// Scrape a single page
const result = await client.scrape("https://example.com");
console.log(result.data?.markdown);

Authentication

Initialize the client with your API key:

typescript
const client = new LLMCrawl({
  apiKey: "your-api-key",
  baseUrl: "https://api.llmcrawl.dev", // Optional custom base URL
});

You can obtain an API key from your LLMCrawl Dashboard.

Core Features

🌐 Single Page Scraping

Extract content from individual web pages with multiple format options.

🕷️ Website Crawling

Crawl entire websites with customizable depth and filtering options.

🗺️ Site Mapping

Get all URLs from a website without scraping content.

🤖 AI-Powered Extraction

Extract structured data using custom JSON schemas and AI.

📷 Screenshot Capture

Take screenshots of web pages during scraping.

⚙️ Flexible Configuration

Extensive customization options for headers, timeouts, and more.

API Reference

Scraping Single Pages

The scrape() method extracts content from a single webpage:

typescript
const result = await client.scrape("https://example.com", {
  formats: ["markdown", "html", "links"],
  headers: {
    "User-Agent": "Mozilla/5.0 (compatible; LLMCrawl)",
  },
  waitFor: 3000, // Wait 3 seconds for page to load
  timeout: 30000, // 30 second timeout
  extract: {
    schema: {
      type: "object",
      properties: {
        title: { type: "string" },
        price: { type: "number" },
        description: { type: "string" },
      },
      required: ["title", "price"],
    },
  },
});

if (result.success) {
  console.log("Markdown:", result.data.markdown);
  console.log("Extracted data:", result.data.extract);
  console.log("Links:", result.data.links);
}

Scrape Options

OptionTypeDescription
formatsstring[]Output formats: 'markdown', 'html', 'rawHtml', 'links', 'screenshot', 'screenshot@fullPage'
headersobjectCustom HTTP headers
includeTagsstring[]HTML tags to include in output
excludeTagsstring[]HTML tags to exclude from output
timeoutnumberRequest timeout in milliseconds (1000-90000)
waitFornumberDelay before capturing content (0-60000ms)
extractExtractOptionsAI extraction configuration
webhookUrlsstring[]URLs to send results to
metadataobjectAdditional metadata to include

Website Crawling

Start a crawl job to scrape multiple pages:

typescript
const crawlResult = await client.crawl("https://example.com", {
  limit: 100,
  maxDepth: 3,
  includePaths: ["/blog/*", "/docs/*"],
  excludePaths: ["/admin/*", "/login"],
  allowBackwardLinks: false,
  allowExternalLinks: false,
  scrapeOptions: {
    formats: ["markdown"],
    extract: {
      schema: {
        type: "object",
        properties: {
          title: { type: "string" },
          content: { type: "string" },
          author: { type: "string" },
        },
      },
    },
  },
});

if (crawlResult.success) {
  console.log("Crawl started with ID:", crawlResult.id);

  // Monitor crawl progress
  const status = await client.getCrawlStatus(crawlResult.id);
  console.log(`Progress: ${status.completed}/${status.total}`);
}

Crawl Options

OptionTypeDescription
limitnumberMaximum number of pages to crawl
maxDepthnumberMaximum crawl depth
includePathsstring[]Path patterns to include (supports wildcards)
excludePathsstring[]Path patterns to exclude
allowBackwardLinksbooleanAllow crawling backward links
allowExternalLinksbooleanAllow crawling external domains
ignoreSitemapbooleanIgnore robots.txt and sitemap.xml
scrapeOptionsScrapeOptionsScraping options for each page
webhookUrlsstring[]URLs for webhook notifications
webhookMetadataobjectAdditional webhook metadata

Monitoring Crawl Jobs

Check the status of a running crawl:

typescript
const status = await client.getCrawlStatus("crawl-job-id");

if (status.success) {
  console.log("Status:", status.status); // 'scraping', 'completed', 'failed'
  console.log("Progress:", `${status.completed}/${status.total}`);

  if (status.status === "completed") {
    console.log("Scraped pages:", status.data.length);
    status.data.forEach((page, index) => {
      console.log(`Page ${index + 1}:`, page.metadata?.title);
    });
  }
}

Cancel a running crawl:

typescript
const result = await client.cancelCrawl("crawl-job-id");

if (result.success) {
  console.log("Crawl cancelled:", result.message);
}

Site Mapping

Get all URLs from a website without scraping content:

typescript
const mapResult = await client.map("https://example.com", {
  limit: 1000,
  includeSubdomains: true,
  search: "documentation",
  includePaths: ["/docs/*", "/api/*"],
  excludePaths: ["/internal/*"],
});

if (mapResult.success) {
  console.log(`Found ${mapResult.links.length} URLs`);
  mapResult.links.forEach((link) => {
    console.log(link);
  });
}

Map Options

OptionTypeDescription
limitnumberMaximum number of links to return (1-5000)
includeSubdomainsbooleanInclude subdomain URLs
searchstringFilter links by search query
ignoreSitemapbooleanIgnore robots.txt and sitemap.xml
includePathsstring[]Path patterns to include
excludePathsstring[]Path patterns to exclude

AI-Powered Data Extraction

LLMCrawl's most powerful feature is AI-powered structured data extraction using custom JSON schemas:

E-commerce Example

typescript
const result = await client.scrape("https://store.example.com/product/123", {
  formats: ["markdown"],
  extract: {
    mode: "llm",
    schema: {
      type: "object",
      properties: {
        productName: { type: "string" },
        price: { type: "number" },
        originalPrice: { type: "number" },
        discount: { type: "number" },
        inStock: { type: "boolean" },
        rating: { type: "number" },
        reviewCount: { type: "number" },
        description: { type: "string" },
        specifications: {
          type: "object",
          properties: {
            color: { type: "string" },
            size: { type: "string" },
            brand: { type: "string" },
          },
        },
        images: {
          type: "array",
          items: { type: "string" },
        },
      },
      required: ["productName", "price", "inStock"],
    },
    systemPrompt: "Extract product information from this e-commerce page.",
    prompt: "Focus on getting accurate pricing and availability information.",
  },
});

if (result.success && result.data.extract) {
  const product = JSON.parse(result.data.extract);
  console.log("Product:", product.productName);
  console.log("Price:", product.price);
  console.log("In Stock:", product.inStock);
}

News Article Example

typescript
const article = await client.scrape("https://news.example.com/article/123", {
  formats: ["markdown"],
  extract: {
    schema: {
      type: "object",
      properties: {
        headline: { type: "string" },
        subheadline: { type: "string" },
        author: { type: "string" },
        publishDate: { type: "string", format: "date-time" },
        content: { type: "string" },
        tags: {
          type: "array",
          items: { type: "string" },
        },
        category: { type: "string" },
        readTime: { type: "number" },
        relatedArticles: {
          type: "array",
          items: {
            type: "object",
            properties: {
              title: { type: "string" },
              url: { type: "string" },
            },
          },
        },
      },
      required: ["headline", "content", "publishDate"],
    },
  },
});

Complex Schema Example

typescript
const schema = {
  type: "object",
  properties: {
    company: {
      type: "object",
      properties: {
        name: { type: "string" },
        description: { type: "string" },
        founded: { type: "string" },
        employees: { type: "number" },
        location: {
          type: "object",
          properties: {
            city: { type: "string" },
            country: { type: "string" },
            address: { type: "string" },
          },
        },
      },
    },
    leadership: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          position: { type: "string" },
          bio: { type: "string" },
        },
      },
    },
    products: {
      type: "array",
      items: {
        type: "object",
        properties: {
          name: { type: "string" },
          description: { type: "string" },
          price: { type: "number" },
        },
      },
    },
  },
};

Advanced Examples

Crawling Documentation Sites

typescript
// Start crawling documentation
const crawl = await client.crawl("https://docs.example.com", {
  limit: 500,
  maxDepth: 4,
  includePaths: ["/docs/*", "/api/*", "/guides/*"],
  excludePaths: ["/docs/internal/*", "/admin/*"],
  scrapeOptions: {
    formats: ["markdown"],
    excludeTags: ["nav", "footer", "aside"],
    extract: {
      schema: {
        type: "object",
        properties: {
          title: { type: "string" },
          section: { type: "string" },
          content: { type: "string" },
          codeExamples: {
            type: "array",
            items: {
              type: "object",
              properties: {
                language: { type: "string" },
                code: { type: "string" },
              },
            },
          },
        },
      },
    },
  },
});

// Poll for completion
if (crawl.success) {
  let status = await client.getCrawlStatus(crawl.id);

  while (status.success && status.status === "scraping") {
    console.log(`Progress: ${status.completed}/${status.total} pages`);
    await new Promise((resolve) => setTimeout(resolve, 5000));
    status = await client.getCrawlStatus(crawl.id);
  }

  if (status.success && status.status === "completed") {
    console.log("Documentation crawl completed!");
    console.log(`Total pages: ${status.data.length}`);

    // Process the results
    status.data.forEach((page) => {
      if (page.extract) {
        const pageData = JSON.parse(page.extract);
        console.log(`Section: ${pageData.section}`);
        console.log(`Title: ${pageData.title}`);
      }
    });
  }
}

Batch Processing with Custom Headers

typescript
const urls = [
  "https://example.com/page1",
  "https://example.com/page2",
  "https://example.com/page3",
];

const customHeaders = {
  "User-Agent": "Mozilla/5.0 (compatible; LLMCrawl)",
  "Accept-Language": "en-US,en;q=0.9",
};

const results = await Promise.all(
  urls.map(async (url) => {
    try {
      const result = await client.scrape(url, {
        formats: ["markdown", "links"],
        headers: customHeaders,
        timeout: 30000,
        waitFor: 2000,
      });
      return { url, success: true, data: result.data };
    } catch (error) {
      return { url, success: false, error: error.message };
    }
  })
);

results.forEach((result) => {
  if (result.success) {
    console.log(`✅ ${result.url}: ${result.data?.markdown?.length} chars`);
  } else {
    console.log(`❌ ${result.url}: ${result.error}`);
  }
});

Screenshot Capture

typescript
const result = await client.scrape("https://example.com", {
  formats: ["screenshot@fullPage", "markdown"],
  waitFor: 3000, // Wait for page to fully load
});

if (result.success && result.data.screenshot) {
  // Screenshot is returned as base64 encoded string
  const screenshotBuffer = Buffer.from(result.data.screenshot, "base64");

  // Save to file (Node.js)
  await fs.writeFile("screenshot.png", screenshotBuffer);

  // Or create download link (Browser)
  const blob = new Blob([screenshotBuffer], { type: "image/png" });
  const url = URL.createObjectURL(blob);
}

Error Handling

All SDK methods return a response object with a success field for consistent error handling:

typescript
const result = await client.scrape("https://example.com");

if (result.success) {
  // Handle successful response
  console.log("Content:", result.data.markdown);
  console.log("Metadata:", result.data.metadata);
} else {
  // Handle error
  console.error("Error:", result.error);
  console.error("Details:", result.details);

  // Common error types:
  // - Authentication errors (invalid API key)
  // - Rate limiting
  // - Network timeouts
  // - Invalid URLs
  // - Server errors
}

Error Types

Error TypeDescription
AuthenticationInvalid or missing API key
RateLimitToo many requests, try again later
TimeoutRequest exceeded timeout limit
InvalidURLMalformed or unreachable URL
ServerErrorInternal server error
ValidationErrorInvalid parameters or schema

Type Definitions

The SDK includes comprehensive TypeScript types for better development experience:

typescript
import type {
  // Response types
  ScrapeResponse,
  CrawlResponse,
  CrawlStatusResponse,
  MapResponse,

  // Data types
  Document,
  ExtractOptions,
  ScrapeOptions,
  CrawlerOptions,
  MapOptions,

  // Configuration types
  LLMCrawlConfig,
} from "@llmcrawl/llmcrawl-js";

// Example usage with types
const scrapeOptions: ScrapeOptions = {
  formats: ["markdown", "html"],
  timeout: 30000,
  extract: {
    schema: {
      type: "object",
      properties: {
        title: { type: "string" },
      },
    },
  },
};

const result: ScrapeResponse = await client.scrape(url, scrapeOptions);

Environment-Specific Usage

Node.js

typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
import fs from "fs/promises";

const client = new LLMCrawl({
  apiKey: process.env.LLMCRAWL_API_KEY,
});

const result = await client.scrape("https://example.com");
if (result.success) {
  await fs.writeFile("output.md", result.data.markdown);
}

Browser/React

typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";

function MyComponent() {
  const [content, setContent] = useState("");

  const handleScrape = async () => {
    const client = new LLMCrawl({
      apiKey: process.env.REACT_APP_LLMCRAWL_API_KEY,
    });

    const result = await client.scrape("https://example.com");
    if (result.success) {
      setContent(result.data.markdown);
    }
  };

  return (
    <div>
      <button onClick={handleScrape}>Scrape Website</button>
      <pre>{content}</pre>
    </div>
  );
}

Next.js API Route

typescript
// pages/api/scrape.ts
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
import type { NextApiRequest, NextApiResponse } from "next";

export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse
) {
  if (req.method !== "POST") {
    return res.status(405).json({ error: "Method not allowed" });
  }

  const { url } = req.body;

  const client = new LLMCrawl({
    apiKey: process.env.LLMCRAWL_API_KEY!,
  });

  try {
    const result = await client.scrape(url, {
      formats: ["markdown"],
    });

    if (result.success) {
      res.status(200).json({ content: result.data.markdown });
    } else {
      res.status(400).json({ error: result.error });
    }
  } catch (error) {
    res.status(500).json({ error: "Internal server error" });
  }
}

Best Practices

1. Rate Limiting

typescript
// Implement client-side rate limiting
const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));

async function scrapeWithRateLimit(urls: string[]) {
  const results = [];

  for (const url of urls) {
    const result = await client.scrape(url);
    results.push(result);

    // Wait 1 second between requests
    await delay(1000);
  }

  return results;
}

2. Error Recovery

typescript
async function scrapeWithRetry(url: string, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const result = await client.scrape(url);
      if (result.success) return result;

      if (attempt < maxRetries) {
        await delay(1000 * attempt); // Exponential backoff
      }
    } catch (error) {
      if (attempt === maxRetries) throw error;
      await delay(1000 * attempt);
    }
  }
}

3. Schema Validation

typescript
import Ajv from "ajv";

const ajv = new Ajv();

const productSchema = {
  type: "object",
  properties: {
    name: { type: "string" },
    price: { type: "number" },
  },
  required: ["name", "price"],
};

const validate = ajv.compile(productSchema);

const result = await client.scrape(url, {
  extract: { schema: productSchema },
});

if (result.success && result.data.extract) {
  const data = JSON.parse(result.data.extract);
  if (validate(data)) {
    console.log("Valid product data:", data);
  } else {
    console.error("Invalid data:", validate.errors);
  }
}

Migration Guide

From v0.x to v1.0.0

Before (v0.x):

typescript
import LLMCrawl from "@llmcrawl/llmcrawl-js";

const client = new LLMCrawl("your-api-key");

After (v1.0.0):

typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";

const client = new LLMCrawl({
  apiKey: "your-api-key",
});

Support and Resources

License

MIT License - see LICENSE file for details.