Appearance
JavaScript/TypeScript SDK
The official JavaScript SDK for LLMCrawl provides a simple and powerful way to scrape websites, crawl multiple pages, and extract structured data using AI from your JavaScript or TypeScript applications.
Installation
Install the SDK using npm or your preferred package manager:
bash
npm install @llmcrawl/llmcrawl-js
bash
yarn add @llmcrawl/llmcrawl-js
bash
pnpm add @llmcrawl/llmcrawl-js
Quick Start
typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
const client = new LLMCrawl({
apiKey: "your-api-key-here",
});
// Scrape a single page
const result = await client.scrape("https://example.com");
console.log(result.data?.markdown);
Authentication
Initialize the client with your API key:
typescript
const client = new LLMCrawl({
apiKey: "your-api-key",
baseUrl: "https://api.llmcrawl.dev", // Optional custom base URL
});
You can obtain an API key from your LLMCrawl Dashboard.
Core Features
🌐 Single Page Scraping
Extract content from individual web pages with multiple format options.
🕷️ Website Crawling
Crawl entire websites with customizable depth and filtering options.
🗺️ Site Mapping
Get all URLs from a website without scraping content.
🤖 AI-Powered Extraction
Extract structured data using custom JSON schemas and AI.
📷 Screenshot Capture
Take screenshots of web pages during scraping.
⚙️ Flexible Configuration
Extensive customization options for headers, timeouts, and more.
API Reference
Scraping Single Pages
The scrape()
method extracts content from a single webpage:
typescript
const result = await client.scrape("https://example.com", {
formats: ["markdown", "html", "links"],
headers: {
"User-Agent": "Mozilla/5.0 (compatible; LLMCrawl)",
},
waitFor: 3000, // Wait 3 seconds for page to load
timeout: 30000, // 30 second timeout
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
price: { type: "number" },
description: { type: "string" },
},
required: ["title", "price"],
},
},
});
if (result.success) {
console.log("Markdown:", result.data.markdown);
console.log("Extracted data:", result.data.extract);
console.log("Links:", result.data.links);
}
Scrape Options
Option | Type | Description |
---|---|---|
formats | string[] | Output formats: 'markdown' , 'html' , 'rawHtml' , 'links' , 'screenshot' , 'screenshot@fullPage' |
headers | object | Custom HTTP headers |
includeTags | string[] | HTML tags to include in output |
excludeTags | string[] | HTML tags to exclude from output |
timeout | number | Request timeout in milliseconds (1000-90000) |
waitFor | number | Delay before capturing content (0-60000ms) |
extract | ExtractOptions | AI extraction configuration |
webhookUrls | string[] | URLs to send results to |
metadata | object | Additional metadata to include |
Website Crawling
Start a crawl job to scrape multiple pages:
typescript
const crawlResult = await client.crawl("https://example.com", {
limit: 100,
maxDepth: 3,
includePaths: ["/blog/*", "/docs/*"],
excludePaths: ["/admin/*", "/login"],
allowBackwardLinks: false,
allowExternalLinks: false,
scrapeOptions: {
formats: ["markdown"],
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
content: { type: "string" },
author: { type: "string" },
},
},
},
},
});
if (crawlResult.success) {
console.log("Crawl started with ID:", crawlResult.id);
// Monitor crawl progress
const status = await client.getCrawlStatus(crawlResult.id);
console.log(`Progress: ${status.completed}/${status.total}`);
}
Crawl Options
Option | Type | Description |
---|---|---|
limit | number | Maximum number of pages to crawl |
maxDepth | number | Maximum crawl depth |
includePaths | string[] | Path patterns to include (supports wildcards) |
excludePaths | string[] | Path patterns to exclude |
allowBackwardLinks | boolean | Allow crawling backward links |
allowExternalLinks | boolean | Allow crawling external domains |
ignoreSitemap | boolean | Ignore robots.txt and sitemap.xml |
scrapeOptions | ScrapeOptions | Scraping options for each page |
webhookUrls | string[] | URLs for webhook notifications |
webhookMetadata | object | Additional webhook metadata |
Monitoring Crawl Jobs
Check the status of a running crawl:
typescript
const status = await client.getCrawlStatus("crawl-job-id");
if (status.success) {
console.log("Status:", status.status); // 'scraping', 'completed', 'failed'
console.log("Progress:", `${status.completed}/${status.total}`);
if (status.status === "completed") {
console.log("Scraped pages:", status.data.length);
status.data.forEach((page, index) => {
console.log(`Page ${index + 1}:`, page.metadata?.title);
});
}
}
Cancel a running crawl:
typescript
const result = await client.cancelCrawl("crawl-job-id");
if (result.success) {
console.log("Crawl cancelled:", result.message);
}
Site Mapping
Get all URLs from a website without scraping content:
typescript
const mapResult = await client.map("https://example.com", {
limit: 1000,
includeSubdomains: true,
search: "documentation",
includePaths: ["/docs/*", "/api/*"],
excludePaths: ["/internal/*"],
});
if (mapResult.success) {
console.log(`Found ${mapResult.links.length} URLs`);
mapResult.links.forEach((link) => {
console.log(link);
});
}
Map Options
Option | Type | Description |
---|---|---|
limit | number | Maximum number of links to return (1-5000) |
includeSubdomains | boolean | Include subdomain URLs |
search | string | Filter links by search query |
ignoreSitemap | boolean | Ignore robots.txt and sitemap.xml |
includePaths | string[] | Path patterns to include |
excludePaths | string[] | Path patterns to exclude |
AI-Powered Data Extraction
LLMCrawl's most powerful feature is AI-powered structured data extraction using custom JSON schemas:
E-commerce Example
typescript
const result = await client.scrape("https://store.example.com/product/123", {
formats: ["markdown"],
extract: {
mode: "llm",
schema: {
type: "object",
properties: {
productName: { type: "string" },
price: { type: "number" },
originalPrice: { type: "number" },
discount: { type: "number" },
inStock: { type: "boolean" },
rating: { type: "number" },
reviewCount: { type: "number" },
description: { type: "string" },
specifications: {
type: "object",
properties: {
color: { type: "string" },
size: { type: "string" },
brand: { type: "string" },
},
},
images: {
type: "array",
items: { type: "string" },
},
},
required: ["productName", "price", "inStock"],
},
systemPrompt: "Extract product information from this e-commerce page.",
prompt: "Focus on getting accurate pricing and availability information.",
},
});
if (result.success && result.data.extract) {
const product = JSON.parse(result.data.extract);
console.log("Product:", product.productName);
console.log("Price:", product.price);
console.log("In Stock:", product.inStock);
}
News Article Example
typescript
const article = await client.scrape("https://news.example.com/article/123", {
formats: ["markdown"],
extract: {
schema: {
type: "object",
properties: {
headline: { type: "string" },
subheadline: { type: "string" },
author: { type: "string" },
publishDate: { type: "string", format: "date-time" },
content: { type: "string" },
tags: {
type: "array",
items: { type: "string" },
},
category: { type: "string" },
readTime: { type: "number" },
relatedArticles: {
type: "array",
items: {
type: "object",
properties: {
title: { type: "string" },
url: { type: "string" },
},
},
},
},
required: ["headline", "content", "publishDate"],
},
},
});
Complex Schema Example
typescript
const schema = {
type: "object",
properties: {
company: {
type: "object",
properties: {
name: { type: "string" },
description: { type: "string" },
founded: { type: "string" },
employees: { type: "number" },
location: {
type: "object",
properties: {
city: { type: "string" },
country: { type: "string" },
address: { type: "string" },
},
},
},
},
leadership: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
position: { type: "string" },
bio: { type: "string" },
},
},
},
products: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
description: { type: "string" },
price: { type: "number" },
},
},
},
},
};
Advanced Examples
Crawling Documentation Sites
typescript
// Start crawling documentation
const crawl = await client.crawl("https://docs.example.com", {
limit: 500,
maxDepth: 4,
includePaths: ["/docs/*", "/api/*", "/guides/*"],
excludePaths: ["/docs/internal/*", "/admin/*"],
scrapeOptions: {
formats: ["markdown"],
excludeTags: ["nav", "footer", "aside"],
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
section: { type: "string" },
content: { type: "string" },
codeExamples: {
type: "array",
items: {
type: "object",
properties: {
language: { type: "string" },
code: { type: "string" },
},
},
},
},
},
},
},
});
// Poll for completion
if (crawl.success) {
let status = await client.getCrawlStatus(crawl.id);
while (status.success && status.status === "scraping") {
console.log(`Progress: ${status.completed}/${status.total} pages`);
await new Promise((resolve) => setTimeout(resolve, 5000));
status = await client.getCrawlStatus(crawl.id);
}
if (status.success && status.status === "completed") {
console.log("Documentation crawl completed!");
console.log(`Total pages: ${status.data.length}`);
// Process the results
status.data.forEach((page) => {
if (page.extract) {
const pageData = JSON.parse(page.extract);
console.log(`Section: ${pageData.section}`);
console.log(`Title: ${pageData.title}`);
}
});
}
}
Batch Processing with Custom Headers
typescript
const urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
];
const customHeaders = {
"User-Agent": "Mozilla/5.0 (compatible; LLMCrawl)",
"Accept-Language": "en-US,en;q=0.9",
};
const results = await Promise.all(
urls.map(async (url) => {
try {
const result = await client.scrape(url, {
formats: ["markdown", "links"],
headers: customHeaders,
timeout: 30000,
waitFor: 2000,
});
return { url, success: true, data: result.data };
} catch (error) {
return { url, success: false, error: error.message };
}
})
);
results.forEach((result) => {
if (result.success) {
console.log(`✅ ${result.url}: ${result.data?.markdown?.length} chars`);
} else {
console.log(`❌ ${result.url}: ${result.error}`);
}
});
Screenshot Capture
typescript
const result = await client.scrape("https://example.com", {
formats: ["screenshot@fullPage", "markdown"],
waitFor: 3000, // Wait for page to fully load
});
if (result.success && result.data.screenshot) {
// Screenshot is returned as base64 encoded string
const screenshotBuffer = Buffer.from(result.data.screenshot, "base64");
// Save to file (Node.js)
await fs.writeFile("screenshot.png", screenshotBuffer);
// Or create download link (Browser)
const blob = new Blob([screenshotBuffer], { type: "image/png" });
const url = URL.createObjectURL(blob);
}
Error Handling
All SDK methods return a response object with a success
field for consistent error handling:
typescript
const result = await client.scrape("https://example.com");
if (result.success) {
// Handle successful response
console.log("Content:", result.data.markdown);
console.log("Metadata:", result.data.metadata);
} else {
// Handle error
console.error("Error:", result.error);
console.error("Details:", result.details);
// Common error types:
// - Authentication errors (invalid API key)
// - Rate limiting
// - Network timeouts
// - Invalid URLs
// - Server errors
}
Error Types
Error Type | Description |
---|---|
Authentication | Invalid or missing API key |
RateLimit | Too many requests, try again later |
Timeout | Request exceeded timeout limit |
InvalidURL | Malformed or unreachable URL |
ServerError | Internal server error |
ValidationError | Invalid parameters or schema |
Type Definitions
The SDK includes comprehensive TypeScript types for better development experience:
typescript
import type {
// Response types
ScrapeResponse,
CrawlResponse,
CrawlStatusResponse,
MapResponse,
// Data types
Document,
ExtractOptions,
ScrapeOptions,
CrawlerOptions,
MapOptions,
// Configuration types
LLMCrawlConfig,
} from "@llmcrawl/llmcrawl-js";
// Example usage with types
const scrapeOptions: ScrapeOptions = {
formats: ["markdown", "html"],
timeout: 30000,
extract: {
schema: {
type: "object",
properties: {
title: { type: "string" },
},
},
},
};
const result: ScrapeResponse = await client.scrape(url, scrapeOptions);
Environment-Specific Usage
Node.js
typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
import fs from "fs/promises";
const client = new LLMCrawl({
apiKey: process.env.LLMCRAWL_API_KEY,
});
const result = await client.scrape("https://example.com");
if (result.success) {
await fs.writeFile("output.md", result.data.markdown);
}
Browser/React
typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
function MyComponent() {
const [content, setContent] = useState("");
const handleScrape = async () => {
const client = new LLMCrawl({
apiKey: process.env.REACT_APP_LLMCRAWL_API_KEY,
});
const result = await client.scrape("https://example.com");
if (result.success) {
setContent(result.data.markdown);
}
};
return (
<div>
<button onClick={handleScrape}>Scrape Website</button>
<pre>{content}</pre>
</div>
);
}
Next.js API Route
typescript
// pages/api/scrape.ts
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
import type { NextApiRequest, NextApiResponse } from "next";
export default async function handler(
req: NextApiRequest,
res: NextApiResponse
) {
if (req.method !== "POST") {
return res.status(405).json({ error: "Method not allowed" });
}
const { url } = req.body;
const client = new LLMCrawl({
apiKey: process.env.LLMCRAWL_API_KEY!,
});
try {
const result = await client.scrape(url, {
formats: ["markdown"],
});
if (result.success) {
res.status(200).json({ content: result.data.markdown });
} else {
res.status(400).json({ error: result.error });
}
} catch (error) {
res.status(500).json({ error: "Internal server error" });
}
}
Best Practices
1. Rate Limiting
typescript
// Implement client-side rate limiting
const delay = (ms: number) => new Promise((resolve) => setTimeout(resolve, ms));
async function scrapeWithRateLimit(urls: string[]) {
const results = [];
for (const url of urls) {
const result = await client.scrape(url);
results.push(result);
// Wait 1 second between requests
await delay(1000);
}
return results;
}
2. Error Recovery
typescript
async function scrapeWithRetry(url: string, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const result = await client.scrape(url);
if (result.success) return result;
if (attempt < maxRetries) {
await delay(1000 * attempt); // Exponential backoff
}
} catch (error) {
if (attempt === maxRetries) throw error;
await delay(1000 * attempt);
}
}
}
3. Schema Validation
typescript
import Ajv from "ajv";
const ajv = new Ajv();
const productSchema = {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
},
required: ["name", "price"],
};
const validate = ajv.compile(productSchema);
const result = await client.scrape(url, {
extract: { schema: productSchema },
});
if (result.success && result.data.extract) {
const data = JSON.parse(result.data.extract);
if (validate(data)) {
console.log("Valid product data:", data);
} else {
console.error("Invalid data:", validate.errors);
}
}
Migration Guide
From v0.x to v1.0.0
Before (v0.x):
typescript
import LLMCrawl from "@llmcrawl/llmcrawl-js";
const client = new LLMCrawl("your-api-key");
After (v1.0.0):
typescript
import { LLMCrawl } from "@llmcrawl/llmcrawl-js";
const client = new LLMCrawl({
apiKey: "your-api-key",
});
Support and Resources
- 📧 Email Support: [email protected]
- 📚 Documentation: https://docs.llmcrawl.dev
- 🎮 Playground: https://llmcrawl.dev/tools
- 🐛 Issues: GitHub Issues
- 💬 Community: Discord Server
License
MIT License - see LICENSE file for details.