Web Crawler Documentation

TrellisBot

TrellisBot is the automated web crawler that powers TrellisSearch. This page explains how it works, what it looks for, how to verify it, and how to control its access to your site.

What is TrellisBot?

TrellisBot is an automated program that systematically browses the web to build and maintain the TrellisSearch index. It follows hyperlinks from page to page, fetches content, and extracts information that powers search results at trellissearch.com.

TrellisBot is designed to be a well-behaved, respectful crawler. It identifies itself clearly in every request, obeys robots.txt rules, respects crawl delays, and does not attempt to access password-protected or otherwise restricted content.

Open index: TrellisSearch is an independent search engine. Appearing in our index is separate from appearing in Google, Bing, or other search engines. Blocking TrellisBot only affects TrellisSearch.

User Agent String

TrellisBot identifies itself in every HTTP request using the following user agent string:

TrellisBot/1.0 (+https://trellissearch.com/bot.html)
FieldValue
Crawler nameTrellisBot
Version1.0
robots.txt tokenTrellisBot
Documentation URLhttps://trellissearch.com/bot.html
OperatorTrellis Group LLC
Contactsupport@trellissearch.com

How TrellisBot Crawls

Discovery

TrellisBot discovers new pages primarily by following hyperlinks found on pages it has already visited. It also processes XML sitemaps referenced in robots.txt files via the Sitemap: directive.

Crawl rate

TrellisBot is designed to crawl politely and avoid placing excessive load on web servers. It limits concurrent connections per server, introduces delays between requests, and fully respects Crawl-delay directives. If your server is struggling with crawl traffic from TrellisBot, set a crawl delay in your robots.txt.

Content size limits

TrellisBot fetches up to 5MB of content per page. Content beyond this limit is not downloaded. For most pages this limit is never reached. Very large pages — such as those that embed large blocks of data or generated content — may be partially indexed.

Supported content types

TrellisBot does not currently execute JavaScript. Pages that rely entirely on client-side rendering to display content may not be fully indexed. Server-side rendered or static HTML pages will be indexed most accurately.

Crawl frequency

How often TrellisBot revisits a page depends on how frequently that page changes. Pages that update often are checked more frequently; static pages are revisited less often. Revisit intervals adapt automatically based on observed change history, ranging from daily for very active pages to every 90 days for content that rarely changes.

Text extraction

TrellisBot extracts the visible text content of a page — headings, paragraphs, lists, and anchor text. It stores a portion of the page text for snippet generation and relevance scoring. Navigation menus, footers, and repeated boilerplate elements have less influence on indexing than the main body content of a page.

What Ranks Well

TrellisSearch uses a multi-signal ranking system that rewards genuine, human-readable content. The following characteristics positively influence how a page ranks:

✓ Content Quality

  • Original, well-written content
  • Sufficient depth and length
  • Clear, descriptive page titles
  • Logical heading structure
  • Natural, readable prose
  • Accurate meta descriptions
  • Descriptive alt text on images

✓ Technical Signals

  • Served over HTTPS
  • Fast server response times
  • Clean, readable URLs
  • Valid HTML structure
  • Mobile-friendly layout
  • Proper use of canonical tags

✓ Authority Signals

  • Links from other indexed sites
  • Consistent domain history
  • Established domain age
  • Relevant internal linking
  • Sitemap provided

✓ Freshness

  • Recently published content
  • Regularly updated pages
  • Active, maintained sites
  • Current, accurate information

Our philosophy: TrellisSearch intentionally favors smaller honest pages over aggressively optimized ones. A well-written page on a personal site can outrank a keyword-stuffed page on a large domain.

What Ranks Poorly

TrellisSearch automatically detects and penalizes pages that attempt to manipulate rankings or provide little genuine value to users. The following will result in ranking suppression or removal:

✗ Content Problems

  • Thin or near-empty pages
  • Duplicate content across pages
  • Auto-generated or gibberish text
  • Misleading titles that don't match content
  • Pages with no meaningful text
  • Scraped or copied content
  • Important content only in images with no alt text

✗ Keyword Manipulation

  • Keyword stuffing in body text
  • Repeated keywords in titles
  • Unnatural keyword density
  • Keyword lists unrelated to content
  • Repeated phrases throughout page

✗ Hidden Content

  • Text hidden via CSS
  • White text on white backgrounds
  • Content hidden off-screen
  • Zero-size fonts
  • Invisible overlays

✗ Link Manipulation

  • Excessive outbound links
  • Link farms
  • Unrelated link clusters
  • Doorway pages
  • Thin pages designed to pass link equity

Penalty severity: Detected spam signals result in automatic ranking suppression. Severe cases — such as hidden text or egregious keyword stuffing — can reduce a page's ranking score by up to 99%, effectively removing it from results.

Technical Requirements for Indexing

To ensure your pages are indexed correctly, keep the following in mind:

Page accessibility

TrellisBot must be able to reach your page without authentication, CAPTCHA, or JavaScript-only rendering. Pages behind login walls, paywalls, or that require user interaction to display content will not be fully indexed.

Response codes

HTTP CodeWhat happens
200 OKPage is fetched and processed for indexing
301 / 302TrellisBot follows redirects to the final destination
404 Not FoundURL is marked as permanently gone and removed from queue
403 ForbiddenPage is skipped; repeated 403s may result in domain being deprioritized
429 Too Many RequestsTrellisBot backs off and retries later
500 Server ErrorRetry attempted; persistent errors skip the URL

Minimum content threshold

Pages with very little text content — typically fewer than 50 words — are considered low quality and may be skipped or ranked very low. This includes pages that are primarily navigation menus, error pages, or auto-generated index pages with no original content.

Sitemaps

XML sitemaps significantly improve discovery speed. TrellisBot processes up to 10 sitemaps per domain. Sitemap index files are supported. Include your sitemap in robots.txt:

Sitemap: https://example.com/sitemap.xml

Canonical URLs

Use canonical tags to indicate the preferred version of a page when duplicate or similar content exists across multiple URLs. TrellisBot respects <link rel="canonical"> tags and uses the canonical URL as the indexed version.

JavaScript rendering

TrellisBot does not execute JavaScript. If your site relies on JavaScript to render content, consider implementing server-side rendering (SSR) or providing static HTML fallbacks to ensure your content is indexable.

Images and alt text

TrellisBot cannot read text inside images. If your page uses images to display important information — such as infographics, charts, screenshots of text, banners with text, or logos — that content is invisible to the crawler. Use alt attributes on your <img> tags to describe the image content in plain text:

<img src="infographic.png" alt="Chart showing 40% growth in renewable energy from 2020 to 2025">

Alt text serves two purposes: it gives TrellisBot context about what the image contains, and it improves accessibility for users with screen readers. Descriptive, accurate alt text is always better than generic placeholders like "image" or leaving the attribute empty on meaningful images.

Text in images: If critical page content — headings, product names, descriptions, contact information — only exists inside images with no alt text or surrounding HTML text, TrellisBot will not index that content. Pages that rely heavily on image-based text may rank poorly due to low detected word count.

Verifying TrellisBot

The User-Agent header in HTTP requests can be set to anything by anyone. To confirm that a request genuinely comes from TrellisBot rather than an impersonator, perform a reverse DNS lookup on the source IP address.

  1. Record the IP address of the request from your server access logs
  2. Run a reverse DNS lookup on that IP: host <IP_ADDRESS>
  3. The result should resolve to a hostname associated with TrellisSearch or its hosting provider
  4. Run a forward lookup on that hostname to confirm it resolves back to the same IP

Impersonators: If a request claims to be TrellisBot but the reverse DNS lookup does not confirm a TrellisSearch hostname, the request is not from TrellisBot. You may safely block it.

Controlling TrellisBot

TrellisBot fully respects the robots exclusion protocol. You can manage its access to your site using robots.txt, HTML meta tags, or HTTP response headers.

Block TrellisBot from your entire site

User-agent: TrellisBot
Disallow: /

Block TrellisBot from a specific path

User-agent: TrellisBot
Disallow: /private/
Disallow: /members/

Set a crawl delay (seconds between requests)

User-agent: TrellisBot
Crawl-delay: 10

Allow TrellisBot, block all others

User-agent: *
Disallow: /

User-agent: TrellisBot
Disallow:

Prevent indexing with a meta tag

To stop a specific page from appearing in TrellisSearch results, add a robots meta tag to the <head> of that page:

<meta name="robots" content="noindex">

Supported directives: noindex, nofollow, nosnippet, none.

Using X-Robots-Tag for non-HTML files

X-Robots-Tag: noindex

Crawling vs. indexing: Blocking TrellisBot via robots.txt prevents it from fetching a page, but that URL may still appear in results if other sites link to it. To remove a URL from results entirely, use noindex — but TrellisBot must be able to fetch the page to read that directive.

Submitting Your Site

You do not need to submit your site for TrellisBot to find it — it discovers pages naturally by following links. However, you can speed up discovery by submitting your URL directly or by providing a sitemap.

Submit a URL →

Sitemaps

Reference your XML sitemap in robots.txt to help TrellisBot discover all your pages efficiently:

Sitemap: https://example.com/sitemap.xml

Tips for faster indexing

Contact

For questions about TrellisBot or to report a crawling issue, reach out at support@trellissearch.com.

If you believe TrellisBot is behaving incorrectly or causing server problems, please include your server logs and the specific URLs involved. We take such reports seriously and respond promptly.