TrellisBot — TrellisSearch Web Crawler

01 / Overview

What is TrellisBot?

TrellisBot is an automated program that systematically browses the web to build and maintain the TrellisSearch index. It follows hyperlinks from page to page, fetches content, and extracts information that powers search results at trellissearch.com.

TrellisBot is designed to be a well-behaved, respectful crawler. It identifies itself clearly in every request, obeys robots.txt rules, respects crawl delays, and does not attempt to access password-protected or otherwise restricted content.

ℹ

Open index: TrellisSearch is an independent search engine. Appearing in our index is separate from appearing in Google, Bing, or other search engines. Blocking TrellisBot only affects TrellisSearch.

02 / Identification

User Agent String

TrellisBot identifies itself in every HTTP request using the following user agent string:

TrellisBot/1.0 (+https://trellissearch.com/bot.html)

Field	Value
Crawler name	`TrellisBot`
Version	1.0
robots.txt token	`TrellisBot`
Documentation URL	`https://trellissearch.com/bot.html`
Operator	Trellis Group LLC
Contact	`support@trellissearch.com`

03 / Behavior

How TrellisBot Crawls

Discovery

TrellisBot discovers new pages primarily by following hyperlinks found on pages it has already visited. It also processes XML sitemaps referenced in robots.txt files via the Sitemap: directive.

Crawl rate

TrellisBot is designed to crawl politely and avoid placing excessive load on web servers. It limits concurrent connections per server, introduces delays between requests, and fully respects Crawl-delay directives. If your server is struggling with crawl traffic from TrellisBot, set a crawl delay in your robots.txt.

Content size limits

TrellisBot fetches up to 5MB of content per page. Content beyond this limit is not downloaded. For most pages this limit is never reached. Very large pages — such as those that embed large blocks of data or generated content — may be partially indexed.

Supported content types

HTML pages (text/html)
PDF documents (application/pdf)
Plain text (text/plain)

TrellisBot does not currently execute JavaScript. Pages that rely entirely on client-side rendering to display content may not be fully indexed. Server-side rendered or static HTML pages will be indexed most accurately.

Crawl frequency

How often TrellisBot revisits a page depends on how frequently that page changes. Pages that update often are checked more frequently; static pages are revisited less often. Revisit intervals adapt automatically based on observed change history, ranging from daily for very active pages to every 90 days for content that rarely changes.

Text extraction

TrellisBot extracts the visible text content of a page — headings, paragraphs, lists, and anchor text. It stores a portion of the page text for snippet generation and relevance scoring. Navigation menus, footers, and repeated boilerplate elements have less influence on indexing than the main body content of a page.

04 / Quality Signals

What Ranks Well

TrellisSearch uses a multi-signal ranking system that rewards genuine, human-readable content. The following characteristics positively influence how a page ranks:

✓ Content Quality

Original, well-written content
Sufficient depth and length
Clear, descriptive page titles
Logical heading structure
Natural, readable prose
Accurate meta descriptions
Descriptive alt text on images

✓ Technical Signals

Served over HTTPS
Fast server response times
Clean, readable URLs
Valid HTML structure
Mobile-friendly layout
Proper use of canonical tags

✓ Authority Signals

Links from other indexed sites
Consistent domain history
Established domain age
Relevant internal linking
Sitemap provided

✓ Freshness

Recently published content
Regularly updated pages
Active, maintained sites
Current, accurate information

✓

Our philosophy: TrellisSearch intentionally favors smaller honest pages over aggressively optimized ones. A well-written page on a personal site can outrank a keyword-stuffed page on a large domain.

05 / Spam & Quality Penalties

What Ranks Poorly

TrellisSearch automatically detects and penalizes pages that attempt to manipulate rankings or provide little genuine value to users. The following will result in ranking suppression or removal:

✗ Content Problems

Thin or near-empty pages
Duplicate content across pages
Auto-generated or gibberish text
Misleading titles that don't match content
Pages with no meaningful text
Scraped or copied content
Important content only in images with no alt text

✗ Keyword Manipulation

Keyword stuffing in body text
Repeated keywords in titles
Unnatural keyword density
Keyword lists unrelated to content
Repeated phrases throughout page

✗ Hidden Content

Text hidden via CSS
White text on white backgrounds
Content hidden off-screen
Zero-size fonts
Invisible overlays

✗ Link Manipulation

Excessive outbound links
Link farms
Unrelated link clusters
Doorway pages
Thin pages designed to pass link equity

⚠

Penalty severity: Detected spam signals result in automatic ranking suppression. Severe cases — such as hidden text or egregious keyword stuffing — can reduce a page's ranking score by up to 99%, effectively removing it from results.

06 / Technical Requirements

Technical Requirements for Indexing

To ensure your pages are indexed correctly, keep the following in mind:

Page accessibility

TrellisBot must be able to reach your page without authentication, CAPTCHA, or JavaScript-only rendering. Pages behind login walls, paywalls, or that require user interaction to display content will not be fully indexed.

Response codes

HTTP Code	What happens
200 OK	Page is fetched and processed for indexing
301 / 302	TrellisBot follows redirects to the final destination
404 Not Found	URL is marked as permanently gone and removed from queue
403 Forbidden	Page is skipped; repeated 403s may result in domain being deprioritized
429 Too Many Requests	TrellisBot backs off and retries later
500 Server Error	Retry attempted; persistent errors skip the URL

Minimum content threshold

Pages with very little text content — typically fewer than 50 words — are considered low quality and may be skipped or ranked very low. This includes pages that are primarily navigation menus, error pages, or auto-generated index pages with no original content.

Sitemaps

XML sitemaps significantly improve discovery speed. TrellisBot processes up to 10 sitemaps per domain. Sitemap index files are supported. Include your sitemap in robots.txt:

Sitemap: https://example.com/sitemap.xml

Canonical URLs

Use canonical tags to indicate the preferred version of a page when duplicate or similar content exists across multiple URLs. TrellisBot respects <link rel="canonical"> tags and uses the canonical URL as the indexed version.

JavaScript rendering

TrellisBot does not execute JavaScript. If your site relies on JavaScript to render content, consider implementing server-side rendering (SSR) or providing static HTML fallbacks to ensure your content is indexable.

Images and alt text

TrellisBot cannot read text inside images. If your page uses images to display important information — such as infographics, charts, screenshots of text, banners with text, or logos — that content is invisible to the crawler. Use alt attributes on your <img> tags to describe the image content in plain text:

<img src="infographic.png" alt="Chart showing 40% growth in renewable energy from 2020 to 2025">

Alt text serves two purposes: it gives TrellisBot context about what the image contains, and it improves accessibility for users with screen readers. Descriptive, accurate alt text is always better than generic placeholders like "image" or leaving the attribute empty on meaningful images.

⚠

Text in images: If critical page content — headings, product names, descriptions, contact information — only exists inside images with no alt text or surrounding HTML text, TrellisBot will not index that content. Pages that rely heavily on image-based text may rank poorly due to low detected word count.

07 / Verification

Verifying TrellisBot

The User-Agent header in HTTP requests can be set to anything by anyone. To confirm that a request genuinely comes from TrellisBot rather than an impersonator, perform a reverse DNS lookup on the source IP address.

Record the IP address of the request from your server access logs
Run a reverse DNS lookup on that IP: host <IP_ADDRESS>
The result should resolve to a hostname associated with TrellisSearch or its hosting provider
Run a forward lookup on that hostname to confirm it resolves back to the same IP

⚠

Impersonators: If a request claims to be TrellisBot but the reverse DNS lookup does not confirm a TrellisSearch hostname, the request is not from TrellisBot. You may safely block it.

08 / Control

Controlling TrellisBot

TrellisBot fully respects the robots exclusion protocol. You can manage its access to your site using robots.txt, HTML meta tags, or HTTP response headers.

Block TrellisBot from your entire site

User-agent: TrellisBot
Disallow: /

Block TrellisBot from a specific path

User-agent: TrellisBot
Disallow: /private/
Disallow: /members/

Set a crawl delay (seconds between requests)

User-agent: TrellisBot
Crawl-delay: 10

Allow TrellisBot, block all others

User-agent: *
Disallow: /

User-agent: TrellisBot
Disallow:

Prevent indexing with a meta tag

To stop a specific page from appearing in TrellisSearch results, add a robots meta tag to the <head> of that page:

<meta name="robots" content="noindex">

Supported directives: noindex, nofollow, nosnippet, none.

Using X-Robots-Tag for non-HTML files

X-Robots-Tag: noindex

ℹ

Crawling vs. indexing: Blocking TrellisBot via robots.txt prevents it from fetching a page, but that URL may still appear in results if other sites link to it. To remove a URL from results entirely, use noindex — but TrellisBot must be able to fetch the page to read that directive.

09 / Discovery

Submitting Your Site

You do not need to submit your site for TrellisBot to find it — it discovers pages naturally by following links. However, you can speed up discovery by submitting your URL directly or by providing a sitemap.

Submit a URL →

Sitemaps

Reference your XML sitemap in robots.txt to help TrellisBot discover all your pages efficiently:

Sitemap: https://example.com/sitemap.xml

Tips for faster indexing

Submit your homepage URL — TrellisBot will follow internal links to discover other pages
Ensure your sitemap is up to date and referenced in robots.txt
Make sure TrellisBot is not blocked in your robots.txt
Internal links between your pages help TrellisBot discover more content
Use descriptive anchor text on internal links

10 / Support

Contact

For questions about TrellisBot or to report a crawling issue, reach out at support@trellissearch.com.

If you believe TrellisBot is behaving incorrectly or causing server problems, please include your server logs and the specific URLs involved. We take such reports seriously and respond promptly.