SEOAI Crawler Policy

Effective May 27, 2026 · Version 1.0

SEOAI operates a single dedicated crawler — SEOAI-CorpusBot — that builds the AI-readiness dataset behind api.seoai.space. This page documents exactly how it behaves so site operators, legal teams, and security reviewers can verify and, if they wish, block it.

1. Bot identification

Every request the corpus crawler makes carries this exact User-Agent header:

SEOAI-CorpusBot/1.0 (+https://seoai.space; contact=support@seoai.space)

Customer-facing audit fetches (initiated by a logged-in SEOAI customer to score their own site) use a browser-style user agent because they're loading the site as that customer would. Bulk corpus fetches are always SEOAI-CorpusBot.

2. robots.txt — fully honored

We fetch /robots.txt for every domain before crawling.
If the homepage is Disallowed for our bot or for *, we crawl zero pages and mark the domain blocked in our database.
Every individual URL is filtered through Python's standard RobotFileParser — paths under a Disallow rule are never requested.
We respect the Crawl-delay directive per host (default 200ms floor; higher values from robots.txt are obeyed).

To block us specifically:

User-agent: SEOAI-CorpusBot
Disallow: /

3. What we fetch

Publicly accessible HTML pages (HTTP 200 with no auth wall).
Maximum 10 pages per domain per crawl cycle.
One crawl cycle per domain per 30+ days (we are not a real-time crawler).
Sitemap and homepage links — no aggressive deep BFS.

4. What we do NOT fetch

Anything behind a login wall, paywall, or session token.
Files matching common admin / private patterns (/admin/, /wp-login.php, /.git/, etc.).
Non-HTML binaries we wouldn't analyze (videos, large PDFs > 10 MB).
Email addresses, phone numbers, or other PII for direct contact use.

5. Rate limits

Per-host concurrency cap: 5 simultaneous requests.
Per-host spacing: minimum 200ms between requests, raised to whatever Crawl-delay asks for.
Hard per-domain wall-clock cap: 5 minutes — we move on if a site is slow.

6. Identifying us in logs

Filter your access logs for the literal string SEOAI-CorpusBot in the User-Agent to identify our crawler. For verification of a specific request, contact support@seoai.space.

7. Opt-out

Three options, all equally honored:

robots.txt — instant, takes effect on our next visit (within 30 days).
Web form at seoai.space/corpus-opt-out — immediate deletion + permanent exclusion.
Email opt-out@seoai.space from a domain-matched address.

8. Contact

Security reports, abuse complaints, or general questions: support@seoai.space.

SEOAI クローラーポリシー

発効：2026年5月27日・ Version 1.0

SEOAI は専用クローラー SEOAI-CorpusBot を運用し、api.seoai.space のデータセットを構築しています。本ページではサイト管理者・法務・セキュリティレビュー担当者が検証・遮断できるよう、挙動を明示します。

1. Bot 識別子

コーパスクローラーの全リクエストには次の User-Agent が付与されます：

SEOAI-CorpusBot/1.0 (+https://seoai.space; contact=support@seoai.space)

顧客（ログイン済みSEOAIユーザー）が自社サイトを採点する監査リクエストはブラウザ型UAを使用します（顧客視点での閲覧再現のため）。一括コーパス取得は常に SEOAI-CorpusBot です。

2. robots.txt — 完全遵守

クロール前に必ず /robots.txt を取得します。
ホームページが SEOAI-CorpusBot または * に対して Disallow の場合、1ページもクロールせず blocked 記録します。
個別URLは Python 標準 RobotFileParser でフィルタ、Disallow 配下は一切リクエストしません。
ホスト単位の Crawl-delay も遵守します（最低200ms、robots.txt がより長い値を指定すれば従います）。

当社を個別ブロックする場合：

User-agent: SEOAI-CorpusBot
Disallow: /

3. 取得対象

公開アクセス可能な HTML ページ（HTTP 200 / 認証無し）。
1ドメインあたり最大10ページ／クロールサイクル。
1ドメインあたり30日以上の間隔（リアルタイムクローラーではありません）。
sitemap と内部リンクのみ。深い BFS は実施しません。

4. 取得しないもの

ログイン・ペイウォール・セッショントークン配下のページ。
管理／非公開パス（/admin/, /wp-login.php, /.git/ 等）。
大型バイナリ（動画・10MB超PDF）。
メールアドレス・電話番号など連絡用 PII。

5. レート制限

ホスト単位同時接続数：5。
ホスト単位間隔：最低200ms（Crawl-delay 指定時はそれに従う）。
1ドメインあたり実時間上限：5分。

6. ログでの識別方法

アクセスログの User-Agent を SEOAI-CorpusBot でフィルタすると当社クローラーを識別できます。特定リクエストの確認が必要な場合は support@seoai.space までご連絡ください。

7. オプトアウト

robots.txt — 即時、次回訪問時（30日以内）に反映。
Webフォーム seoai.space/corpus-opt-out — 即時削除＋永久除外。
メール ドメインに紐づくアドレスから opt-out@seoai.space。

8. 連絡先

セキュリティ報告・苦情・一般質問：support@seoai.space