katana 下一代网页蜘蛛爬行网址抓取框架

katana简介

katana是projectdiscovery项目中的一个网页链接抓取工具,可以自动解析js文件.

特色

快速且完全可配置,可网络升级
标准和无头模式支持
JavaScript解析/刷取
可定制的自动填充表
范围控制– 预先配置文字段/正则表达式
可定制的输出– 预先配置文字段
输入 – STDIN , URL和LIST
输出 – STDOUT、文件和JSON

安装方法

katana 需要Go 1.18才能安装成功。
要安装，只需要运行以下指令或从发布页面下载预编译的二次制作文件。

go install github.com/projectdiscovery/katana/cmd/katana@latest

docker安装

安装/更新 docker 到最新 –

docker pull projectdiscovery/katana:latest

使用 docker 在标准模式下运行 katana –

docker run projectdiscovery/katana:latest -u https://tesla.com

使用 docker 以无头模式运行 katana –

docker run projectdiscovery/katana:latest -u https://tesla.com -system-chrome -headless

使用方法

katana -h

这将显示该工具的帮助。这是它支持的所有开关。

用法:
  ./katana [flags]

参数:
输入:
   -u, -list string[]  target url / list to crawl

配置:
   -d, -depth int                最大爬行深度（默认为2）。
   -jc, -js-crawl                在javascript文件中启用端点解析/抓取功能
   -ct, -crawl-duration int      爬行目标的最长时间
   -kf, -known-files string      启用对已知文件的抓取（all,robotstxt,sitemapxml）。
   -mrs, -max-response-size int  读取的最大响应大小（默认为2097152）。
   -timeout int                  等待请求的时间，以秒为单位（默认为10）。
   -aff, -automatic-form-fill    启用可选的自动填表功能（试验性的）。
   -retry int                    重试请求的次数（默认为1）。
   -proxy string                 使用的http/socks5代理
   -H, -headers string[]         请求中包含的自定义头信息/cookie
   -config string                katana配置文件的路径
   -fc, -form-config string      自定义表单配置文件的路径

无头的:
   -hl, -headless                   启用无头混合爬行(实验性)
   -sc, -system-chrome              使用本地安装的chrome浏览器，而不是katana自带的。
   -sb, -show-browser               以无头模式在屏幕上显示浏览器
   -ho, -headless-options string[]  开始使用带有额外选项的无头浏览器
   -nos, -no-sandbox                在无沙盒模式下启动无头的Chrome

范围:
   -cs, -crawl-scope string[]       在范围内的URL重词将被爬虫遵循
   -cos, -crawl-out-scope string[]  被爬虫排除的超范围的URL重词
   -fs, -field-scope string         预定义的范围字段（dn,rdn,fqdn）（默认为 "rdn"）。
   -ns, -no-scope                   禁用基于主机的默认范围
   -do, -display-out-scope          从范围内抓取显示外部端点

过滤器:
   -f, -field string                输出中显示的字段 (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
   -sf, -store-field string         储存在每个主机输出中的字段 (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
   -em, -extension-match string[]   匹配给定扩展名的输出（例如，-em php,html,js）。
   -ef, -extension-filter string[]  过滤给定扩展名的输出（例如，-ef png,css）。

速率限制:
   -c, -concurrency int          使用的并发取物器的数量（默认为10）。
   -p, -parallelism int          要处理的并发输入的数量（默认为10）。
   -rd, -delay int               每个请求之间的延迟，以秒为单位
   -rl, -rate-limit int          每秒发送的最大请求数（默认为150）。
   -rlm, -rate-limit-minute int  每分钟发送的最大请求数

输出:
   -o, -output string  文件输出到
   -j, -json           以JSONL(ines)格式输出
   -nc, -no-color      禁用输出内容着色（ANSI转义代码）。
   -silent             仅显示输出
   -v, -verbose        展示粗略的输出
   -version            显示项目版本

katana使用示例

katana需要url或端点来抓取并接受单个或多个输入。

可以使用 URL -u选项提供输入，可以使用逗号分隔的输入提供多个值，类似地，使用选项支持文件-list输入，另外还支持管道输入 (stdin)。

网址输入

katana -u https://tesla.com

多个URL 输入（逗号分隔）

katana -u https://tesla.com,https://google.com

列表输入

$ cat url_list.txt
https://tesla.com
https://google.com

katana -list url_list.txt

STDIN（管道）输入

echo https://tesla.com | katana

cat domains | httpx | katana

运行起来是这个样子的:

katana -u https://youtube.com

   __        __                
  / /_____ _/ /____ ____  ___ _
 /  '_/ _  / __/ _  / _ \/ _  /
/_/\_\\_,_/\__/\_,_/_//_/\_,_/ v0.0.1                     

      projectdiscovery.io

[WRN] Use with caution. You are responsible for your actions.
[WRN] Developers assume no liability and are not responsible for any misuse or damage.
https://www.youtube.com/
https://www.youtube.com/about/
https://www.youtube.com/about/press/
https://www.youtube.com/about/copyright/
https://www.youtube.com/t/contact_us/
https://www.youtube.com/creators/
https://www.youtube.com/ads/
https://www.youtube.com/t/terms
https://www.youtube.com/t/privacy
https://www.youtube.com/about/policies/
https://www.youtube.com/howyoutubeworks?utm_campaign=ytgen&utm_source=ythp&utm_medium=LeftNav&utm_content=txt&u=https%3A%2F%2Fwww.youtube.com%2Fhowyoutubeworks%3Futm_source%3Dythp%26utm_medium%3DLeftNav%26utm_campaign%3Dytgen
https://www.youtube.com/new
https://m.youtube.com/
https://www.youtube.com/s/desktop/4965577f/jsbin/desktop_polymer.vflset/desktop_polymer.js
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-home-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/cssbin/www-onepick.css
https://www.youtube.com/s/_/ytmainappweb/_/ss/k=ytmainappweb.kevlar_base.0Zo5FUcPkCg.L.B1.O/am=gAE/d=0/rs=AGKMywG5nh5Qp-BGPbOaI1evhF5BVGRZGA
https://www.youtube.com/opensearch?locale=en_GB
https://www.youtube.com/manifest.webmanifest
https://www.youtube.com/s/desktop/4965577f/cssbin/www-main-desktop-watch-page-skeleton.css
https://www.youtube.com/s/desktop/4965577f/jsbin/web-animations-next-lite.min.vflset/web-animations-next-lite.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/custom-elements-es5-adapter.vflset/custom-elements-es5-adapter.js
https://www.youtube.com/s/desktop/4965577f/jsbin/webcomponents-sd.vflset/webcomponents-sd.js
https://www.youtube.com/s/desktop/4965577f/jsbin/intersection-observer.min.vflset/intersection-observer.min.js
https://www.youtube.com/s/desktop/4965577f/jsbin/scheduler.vflset/scheduler.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-i18n-constants-en_GB.vflset/www-i18n-constants.js
https://www.youtube.com/s/desktop/4965577f/jsbin/www-tampering.vflset/www-tampering.js
https://www.youtube.com/s/desktop/4965577f/jsbin/spf.vflset/spf.js
https://www.youtube.com/s/desktop/4965577f/jsbin/network.vflset/network.js
https://www.youtube.com/howyoutubeworks/
https://www.youtube.com/trends/
https://www.youtube.com/jobs/
https://www.youtube.com/kids/

爬行模式

标准模式

标准爬行模式使用标准的 go http 库来处理 HTTP 请求/响应。这种模式要快得多，因为它没有浏览器开销。尽管如此，它仍按原样分析 HTTP 响应主体，没有任何 javascript 或 DOM 渲染，可能会丢失复杂 Web 应用程序中可能发生的后 dom 渲染端点或异步端点调用，例如，取决于特定于浏览器的事件。

无头模式

无头模式挂钩内部无头调用以直接在浏览器上下文中处理 HTTP 请求/响应。这有两个好处：

HTTP 指纹（TLS 和用户代理）将客户端完全识别为合法浏览器
更好的覆盖率，因为发现端点分析标准原始响应，如在以前的模式中，以及启用 javascript 的浏览器呈现的响应。

无头爬行是可选的，可以使用-headless选项启用。

这是其他无头 CLI 选项 –

katana -h headless

Flags:
无头的:
   -hl, -headless                   启用无头混合爬行(实验性)
   -sc, -system-chrome              使用本地安装的chrome浏览器，而不是katana自带的。
   -sb, -show-browser               以无头模式在屏幕上显示浏览器
   -ho, -headless-options string[]  开始使用带有额外选项的无头浏览器
   -nos, -no-sandbox                在无沙盒模式下启动无头的Chrome

`-no-sandbox`

使用no-sandbox选项运行 headless chrome 浏览器，在以 root 用户身份运行时很有用。

katana -u https://tesla.com -headless -no-sandbox

`-headless-options`

在无头模式下爬行时，可以使用指定额外的 chrome 选项-headless-options，例如 –

katana -u https://tesla.com -headless -system-chrome -headless-options --disable-gpu,proxy-server=http://127.0.0.1:8080

范围控制

如果没有范围，爬行可能是无止境的，因为 katana 提供了多种支持来定义爬行范围。

`-field-scope`

使用预定义字段名称定义范围的最方便的选项，rdn是字段范围的默认选项。

rdn– 抓取范围为根域名和所有子域（默认）
fqdn– 爬行范围为给定的子（域）
dn– 抓取范围为域名关键字

katana -u https://tesla.com -fs dn

`-crawl-scope`

对于高级范围控制，-cs可以使用正则表达式支持附带的选项。

katana -u https://tesla.com -cs login

对于范围内的多个规则，可以传递带有多行字符串/正则表达式的文件输入。

$ cat in_scope.txt

login/
admin/
app/
wordpress/

katana -u https://tesla.com -cs in_scope.txt

`-crawl-out-scope`

为了定义不抓取的内容，-cos可以使用选项并且还支持正则表达式输入。

katana -u https://tesla.com -cos logout

对于多个超出范围的规则，可以传递多行字符串/正则表达式的文件输入。

$ cat out_of_scope.txt

/logout
/log_out

katana -u https://tesla.com -cos out_of_scope.txt

`-no-scope`

Katana 默认为 scope *.domain，可以使用禁用此-ns选项以及抓取互联网。

katana -u https://tesla.com -ns

`-display-out-scope`

默认情况下，当使用 scope 选项时，它还适用于显示为输出的链接，因为此类外部 URL 默认排除并覆盖此行为，该-do选项可用于显示目标范围 URL 中存在的所有外部 URL /端点。

katana -u https://tesla.com -do

这是范围控制的所有 CLI 选项 –

katana -h scope

Flags:
范围:
   -cs, -crawl-scope string[]       在范围内的URL重词将被爬虫遵循
   -cos, -crawl-out-scope string[]  被爬虫排除的超范围的URL重词
   -fs, -field-scope string         预定义的范围字段（dn,rdn,fqdn）（默认为 "rdn"）。
   -ns, -no-scope                   禁用基于主机的默认范围
   -do, -display-out-scope          从范围内抓取显示外部端点

爬虫配置

Katana 带有多个选项，可以按照我们想要的方式配置和控制爬网。

`-depth`

定义depth跟随 url 进行爬网的选项，深度越深，被爬网的端点数量越多 + 爬网时间越长。

katana -u https://tesla.com -d 5

`-js-crawl`

启用 JavaScript 文件解析 + 抓取在 JavaScript 文件中发现的端点的选项，默认情况下禁用。

katana -u https://tesla.com -jc

`-crawl-duration`

预定义爬网持续时间的选项，默认情况下禁用。

katana -u https://tesla.com -ct 2

`-known-files`

启用爬网robots.txt和sitemap.xml文件的选项，默认情况下禁用。

katana -u https://tesla.com -kf robotstxt,sitemapxml

`-automatic-form-fill`

为已知/未知字段启用自动表单填充的选项，可以根据需要通过更新表单配置文件自定义已知字段值$HOME/.config/katana/form-config.yaml。

自动填表是实验性的功能。

   -aff, -automatic-form-fill  enable optional automatic form filling (experimental)

需要时可以配置更多选项，这里是所有与配置相关的 CLI 选项 –

katana -h config

Flags:
配置:
   -d, -depth int                最大爬行深度（默认为2）。
   -jc, -js-crawl                在javascript文件中启用端点解析/抓取功能
   -ct, -crawl-duration int      爬行目标的最长时间
   -kf, -known-files string      启用对已知文件的抓取（all,robotstxt,sitemapxml）。
   -mrs, -max-response-size int  读取的最大响应大小（默认为2097152）。
   -timeout int                  等待请求的时间，以秒为单位（默认为10）。
   -aff, -automatic-form-fill    启用可选的自动填表功能（试验性的）。
   -retry int                    重试请求的次数（默认为1）。
   -proxy string                 使用的http/socks5代理
   -H, -headers string[]         请求中包含的自定义头信息/cookie
   -config string                katana配置文件的路径
   -fc, -form-config string      自定义表单配置文件的路径

过滤器

`-field`

Katana 带有内置字段，可用于过滤所需信息的输出，-f选项可用于指定任何可用字段。

   -f, -field string  输出中显示的字段 (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)

这是一个表格，其中包含每个字段的示例以及使用时的预期输出 –

领域	描述	例子
`url`	网址端点	`https://admin.projectdiscovery.io/admin/login?user=admin&password=admin`
`qurl`	包含查询参数的 URL	`https://admin.projectdiscovery.io/admin/login.php?user=admin&password=admin`
`qpath`	包含查询参数的路径	`/login?user=admin&password=admin`
`path`	网址路径	`https://admin.projectdiscovery.io/admin/login`
`fqdn`	完全合格的域名	`admin.projectdiscovery.io`
`rdn`	根域名	`projectdiscovery.io`
`rurl`	根网址	`https://admin.projectdiscovery.io`
`file`	URL 中的文件名	`login.php`
`key`	URL 中的参数键	`user,password`
`value`	URL 中的参数值	`admin,admin`
`kv`	Keys=URL中的值	`user=admin&password=admin`
`dir`	网址目录名称	`/admin/`
`udir`	带目录的 URL	`https://admin.projectdiscovery.io/admin/`

这是使用字段选项仅显示其中包含查询参数的所有网址的示例 –

katana -u https://tesla.com -f qurl -silent

https://shop.tesla.com/en_au?redirect=no
https://shop.tesla.com/en_nz?redirect=no
https://shop.tesla.com/product/men_s-raven-lightweight-zip-up-bomber-jacket?sku=1740250-00-A
https://shop.tesla.com/product/tesla-shop-gift-card?sku=1767247-00-A
https://shop.tesla.com/product/men_s-chill-crew-neck-sweatshirt?sku=1740176-00-A
https://www.tesla.com/about?redirect=no
https://www.tesla.com/about/legal?redirect=no
https://www.tesla.com/findus/list?redirect=no

`-store-field`

为了补充field在运行时过滤输出有用的选项，有一个-sf, -store-fields选项与字段选项完全一样，除了不是过滤，它将所有信息存储在磁盘上katana_output按目标 url 排序的目录下。

katana -u https://tesla.com -sf key,fqdn,qurl -silent

$ ls katana_output/

https_www.tesla.com_fqdn.txt
https_www.tesla.com_key.txt
https_www.tesla.com_qurl.txt

注意：store-field选项可以方便地收集信息来为以下内容构建目标感知词表，但不限于 -大多数/常用参数大多数/常用路径大多数/常用文件相关/未知子（域）

`-extension-match`

使用选项可以轻松匹配抓取输出-em以确保仅显示包含给定扩展名的输出。

katana -u https://tesla.com -silent -em js,jsp,json

`-extension-filter`

-ef可以使用确保删除包含给定扩展名的所有 url 的选项轻松过滤抓取输出以获取特定扩展名。

katana -u https://tesla.com -silent -ef css,txt,md

这是其他过滤器选项 –

过滤器:
   -f, -field string                输出中显示的字段 (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
   -sf, -store-field string         储存在每个主机输出中的字段 (url,path,fqdn,rdn,rurl,qurl,qpath,file,key,value,kv,dir,udir)
   -em, -extension-match string[]   匹配给定扩展名的输出（例如，-em php,html,js）。
   -ef, -extension-filter string[]  过滤给定扩展名的输出（例如，-ef png,css）。

速率限制和延迟

如果不遵循目标网站限制，在爬行时很容易被阻止/禁止，katana 带有多个选项来调整爬行以达到我们想要的快/慢。

`-delay`

在 katana 爬行时发出的每个新请求之间引入以秒为单位的延迟的选项，默认情况下禁用。

katana -u https://tesla.com -delay 20

`-concurrency`

用于控制每个目标同时获取的 url 数量的选项。

katana -u https://tesla.com -c 20

`-parallelism`

用于定义要从列表输入同时处理的目标数量的选项。

katana -u https://tesla.com -p 20

`-rate-limit`

用于定义每秒可以发出的最大请求数的选项。

katana -u https://tesla.com -rl 100

`-rate-limit-minute`

用于定义每分钟可以发出的最大请求数的选项。

katana -u https://tesla.com -rlm 500

这是用于速率限制控制的所有长/短 CLI 选项 –

katana -h rate-limit

Flags:
速率限制:
   -c, -concurrency int          使用的并发取物器的数量（默认为10）。
   -p, -parallelism int          要处理的并发输入的数量（默认为10）。
   -rd, -delay int               每个请求之间的延迟，以秒为单位
   -rl, -rate-limit int          每秒发送的最大请求数（默认为150）。
   -rlm, -rate-limit-minute int  每分钟发送的最大请求数

输出

`-json`

Katana 既支持纯文本格式的文件输出，也支持 JSON，后者包括附加信息，如source、tag和attribute名称，以将发现的端点关联起来。

katana -u https://example.com -json -do | jq .

{
  "timestamp": "2022-11-05T22:33:27.745815+05:30",
  "endpoint": "https://www.iana.org/domains/example",
  "source": "https://example.com",
  "tag": "a",
  "attribute": "href"
}

以下是与输出相关的其他 CLI 选项 –

katana -h output

输出:
   -o, -output string  文件输出到
   -j, -json           以JSONL(ines)格式输出
   -nc, -no-color      禁用输出内容着色（ANSI转义代码）。
   -silent             仅显示输出
   -v, -verbose        展示粗略的输出
   -version            显示项目版本

kali使用测试

[1]安装go

apt install golang-go

[2]直接下载或使用下面的命令下载katana_0.0.2_linux_amd64.zip

go install github.com/projectdiscovery/katana/cmd/katana@latest

[3]给权限

chmod +x katana

[4]运行

./katana

[5]拿hackerone试试水

./katana -u https://www.hackerone.com -d 5 -jc -kf all

具体的最优策略自行摸索.

项目地址:

GitHub:
https://github.com/projectdiscovery/katana

转载请注明出处及链接

katana简介

特色

安装方法

docker安装

使用方法

katana使用示例

网址输入

多个URL 输入（逗号分隔）

列表输入

STDIN（管道）输入

爬行模式

标准模式

无头模式

-no-sandbox

-headless-options

范围控制

-field-scope

-crawl-scope

-crawl-out-scope

-no-scope

-display-out-scope

爬虫配置

-depth

-js-crawl

-crawl-duration

-known-files

-automatic-form-fill

过滤器

-field

-store-field

-extension-match

-extension-filter