
Last Update: March 27, 2025
BYeric
Keywords
Unleash the Power of Puppeteer: A Deep Dive into the tyo-crawler
Web Crawler
The internet is a vast ocean of data, and web scraping is your trusty ship to navigate it. Web scraping has become an essential tool for extracting valuable information from the internet, enabling businesses and developers to harness data for various applications. Particularly at the age of AI, where data is the new oil, the ability to scrape and analyze web content is more important than ever. Whether it's for training chatbots, enhancing search capabilities, or developing AI-driven insights, high-quality data is essential. One of the key applications of web scraping is in building Retrieval-Augmented Generation (RAG) systems, where structured or unstructured data serves as the foundation for an intelligent knowledge base to power AI models.
While many similar tools exist, tyo-crawler
serves as a powerful alternative—a Node.js web crawler built on the shoulders of Puppeteer
, designed to simplify the process and unlock the potential of web data extraction. This tutorial will guide you through its features and capabilities, helping you harness the full power of web scraping with ease.
What is tyo-crawler
?
tyo-crawler
is just another web crawler that leverages the capabilities of Puppeteer, a Node library that provides a high-level API for controlling headless Chrome or Chromium. This allows for seamless interaction with websites, handling JavaScript, and navigating complex page structures—something many simpler scraping tools struggle with. It also integrates with Redis for efficient caching, preventing duplicate crawls and optimizing performance.
Key Features:
-
Headless Browser Power: Runs in headless mode by default (no visible browser window), making it ideal for automated tasks. You can, however, enable a visible browser window for debugging purposes or seeing a process of web scrapping in real-time.
-
Redis Integration: Uses Redis to store and manage crawled URLs, significantly improving efficiency by avoiding redundant crawls. This is crucial for large-scale scraping projects.
-
Highly Configurable: Offers a wide array of options to customize the crawling process, including:
- Depth Control: Specify the maximum number of levels to crawl.
- URL Filtering: Use regular expressions to include or exclude specific URLs.
- Wait Times: Adjust wait times between page requests to avoid overloading servers.
- Cookie Handling: Manage cookies for authenticated access to websites.
- Local Storage Interaction: Interact with a website's local storage.
- Action Execution: Perform custom actions on pages using a JSON configuration file (more on this below).
- Website Cloning: Download entire websites or sections of websites.
- Curl Integration: For faster downloads of static content, you can opt to use
curl
. - View Only Mode: Inspect pages without downloading content.
-
Action Execution: This is a standout feature. You can define a
actions.json
file to automate complex interactions with websites. This allows you to log in, fill out forms, click buttons, and more, all within the crawling process.
Getting Started: Installation and Setup
-
Clone the Repository:
git clone https://github.com/e-tang/tyo-crawler.git cd tyo-crawler
-
Install Dependencies:
npm install
-
Redis Setup:
tyo-crawler
relies on Redis for caching. You'll need to have Redis installed and running. The README provides instructions for installation using apt-get (for Ubuntu) or conveniently using Docker Compose.
Usage and Configuration
The crawler is run from the command line using node index.js
. A multitude of options are available, all clearly documented in the README. Here are a few examples:
Basic Crawl:
node index.js https://www.example.com
This will crawl everything by starting from https://www.example.com
, and it will not stop until no more links are found.
Clone a Website:
node index.js --clone true https://www.example.com
This will download the entire website including anything that has the same domain as www.example.com
, including all assets, to your local machine. Please note you would like to include link other than www.example.com
, you need to provide extra parameters:
node index.js --clone true --pattern "trove.example.com" --pattern "gold.example2.com" https://www.example.com
Crawl with Specific Options:
node index.js --show-window true --level 2 https://www.example.com
This crawls example.com to a depth of 2 levels, showing the browser window.
Using actions.json for Automated Interactions:
This is where the real power shines. Create a actions.json
file (see the example provided in the repository) to define actions. Then run:
node index.js --actions-file actions.json https://www.example.com/login
This will execute the actions defined in actions.json
on the specified URL. The example actions.json
shows how to handle login forms and other dynamic elements.
Understanding actions.json
The actions.json
file is a crucial part of tyo-crawler
. It's a JSON array where each object represents a conditional action. Each action object typically has:
if
: A CSS selector. If an element matching this selector is found, the actions in thethen
array are executed.then
: An array of actions to perform. Each action specifies an action type (e.g.,type
,click
,eval
), a selector to target, and a value (if needed).
Conclusion
tyo-crawler
is a versatile and powerful tool for web scraping. Its combination of Puppeteer's browser automation capabilities, Redis caching, and the flexible actions.json
configuration makes it a highly effective solution for a wide range of web data extraction tasks. The detailed README and the examples provided make it relatively easy to get started, even for those new to web scraping. Give it a try and unlock the potential of web data!
Previous Article

Oct 03, 2021
Setting up Ingress for a Web Service in a Kubernetes Cluster with NGINX Ingress Controller
A simple tutorial that helps configure ingress for a web service inside a kubernetes cluster using NGINX Ingress Controller
Next Article

Mar 14, 2025
Building a Retrieval-Augmented Generation System Using Open-Source Tools
In this post, we explore some of the most prominent open-source tools for building a Retrieval-Augmented Generation (RAG) system, including Open Web UI, Verba, Onyx, LobeChat, RagFlow, RAG Web UI, Kotaemon, and Cognita.