preloader
post-thumb

Last Update: March 27, 2025


BYauthor-thumberic


Keywords

Unleash the Power of Puppeteer: A Deep Dive into the tyo-crawler Web Crawler

The internet is a vast ocean of data, and web scraping is your trusty ship to navigate it. Web scraping has become an essential tool for extracting valuable information from the internet, enabling businesses and developers to harness data for various applications. Particularly at the age of AI, where data is the new oil, the ability to scrape and analyze web content is more important than ever. Whether it's for training chatbots, enhancing search capabilities, or developing AI-driven insights, high-quality data is essential. One of the key applications of web scraping is in building Retrieval-Augmented Generation (RAG) systems, where structured or unstructured data serves as the foundation for an intelligent knowledge base to power AI models.

While many similar tools exist, tyo-crawler serves as a powerful alternative—a Node.js web crawler built on the shoulders of Puppeteer, designed to simplify the process and unlock the potential of web data extraction. This tutorial will guide you through its features and capabilities, helping you harness the full power of web scraping with ease.

What is tyo-crawler?

tyo-crawler is just another web crawler that leverages the capabilities of Puppeteer, a Node library that provides a high-level API for controlling headless Chrome or Chromium. This allows for seamless interaction with websites, handling JavaScript, and navigating complex page structures—something many simpler scraping tools struggle with. It also integrates with Redis for efficient caching, preventing duplicate crawls and optimizing performance.

Key Features:

  • Headless Browser Power: Runs in headless mode by default (no visible browser window), making it ideal for automated tasks. You can, however, enable a visible browser window for debugging purposes or seeing a process of web scrapping in real-time.

  • Redis Integration: Uses Redis to store and manage crawled URLs, significantly improving efficiency by avoiding redundant crawls. This is crucial for large-scale scraping projects.

  • Highly Configurable: Offers a wide array of options to customize the crawling process, including:

    • Depth Control: Specify the maximum number of levels to crawl.
    • URL Filtering: Use regular expressions to include or exclude specific URLs.
    • Wait Times: Adjust wait times between page requests to avoid overloading servers.
    • Cookie Handling: Manage cookies for authenticated access to websites.
    • Local Storage Interaction: Interact with a website's local storage.
    • Action Execution: Perform custom actions on pages using a JSON configuration file (more on this below).
    • Website Cloning: Download entire websites or sections of websites.
    • Curl Integration: For faster downloads of static content, you can opt to use curl.
    • View Only Mode: Inspect pages without downloading content.
  • Action Execution: This is a standout feature. You can define a actions.json file to automate complex interactions with websites. This allows you to log in, fill out forms, click buttons, and more, all within the crawling process.

Getting Started: Installation and Setup

  1. Clone the Repository:

    git clone https://github.com/e-tang/tyo-crawler.git
    cd tyo-crawler
    
  2. Install Dependencies:

    npm install
    
  3. Redis Setup: tyo-crawler relies on Redis for caching. You'll need to have Redis installed and running. The README provides instructions for installation using apt-get (for Ubuntu) or conveniently using Docker Compose.

Usage and Configuration

The crawler is run from the command line using node index.js. A multitude of options are available, all clearly documented in the README. Here are a few examples:

Basic Crawl:

node index.js https://www.example.com

This will crawl everything by starting from https://www.example.com, and it will not stop until no more links are found.

Clone a Website:

node index.js --clone true https://www.example.com

This will download the entire website including anything that has the same domain as www.example.com, including all assets, to your local machine. Please note you would like to include link other than www.example.com, you need to provide extra parameters:

node index.js --clone true --pattern "trove.example.com" --pattern "gold.example2.com" https://www.example.com

Crawl with Specific Options:

node index.js --show-window true --level 2 https://www.example.com

This crawls example.com to a depth of 2 levels, showing the browser window.

Using actions.json for Automated Interactions:

This is where the real power shines. Create a actions.json file (see the example provided in the repository) to define actions. Then run:

node index.js --actions-file actions.json https://www.example.com/login

This will execute the actions defined in actions.json on the specified URL. The example actions.json shows how to handle login forms and other dynamic elements.

Understanding actions.json

The actions.json file is a crucial part of tyo-crawler. It's a JSON array where each object represents a conditional action. Each action object typically has:

  • if: A CSS selector. If an element matching this selector is found, the actions in the then array are executed.
  • then: An array of actions to perform. Each action specifies an action type (e.g., type, click, eval), a selector to target, and a value (if needed).

Conclusion

tyo-crawler is a versatile and powerful tool for web scraping. Its combination of Puppeteer's browser automation capabilities, Redis caching, and the flexible actions.json configuration makes it a highly effective solution for a wide range of web data extraction tasks. The detailed README and the examples provided make it relatively easy to get started, even for those new to web scraping. Give it a try and unlock the potential of web data!

Previous Article
post-thumb

Oct 03, 2021

Setting up Ingress for a Web Service in a Kubernetes Cluster with NGINX Ingress Controller

A simple tutorial that helps configure ingress for a web service inside a kubernetes cluster using NGINX Ingress Controller

Next Article
post-thumb

Mar 14, 2025

Building a Retrieval-Augmented Generation System Using Open-Source Tools

In this post, we explore some of the most prominent open-source tools for building a Retrieval-Augmented Generation (RAG) system, including Open Web UI, Verba, Onyx, LobeChat, RagFlow, RAG Web UI, Kotaemon, and Cognita.

agico

We transform visions into reality. We specializes in crafting digital experiences that captivate, engage, and innovate. With a fusion of creativity and expertise, we bring your ideas to life, one pixel at a time. Let's build the future together.

Copyright ©  2025  TYO Lab