Skip to content

Web Scraping - 22 Open Source Projects

Posted on:March 16, 2024

Python

Preview Image

news-please - an integrated web crawler and information extractor for news that just works - GitHub - fhamborg/news-please: news-please - an integrated web crawler and information extractor for new...

Preview Image

News, full-text, and article metadata extraction in Python 3. Advanced docs: - GitHub - codelucas/newspaper: News, full-text, and article metadata extraction in Python 3. Advanced docs:

Preview Image

A service daemon to run Scrapy spiders. Contribute to scrapy/scrapyd development by creating an account on GitHub.

(Proxy UI + Scrape)

Preview Image

Intelligent proxy pool for Humans™ to extract content from the internet and build your own Large Language Models in this new AI era - GitHub - imWildCat/scylla: Intelligent proxy pool for Humans™ t...

Preview Image

Every web site provides APIs. Contribute to elliotgao2/toapi development by creating an account on GitHub.

Preview Image

Incredibly fast crawler designed for OSINT. Contribute to s0md3v/Photon development by creating an account on GitHub.

Preview Image

admin ui for scrapy/open source scrapinghub. Contribute to DormyMo/SpiderKeeper development by creating an account on GitHub.

Preview Image

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js - GitHub - Gerapy/Gerapy: Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Preview Image

Web app for Scrapyd cluster management, Scrapy log analysis & visualization, Auto packaging, Timer tasks, Monitor & Alert, and Mobile UI. DEMO :point_right: - GitHub - my8100/scrapydweb: We...

Pasted image 20231222001729.png

Golang

Preview Image

Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架 - GitHub - crawlab-team/crawlab: Distributed web crawler admin platform for...

Preview Image

SDK for Crawlab, including SDK for different programming languages such as Python, Node.js and Java, and a CLI Tool written in Python. - GitHub - crawlab-team/crawlab-sdk: SDK for Crawlab, includin...

Preview Image

Elegant Scraper and Crawler Framework for Golang. Contribute to gocolly/colly development by creating an account on GitHub.

Preview Image

Declarative web scraping. Contribute to MontFerret/ferret development by creating an account on GitHub.

Preview Image

Extract structured data from web sites. Web sites scraping. - GitHub - slotix/dataflowkit: Extract structured data from web sites. Web sites scraping.

Preview Image

A standalone and scriptable web scraper in Go. Contribute to philippta/flyscrape development by creating an account on GitHub.

Preview Image

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application - GitHub - hakluke/hakrawler: Simple, fast web crawler designed for easy, quick discover...

Preview Image

Takes a list of URLs and returns their HTTP response codes - GitHub - hakluke/hakcheckurl: Takes a list of URLs and returns their HTTP response codes

Preview Image

Geziyor, blazing fast web crawling & scraping framework for Go. Supports JS rendering. - GitHub - geziyor/geziyor: Geziyor, blazing fast web crawling & scraping framework for Go. Supports J...

Rust

Preview Image

The fastest web crawler written in Rust. Contribute to spider-rs/spider development by creating an account on GitHub.

Javascript

Preview Image

Crawlee—A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. - GitHub - apify/crawlee: Crawlee—A web scraping and browser automation library for N...

Preview Image

Web Crawler/Spider for NodeJS + server-side jQuery ;-) - GitHub - bda-research/node-crawler: Web Crawler/Spider for NodeJS + server-side jQuery ;-)

Preview Image

Lightweight scraper for Google News. Contribute to lewisdonovan/google-news-scraper development by creating an account on GitHub.

Preview Image

Web service for web page to Markdown conversion. Contribute to macsplit/urltomarkdown development by creating an account on GitHub.