爬虫写得好,牢饭吃得早,大模型ScrapeGraphAI助力高质量爬虫

智能科技扫地僧 2024-06-11 14:05:07
ScrapeGraphAI 是一个 Python 库,它利用大型语言模型(LLM)和直接图形逻辑为网站、文档和 XML 文件创建抓取管道。这个库的特点是,用户只需要描述他们想要提取的信息,库就会自动完成抓取任务。 安装 ScrapeGraphAI通过 pip 安装 ScrapeGraphAI:pip install scrapegraphai安装 Playwright,用于基于 JavaScript 的抓取:playwright install建议在虚拟环境中安装库,以避免与其他库发生冲突。使用 ScrapeGraphAIScrapeGraphAI 提供了三种主要的抓取管道: SmartScraperGraph:单页面抓取器,只需要用户提示和输入源。SearchGraph:多页面抓取器,从搜索引擎的前 n 个搜索结果中提取信息。SpeechGraph:单页面抓取器,从网站提取信息并生成音频文件。示例用例使用本地模型的 SmartScraperGraph:确保已安装 Ollama 并使用 ollama pull 命令下载模型。示例代码展示了如何创建 SmartScraperGraph 实例并运行它,以获取项目列表及其描述。使用混合模型的 SearchGraph:使用 Groq 作为 LLM 和 Ollama 作为嵌入模型。示例代码展示了如何创建 SearchGraph 实例并运行它,以获取 Chioggia 的传统食谱列表。使用 OpenAI 的 SpeechGraph:只需要传递 OpenAI API 密钥和模型名称。示例代码展示了如何创建 SpeechGraph 实例并运行它,以生成项目摘要的音频文件。输出示例SmartScraperGraph 的输出是项目及其描述的列表。SearchGraph 的输出是食谱的列表。SpeechGraph 的输出是页面上项目摘要的音频文件。注意事项在使用之前,需要设置 OpenAI API 密钥。文档和参考页面可以在 ScrapeGraphAI 的官方页面上找到。ScrapeGraphAI 库通过简化抓取过程,使得用户无需深入了解网页结构或编写复杂的抓取逻辑,就能够从网站中提取所需信息。这对于需要从多个来源收集数据的用户来说,是一个非常有用的工具。 Quick installThe reference page for Scrapegraph-ai is available on the official page of pypy: pypi. pip install scrapegraphaiyou will also need to install Playwright for javascript-based scraping: playwright installNote: it is recommended to install the library in a virtual environment to avoid conflicts with other libraries DemoFollow the procedure on the following link to setup your OpenAI API key: link. DocumentationThe documentation for ScrapeGraphAI can be found here. Check out also the docusaurus documentation. UsageThere are three main scraping pipelines that can be used to extract information from a website (or local file): SmartScraperGraph: single-page scraper that only needs a user prompt and an input source;SearchGraph: multi-page scraper that extracts information from the top n search results of a search engine;SpeechGraph: single-page scraper that extracts information from a website and generates an audio file.It is possible to use different LLM through APIs, such as OpenAI, Groq, Azure and Gemini, or local models using Ollama. Case 1: SmartScraper using Local ModelsRemember to have Ollama installed and download the models using the ollama pull command. from scrapegraphai.graphs import SmartScraperGraphgraph_config = { "llm": { "model": "ollama/mistral", "temperature": 0, "format": "json", # Ollama needs the format to be specified explicitly "base_url": "http://localhost:11434", # set Ollama URL }, "embeddings": { "model": "ollama/nomic-embed-text", "base_url": "http://localhost:11434", # set Ollama URL }, "verbose": True,}smart_scraper_graph = SmartScraperGraph( prompt="List me all the projects with their descriptions", # also accepts a string with the already downloaded HTML code source="https://perinim.github.io/projects", config=graph_config)result = smart_scraper_graph.run()print(result)The output will be a list of projects with their descriptions like the following: {'projects': [{'title': 'Rotary Pendulum RL', 'description': 'Open Source project aimed at controlling a real life rotary pendulum using RL algorithms'}, {'title': 'DQN Implementation from scratch', 'description': 'Developed a Deep Q-Network algorithm to train a simple and double pendulum'}, ...]}Case 2: SearchGraph using Mixed ModelsWe use Groq for the LLM and Ollama for the embeddings. from scrapegraphai.graphs import SearchGraph# Define the configuration for the graphgraph_config = { "llm": { "model": "groq/gemma-7b-it", "api_key": "GROQ_API_KEY", "temperature": 0 }, "embeddings": { "model": "ollama/nomic-embed-text", "base_url": "http://localhost:11434", # set ollama URL arbitrarily }, "max_results": 5,}# Create the SearchGraph instancesearch_graph = SearchGraph( prompt="List me all the traditional recipes from Chioggia", config=graph_config)# Run the graphresult = search_graph.run()print(result)The output will be a list of recipes like the following: {'recipes': [{'name': 'Sarde in Saòre'}, {'name': 'Bigoli in salsa'}, {'name': 'Seppie in umido'}, {'name': 'Moleche frite'}, {'name': 'Risotto alla pescatora'}, {'name': 'Broeto'}, {'name': 'Bibarasse in Cassopipa'}, {'name': 'Risi e bisi'}, {'name': 'Smegiassa Ciosota'}]}Case 3: SpeechGraph using OpenAIYou just need to pass the OpenAI API key and the model name. from scrapegraphai.graphs import SpeechGraphgraph_config = { "llm": { "api_key": "OPENAI_API_KEY", "model": "gpt-3.5-turbo", }, "tts_model": { "api_key": "OPENAI_API_KEY", "model": "tts-1", "voice": "alloy" }, "output_path": "audio_summary.mp3",}# ************************************************# Create the SpeechGraph instance and run it# ************************************************speech_graph = SpeechGraph( prompt="Make a detailed audio summary of the projects.", source="https://perinim.github.io/projects/", config=graph_config,)result = speech_graph.run()print(result)The output will be an audio file with the summary of the projects on the page.
1 阅读:6

智能科技扫地僧

简介:感谢大家的关注