WebScraping for RAG - Pragmatic solution (html2text)

Apr 11 (edited) in 💬 General

WebScraping for RAG - Pragmatic solution (html2text) - Discussion

-- EDIT: Now, a week after having posted the message below, I have found a solution: html2text, a python package that works surprisingly well. ---

-- EDIT2: Now again a bit later, I have learned about the python library "unstructured" that is doing exactly what I am aiming at take unstructured data formats and parse them to the information of RAG ---

--- Original question - already solved ----

Hello everyone,

I can image most of you have already given webscraping a try. In Python this typically means using either beautifulsoup / selectolax for static pages or selenium / playwright for dynamic pages.

This is all very fine if you have a specific website and you adjust your code to get exactly what you want on this specific website..

But what if your goal is much broader and you just want to get the texts from random guardian, financial times and wallstreet journal articles (or any other information source). (for example to include them in a RAG).

Assuming you have access, all you need is a conversion from html to txt. This is a much broader and seamingly much simple task.

I have found out, that it unfortunately is not easy at all. There are some website extensions, but we want to use python here. I have tried the readability library, as well as the newspaper library, the results for both are rather poor.

Here an overview article:

https://ujeebu.com/blog/how-to-extract-clean-text-from-html/

Ujeebu (SaaS service) works rather well, even though some texts is still omitted. But much better than the other libraries.

How are you guys approaching this topic?