html2llm

This project is an experiment aimed at converting an HTML website into a format understandable by large language models (LLMs). The output can be used for various purposes, such as website navigation or content reading. The project incorporates elements of Microsoft's OmniParser release and operates in the browser using WebAssembly. Surprisingly, it performs quite efficiently, with inference taking less than 300ms on my Mac M1.

Demos:

⭕ OmniParser WebAssembly - a demo of YOLOv8 icon detection using WebAssembly
📺 App Website - a demo of detecting UI elements by combining YOLOv8 with DOM tree traversal

🚧 Idea

The OmniParser released by Microsoft operates in three steps:

OCR -> Icon Detection -> Icon/Box Captioning

This approach enables control over almost any interface. However, it comes with a significant computational cost, particularly in the final step, which is the most resource-intensive part of the pipeline. The icon detection step requires 6.1MB of weights, while the icon captioning step demands 1GB of weights.

Interestingly, in a browser environment, the first and last step can be skipped because we can traverse the DOM tree to extract this information directly. Surprisingly, the second step, which uses YOLOv8, performs efficiently in the browser thanks to WebAssembly.

From the universal approach, we derived the following process:

Screenshot Capturing -> Icon Detection (OmniParser WebAssembly) -> Icon/Box Captioning via Traversing DOM Tree

Now we have two problems:

how to capture a screenshot of the website (captureVisibleTab via a browser extension, getScreenshotAs via Selenium, etc.)
how to resolve found bounding boxes to useful information (this is definitely not trivial, this part is resolved in this project by the element extractor).

This project is on a very early stage.

🚀 How to Run on Any Page?

You can do it by using the Playwright App Demo.

Clone the repository.
Install all dependencies pnpm install.
Run cd demos/playwright-app.
Run pnpm start <URL>. For example, pnpm start https://www.google.com.

💡 License

This project is released under the MIT license.

The used part of the OmniParser is released under the Creative Commons Attribution 4.0 International license.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
demos		demos
html2llm		html2llm
.editorconfig		.editorconfig
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

html2llm

🚧 Idea

🚀 How to Run on Any Page?

💡 License

About

Releases

Packages

Languages

License

b4rtaz/html2llm

Folders and files

Latest commit

History

Repository files navigation

html2llm

🚧 Idea

🚀 How to Run on Any Page?

💡 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages