Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 32 additions & 9 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,18 @@ Purpose
-------
This file gives concise, actionable guidance for AI coding agents working on the `webinfo` Go module.

**What this project does**: Extracts metadata (title, description, canonical, image, etc.) from web pages and provides utilities to fetch and save representative images.
What this project does
----------------------
Extracts metadata (title, description, canonical, image, etc.) from web pages and provides utilities
to fetch and save representative images and create thumbnails.

Quick entry points
------------------
- **Primary package**: `webinfo` — key files: `fetch.go` (core `Fetch` function), `webinfo.go` (`Webinfo` struct and `DownloadImage`), `errs.go` (error sentinel values), `fetch_test.go` (behavioral tests).
- **Primary package**: `webinfo` — key files:
- `fetch.go` (core `Fetch` function and encoding handling)
- `webinfo.go` (`Webinfo` type, `DownloadImage`, and `DownloadThumbnail`)
- `errs.go` (error sentinel values)
- `fetch_test.go` (behavioral tests and examples)
- **Go module**: `go 1.25` (see `go.mod`).

Developer workflows
Expand All @@ -25,10 +32,24 @@ Project-specific conventions and patterns
- Default User-Agent: `getUserAgent("")` returns a dummy UA string. Functions accept a `userAgent` param but fall back to this default.
- Encoding: `Fetch` peeks the first 1024 bytes and uses `charset.DetermineEncoding` and `encoding.GetEncoding(name)` to decode response bodies before HTML parsing — preserve this approach when touching parsing logic.
- HTML parsing: `goquery` is used to select head elements and meta tags. Extraction precedence is explicit in `fetch.go` (title → `twitter:title`/`og:title`, description → `twitter:description`/`og:description`, image → `twitter:image`/`og:image`). Follow this precedence in code changes or tests.
- Image download (`DownloadImage` in `webinfo.go`):
- Determines extension from URL path, `Content-Type` header, sniffing (up to 512 bytes), then fallback to `.img`.
- If URL has no filename, `temporary` is forced true and `os.CreateTemp(destDir, "webinfo-image-*"+ext)` is used.
- When sniffing bytes, the code prepends the read bytes back into the stream with `io.MultiReader` so the full image is written.

Image download and thumbnail notes
---------------------------------
- `DownloadImage` (in `webinfo.go`) downloads `w.ImageURL` and saves it to disk. It determines the output file extension using this order:
1) extension from the URL path,
2) extensions inferred from the response `Content-Type` header,
3) sniffing the first up to 512 bytes via `http.DetectContentType`,
4) fallback to `.img` if none found.
When sniffing, the read bytes are prepended back into the response body with `io.MultiReader` so the full image is written.
- `DownloadThumbnail` (added to `webinfo.go`) downloads the original image (via `DownloadImage`), resizes it to a requested width (preserving aspect ratio) and writes a thumbnail. Implementation notes:
- The code currently uses a local nearest-neighbor scaler (no external `x/image/draw` dependency) to avoid adding module requirements.
- The method accepts `width` (default 150 when <= 0), `destDir`, and `temporary` flags. When `destDir` is empty the method forces creation of a temporary file.
- When `temporary` is false, the thumbnail filename is derived from the original image basename with `-thumb` appended before the extension.

I/O and cleanup
----------------
- Response bodies and files are closed; close errors are wrapped/joined with any existing error.
- Errors encountered while parsing the URL, fetching, reading, sniffing, creating directories/files, or copying data are wrapped with contextual information (e.g. `"url"`, `"path"`, `"dir"`, `"file"`) using the `errs` package.

Tests and examples
------------------
Expand All @@ -39,24 +60,26 @@ Tests and examples
- Example usage patterns to follow when adding code or tests:
- Fetch: `info, err := Fetch(ctx, "https://example.com", "")` — empty UA uses the default.
- Download image: `outPath, err := w.DownloadImage(ctx, "images", true)`
- Download thumbnail: `thumbPath, err := w.DownloadThumbnail(ctx, "thumbnails", 150, false)`

External dependencies & integration points
----------------------------------------
- Key dependencies in `go.mod`: `github.com/goark/fetch`, `github.com/goark/errs`, `github.com/PuerkitoBio/goquery`, `golang.org/x/text` (encodings).
- The repository intentionally avoids adding `golang.org/x/image/draw` as a dependency; if you need higher-quality scaling consider adding it and updating `go.mod` and tests.
- The `Taskfile.yml` runs additional tools: `govulncheck`, `golangci-lint-v2`, and (optionally) `nancy` via `depm` — keep CI tool invocations in sync when adding dependencies.

When modifying public APIs
-------------------------
- Maintain existing error-wrapping conventions (`errs.Wrap`, `errs.WithContext`).
- Preserve encoding detection behavior and the 1024-byte peek in `Fetch` unless a clear, tested performance reason exists.
- Preserve `DownloadImage`'s extension-detection order and the behavior of `temporary` vs permanent files.
- Preserve `DownloadImage`'s extension-detection order and the behavior of `temporary` vs permanent files. When adding `DownloadThumbnail` behavior or changing file-naming semantics, update tests accordingly.

Where to look next (high-value files)
-------------------------------------
- `fetch.go` — how pages are fetched, decoded and parsed.
- `webinfo.go` — `Webinfo` type and `DownloadImage` implementation.
- `webinfo.go` — `Webinfo` type, `DownloadImage`, and `DownloadThumbnail` implementations.
- `fetch_test.go` — canonical tests and examples you should mirror for new behaviors.
- `errs.go` and `go.mod` — error constants and dependency hints.
- `Taskfile.yml` — canonical developer/test/lint workflow.

If anything above is unclear or you want more examples (small patches, test templates, or a CI-safe refactor suggestion), tell me which area to expand and I will iterate.
If anything above is unclear or you want small patches, test templates, or a CI-safe refactor suggestion, tell me which area to expand and I will iterate.
121 changes: 107 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,33 +1,33 @@
# webinfo -- Extract metadata and structured information from web pages
# [webinfo] -- Extract metadata and structured information from web pages

[![lint status](https://github.com/goark/webinfo/workflows/lint/badge.svg)](https://github.com/goark/webinfo/actions)
[![GitHub license](https://img.shields.io/badge/license-Apache%202-blue.svg)](https://raw.githubusercontent.com/goark/webinfo/master/LICENSE)
[![GitHub release](http://img.shields.io/github/release/goark/webinfo.svg)](https://github.com/goark/webinfo/releases/latest)

webinfo is a small Go module that extracts common metadata from web pages and provides utilities
[`webinfo`][webinfo] is a small Go module that extracts common metadata from web pages and provides utilities
to download representative images and create thumbnails.

**Quick overview**
## Quick overview

- **Package**: `webinfo`
- **Repository**: `github.com/goark/webinfo`
- **Purpose**: fetch page metadata (title, description, canonical, image, etc.) and download images

**Features**
## Features

- Fetch page metadata with `Fetch` (handles encodings and meta tag precedence).
- Download an image referenced by `Webinfo.ImageURL` using `(*Webinfo).DownloadImage`.
- Create a thumbnail from the referenced image using `(*Webinfo).DownloadThumbnail`.

**Install**
## Install

Use Go modules (Go 1.25+ as used by the project):

```bash
go get github.com/goark/webinfo@latest
```

**Basic usage**
## Basic usage

Example showing fetch and download thumbnail (error handling omitted for brevity):

Expand All @@ -37,43 +37,136 @@ package main
import (
"context"
"fmt"

"github.com/goark/webinfo"
)

func main() {
ctx := context.Background()
// Fetch metadata for a page (empty UA uses default)
info, err := webinfo.Fetch(ctx, "https://example.com", "")
info, err := webinfo.Fetch(ctx, "https://text.baldanders.info/", "")
if err != nil {
fmt.Fprintln(os.Stderr, "error fetching webinfo:", err)
panic(err)
fmt.Printf("error detail:\n%+v\n", err)
return
}

// Download thumbnail: width 150, to directory "thumbnails", permanent file
thumbPath, err := info.DownloadThumbnail(ctx, "thumbnails", 150, false)
if err != nil {
panic(err)
fmt.Printf("error detail:\n%+v\n", err)
return
}
fmt.Println("thumbnail saved:", thumbPath)
}
```

**API notes**
### API notes

- `Fetch(ctx, url, userAgent)` — Parse and extract metadata. Pass an empty userAgent to use the module default.
- `(*Webinfo).DownloadImage(ctx, destDir, temporary)` — Download the image in `Webinfo.ImageURL` and save it. If
`temporary` is true (or `destDir` is empty), a temporary file is created.
- `(*Webinfo).DownloadThumbnail(ctx, destDir, width, temporary)` — Download the referenced image and produce a
thumbnail resized to `width` pixels (height is preserved by aspect ratio). If `destDir` is empty the method
creates a temporary file; when `temporary` is false the thumbnail file is named based on the original image
name with `-thums` appended before the extension.
name with `-thumb` appended before the extension.

Note on defaults and test hooks:

- **Default width**: If `width <= 0` is passed to `DownloadThumbnail`, the method uses a default width of 150 pixels.
- **Extension detection**: `DownloadImage` determines an output extension from the URL path, the response
`Content-Type` (via `mime.ExtensionsByType`), or by sniffing up to the first 512 bytes with `http.DetectContentType`.
- **Test hooks / injection points**: For easier testing the package exposes a few package-level variables that
tests can override:
- `createFile`: used to create temporary or permanent files (wraps `os.CreateTemp` / `os.Create`). Override to
simulate file-creation failures.
- `decodeImage`: wrapper around `image.Decode` used by `DownloadThumbnail` — override to simulate decode results
(for example, to return a zero-dimension image).
- `outputImage`: encoder that writes the thumbnail image to disk (wraps `jpeg.Encode`, `png.Encode`, etc.).
Override to simulate encoder failures.

These hooks are intended for tests and let callers reproduce rare I/O or encoding failures without changing
production behavior.

- **HTTP client timeout**: `DownloadImage` uses an HTTP client with a default 30-second `Timeout` for the whole
request; tests can override this by replacing the `newHTTPClient` package variable.

## Test examples

Below are short examples showing how to override the package-level hooks from a test to simulate failures.
These snippets are intended for `*_test.go` files and assume the usual `testing` and `net/http/httptest` helpers.

1) Simulate thumbnail temporary-file creation failure (override `createFile`):

```go
// in your test function
orig := createFile
defer func() { createFile = orig }()
createFile = func(temp bool, dir, pattern string) (*os.File, error) {
// fail only for thumbnail temp pattern
if temp && strings.Contains(pattern, "webinfo-thumb-") {
return nil, errors.New("simulated thumbnail temp create failure")
}
return orig(temp, dir, pattern)
}

// then call the method under test
_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true)
// assert err != nil
```

2) Simulate a zero-dimension decoded image (override `decodeImage`):

```go
origDecode := decodeImage
defer func() { decodeImage = origDecode }()
decodeImage = func(r io.Reader) (image.Image, string, error) {
// return an image with zero width to hit the origW==0 error path
return image.NewRGBA(image.Rect(0, 0, 0, 10)), "png", nil
}

_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true)
// assert err != nil
```

3) Simulate encoder failure when writing thumbnails (override `outputImage`):

```go
origOut := outputImage
defer func() { outputImage = origOut }()
outputImage = func(dst *os.File, src *image.RGBA, format string) error {
return errors.New("simulated encode failure")
}

_, err := info.DownloadThumbnail(ctx, t.TempDir(), 50, true)
// assert err != nil
```

Notes:
- Ensure your test imports include `errors`, `io`, `image`, and `strings` as needed.
- Restore the original variables with `defer` to avoid cross-test interference.
- These examples are intentionally minimal — adapt them to your test fixtures (httptest servers, temp dirs, etc.).

4) Simulate HTTP client timeout by overriding `newHTTPClient`:

```go
origClient := newHTTPClient
defer func() { newHTTPClient = origClient }()
newHTTPClient = func() *http.Client {
// short timeout for test
return &http.Client{Timeout: 50 * time.Millisecond}
}

// then call DownloadImage which uses newHTTPClient()
_, err := info.DownloadImage(ctx, t.TempDir(), true)
// assert err != nil (expect timeout)
```

**Error handling**
### Error handling

The package uses `github.com/goark/errs` for wrapping errors with contextual keys (e.g. `url`, `path`, `dir`).
Callers should inspect returned errors accordingly.

**Tests & development**
### Tests & development

- Run all tests: `go test ./...`
- The repository includes `Taskfile.yml` tasks for common workflows; see that file for CI/test commands.
Expand Down
3 changes: 2 additions & 1 deletion Taskfile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ tasks:
desc: Test and lint.
cmds:
- go mod verify
- go test -shuffle on ./...
- go test -shuffle on ./... -coverprofile=coverage.out -cover
- go tool cover -func=coverage.out
- govulncheck ./...
- golangci-lint-v2 run --enable gosec --timeout 10m0s ./...
sources:
Expand Down
62 changes: 62 additions & 0 deletions fetch_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@ package webinfo

import (
"context"
"errors"
"fmt"
"io"
"net/http"
"net/http/httptest"
"testing"
Expand Down Expand Up @@ -87,6 +89,66 @@ func TestFetch_DefaultUserAgent(t *testing.T) {
}
}

func TestFetch_CustomUserAgent(t *testing.T) {
uaCh := make(chan string, 1)
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
uaCh <- r.Header.Get("User-Agent")
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write([]byte("<html><head><title>X</title></head><body></body></html>"))
})
srv := httptest.NewServer(handler)
defer srv.Close()

ctx := context.Background()
customUA := "MyCustomAgent/1.0"
info, err := Fetch(ctx, srv.URL, customUA)
if err != nil {
t.Fatalf("Fetch returned error: %v", err)
}
var gotUA string
select {
case gotUA = <-uaCh:
default:
t.Fatalf("server did not receive request")
}
if gotUA != customUA {
t.Errorf("User-Agent: want %q, got %q", customUA, gotUA)
}
if info == nil {
t.Fatalf("expected non-nil info")
}
if info.UserAgent != customUA {
t.Errorf("info.UserAgent: want %q, got %q", customUA, info.UserAgent)
}
}

func TestFetch_BodyCloseReturnsError(t *testing.T) {
orig := http.DefaultTransport
defer func() { http.DefaultTransport = orig }()

rt := roundTripperFunc(func(req *http.Request) (*http.Response, error) {
b := &failingBody{
firstData: []byte("<html><head><title>X</title></head><body></body></html>"),
firstErr: io.ErrUnexpectedEOF,
closeErr: errors.New("simulated close error"),
}
return &http.Response{
StatusCode: 200,
Status: "200 OK",
Header: make(http.Header),
Body: b,
Request: req,
}, nil
})
http.DefaultTransport = rt

ctx := context.Background()
_, err := Fetch(ctx, "http://example.invalid/", "")
if err == nil {
t.Fatalf("expected error when response body Close returns error, got nil")
}
}

func TestFetch_BadURL(t *testing.T) {
ctx := context.Background()
_, err := Fetch(ctx, "://bad-url", "")
Expand Down
23 changes: 23 additions & 0 deletions sample/sample1.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
//go:build run

package main

import (
"context"
"fmt"
"log"
"time"

"github.com/goark/webinfo"
)

func main() {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

info, err := webinfo.Fetch(ctx, "https://example.com", "")
if err != nil {
log.Fatalf("Fetch failed: %v", err)
}
fmt.Printf("Title: %s\nDescription: %s\nImage: %s\n", info.Title, info.Description, info.ImageURL)
}
Loading