Skip to content

Commit dcba7bc

Browse files
committed
Publish Latest 2025-11-15
Updates based on OWASP/wstg@0847e37
1 parent 50ded29 commit dcba7bc

File tree

1 file changed

+27
-2
lines changed

1 file changed

+27
-2
lines changed

latest/4-Web_Application_Security_Testing/01-Information_Gathering/01-Conduct_Search_Engine_Discovery_Reconnaissance_for_Information_Leakage.md

Lines changed: 27 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ tags: WSTG
1717

1818
In order for search engines to work, computer programs (or `robots`) regularly fetch data (referred to as [crawling](https://en.wikipedia.org/wiki/Web_crawler)) from billions of pages on the web. These programs find web content and functionality by following links from other pages, or by looking at sitemaps. If a site uses a special file called `robots.txt` to list pages that it does not want search engines to fetch, then the pages listed there will be ignored. This is a basic overview - Google offers a more in-depth explanation of [how a search engine works](https://support.google.com/webmasters/answer/70897?hl=en).
1919

20-
Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indexes and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites.
20+
Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indices and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites.
2121

22-
Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indexes to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content.
22+
Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indices to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content.
2323

2424
## Test Objectives
2525

@@ -82,6 +82,31 @@ cache:owasp.org
8282
![Google Cache Operation Search Result Example](images/Google_cache_Operator_Search_Results_Example_20200406.png)\
8383
*Figure 4.1.1-2: Google Cache Operation Search Result Example*
8484

85+
#### Internet Archive Wayback Machine
86+
87+
The [Internet Archive Wayback Machine](https://archive.org/web/) is the most comprehensive tool for viewing historical snapshots of web pages. It maintains an extensive archive of web pages dating back to 1996.
88+
89+
To view archived versions of a site, visit `https://web.archive.org/web/*/`
90+
followed by the target URL:
91+
92+
```text
93+
https://web.archive.org/web/*/owasp.org
94+
```
95+
96+
This will display a calendar view showing all available snapshots of the site over time.
97+
98+
#### Bing Cache
99+
100+
Bing still provides cached versions of web pages. To view cached content, use the `cache:` operator:
101+
Alternatively, click the arrow next to search results in Bing and select "Cached" from the dropdown menu.
102+
103+
#### Other Cached Content Services
104+
105+
Additional services for viewing cached or archived web pages include:
106+
107+
- [archive.ph](https://archive.ph) (also known as archive.md) - On-demand archiving service that creates permanent snapshots
108+
- [CachedView](https://cachedview.com/) - Aggregates cached pages from multiple sources including Google Cache historical data, Wayback Machine, and others
109+
85110
### Google Hacking or Dorking
86111

87112
Searching with operators can be a very effective discovery technique when combined with the creativity of the tester. Operators can be chained to effectively discover specific kinds of sensitive files and information. This technique, called [Google hacking](https://en.wikipedia.org/wiki/Google_hacking) or Dorking, is also possible using other search engines, as long as the search operators are supported.

0 commit comments

Comments
 (0)