You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: latest/4-Web_Application_Security_Testing/01-Information_Gathering/01-Conduct_Search_Engine_Discovery_Reconnaissance_for_Information_Leakage.md
+27-2Lines changed: 27 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,9 +17,9 @@ tags: WSTG
17
17
18
18
In order for search engines to work, computer programs (or `robots`) regularly fetch data (referred to as [crawling](https://en.wikipedia.org/wiki/Web_crawler)) from billions of pages on the web. These programs find web content and functionality by following links from other pages, or by looking at sitemaps. If a site uses a special file called `robots.txt` to list pages that it does not want search engines to fetch, then the pages listed there will be ignored. This is a basic overview - Google offers a more in-depth explanation of [how a search engine works](https://support.google.com/webmasters/answer/70897?hl=en).
19
19
20
-
Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indexes and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites.
20
+
Testers can use search engines to perform reconnaissance on sites and web applications. There are direct and indirect elements to search engine discovery and reconnaissance: direct methods relate to searching the indices and the associated content from caches, while indirect methods relate to learning sensitive design and configuration information by searching forums, newsgroups, and tendering sites.
21
21
22
-
Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indexes to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content.
22
+
Once a search engine robot has completed crawling, it commences indexing the web content based on tags and associated attributes, such as `<TITLE>`, in order to return relevant search results. If the `robots.txt` file is not updated during the lifetime of the site, and in-line HTML meta tags that instruct robots not to index content have not been used, then it is possible for indices to contain web content not intended to be included by the owners. Site owners may use the previously mentioned `robots.txt`, HTML meta tags, authentication, and tools provided by search engines to remove such content.
23
23
24
24
## Test Objectives
25
25
@@ -82,6 +82,31 @@ cache:owasp.org
82
82
\
83
83
*Figure 4.1.1-2: Google Cache Operation Search Result Example*
84
84
85
+
#### Internet Archive Wayback Machine
86
+
87
+
The [Internet Archive Wayback Machine](https://archive.org/web/) is the most comprehensive tool for viewing historical snapshots of web pages. It maintains an extensive archive of web pages dating back to 1996.
88
+
89
+
To view archived versions of a site, visit `https://web.archive.org/web/*/`
90
+
followed by the target URL:
91
+
92
+
```text
93
+
https://web.archive.org/web/*/owasp.org
94
+
```
95
+
96
+
This will display a calendar view showing all available snapshots of the site over time.
97
+
98
+
#### Bing Cache
99
+
100
+
Bing still provides cached versions of web pages. To view cached content, use the `cache:` operator:
101
+
Alternatively, click the arrow next to search results in Bing and select "Cached" from the dropdown menu.
102
+
103
+
#### Other Cached Content Services
104
+
105
+
Additional services for viewing cached or archived web pages include:
106
+
107
+
-[archive.ph](https://archive.ph) (also known as archive.md) - On-demand archiving service that creates permanent snapshots
108
+
-[CachedView](https://cachedview.com/) - Aggregates cached pages from multiple sources including Google Cache historical data, Wayback Machine, and others
109
+
85
110
### Google Hacking or Dorking
86
111
87
112
Searching with operators can be a very effective discovery technique when combined with the creativity of the tester. Operators can be chained to effectively discover specific kinds of sensitive files and information. This technique, called [Google hacking](https://en.wikipedia.org/wiki/Google_hacking) or Dorking, is also possible using other search engines, as long as the search operators are supported.
0 commit comments