# Content Discovery

## Content Discovery :&#x20;

* Three main ways of discovering content : Manually, Automated and OSINT (Open-Source Intelligence).

## 1] Manual Discovery :&#x20;

* **Robots.txt**

  The robots.txt file is a document that tells search engines which pages they are and aren't allowed to show on their search engine .  ![](https://2855293502-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FjpZ0GSi6rFzKo8D8arzs%2Fuploads%2FJefhW8bVKoHMYgqQxBEw%2Fimage.png?alt=media\&token=5bb2674d-b47f-4962-8252-76456dc4ceb2)
* **Favicon**

  The favicon is a small icon displayed in the browser's address bar or tab used for branding a website. If the sirte has default Web frameworks favicon , we can get the details of the framework.

![](https://2855293502-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FjpZ0GSi6rFzKo8D8arzs%2Fuploads%2FCvzu9qQpeXAnr8SVPSR0%2Fimage.png?alt=media\&token=7de4b33c-dea7-47a0-8701-df7307d8be8d)  Inside the page source we wind favicon path.

```
// curl https://static-labs.tryhackme.cloud/sites/favicon/images/favicon.ico | md5sum
```

By following command we can find md5sum hash of favicon and then search it to <https://wiki.owasp.org/index.php/OWASP_favicon_database.> and khow the framework details

* **Sitemap.xml                                                                                                                                                                              L**ist of every file the website owner wishes to be listed on a search engine.

* **HTTP Headers**                                                                                                                                                                            &#x20;

<pre><code>-->root@ip-10-10-249-10:~#<a data-footnote-ref href="#user-content-fn-1"> curl http://10.10.72.82 -v</a>
* Rebuilt URL to: http://10.10.72.82/
*   Trying 10.10.72.82...
* TCP_NODELAY set
* Connected to 10.10.72.82 (10.10.72.82) port 80 (#0)
> GET / HTTP/1.1
> Host: 10.10.72.82
> User-Agent: curl/7.58.0
> Accept: */*
> 
&#x3C; HTTP/1.1 200 OK
&#x3C; Server: nginx/1.18.0 (Ubuntu)
&#x3C; Date: Sat, 01 Jul 2023 04:50:03 GMT
&#x3C; Content-Type: text/html; charset=UTF-8
&#x3C; Transfer-Encoding: chunked
&#x3C; Connection: keep-alive
&#x3C; X-FLAG: THM{HEADER_FLAG}
&#x3C; 

</code></pre>

* **Framework Stack**

## **2]** OSINT :

* **Google Hacking / Dorking**

```
//site:tryhackme.com
returns results only from the specified website address

//inurl:admin
returns results that have the specified word in the URL

//filetype:pdf
returns results which are a particular file extension

//intitle:admin
returns results that contain the specified word in the title
```

* **Wappalyzer:** an online tool and browser extension that helps identify what technologies a website uses.
* **Wayback Machine :** is a historical archive of websites that dates back to the late 90s.&#x20;
* **GitHub :** version control system.
* **S3 Buckets :**                                                                                                                                                                                      The format of the S3 buckets is http(s)://**{name}.**[**s3.amazonaws.com**](http://s3.amazonaws.com/)

## 3] Automated Discovery :

using tools to discover content rather than doing it manually

**Automation Tools**

Although there are many different content discovery tools available, all with their features and flaws, we're going to cover three which are preinstalled on our attack box, <mark style="color:orange;">ffuf, dirb and gobuster</mark>.

```
// usinf ffuf
#ffuf -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt -u http://10.10.72.82/FUZZ

//Using dirb:
#dirb http://10.10.72.82/ /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt

//Using Gobuster: ( i prefer )
#gobuster dir --url http://10.10.72.82/ -w /usr/share/wordlists/SecLists/Discovery/Web-Content/common.txt
```

[^1]:
