How to find out what programming language a website is built in?
I think that it's fundamental for security testers to gather information about how a web application works and eventually what language it's written in.
I know that URL extensions, HTTP headers, session cookies, HTML comments and style-sheets may reveal some information but it's still hard and not assured.
So I was wondering: is there a way to determine what technology and framework are behind a website ?
@HagenvonEitzen If HTML had been a programming language it would have been named HTPL rather than HTML.
`I think that it's fundamental for security testers to gather information about how a web application works and what language it's written in.` I think that, if even a security tester can't figure out what language the site is built in, that makes it more secure because then no one will know which exploits to try. (Yes, there are occasionally valid use cases for security through obscurity.)
@MasonWheeler: figuring out what language the site is built in will only determine which exploits *not* to try. That won't make the site more secure.
@BenoitEsnard well, if an attacker uses it to determine which exploits *not* to try, then it would be a security improvement if a site successfully misleads the attacker into thinking it's something different and thus the attacker skips trying the "proper" exploits.
There's no way to be 100% sure if you don't have access to the server, so it's about guessing. Here are some clues:
- File extensions:
login.phpis most likely a PHP script.
- HTTP headers: they may leak some information about the language which is running on the server, and some additional details like the version:
X-Powered-By: PHP/7.0.0means that the page was rendered by PHP.
- HTTP Parameter Pollution: if you managed to guess which server is running, you can refine the guess.
- Language limits: maximum post data, maximum number variable in GET and POST data, etc. It may be useful if the webmaster kept the default values.
- Specific input: for example, PHP had some easter eggs.
- Errors: triggering errors may also leak the language.
Warning: Division by zero in /var/www/html/index.php on line 3is PHP, for example.
- File uploads: libraries may add metadata if the file is being modified server-side. For example, most sites resize users' avatars, and checking for EXIF data will leak
CREATOR: gd-jpeg v1.0 (using IJG JPEG v90), default quality, which may help to guess which language is used.
- Default filenames: Check if
/index.phpare the same page.
- Exploits: reading a backup file, or executing arbitrary code on the server.
- Open source: the website may have been open-sourced and is available somewhere on Internet.
- About page: the webmaster may have thanked the language community in a "FAQ" or "About" page.
- Jobs page: the development team may be recruiting, and they may have detailed the technologies they're using.
- Social Engineering: ask the webmaster!
- Public profiles: if you know who is working on the website (check LinkedIn and
/humans.txt), you can check their public repos or their skills on online profiles (GitHub, LinkedIn, Twitter, ...).
You may also want to know if the website is built with a framework or a CMS, since this will give information about the language used:
- URLs: directories and pages are specific to certain CMS. For example, if some resources are located in the
/wp-content/directory, it means that WordPress have been used.
- Session cookies: name and format.
- CSRF tokens: name and format.
- Rendered HTML: for example: meta tags order, comments.
Note that all information coming from the server may be altered to trick you. You should always try to use multiple sources to validate your guess.
You forget to mention some example that are from Java which use generally a cookie JSESSIONID for their session management. Login URL can betray unlerlying technology too, spring default URL for instance. Those example are for java but are surely true from some others
Just a note: just because the http headers *say* they're powered by php, doesn't mean the site actually is. Although this example is more about the server platform, I know of a guy who would make his nginx server return Server: Microsoft-IIS/5.0 with every request so he could trick attackers into using the wrong attacks against the server. "It's too easy!" ~ *the attacker*. You're right about that! (This just goes to show that you can't trust headers)
I liked the Parameter Pollution technique .. I'm sure that there are many more ways though
Another good one is checking the source to see if there are tell-tale signs of the use of some templating engine specific to a language.
Nitpick: the first 9 will really only tell you what language was used to *deploy* the site, not to *build* it. E.g., if you determine that the site was deployed on a JVM, that doesn't tell you much, there are over 400 languages with implementations for the JVM, the site may have been built in Scala, Groovy, Clojure (which also has implementations for the CLI and ECMAScript), Fantom (ditto), Ruby (JRuby), Python (Jython), PHP (IBM P8, Quercus), ECMAScript (Mozilla Rhino, Oracle Nashorn, dyn.js). The same applies to the CLI (IronPython, IronRuby, IronJS, …). There are also many compilers that …
@mowwwalker: i've added that sign under the "rendered HTML" part. I'm not sure if you were thinking about another sign though, so let me know if I missed something!
- File extensions:
For guessing the programming language, you can follow the three steps approach detailed below:
STEP 1 - Search evidences on the site itself
Search on a site page at the bottom for phrases like:
->"Powered by XXX"
->"Proudly Powered by XXX"
->"Running on XXX"
Search on the site if it will attend any conference where they could talk about the website from a technical point of view
...or with the help of a tool
Read the HTML code downloaded by your browser
Fire up the
Network Tabin developer toolbar and study the exchanges made between the browser and the server.
Search for some known hidden page:
wget -head http://the-site.com/private/admin
If you get 200, the site may be running on a plublicly (free, paid etc) available software.
STEP 2 - Search evidences on the web
Ask search engines for front-end errors
You can look for some errors produced by the website.
Some keywords to type in a search engine:
- Error 500 site:the-site.com
- Exception site:the-site.com
- <what ever> site:the-site.com
=> You can simply replace "<what ever>" with some known error message produced by the various web technologies.
Ask search engines for back-end errors
You can even guess the technologies used in the backend:
- ORA-12170 site:the-site.com
=> If you find something, the site may be using Oracle in its backend part.
Ask search engines for website competitors
Find what technology is popular in the website industry
Find what technology competitors are using
Find comparisons of the site with other competitors.
Those comparisons may talk about technologies in use
Technology survey sites
Those sites can provide great info to the the site you target. They may have already done some part of the job for you.
=> Enter the url of the site you're targetting and see what technologies (client or server side) have been detected.
Note that the site must be in the top 1M Alexa ranking.
=> <keyword> can be anything company name, website name, etc
STEP 3 - Analyze your results
The evidences you have found in step 1 may be wrong because the site owner can alter them. Try to find contradictions between those evidences. Eliminate contradictional evidences.
Merge the evidences in step 2 between the various sources and yours. Again eliminate contradictional evidences.
Resume all your findings in a table like the one below.
+-------------+-----------+------------------+ ... +----------+-------+--------+ | EVIDENCES | ON SITE | Search Engine 1 SOURCE n SCORE PCT (%) +-------------+------------------------------+ ... +----------+-------+--------+ | PHP 7 | X | X | X | 3 | 300/n +-------------+------------------------------+ ... +----------+-------+--------+ | Wordpress | | X | X | 2 | 200/n +-------------+------------------------------+ ... +----------+-------+--------+ ... +-------------+------------------------------+ ... +----------+-------+--------+ | EVIDENCE m | | | | | (100*SCORE)/n +-------------+------------------------------+ ... +----------+-------+--------+
Finally, you will be able to say "I'm confident at XX% that this site runs on YY (EVIDENCE i)".
This looks like a useful step by step guide, but it's probably a bad idea to present the arbitrary confidence score as a percentage. Even if a server gets a perfect score it could very well be a carefully assembled honeypot, so you shouldn't say you are a 100% confident that it isn't.
It tells about programming language, server, analytics tool or about CMS & Frameworks on which website is built.
Give it a try, you will love it.
Yes, its very much accurate. I'm using it from last 4 years and even on my own developed websites. Its always accurate.
I don't think it can be considered accurate. We purposely fake our sent headers to return IIS. Have a wp-admin.php even though we don't use Wordpress. And several other honey pots. Our site is actually a Node.js application that returns static content.
I just downloaded it as it is accurate as it can be. Obviously it can't tell if the headers are being spoofed or not.
The answer is that you can never "Be assured". Whilst 99.9% of the time the highly up voted answers will find the "tells" of the framework behind the site but it's never a certainty.
If a website is serving content from wp-uploads/ It's a safe bet that it's running Wordpress but it's not a certainty. Perhaps the site was using Wordpress but when it was migrated to something else the wp-uploads/ path was kept to avoid breaking links and bookmarks.
Sometimes you can know, sometimes you cannot.
If the HTML is generated on the server-side you may not know which programming language generated it. These languages include: PHP, C++, and many other languages. On the server-side, for as many ways as you can think of to guess which language it is, there are just as many ways to for the technology to hide itself.
Suppose you are a web administrator that wants to hide the server-side technology. Pick one of the techniques listed in another question for attempting to identify the language. For example, the *.php extension for a file. Now, configure your web server to execute C code from a file with a *.php extension. Your users will have no way to view the source (since both languages are equally capable of producing the same output, by Turing completeness), but they will be misled into thinking you are running PHP.
Why would someone want to obfuscate the server-side choice of technology? Because CGI languages have various vulnerabilities that are easier to target if the end-users know which of those languages you are using. Misleading the users about which server-side technologies you are using is a very reasonable security measure.
I didn't downvote, but this answer neglects the numerous techniques available for determining the server-side language and tech.