JavaScript Parser for Java

Javascript parser for Java

From https://github.com/google/caja/blob/master/src/com/google/caja/parser/js/Parser.java

The grammar below is a context-free representation of the grammar this
parser parses. It disagrees with EcmaScript 262 Edition 3 (ES3) where
implementations disagree with ES3. The rules for semicolon insertion and
the possible backtracking in expressions needed to properly handle
backtracking are commented thoroughly in code, since semicolon insertion
requires information from both the lexer and parser and is not determinable
with finite lookahead.

Noteworthy features

Reports warnings on a queue where an error doesn't prevent any further errors, so that we can report multiple errors in a single compile pass instead of forcing developers to play whack-a-mole.

Does not parse Firefox style catch (<Identifier> if <Expression>) since those don't work on IE and many other interpreters.

Recognizes const since many interpreters do (not IE) but warns.

Allows, but warns, on trailing commas in Array and Object constructors.

Allows keywords as identifier names but warns since different interpreters have different keyword sets. This allows us to use an expansive keyword set.

To parse strict code, pass in a PedanticWarningMessageQueue that
converts MessageLevel#WARNING and above to MessageLevel#FATAL_ERROR.

CajaTestCase.js shows how to set up a parser, and [fromResource] and [fromString] in the same class show how to get an input of the right kind.

Parsing HTML page containing JS in Java

Selenium's Webdriver is fantastic: http://docs.seleniumhq.org/docs/03_webdriver.jsp

See this answer for an example of what you are trying to do:
Using Selenium Web Driver to retrieve value of a HTML input

JavaScript: Parse Java source code, extract method

The AST is just another JSON object. Try jsonpath.

npm install jsonpath

To extract all methods, just filter on condition node=="MethodDeclaration":

var jp = require('jsonpath');
var methods = jp.query(ast, '$.types..[?(@.node=="MethodDeclaration")]');
console.log(methods);

See here for more JSON path syntax.

Java parser written in JavaScript

Have a look at ANTLR which can have Javascript as a target, with the Java 1.5 grammar at http://www.antlr.org/grammar/1152141644268/Java.g

Edit: link stopped working - try https://github.com/antlr/grammars-v4/blob/master/java/Java.g4 :)

javascript parser in java

Since you are already parsing your HTML using JSoup, your next step is to traverse each element to check if they contain Javascript. Something like this code will check each element:

boolean validateHtml(String html) {
  Document doc = Jsoup.parse(html);
  for(Element e : doc.getAllElements()) {
      if(detectJavascript(e)) {
          return false;
      }
  }
  return true;
}

private boolean detectJavascript(Element e) {
  if(/* Check if element contains javascript */) {
      return true;
  }
  return false;
}

Then, there are several checks you should perform inside detectJavacript function:

Of course, reject script elements: e.normalName() == "script"
Reject elements with a value in any on* attribute (onload, onclick, etc). You have the complete list here but it's probably just enough to get all attributes with e.attributes() and reject if any of them starts with "on".
Every attribute that accepts a URL (href, src, etc.) can contain a "javascript:" value that executes JavaScript. You should check all those too. For a complete (?) list of these attributes, check this other SO question.

Finally, I advise not to store the original html into the database, even if it passes your validation. Instead convert the document parsed by JSoup again to html. This way you make sure you have a well-formed document free of any "dangerous" elements.

Parse JavaScript with Java

import java.util.regex.*;
Pattern p1 = Pattern.compile("X-MOON-EXPIRED', \"([^\"]*)\"");
Pattern p2 = Pattern.compile("X-MOON-TOKEN', \"([^\"]*)\"");
String html = "<script type=\"text/javascript\"> $(function() {   $.ajaxSetup({     beforeSend: function(xhr) {       xhr.setRequestHeader('X-MOON-EXPIRED', \"1445350653\");       xhr.setRequestHeader('X-MOON-TOKEN', \"10dafe974cc156d2d3b7fd9bb1e4e3ed\");     }   }); }); </script>";
Matcher m1 = p1.matcher(html);
Matcher m2 = p2.matcher(html);
if (!m1.find() || !m2.find()) {
    throw new Exception("Didn't match");
}
System.out.println(String.format("X-MOON-EXPIRED=%s, X-MOON-TOKEN=%s", m1.group(1), m2.group(1)));

Prints:

X-MOON-EXPIRED=1445350653 X-MOON-TOKEN=10dafe974cc156d2d3b7fd9bb1e4e3ed

Using Nashorn to parse JavaScript into a syntax tree

tree.getSourceElements() gives you a list of elements of type Tree which has the method getKind() that gives you the Tree.Kind of the element:

Parser parser = Parser.create();
CompilationUnitTree tree = parser.parse(file, new InputStreamReader(stream), null);

for (Tree tree : tree.getSourceElements()) {
    System.out.println(tree.getKind());

    switch(tree.getKind()) {
        case FUNCTION:
            [...]
    }
}

If you want to run down the AST you can then implement the interface TreeVisitor<R,D> to visit the nodes:

Parser parser = Parser.create();
CompilationUnitTree tree = parser.parse(file, new InputStreamReader(stream), null);

if (tree != null) {
    tree.accept(new BasicTreeVisitor<Void, Void>() {
        public Void visitFunctionCall(FunctionCallTree functionCallTree, Void v) {
             System.out.println("Found a functionCall: " + functionCallTree.getFunctionSelect().getKind());
             return null;
         }
     }, null);
}

JavaScript Parser for Java