How to Detect the Presence of Url in a String

How can I detect if a string contains a URL in java?

Please see: http://download.oracle.com/javase/6/docs/api/java/net/URL.html

import java.net.URL;
import java.net.MalformedURLException;

// Replaces URLs with html hrefs codes
public class URLInString {
public static void main(String[] args) {
String s = args[0];
// separete input by spaces ( URLs don't have spaces )
String [] parts = s.split("\\s");

// Attempt to convert each item into an URL.
for( String item : parts ) try {
URL url = new URL(item);
// If possible then replace with anchor...
System.out.print("<a href=\"" + url + "\">"+ url + "</a> " );
} catch (MalformedURLException e) {
// If there was an URL that was not it!...
System.out.print( item + " " );
}

System.out.println();
}
}

Obtained from, How to detect the presence of URL in a string

Detect URLs in text with JavaScript

First you need a good regex that matches urls. This is hard to do. See here, here and here:

...almost anything is a valid URL. There
are some punctuation rules for
splitting it up. Absent any
punctuation, you still have a valid
URL.

Check the RFC carefully and see if you
can construct an "invalid" URL. The
rules are very flexible.

For example ::::: is a valid URL.
The path is ":::::". A pretty
stupid filename, but a valid filename.

Also, ///// is a valid URL. The
netloc ("hostname") is "". The path
is "///". Again, stupid. Also
valid. This URL normalizes to "///"
which is the equivalent.

Something like "bad://///worse/////"
is perfectly valid. Dumb but valid.

Anyway, this answer is not meant to give you the best regex but rather a proof of how to do the string wrapping inside the text, with JavaScript.

OK so lets just use this one: /(https?:\/\/[^\s]+)/g

Again, this is a bad regex. It will have many false positives. However it's good enough for this example.

function urlify(text) {  var urlRegex = /(https?:\/\/[^\s]+)/g;  return text.replace(urlRegex, function(url) {    return '<a href="' + url + '">' + url + '</a>';  })  // or alternatively  // return text.replace(urlRegex, '<a href="$1">$1</a>')}
var text = 'Find me at http://www.example.com and also at http://stackoverflow.com';var html = urlify(text);
console.log(html)

Detecting a (naughty or nice) URL or link in a text string

I'm concentrating my answer on trying to avoid spammers. This leads to two sub-assumptions: the people using the system will therefore be actively trying to contravene your check and your goal is only to detect the presence of a URL, not to extract the complete URL. This solution would look different if your goal is something else.

I think your best bet is going to be with the TLD. There are the two-letter ccTLDs and the (currently) comparitively small list of others. These need to be prefixed by a dot and suffixed by either a slash or some word boundary. As others have noted, this isn't going to be perfect. There's no way to get "buyfunkypharmaceuticals . it" without disallowing the legitimate "I tried again. it doesn't work" or similar. All of that said, this would be my suggestion:

[^\b]\.([a-zA-Z]{2}|aero|asia|biz|cat|com|coop|edu|gov|info|int|jobs|mil|mobi|museum|name|net|org|pro|tel|travel)[\b/]

Things this will get:

  • buyfunkypharmaceuticals.it
  • google.com
  • http://stackoverflo**w.com/**questions/700163/

It will of course break as soon as people start obfuscating their URLs, replacing "." with " dot ". But, again assuming spammers are your goal here, if they start doing that sort of thing, their click-through rates are going to drop another couple of orders of magnitude toward zero. The set of people informed enough to deobfuscate a URL and the set of people uninformed enough to visit spam sites have, I think, a miniscule intersection. This solution should let you detect all URLs that are copy-and-pasteable to the address bar, whilst keeping collateral damage to a bare minimum.



Related Topics



Leave a reply



Submit