Programming Language Independent Model Validation

A programming language designed to be testable

Google is working on noop as a language (OOP, Java-based) created to produce code that is always testable.

Noop is a new language that will run on the Java Virtual Machine, and in source form will look similar to Java. The goal is to build dependency injection and testability into the language from the beginning, rather than rely on third-party libraries as other languages do.

The language also explicitly forbids certain constructs that make it harder to test code, like for instance statics.

A programming language designed to be testable

Google is working on noop as a language (OOP, Java-based) created to produce code that is always testable.

Noop is a new language that will run on the Java Virtual Machine, and in source form will look similar to Java. The goal is to build dependency injection and testability into the language from the beginning, rather than rely on third-party libraries as other languages do.

The language also explicitly forbids certain constructs that make it harder to test code, like for instance statics.

Whose responsibility is it to check data validity?

Both consumer side(client) and provider side(API) validation.

Clients should do it because it means a better experience. For example, why do a network round trip just to be told that you've got one bad text field?

Providers should do it because they should never trust clients (e.g. XSS and man in the middle attacks). How do you know the request wasn't intercepted? Validate everything.

There are several levels of valid:

  1. All required fields present, correct formats. This is what the client validates.
  2. # 1 plus valid relationships between fields (e.g. if X is present then Y is required).
  3. # 1 plus # 2 plus business valid: meets all business rules for proper processing.

Only the provider side can do #2 and #3.

Where to start programming?

It seems like everyone who's posted an answer to this question names a different place to start. The wide variation in starting points is a perfect illustration of where you should really start: wherever your ideal starting point is.

Different people have different ways of approaching problems. Often the success of a project is independent of the initial approach that one takes. Take time to think about and try different areas to focus on first and find out what's right for you.

EDIT: On a more abstract level, this article by Paul Graham offers good insight on a Lisp-style, bottom-up approach to programming.

Multi-language input validation with UTF-8 encoding

You can approximate the Unicode derived property \p{Alphabetic} pretty succintly with [\pL\pM\p{Nl}] if your language doensn’t support a proper Alphabetic property directly.

Don’t use Java’s \p{Alpha}, because that’s ASCII-only.

But then you’ll notice that you’ve failed to account for dashes (\p{Pd} or DashPunctuation works, but that does not include most of the hyphens!), apostrophes (usually but not always one of U+27, U+2BC, U+2019, or U+FF07), comma, or full stop/period.

You probably had better include \p{Pc} ConnectorPunctuation, just in case.

If you have the Unicode derived property \p{Diacritic}, you should use that, too, because it includes things like the mid-dot needed for geminated L’s in Catalan and the non-combining forms of diacritic marks which people sometimes use.

But then you’ll find people who use ordinal numbers in their names in ways that \p{Nl} (LetterNumber) doesn’t accomodate, so you throw \p{Nd} (DecimalNumber) or even all of \pN (Number) into the mix.

Then you realize that Asian names often require the use of ZWJ or ZWNJ to be written correctly in their scripts, so then you have to add U+200D and U+200C to the mix, which are both \p{Cf} (Format) characters and indeed also JoinControl ones.

By the time you’re done looking up the various Unicode properties for the various and many exotic characters that keep cropping up — or when you think you’re done, rather — you’re almost certain to conclude that you would do a much better job at this if you simply allowed them to use whatever Unicode characters for their name that they wish, as the link Tim cites advises. Yes, you’ll get a few jokers putting in things like “əɯɐuʇƨɐ⅂ əɯɐuʇƨɹᴉℲ”, but that just goes with the territory, and you can’t preclude silly names in any reasonable way.

Is there any regular expression engine that works with multiple programming languages?

Your impression is incorrect. "Perl-compatible regular expressions" are widely supported, largely by using the same engine in the background. In PHP, you get them with the pcre_ function family. In python, they're what re supports. Even Mysql supports this regular expression style (with RLIKE), in addition to traditional SQL syntax. Languages that don't support the full perl syntax often support a compatible subset.

I can't offer a full list of languages that support it; I'm not sure the question can even be answered-- would you count gaming environments and the like if they embed regular expressions in their command language? But does it matter? Look for regexp support in the languages you are interested in, and if you don't find full PCRE support, chances are you'll find a good subset.

The main incompatible regexp families are SQL's LIKE syntax, shell-style "globs" (simpler than full regexps), and various unix tools that, for historical reasons, stick to variants of the regexp syntax: grep (so-called "basic" regular expressions by default), sed, etc. (Keep in mind that grep and sed predate not only perl, but the very culture of compatible implementations).



Related Topics



Leave a reply



Submit