Language Detection with Data in Postgresql

Language detection with data in PostgreSQL

Try these:

http://code.google.com/p/language-detection/ (Java)
http://code.google.com/p/chromium-compact-language-detector/ (C++/Python)

This blog post shares some tests to compare the 2 libraries (along with a 3rd - the Language Identification module of Apache Tika, which really is a complete toolkit for Text Analysis).

Efficient approach for language tags (small tag set) in rdbms queries

To make it easy to query you can create a view to avoid the constant joining.

create table object (
    id serial unique,
    object text primary key
);

create table tag (
    id serial unique,
    tag text primary key
);

create table object_tag (
    object_id integer references object(id),
    tag_id integer references tag(id)
);

insert into tag (tag) values ('English'), ('French'), ('German');
insert into object (object) values ('o1'), ('o2');
insert into object_tag (object_id, tag_id) values (1, 1), (1, 2), (2, 3);

create view v_object_tag as
select o.id object_id, o.object, t.id tag_id, t.tag
from
    object o
    inner join
    object_tag ot on o.id = ot.object_id
    inner join
    tag t on t.id = ot.tag_id
;

Now query as if it were a single table:

select *
from v_object_tag
where tag in ('English', 'German')
;
 object_id | object | tag_id |   tag   
-----------+--------+--------+---------
         1 | o1     |      1 | English
         2 | o2     |      3 | German

How to select rows which are in russian?

You could check for certain ranges of characters.

For example, this query checks if the string contains a Cyrillic letter:

SELECT id FROM mytable
WHERE message ~ '[АаБбВвГгДдЕеЁёЖжЗзИиЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЪъЫыЬьЭэЮюЯя]';

plpgsql - column type auto detect by returning a resultset

I don't think this is possible in pl/pgsql because, it strongly depends on user defined types. Sadly this language is not smart enough for type auto detection... I think my first possible solution I'll use, it solves the problem partially because at least I won't need to refactor every function manually by type change of a table column.

1.) Possible solution with asking column types:

CREATE FUNCTION test ()
  RETURNS TABLE (id "user".user_id%TYPE, name "user".user_name%TYPE, email "user".user_email%TYPE)
AS
  $BODY$
  BEGIN
    return QUERY SELECT
      "user".user_id, "user".user_name, "user".user_email
    FROM
      "user";
  END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

With this at least the type is not redundant. Not the best, but acceptable.

2.) Possible solution with SETOF RECORD:

CREATE FUNCTION test ()
  RETURNS SETOF RECORD
AS
  $BODY$
  BEGIN
    RETURN QUERY SELECT
      "user".user_id AS id, "user".user_name AS name, "user".user_email AS email
    FROM
      "user";
  END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

Without column type definition list I got the following error:

ERROR: a column definition list is required for functions returning
"record"

So I have to use it like this:

SELECT * FROM test() AS (id INT, name VARCHAR, email VARCHAR);

Instead of this:

SELECT * FROM test()

I got every column in string by the php client, so the column type definition is more than useless for me... This solution would be the best without column type definition, but with it not acceptable.

It is possible to use this similar to table:

CREATE FUNCTION test (OUT id "user".user_id%TYPE, OUT name "user".user_name%TYPE, OUT email "user".user_email%TYPE)
  RETURNS SETOF RECORD
AS
  $BODY$
  BEGIN
    RETURN QUERY SELECT
      "user".user_id, "user".user_name, "user".user_email
    FROM
      "user";
  END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

For example I could set everything to TEXT:

CREATE FUNCTION test (OUT id TEXT, OUT name TEXT , OUT email TEXT )
  RETURNS SETOF RECORD
AS
  $BODY$
  BEGIN
    RETURN QUERY SELECT
      "user".user_id::TEXT , "user".user_name::TEXT , "user".user_email::TEXT 
    FROM
      "user";
  END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

This works, but this is far from type auto detection, and it would result a lot of redundant text converter code...

3.) Possible solution with refcursors:

CREATE FUNCTION test ()
  RETURNS SETOF "user"
AS
  $BODY$
  DECLARE
    refc "user";
  BEGIN
    FOR refc IN 
      SELECT
        "user".user_id, "user".user_name, "user".user_email
      FROM
        "user"
    LOOP
        RETURN NEXT refc;
    END LOOP ;
    RETURN ;
  END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;

This fills out the lacking columns with null values, I cannot name the columns, and sql loops are very slow... So this is not acceptable.

By refcursors there is another way: to return the refcursor itself, but it is not acceptable because I cannot use it as a normal variable, I have to give a string as cursor name... Btw I did not manage to use the refcursor itself as result in phpstorm. I got jdbc cursor not found error. Maybe I set the name wrong, I don't know, I don't think it's worth more effort.

Full text search index on a multilingual column

Is it possible to create an index that works for several languages ?

Yes, but you need a second column that identifies the language of the text. Say you added a column doc_language to the table; you could then write:

CREATE INDEX title_idx ON shows USING gin(to_tsvector(doc_language, title));

Of course, this requires that you know the language of the subject text, something that can be hard to do in practice. If you don't need stemming, etc, you can just use the language simple, but I'm guessing you would've done that already if it were an option.

As an alternative, if you have a fixed and limited set of languages, you can concatenate the vectors for the different languages. E.g.:

regress=> SELECT to_tsvector('english', 'cafés') || to_tsvector('french', 'cafés') || to_tsvector('simple', 'cafés');
          ?column?          
----------------------------
 'caf':2 'café':1 'cafés':3
(1 row)

That'll match a tsquery for cafés in any of those three languages.

As an index:

CREATE INDEX title_idx ON shows USING gin((
    to_tsvector('english', title) || 
    to_tsvector('french', title) || 
    to_tsvector('simple', title)
));

but this is clumsy to use in queries, as the planner isn't very smart about matching index quals. So I'd wrap it in a function:

CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS $$
SELECT to_tsvector('english', $1) || 
       to_tsvector('french', $1) || 
       to_tsvector('simple', $1)
$$ LANGUAGE sql IMMUTABLE;

CREATE INDEX title_idx ON shows USING gin(to_tsvector_multilang(title));

If you want you can even get fancy: pass the list of languages as an array (but remember it'll have to be exactly the same order for an index qual match to work). Use priorities with setweight, so you prefer a match in English to one in French, say. All sorts of options.

Language Detection with Data in Postgresql