How to Use Group by on a Clob Column with Oracle

How to use GROUP BY on a CLOB column with Oracle?

After some fixes it seems that the major issue was in the group by

YOu have to use the same tables in the SELECT and in the GROUP BY

I also take only a substring of the CLOB to get it works. THe working request is :

    SELECT TABLE_A.ID,
       TABLE_A.VA,
       B.TRACE
FROM
(SELECT A.T_ID ID,
          MAX(A.V) VA
   FROM BDD.LOG A
   GROUP BY A.T_ID HAVING MAX(A.V) <= '1.00') TABLE_A,
                                                                BDD.T B
WHERE TABLE_A.ID = B.T_id;

Group by with CLOB in select-statement

A CLOB value cannot be used for grouping or inside a distinct clause.

The only chance you have is to convert the CLOB to a varchar but that means you cannot compare the complete contents of the column (note: those are columns, not rows). If you are certain that all your CLOB values are smaller than 8000 bytes, you can use something like this:

select min(dbms_lob.substr(column1)), column2
from foo
group by column2;

GROUP BY invloving a CLOB data

A fairly simple option is to modify your current query to add an analytic function that finds the maximum date for each ID, as something like:

..., max(mst.last_updated_date) over (partition by id) as max_updated_date

A quick demo of the general idea:

with cte (id, last_updated_date, sql_statement) as (
  select 1, date '2017-01-01', to_clob('stmt 1') from dual
  union all select 1, date '2017-01-02', to_clob('stmt 2') from dual
  union all select 1, date '2017-01-03', to_clob('stmt 3') from dual
  union all select 2, date '2017-01-02', to_clob('stmt 4') from dual
)
select id, last_updated_date, sql_statement
from (
  select id, last_updated_date, sql_statement,
    max(last_updated_date) over (partition by id) as max_updated_date
  from cte
)
where last_updated_date = max_updated_date;

        ID LAST_UPDAT SQL_STATEMENT                                                                   
---------- ---------- --------------------------------------------------------------------------------
         1 2017-01-03 stmt 3                                                                          
         2 2017-01-02 stmt 4

You could use a row_number() or rank() or dense_rank() to identify the row with the earliest date and filter on that instead, but the general idea is the same.

However, your current query isn't very clear (or valid before 12c) to begin with. Rather than trying to guess how to include such a function and filter, it's probably simpler to start again from your base tables, though this makes a lot of assumptions about what you're doing, and may ignore some things - like left and right joins - that may or may not actually be needed.

Making up some data via CTEs:

with test_1 (sr_id, last_updated_date) as (
  select 1108197, timestamp '2017-02-05 23:56:59' from dual
  union all select 1108217, timestamp '2017-02-14 00:37:39' from dual
  union all select 1108218, timestamp '2017-02-14 01:39:50' from dual
  union all select 1108220, timestamp '2017-02-14 03:39:07' from dual
),
test_2 (sm_id, segmentation_name, sql_statement) as (
  select 958, 'test_not_in', to_clob('select * from dual') from dual
),
test_3 (q_id, sr_id, answer) as (
  select 41, 1108197, 958 from dual
  union all select 42, 1108217, 958 from dual
  union all select 43, 1108218, 958 from dual
  union all select 44, 1108220, 958 from dual
),
test_4 (q_id, field_id) as (
  select 41, 'LM_LRE_Q6' from dual
  union all select 42, 'LM_LRE_Q6' from dual
  union all select 43, 'LM_LRE_Q6' from dual
  union all select 44, 'LM_LRE_Q6' from dual
)

then this gets the same output you showed in the question:

select t1.sr_id,
  t2.sm_id,
  t2.segmentation_name,
  to_char(t1.last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
  t2.sql_statement
from test_4 t4
join test_3 t3 on t3.q_id = t4.q_id
join test_2 t2 on t2.sm_id = t3.answer
join test_1 t1 on t1.sr_id = t3.sr_id;

     SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE             SQL_STATEMENT                                                                   
---------- ----- ----------- ----------------------------- --------------------------------------------------------------------------------
   1108197   958 test_not_in 05-feb-2017 23:56:59          select * from dual                                                              
   1108217   958 test_not_in 14-feb-2017 00:37:39          select * from dual                                                              
   1108218   958 test_not_in 14-feb-2017 01:39:50          select * from dual                                                              
   1108220   958 test_not_in 14-feb-2017 03:39:07          select * from dual

On the wild assumption that is close to right, you could find the row with the latest date for each sm_id with something like this:

select sr_id,
  sm_id,
  segmentation_name,
  to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
  sql_statement
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
  from test_4 t4
  join test_3 t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.answer
  join test_1 t1 on t1.sr_id = t3.sr_id
)
where last_updated_date = max_updated_date;

     SR_ID SM_ID SEGMENTATIO LAST_UPDATED_DATE             SQL_STATEMENT                                                                   
---------- ----- ----------- ----------------------------- --------------------------------------------------------------------------------
   1108220   958 test_not_in 14-feb-2017 03:39:07          select * from dual

You will need to adapt that to deal with any other restrictions or requirements that weren't clear (including your left/right outer joins, for instance).

I've deliberately ignored the the subquery you were doing to split the 'answer' into multiple values. It's possible you have something horrible like a delimited list of IDs in there, which is a data model problem. If that is the case then you'll still need to extract the individual sm_id values; something like:

with answer_extraction as (
  select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '\d+', 1, level) is not null
)
select sr_id,
  sm_id,
  segmentation_name,
  to_char(last_updated_date, 'dd-mon-yyyy hh24:mi:ss') as last_updated_date,
  sql_statement
from (
  select t1.sr_id,
    t2.sm_id,
    t2.segmentation_name,
    t1.last_updated_date,
    t2.sql_statement,
    max(t1.last_updated_date) over (partition by t2.sm_id) as max_updated_date
  from test_4 t4
  join answer_extraction t3 on t3.q_id = t4.q_id
  join test_2 t2 on t2.sm_id = t3.sm_id
  join test_1 t1 on t1.sr_id = t3.sr_id
)
where last_updated_date = max_updated_date;

Based on the actual contents of test3 you added, your regular expression isn't doing quite what you need. With the pattern you're using it finds 14 numeric values, i.e. any numbers:

with test_3 (q_id, sr_id, answer) as (
  select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
  union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
  union all select 1009330, 1108239, '972~test_emea' from dual
  union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
  union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
),
answer_extraction as (
  select q_id, sr_id, regexp_substr(answer, '\d+', 1, level) as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '\d+', 1, level) is not null
)
select * from answer_extraction;

      Q_ID      SR_ID SM_ID     
---------- ---------- ----------
   1009330    1108239 972       
   1009330    1108240 972       
   1009330    1108240 827       
   1009330    1108246 976       
   1009330    1108246 24        
   1009330    1108246 941       
   1009330    1108247 941       
   1009330    1108247 2016      
   1009330    1108247 787       
   1009330    1108247 28        
   1009330    1108247 976       
   1009330    1108247 24        
   1009330    1108251 981       
   1009330    1108251 100579729

It appears you only want the bits between the ^ delimiters and the ~ markers. A common way to split a delimited string is:

with test_3 (q_id, sr_id, answer) as (
  select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
  union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
  union all select 1009330, 1108239, '972~test_emea' from dual
  union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
  union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
),
answer_extraction as (
  select q_id, sr_id, regexp_substr(answer, '[^^]+', 1, level) as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select * from answer_extraction;

      Q_ID      SR_ID SM_ID                                   
---------- ---------- ----------------------------------------
   1009330    1108239 972~test_emea                           
   1009330    1108240 972~test_emea                           
   1009330    1108240 827~test_with_region_country            
   1009330    1108246 976~feb_24                              
   1009330    1108246 941~Test_regionwithcountry              
   1009330    1108247 941~Test_regionwithcountry_2016         
   1009330    1108247 787~Test_Request_28                     
   1009330    1108247 976~feb_24                              
   1009330    1108251 981~MSE100579729 testing.

but you then need to get the first part of that, e.g. borrowing your original pattern (others are available!):

column sm_id format a10
with test_3 (q_id, sr_id, answer) as (
  select 1009330, 1108246, '976~feb_24^941~Test_regionwithcountry' from dual
  union all select 1009330, 1108247, '941~Test_regionwithcountry_2016^787~Test_Request_28^976~feb_24' from dual
  union all select 1009330, 1108239, '972~test_emea' from dual
  union all select 1009330, 1108240, '972~test_emea^827~test_with_region_country' from dual
  union all select 1009330, 1108251, '981~MSE100579729 testing.' from dual
),
answer_extraction as (
  select q_id, sr_id,
    regexp_substr(regexp_substr(answer, '[^^]+', 1, level), '\d+') as sm_id
  from test_3
  connect by q_id = prior q_id
  and sr_id = prior sr_id
  and prior dbms_random.value is not null
  and regexp_substr(answer, '[^^]+', 1, level) is not null
)
select * from answer_extraction;

      Q_ID      SR_ID SM_ID     
---------- ---------- ----------
   1009330    1108239 972       
   1009330    1108240 972       
   1009330    1108240 827       
   1009330    1108246 976       
   1009330    1108246 941       
   1009330    1108247 941       
   1009330    1108247 787       
   1009330    1108247 976       
   1009330    1108251 981

Notice that extra regexp_substr() is only in the select list, not the connect-by clause; and that the extract sm_id is still a string. If test_2.sm_id is a number then add a to_number() call around the pair of substrings in that select list too.

Group by with a substr that is part of a clob

SUBSTR doesn't work with CLOBs - you need DBMS_LOB.SUBSTR:

SELECT 
  dbms_lob.substr(nd.nne_def,4,4) as bzn
FROM 
  core_def.nne_def nd
group by dbms_lob.substr(nd.nne_def,4,4);

How to Use Group by on a Clob Column with Oracle