Das heutige Blog-Posting ist nicht aus meiner Feder, vielmehr war
mein Kollege
Rainer Willems
so nett, ein Posting zum neuen Oracle12c-Feature
SQL Pattern Matching zu
schreiben. Meine Aufgabe ist nun lediglich, das Posting auf dem Blog zu veröffentlichen,
was ich hiermit gerne tue.
Heute wollen wir uns mit SQL Pattern Matching beschäftigen. Mit
Oracle 10g wurden reguläre Ausdrücke in der Datenbank eingeführt, um
nach Mustern in Zeichenketten zu suchen. SQL Pattern Matching bieten nun
die Möglichkeit, auch in über Datenströme deklarativ nach Mustern zu
suchen. Im folgenden sehen wir ein kleines Beispiel und entwickeln
anhand eines Aktienkursverlaufes Suchen nach Mustern darin. Beginnen wir
mit einer Tabelle mit Beispieldaten.
CREATE TABLE TICKER (
TSTAMP DATE,
SYMBOL VARCHAR2(5),
PRICE NUMBER );
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('01-JAN-14','DD-MON-RR'),'abc',13);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('02-JAN-14','DD-MON-RR'),'abc',13.4);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('03-JAN-14','DD-MON-RR'),'abc',13.8);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('04-JAN-14','DD-MON-RR'),'abc',13.1);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('05-JAN-14','DD-MON-RR'),'abc',13.3);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('06-JAN-14','DD-MON-RR'),'abc',13.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('07-JAN-14','DD-MON-RR'),'abc',14.1);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('08-JAN-14','DD-MON-RR'),'abc',14.7);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('09-JAN-14','DD-MON-RR'),'abc',14.7);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('10-JAN-14','DD-MON-RR'),'abc',14.6);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('11-JAN-14','DD-MON-RR'),'abc',14.8);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('12-JAN-14','DD-MON-RR'),'abc',14.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('13-JAN-14','DD-MON-RR'),'abc',14.6);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('14-JAN-14','DD-MON-RR'),'abc',14.5);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('15-JAN-14','DD-MON-RR'),'abc',14.7);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('16-JAN-14','DD-MON-RR'),'abc',14.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('17-JAN-14','DD-MON-RR'),'abc',14.3);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('18-JAN-14','DD-MON-RR'),'abc',14.4);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('19-JAN-14','DD-MON-RR'),'abc',14.6);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('01-JAN-14','DD-MON-RR'),'xyz',17);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('02-JAN-14','DD-MON-RR'),'xyz',16.5);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('03-JAN-14','DD-MON-RR'),'xyz',16.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('04-JAN-14','DD-MON-RR'),'xyz',16.5);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('05-JAN-14','DD-MON-RR'),'xyz',16.3);
/
Die Tabelle enthält nun Aktienkurse von 2 Symbolen. Als erstes
wollen wir ein simples SQL Pattern Matching Statement verwenden, welches
auf den ersten Blick keinen Mehrwert zu einer normalen Abfrage liefert:
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES e.tstamp as st, e.symbol as s, e.price as p
ONE ROW PER MATCH
PATTERN (e)
DEFINE e AS price=price);
- MATCH_RECOGNIZE ist das
Schlüsselwort für ein SQL Pattern Matching Statement.
- PARTITION BY gibt an, nach welchem
Attribut die Daten zur Analyse (logisch) partitioniert werden sollen. Pattern Matching
findet stets innerhalb einer Partition statt und gefundene Muster gehen niemals
über mehrere Partitionen. Hier ist
es das Stock-Symbol, da wir Ausagen über die Entwicklung eines solchen
treffen wollen.
- ORDER BY legt die Reihenfolge der
Datensätze fest, in der wir nach einem Muster suchen möchten. Dies ist
hier sonnvollerweise die Zeitachse.
- MEASURES gibt an, welche Informationen
zurückgegeben werden, also von der SQL Pattern Matching-Klausel an die
umgebende SQL-Query zurückgegeben werden. Der Alias e
referenziert hierbei einen Teil des Musters, der ...
- ... im Bereich PATTERN
definiert wird. Diese Aliasnamen sind frei wählbar und legen zunächst fest, aus
welchen Abschnitten ein Muster besteht.
- DEFINE legt fest, was konkret unter den benannten
Abschnitten zu verstehen ist. Ist ein in PATTERN verwendeter Abschnitt
nicht unter DEFINE definiert, so matcht dieser alle Zeilen.
- ONE ROW PER MATCH (Default) bestimmt,
dass pro vollständigen Treffer nur eine Zeile zurückgegeben werden soll.
SYMBOL ST S P
abc 01-JAN-14 abc 13
abc 02-JAN-14 abc 13.4
abc 03-JAN-14 abc 13.8
abc 04-JAN-14 abc 13.1
abc 05-JAN-14 abc 13.3
abc 06-JAN-14 abc 13.9
abc 07-JAN-14 abc 14.1
abc 08-JAN-14 abc 14.7
abc 09-JAN-14 abc 14.7
abc 10-JAN-14 abc 14.6
abc 11-JAN-14 abc 14.8
abc 12-JAN-14 abc 14.9
abc 13-JAN-14 abc 14.6
abc 14-JAN-14 abc 14.5
abc 15-JAN-14 abc 14.7
abc 16-JAN-14 abc 14.9
abc 17-JAN-14 abc 14.3
abc 18-JAN-14 abc 14.4
abc 19-JAN-14 abc 14.6
xyz 01-JAN-14 xyz 17
xyz 02-JAN-14 xyz 16.5
xyz 03-JAN-14 xyz 16.9
xyz 04-JAN-14 xyz 16.5
xyz 05-JAN-14 xyz 16.3
Im folgenden hängen wir jeweils noch einen Filter auf das Symbol an,
um nur das Symbol abc zu betrachten. Die Grafik zum Aktienkurs
sieht dann so aus:
In diesem Chart möchten wir nun ein V-Muster suchen, also eine
fallende Phase gefolgt von einer steigenden Phase. Die Grafik zeigt
deutlich vier solcher Muster im Chart. Als ersten Query-Versuch starten wir
mit ...
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) as enddate, strt.price startprice, LAST(up.price) as endprice
ONE ROW PER MATCH
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
Das verwendete Pattern beginnt mit dem Alias strt;
dieser ist innerhalb von DEFINE nicht näher definiert, matcht per Definition
also alle Tabellenzeilen. Das bedeutet, dass ein Match prinzipiell überall beginnen
kann. Gefolgt wird strt vom Abschnitt down und danach
von up. Für "up" ist in DEFINE festgelegt, dass
der vorherige Kurs kleiner sein muss als der der aktuellen Zeile, für "down" muss es
entsprechend umgekehrt sein. Das Pluszeichen dahinter (+) ist ein Quantifizierer; damit
lassen sich Angaben machen, wie oft der Abschnitt vorkommen muss. Folgende
Möglichkeiten gibt es:
* kein, ein oder mehrmaliges Vorkommen
+ 1 oder mehrmaliges Vorkommen
? kein oder genau ein Vorkommen
{n} genau n Vorkommen
{n,} n oder mehr Vorkommen
{n,m} n bis m Vorkommen
{,m} maximal m Vorkommen
Zusätzlich lässt sich noch ein Fragezeichen (?) anhängen, damit
werden die Quantifizierer vom
Greedy- in den
Reluctant Modus
geschaltet.
Greedy bedeutet, dass der Muster-Abschnitt so viele Input-Daten
heranzieht wie möglich - wenn alle Input-Daten auf den ersten Abschnitt passen;
so werden auch alle herangezogen.
Reluctant bedeutet dagegen, dass gerade
soviele Daten herangezogen werden, wie zur Erfüllung der Bedingung minestens nötig
sind. Sobald der Abschnitt gerade erfüllt ist, werden die danach folgenden Abschnitte
geprüft. Mehr zu diesen Unterschieden findet sich in der
Oracle-Dokumentation.
Das Ergebniss der Abfrage ist wie folgt:
SYMBOL STARTDATE ENDDATE STARTPRICE ENDPRICE
abc 03-JAN-14 08-JAN-14 13.8 14.7
abc 09-JAN-14 12-JAN-14 14.7 14.9
abc 13-JAN-14 16-JAN-14 14.6 14.9
Wenn wir das mit dem Chart vergleichen, sind wir offensichtlich noch
nicht so ganz in Einklang mit unserem Ziel.
Das dritte V-Muster startet offensichtlich schon am 12. und nicht
erst am 13. Ein vierter Treffer (16. - 19.) erscheint gar nicht in
der Treffermenge. Dies liegt daran, dass wir nicht angegeben haben,
wo, nach einem gefunden Treffer, weiter gesucht werden soll.
Die Default-Einstellung hierfür ist AFTER MATCH SKIP TO PAST LAST ROW. Dies erklärt das Ergebnis: Nach
dem zweiten Treffer (endet am 12.) wird erst ab dem 13. weiter gesucht und der 12.
bleibt unberücksichtigt. Ebenso wird nach dem dritten Treffer erst
ab dem 17. weiter gesucht und somit gar kein Muster mehr gefunden. Diese
Einstellung kann verändert werden:
- AFTER MATCH SKIP TO NEXT ROW - Resume pattern matching at the row after the first row of the current match.
- AFTER MATCH SKIP PAST LAST ROW - Resume pattern matching at the next row after the last row of the current match.
- AFTER MATCH SKIP TO FIRST pattern_variable - Resume pattern matching at the first row that is mapped to the pattern variable.
- AFTER MATCH SKIP TO LAST pattern_variable - Resume pattern matching at the last row that is mapped to the pattern variable.
- AFTER MATCH SKIP TO pattern_variable - The same as AFTER MATCH SKIP TO LAST pattern_variable.
Um also den letzten Wert eines Treffers auch als möglichen ersten
Wert eines weiteren Treffers zu berücksichtigen, verwenden wir AFTER
MATCH SKIP TO LAST pattern_variable und somit die folgende Abfrage:
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) as enddate, strt.price startprice, LAST(up.price) as endprice
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
Das Ergebnis dieser Abfrage sieht wie folgt aus:
SYMBOL STARTDATE ENDDATE STARTPRICE ENDPRICE
abc 03-JAN-14 08-JAN-14 13.8 14.7
abc 09-JAN-14 12-JAN-14 14.7 14.9
abc 12-JAN-14 16-JAN-14 14.9 14.9
abc 16-JAN-14 19-JAN-14 14.9 14.6
und gibt uns nun also die richtigen Werte zurück
Als nächstes lassen wir uns nun in der Anzeige auch Teilergebnisse
anzeigen, so dass schön dargestellt wird, wie ein Ergebnis entsteht.
Hierzu ersetzen wir das ONE ROW PER MATCH in der Abfrage mit einem
ALL ROWS PER MATCH.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) AS enddate, strt.price startprice, LAST(up.price) AS endprice
ALL ROWS PER MATCH
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
Das Ergebnis ist wie folgt:
SYMBOL TSTAMP STARTDATE ENDDATE STARTPRICE ENDPRICE PRICE
abc 03-JAN-14 03-JAN-14 - 13.8 - 13.8
abc 04-JAN-14 03-JAN-14 - 13.8 - 13.1
abc 05-JAN-14 03-JAN-14 05-JAN-14 13.8 13.3 13.3
abc 06-JAN-14 03-JAN-14 06-JAN-14 13.8 13.9 13.9
abc 07-JAN-14 03-JAN-14 07-JAN-14 13.8 14.1 14.1
abc 08-JAN-14 03-JAN-14 08-JAN-14 13.8 14.7 14.7
abc 09-JAN-14 09-JAN-14 - 14.7 - 14.7
abc 10-JAN-14 09-JAN-14 - 14.7 - 14.6
abc 11-JAN-14 09-JAN-14 11-JAN-14 14.7 14.8 14.8
abc 12-JAN-14 09-JAN-14 12-JAN-14 14.7 14.9 14.9
abc 12-JAN-14 12-JAN-14 - 14.9 - 14.9
abc 13-JAN-14 12-JAN-14 - 14.9 - 14.6
abc 14-JAN-14 12-JAN-14 - 14.9 - 14.5
abc 15-JAN-14 12-JAN-14 15-JAN-14 14.9 14.7 14.7
abc 16-JAN-14 12-JAN-14 16-JAN-14 14.9 14.9 14.9
abc 16-JAN-14 16-JAN-14 - 14.9 - 14.9
abc 17-JAN-14 16-JAN-14 - 14.9 - 14.3
abc 18-JAN-14 16-JAN-14 18-JAN-14 14.9 14.4 14.4
abc 19-JAN-14 16-JAN-14 19-JAN-14 14.9 14.6 14.6
Ist das ENDDATE in der Ausgabe noch leer, haben wir bisher
noch keinen gültigen Treffer entdeckt. In der dritten Zeile sehen wir
den ersten Treffer, welcher sukzessive verlängert wird, bis
das Muster komplett ist. Sehr schön lässt sich dies
nachvollziehen, wenn man sich anzeigen läßt, in welchem Teil des Musters
man gerade ist (CLASSIFIER), und der wievielte Treffer (MATCH_NUMBER)
gerade betrachtet wird. Zusätzlich lassen wir uns UNMATCHED ROWS
anzeigen.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) AS enddate, CLASSIFIER() AS c, MATCH_NUMBER() AS m
ALL ROWS PER MATCH WITH UNMATCHED ROWS
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
Die Ausgabe ist nun leicht verständlich und überschauber:
SYMBOL TSTAMP STARTDATE ENDDATE C M PRICE
abc 01-JAN-14 - - - - 13
abc 02-JAN-14 - - - - 13.4
abc 03-JAN-14 03-JAN-14 - STRT 1 13.8
abc 04-JAN-14 03-JAN-14 - DOWN 1 13.1
abc 05-JAN-14 03-JAN-14 05-JAN-14 UP 1 13.3
abc 06-JAN-14 03-JAN-14 06-JAN-14 UP 1 13.9
abc 07-JAN-14 03-JAN-14 07-JAN-14 UP 1 14.1
abc 08-JAN-14 03-JAN-14 08-JAN-14 UP 1 14.7
abc 08-JAN-14 - - - - 14.7
abc 09-JAN-14 09-JAN-14 - STRT 2 14.7
abc 10-JAN-14 09-JAN-14 - DOWN 2 14.6
abc 11-JAN-14 09-JAN-14 11-JAN-14 UP 2 14.8
abc 12-JAN-14 09-JAN-14 12-JAN-14 UP 2 14.9
abc 12-JAN-14 12-JAN-14 - STRT 3 14.9
abc 13-JAN-14 12-JAN-14 - DOWN 3 14.6
abc 14-JAN-14 12-JAN-14 - DOWN 3 14.5
abc 15-JAN-14 12-JAN-14 15-JAN-14 UP 3 14.7
abc 16-JAN-14 12-JAN-14 16-JAN-14 UP 3 14.9
abc 16-JAN-14 16-JAN-14 - STRT 4 14.9
abc 17-JAN-14 16-JAN-14 - DOWN 4 14.3
abc 18-JAN-14 16-JAN-14 18-JAN-14 UP 4 14.4
abc 19-JAN-14 16-JAN-14 19-JAN-14 UP 4 14.6
abc 19-JAN-14 - - - - 14.6
Die Spalte CLASSIFIER (C) zeigt das Teilmuster unseres Patterns an. Ist die Spalte
leer, befinden wir uns nicht in einem möglichen Treffer. Ab dem Wert
STRT könnte sich ein Treffer entwickeln, gefunden ist er sobald ENDDATE
einen Wert hat und zumindest ein "up" Vorkommen gefunden wurde. Zum Abschluß suchen wir noch nach einem "W-Muster". Hierfür verwenden wir
DOWN und UP einfach doppelt:
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) AS enddate
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+ down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
Heraus kommen folgende Daten
SYMBOL STARTDATE ENDDATE
abc 09-JAN-14 16-JAN-14
Allerdings passt das Ergebnis nicht ganz zum Chart ...
Das erste W-Muster wurde zwar gefunden, nicht aber das zweite.
Die Einstellung AFTER MATCH SKIP TO LAST up bedeutet, dass nach
Finden des ersten Musters ab dem 16. weitergesucht wird - und ab
da gibt es kein Muster mehr. Die Weitersuche muss ab dem 12. erfolgen;
"W"-Muster sollen sich überlappen können. Das ist sehr einfach zu erreichen;
anstelle der doppelt verwendeten Abschnitte UP und DOWN benennen wir
die einzelnen Abschnitte nun explizit. In der Klausel AFTER MATCH SKIP
kann mit UP1 dann explizit das Ende der ersten "up"-Phase angegeben werden.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up2.tstamp) AS enddate
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST up1
PATTERN (strt down+ up1+ down+ up2+)
DEFINE down AS price < prev(price),
up1 AS price > prev(price),
up2 AS price > prev(price))
where symbol='abc';
Und schon haben wir das gewünschte Ergebnis
SYMBOL STARTDATE ENDDATE
abc 09-JAN-14 16-JAN-14
abc 12-JAN-14 19-JAN-14
Today's blog posting has not been authored by me, it is from my colleague
Rainer Willems. He
will introduce
SQL Pattern Matching, the new Oracle12c feature. My task was
simply to publich Rainer's posting on my blog - and I'm happy to do this right now.
Some time ago, with Oracle10g. regular expressions (regex) have been
introduced to the Oracle Database. These allow to look up patterns within
Strings (VARCHAR2). The new feature, SQL Pattern Matching is also about
searching for patterns, but within data streams or simpler: within
a set of table rows. Let's start with a simple example: The following table
contains stock quotes.
CREATE TABLE TICKER (
TSTAMP DATE,
SYMBOL VARCHAR2(5),
PRICE NUMBER );
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('01-JAN-14','DD-MON-RR'),'abc',13);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('02-JAN-14','DD-MON-RR'),'abc',13.4);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('03-JAN-14','DD-MON-RR'),'abc',13.8);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('04-JAN-14','DD-MON-RR'),'abc',13.1);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('05-JAN-14','DD-MON-RR'),'abc',13.3);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('06-JAN-14','DD-MON-RR'),'abc',13.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('07-JAN-14','DD-MON-RR'),'abc',14.1);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('08-JAN-14','DD-MON-RR'),'abc',14.7);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('09-JAN-14','DD-MON-RR'),'abc',14.7);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('10-JAN-14','DD-MON-RR'),'abc',14.6);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('11-JAN-14','DD-MON-RR'),'abc',14.8);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('12-JAN-14','DD-MON-RR'),'abc',14.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('13-JAN-14','DD-MON-RR'),'abc',14.6);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('14-JAN-14','DD-MON-RR'),'abc',14.5);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('15-JAN-14','DD-MON-RR'),'abc',14.7);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('16-JAN-14','DD-MON-RR'),'abc',14.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('17-JAN-14','DD-MON-RR'),'abc',14.3);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('18-JAN-14','DD-MON-RR'),'abc',14.4);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('19-JAN-14','DD-MON-RR'),'abc',14.6);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('01-JAN-14','DD-MON-RR'),'xyz',17);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('02-JAN-14','DD-MON-RR'),'xyz',16.5);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('03-JAN-14','DD-MON-RR'),'xyz',16.9);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('04-JAN-14','DD-MON-RR'),'xyz',16.5);
Insert into TICKER (TSTAMP,SYMBOL,PRICE) values (to_date('05-JAN-14','DD-MON-RR'),'xyz',16.3);
/
As you can see, the table contains data for two stock symbols. The first SQL pattern
matching query sill not provide any additional value, but it's nice to introduce the syntax.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES e.tstamp as st, e.symbol as s, e.price as p
ONE ROW PER MATCH
PATTERN (e)
DEFINE e AS price=price);
- MATCH_RECOGNIZE is the keyword
to start the SQL pattern matching syntax.
- PARTITION BY determines the column
which is being used for logical partitioning of the data. Pattern matching
is always being performed within a partition and a found pattern will never span
multiple partitions. In this example, we'll use the stock symbol.
- ORDER BY determines data ordering. It's obvious
that ordering is an essential requirement to do pattern matching. In this example,
the timestamp will be used.
- MEASURES specify the columns which the
Pattern Matching clause will return to the outer SQL query. The alias e
references the pattern declaration later on.
- PATTERN contains the pattern declaration. The
simple usage of an alias like in this example means always true, so the pattern
matches any row. This will change later.
- ONE ROW PER MATCH (which is default)
determines that the SQL pattern matching clause will return one row per
matched pattern.
That first "pattern matching" query will simply return all rows.
SYMBOL ST S P
abc 01-JAN-14 abc 13
abc 02-JAN-14 abc 13.4
abc 03-JAN-14 abc 13.8
abc 04-JAN-14 abc 13.1
abc 05-JAN-14 abc 13.3
abc 06-JAN-14 abc 13.9
abc 07-JAN-14 abc 14.1
abc 08-JAN-14 abc 14.7
: : : :
xyz 05-JAN-14 xyz 16.3
Now we'll add a filter - we only want to work with the qoutes for abc. A chart
would look as follows.
We want to search for a "V-formation", which is a phase of falling prices, followed by
a period of rising prices. In the chart, we can see clearly, that there are four such patterns. So
let's try a SQL pattern matching query.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) as enddate, strt.price startprice, LAST(up.price) as endprice
ONE ROW PER MATCH
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
The pattern declaration (keywords PATTERN and DEFINE) contains
the most interesting changes. Our pattern consists of three sections;
it begins with strt, followed by down and up. The definition
of these aliases is containes in the DEFINE clause. An alias which is not
defined, matched all rows by definition. So strt
matches all rows; thus a pattern can begin anywhere. down
is being defined as that the current price is below the previous price; stock
is going down. Up is the other way around: The current price is above the previous price.
In the PATTERN clause, a plus (+) sign is appended to down and up; which
means that both patterns have to appear at least once. You can use the following suffixes
to indicate how often a pattern part has to occur.
* none, one or multiple
+ one or multiple
? zero or one
{n} exact "n" times
{n,} "n" or more times
{n,m} "n" to "m" times
{,m} not more than "m" times
By default, these quantifiers are
greedy; appending an additional question
mark makes them
reluctant. A greedy quantifier tries to match as much of the
input data as possible. Subsequent pattern sections have to deal with the rest of the
data. A reluctant quantifier matches just as much data as needed by its definition.
More information on this can be found in the
Oracle documentation. Query results are as follows ...
SYMBOL STARTDATE ENDDATE STARTPRICE ENDPRICE
abc 03-JAN-14 08-JAN-14 13.8 14.7
abc 09-JAN-14 12-JAN-14 14.7 14.9
abc 13-JAN-14 16-JAN-14 14.6 14.9
Comparing this to the line chart shows, that we have not reached our goal so far.
The chart shows clearly that the third match begins at the 12th; the query results
indicate that its the 13th. The forth match is not being found at all. The reason
for this discrepancy is that the SQL pattern matching, after it has found a match, by default continues
with the first row after that match. So by default, one row cannot
be part of two matches. In reality, the stock price at one day could finish one
match and open another. So we have to change that default - we can choose from
the following:
- AFTER MATCH SKIP TO NEXT ROW - Resume pattern matching at the row after the first row of the current match
.
- AFTER MATCH SKIP PAST LAST ROW - Resume pattern matching at the next row after the last row of the current match
.
- AFTER MATCH SKIP TO FIRST pattern_variable - Resume pattern matching at the first row that is mapped to the pattern variable.
- AFTER MATCH SKIP TO LAST pattern_variable - Resume pattern matching at the last row that is mapped to the pattern variable.
- AFTER MATCH SKIP TO pattern_variable - The same as AFTER MATCH SKIP TO LAST pattern_variable.
So we change the query as follows ...
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) as enddate, strt.price startprice, LAST(up.price) as endprice
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
... and we get these results ...
SYMBOL STARTDATE ENDDATE STARTPRICE ENDPRICE
abc 03-JAN-14 08-JAN-14 13.8 14.7
abc 09-JAN-14 12-JAN-14 14.7 14.9
abc 12-JAN-14 16-JAN-14 14.9 14.9
abc 16-JAN-14 19-JAN-14 14.9 14.6
... which nicely match the chart.
Now we want the query to return not only the matches as such, but also
detail information about a match: we want to have the individual rows of a match
in order to see how pattern matching works and how it finds the result. So
we replace the ONE ROW PER MATCH clause by ALL ROWS PER MATCH
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) AS enddate, strt.price startprice, LAST(up.price) AS endprice
ALL ROWS PER MATCH
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
The result is ...
SYMBOL TSTAMP STARTDATE ENDDATE STARTPRICE ENDPRICE PRICE
abc 03-JAN-14 03-JAN-14 - 13.8 - 13.8
abc 04-JAN-14 03-JAN-14 - 13.8 - 13.1
abc 05-JAN-14 03-JAN-14 05-JAN-14 13.8 13.3 13.3
abc 06-JAN-14 03-JAN-14 06-JAN-14 13.8 13.9 13.9
abc 07-JAN-14 03-JAN-14 07-JAN-14 13.8 14.1 14.1
abc 08-JAN-14 03-JAN-14 08-JAN-14 13.8 14.7 14.7
abc 09-JAN-14 09-JAN-14 - 14.7 - 14.7
abc 10-JAN-14 09-JAN-14 - 14.7 - 14.6
abc 11-JAN-14 09-JAN-14 11-JAN-14 14.7 14.8 14.8
abc 12-JAN-14 09-JAN-14 12-JAN-14 14.7 14.9 14.9
abc 12-JAN-14 12-JAN-14 - 14.9 - 14.9
abc 13-JAN-14 12-JAN-14 - 14.9 - 14.6
abc 14-JAN-14 12-JAN-14 - 14.9 - 14.5
abc 15-JAN-14 12-JAN-14 15-JAN-14 14.9 14.7 14.7
abc 16-JAN-14 12-JAN-14 16-JAN-14 14.9 14.9 14.9
abc 16-JAN-14 16-JAN-14 - 14.9 - 14.9
abc 17-JAN-14 16-JAN-14 - 14.9 - 14.3
abc 18-JAN-14 16-JAN-14 18-JAN-14 14.9 14.4 14.4
abc 19-JAN-14 16-JAN-14 19-JAN-14 14.9 14.6 14.6
As long as the ENDDATE result column is SQL NULL, SQL pattern matching
did not identify these rows as part of a pattern. Looking into the data, we can
see that the price is going down - the rows match the "down" section of our pattern
definition. We don't have "up" data so far, so SQL pattern matching does not identify
a match. This changes in the third row:
the price is now going up and the complete pattern definition is satisfied. So Oracle identified
a match and the match the ENDDATE is (for now) being set to JAN-05. This is
a partial match, not the final result.
While walking through the remaining
rows, match details can still change. And exactly this happens in the three subsequent
rows. The stock price is climbing further, so the ENDDATE result is shifted with
each row until JAN-09. At this point, Oracle has a found final match for
the pattern definition. To get even more insight into the process, we can
have the query return the name of the current pattern section (CLASSIFIER),
a match number and additionally all rows which don't belong to a match.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) AS enddate, CLASSIFIER() AS c, MATCH_NUMBER() AS m
ALL ROWS PER MATCH WITH UNMATCHED ROWS
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
Die Ausgabe ist nun leicht verständlich und überschauber:
SYMBOL TSTAMP STARTDATE ENDDATE C M PRICE
abc 01-JAN-14 - - - - 13
abc 02-JAN-14 - - - - 13.4
abc 03-JAN-14 03-JAN-14 - STRT 1 13.8
abc 04-JAN-14 03-JAN-14 - DOWN 1 13.1
abc 05-JAN-14 03-JAN-14 05-JAN-14 UP 1 13.3
abc 06-JAN-14 03-JAN-14 06-JAN-14 UP 1 13.9
abc 07-JAN-14 03-JAN-14 07-JAN-14 UP 1 14.1
abc 08-JAN-14 03-JAN-14 08-JAN-14 UP 1 14.7
abc 08-JAN-14 - - - - 14.7
abc 09-JAN-14 09-JAN-14 - STRT 2 14.7
abc 10-JAN-14 09-JAN-14 - DOWN 2 14.6
abc 11-JAN-14 09-JAN-14 11-JAN-14 UP 2 14.8
abc 12-JAN-14 09-JAN-14 12-JAN-14 UP 2 14.9
abc 12-JAN-14 12-JAN-14 - STRT 3 14.9
abc 13-JAN-14 12-JAN-14 - DOWN 3 14.6
abc 14-JAN-14 12-JAN-14 - DOWN 3 14.5
abc 15-JAN-14 12-JAN-14 15-JAN-14 UP 3 14.7
abc 16-JAN-14 12-JAN-14 16-JAN-14 UP 3 14.9
abc 16-JAN-14 16-JAN-14 - STRT 4 14.9
abc 17-JAN-14 16-JAN-14 - DOWN 4 14.3
abc 18-JAN-14 16-JAN-14 18-JAN-14 UP 4 14.4
abc 19-JAN-14 16-JAN-14 19-JAN-14 UP 4 14.6
abc 19-JAN-14 - - - - 14.6
The classifier column (C) of the result set now shows the part of our
pattern definition which matches the current row. If this is empty, the
row definitively does not match the pattern. As soon as it contains
a value, we have a partial match - it depends on the subsequent data whether
it will become a final match. As soon as we have the sequence of STRT, DOWN and
UP, we have a final pattern match - the match number column (M) is being
increased by one.
Finally, we extend our pattern definition. We don't want to search for
"V" pattern any more, we are interested in the "W" pattern; which
means that we need a "down" phase, followed by "up", followed by "down" and by "up" again.
For this we simply need to extend the PATTERN clause within our query.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up.tstamp) AS enddate
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST up
PATTERN (strt down+ up+ down+ up+)
DEFINE down AS price < prev(price),
up AS price > prev(price))
where symbol='abc';
We get the following result ...
SYMBOL STARTDATE ENDDATE
abc 09-JAN-14 16-JAN-14
... which (again) don't match the chart. What happened?
The first "W" pattern has been successfully found by the query. But we told
the database to continue at the last row of the "up" section. Thus two "W" patterns
cannot overlap and we'll find only one match. To let the query find overlapping
patterns as well, we need te be able to distinct between the two legs of the "W" pattern.
So we introduce explicit aliases for each of them. After that we can change
the AFTER MATCH SKIP clause to continue after the first "up" section and not after the second.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES strt.tstamp startdate, LAST(up2.tstamp) AS enddate
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST up1
PATTERN (strt down+ up1+ down+ up2+)
DEFINE down AS price < prev(price),
up1 AS price > prev(price),
up2 AS price > prev(price))
where symbol='abc';
... and this query works as expected. It is able to find even overlapping matches.
SYMBOL STARTDATE ENDDATE
abc 09-JAN-14 16-JAN-14
abc 12-JAN-14 19-JAN-14