Regular expression - regularexpression

Ask here any questions regarding program functionality
philip707
Posts: 58
Joined: Wed Jan 09, 2008 10:19 am
Location: TN- Chennai - India

Regular expression - regularexpression

Post by philip707 »

Any one put some example on the regular expression. I saw a help file on the forum, But some one please put some example taking a verse or word from the bible.
User avatar
JG
Posts: 4604
Joined: Wed Jun 04, 2008 8:34 pm

Post by JG »

http://www.theword.gr/en/onlinehelp/

If you go to the above page of the website and look under search/advanced you will find examples.
I think the idea is that people that can use/need it, will be the ones that find the help file.
hope this helps
Jon
philip707
Posts: 58
Joined: Wed Jan 09, 2008 10:19 am
Location: TN- Chennai - India

Post by philip707 »

Thanks, Can you tell me how i find a particullar length of word........ for expample Jesus, it contains 5 letters, i like to find all the 5 letters word, how i can do it?
philip707
Posts: 58
Joined: Wed Jan 09, 2008 10:19 am
Location: TN- Chennai - India

Post by philip707 »

Regular expression are not working in the unicode bibles, any one have any idea , when i use \w{17} , in KJV, it found Chushanrishathaim from Jdg 3.8, but when i try this with the unicode tamil bible nothing has returned.....:(
User avatar
JG
Posts: 4604
Joined: Wed Jun 04, 2008 8:34 pm

Post by JG »

Try this simple example for five letter words.
\s\w\w\w\w\w\s

Jon
csterg
Site Admin
Posts: 8627
Joined: Tue Aug 29, 2006 3:09 pm
Location: Corfu, Greece
Contact:

Post by csterg »

Hello Philip,
this is true and it has to do with the regular expression syntax. For unicode Bibles, the \d \D \s \S \w \W modifiers do not work. You will need to use the \p{} modifier. Copying from the regular expression manual:
Unicode character properties

When PCRE is built with Unicode character property support, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. They are:

\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X an extended Unicode sequence

The property names represented by xx above are limited to the Unicode general category properties. Each character has exactly one such property, specified by a two-letter abbreviation. For compatibility with Perl, negation can be specified by including a circumflex between the opening brace and the property name. For example, \p{^Lu} is the same as \P{Lu}.

If only one letter is specified with \p or \P, it includes all the properties that start with that letter. In this case, in the absence of negation, the curly brackets in the escape sequence are optional; these two examples have the same effect:

\p{L}
\pL

The following property codes are supported:
C Other
Cc Control
Cf Format
Cn Unassigned
Co Private use
Cs Surrogate

L Letter
Ll Lower case letter
Lm Modifier letter
Lo Other letter
Lt Title case letter
Lu Upper case letter

M Mark
Mc Spacing mark
Me Enclosing mark
Mn Non-spacing mark

N Number
Nd Decimal number
Nl Letter number
No Other number

P Punctuation
Pc Connector punctuation
Pd Dash punctuation
Pe Close punctuation
Pf Final punctuation
Pi Initial punctuation
Po Other punctuation
Ps Open punctuation

S Symbol
Sc Currency symbol
Sk Modifier symbol
Sm Mathematical symbol
So Other symbol

Z Separator
Zl Line separator
Zp Paragraph separator
Zs Space separator

Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE.

Specifying caseless matching does not affect these escape sequences. For example, \p{Lu} always matches only upper case letters.

The \X escape matches any number of Unicode characters that form an extended Unicode sequence. \X is equivalent to

(?>\PM\pM*)

That is, it matches a character without the "mark" property, followed by zero or more characters with the "mark" property, and treats the sequence as an atomic group (see below). Characters with the "mark" property are typically accents that affect the preceding character.

Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.
So, to find 5 letters you need to write:

Code: Select all

\pL{5}
Is this helpful?

Costas
Rubén Gómez
Posts: 106
Joined: Mon Mar 05, 2007 1:15 pm

Post by Rubén Gómez »

philip707 wrote:Thanks, Can you tell me how i find a particular length of word........ for example Jesus, it contains 5 letters, i like to find all the 5 letters word, how i can do it?
Actually, a more "standard" way to find 5-letter words (non-Unicode) would be this:

\b\w{5}\b

Hope it helps.
Rubén Gómez
philip707
Posts: 58
Joined: Wed Jan 09, 2008 10:19 am
Location: TN- Chennai - India

Post by philip707 »

Thanks Dear Costas, It working fine, I like to mention one thing, when i strat using regular expression, i did not Check the box in details tab and Ignore Case and diacritics. (Search options), why i am mentioning here this, my mistake will be a lesson to someone :) Any one want to use the regular expression please check the both boxes * Regular expression and * Ignore Case and diacritics in search option
csterg
Site Admin
Posts: 8627
Joined: Tue Aug 29, 2006 3:09 pm
Location: Corfu, Greece
Contact:

Post by csterg »

philip707 wrote:Thanks Dear Costas, It working fine, I like to mention one thing, when i strat using regular expression, i did not Check the box in details tab and Ignore Case and diacritics. (Search options), why i am mentioning here this, my mistake will be a lesson to someone :) Any one want to use the regular expression please check the both boxes * Regular expression and * Ignore Case and diacritics in search option
Just one more comment: there was a bug that caused the regex searches to fail at times when the 'Ignore case and diacriticts' was checked. Latest published build 690 does not have this problem: so be sure to upgrade to this version.
Costas
Post Reply