Regular Expression Search Documentation


Intro

Regular expression search is a new, more powerful and flexible way to search in grayhatwarfare.com. Although simple keywords search is efficient and enough form most uses, it has certain limitations that Regular expression search solves. Being flexible though, comes at a cost. Each search costs takes about a minute to completed and takes considerable CPU resources. That is why we charge extra for the package that includes regular expression - more resources are needed.

Limitations of simple search

Lets say we have a filename like: student_resumes/resume1-1_thebackup_1519380372187.docx

How we index this is strip all special characters and replace them with spaces, then these are indexed independently.

So: student_resumes/resume1-1_thebackup_1519380372187.docx
Becomes: student resumes resume1 1 thebackup 1519380372187 docx

That means if you search for: "thebackup", "1519380372187", "docx", "student" you will get this in the results

But if you search only for "backup", "15193" you will not find this entry, there is no partial match on the keywords. Also if for some reason you need to include special characters on the search, this will not work.

Regular Expressions


We take a different approach on regular expressions:

  • special characters are not removed.

This gives you more control over searching. Some examples:

  • .*backup.*
    • Finds everything that contains backup, like thebackup, backup2 _backup_ etc.
  • .*2018[\-_\. ]11.*
    • Find everything related to November 2018
  • .*dump.*(gz|tar|zip)
    • Find all files containing keyword "dump" and end with gz, tar, or zip
  • backup.*
    • Find all files that BEGIN with backup
  • .*backup
    • Find all files that END with backup.
  • .*\.php
    • Find all files with php extension in the site
  • 19[0-9]{2}
    • Text that contains 19 and then exactly 2 digits from 0-9
  • .*"test.txt"
    • Everything in "" is literal and its not used as in the engine.

You can use https://regex101.com/ to test your regular expressions. While doing that, please keep ind mind the below minor differences and implementation details:

Some implementation notes:

  • We convert all text to lower case, to make searching easier. No need for [AaBbCc] etc.
  • Notice that all the above start with .* - That is because in order for an entry to be returned the whole filename must match the regular expression.
    • Our system auto adds .* in the start and end when it detects that input does not contain ^ $ or .*
    • If you want us to not autocorrect the input regular expression, there is a "Do not autocorrect regex" option
  • There are some special characters.
  • If the Full Path option is clicked in the search, then the regular expression is ran in the complete filename (including directory if any). Otherwise it will ran only on the filename part.
    • Lets assume 2 files:
      (files/Metallica - Outlaw Torn.mp3)
      (files/Metallica/Bleeding Me.mp3)
    • Lets assume that we search for: .*Metallica.*\.mp3
      • With no full text enabled only, Outlaw Torn will be returned.
      • With the full text enabled, both files, Outlaw Torn AND Bleeding Me will be returned.
      • And yes, Load is a masterpiece, what if is not metal enough ? Grow up, good music is good music.
  • Sorting and other filters (Extensions, Exclude extensions) work together regular expression.


Some tips:

  • Searching for .*keyword1.*keyword2.* will return different results than searching for .*keyword2.*keyword1.*


Regular Expressions API


To use regular expressions with the API:

  • Base 64 your regular expression.
  • Use the files filter API but instead of keywords and stopwords input the base64 text of the regular expression and add the parameter &regexp=1
  • An example is:
    • .*dump.*(gz|tar|zip)
    • Base64: LipkdW1wLiooZ3p8dGFyfHppcCk=
    • More info about the query in the Api Documentation page


Regular Expressions Manual


Reserved characters

Regular expression engine supports all Unicode characters. However, the following characters are reserved as operators:

. ? + * | { } [ ] ( ) " \

To use one of these characters literally, escape it with a preceding backslash or surround it with double quotes. For example:

\.                  # renders as a literal '.'
\\                  # renders as a literal '\'
"[email protected]"    # renders as '[email protected]'

Operators

Our regular expression engine does not use the Perl Compatible Regular Expressions (PCRE) library, but it does support the following standard operators.

Operator

Description

Example

Comment

 . 
Matches any character
ab.
# matches 'aba', 'abb', 'abz', etc.
 ? 
Repeat the preceding character zero or one times. Often used to make the preceding character optional.
abc?
# matches 'ab' and 'abc'
 + 
Repeat the preceding character one or more times.
ab+
# matches 'abb', 'abbb', 'abbbb', etc.
 * 
Repeat the preceding character zero or more times.
ab*
# matches 'ab', 'abb', 'abbb', 'abbbb', etc.
 {} 
Minimum and maximum number of times the preceding character can repeat.
a{2}
a{2,4}
a{2,}
# matches 'aa'
# matches 'aa', 'aaa', and 'aaaa'
# matches 'a` repeated two or more times
 | 
OR operator. The match will succeed if the longest pattern on either the left side OR the right side matches.
abc|xyz
# matches 'abc' and 'xyz'
 ( … ) 
Forms a group. You can use a group to treat part of the expression as a single character.
abc(def)?
# matches 'abc' and 'abcdef' but not 'abcd'
  [ … ] 

Match one of the characters in the brackets.

Inside the brackets, - indicates a range unless - is the first character or escaped.

A ^ before a character in the brackets negates the character or range.

[abc]
[a-c]
[-abc]
[abc\-]
[^abc]
[^a-c]
[^-abc]
[^abc\-]
# matches 'a', 'b', 'c'
# matches 'a', 'b', or 'c'
# '-' is first character. Matches '-', 'a', 'b', or 'c'
# Escapes '-'. Matches 'a', 'b', 'c', or '-'
# matches any character except 'a', 'b', or 'c'
# matches any character except 'a', 'b', or 'c'
# matches any character except '-', 'a', 'b', or 'c'
# matches any character except 'a', 'b', 'c', or '-'

Notes

Copyright © 2018-2020 grayhatwarfare.com All rights reserved. Hand-crafted & made with with Symfony PHP Framework, golang and all databases known to man 😉