Board index » cppbuilder » extracting e-mail strings (function works but have to be updated...)

extracting e-mail strings (function works but have to be updated...)


2003-08-19 01:59:04 AM
cppbuilder19
hi dear builders,
I got a question, I use the below function to extract e-mail adresses from a
Memo->Text..
it works fine BUT only if the mails are from a website..because the
beginning string is "mailto:"
my problem is, how can I extract ALL available e-mails that are NOT
containing a beginning
"mailto:" string ??
the problem is that it should start with a space " " than (can have) a first
"." for separating
first name / second name, than the "@", than another "." (or maybe more !!!
like subdomains does have..)
I getting crazy because using AnsiPos( ) will be totaly wired for handling
so much...is there maybe a nicer way ??
for example here are diffrent e-mails I need to extract:
________________________________________________
XXXX@XXXXX.COM // realy ease one..
XXXX@XXXXX.COM // little bit difficult...
XXXX@XXXXX.COM // even more difficult...
________________________________________________
One more problem is that the e-mail string CAN begin AND end
with several delimiters (not only a space " ")
here's a sample:
________________________________________________
.... give Thomas a replay on( XXXX@XXXXX.COM ) !!....
________________________________________________
So how you see beginning and ending spaces " " cannot be the true solution..
Can someone PLEASE help to solve this problem ???
Oren
/***********************************************/
int retrieve_mailtos(TMemo *Memo, char *text)
{
int count; char *ptr = strstr(text, "mailto:");
while(ptr)
{
ptr += strlen("mailto:");
char *quote = strchr(ptr,'\"');
if (! quote) break; *quote = '\0';
Memo->Lines->Add(ptr);
*quote = '\"'; count++;
ptr = strstr(ptr, "mailto:");
}
return count;
}
void __fastcall TfrmStringDetails::EMailAdressen_extrahieren1Click( TObject
*Sender)
{
frmExtract_Window->Show();
frmExtract_Window->Caption = "Folgende E-Mail Adressen wurden
extrahiert...";
int textlength = Memo1->Text.Length();
char *text = new char [textlength+100];
strncpy(text, frmStringDetails->Memo1->Text.c_str(), textlength);
text[textlength] = '\0';
frmExtract_Window->txtExtract->Clear();
retrieve_mailtos(frmExtract_Window->txtExtract, text);
delete text;
}
 
 

Re:extracting e-mail strings (function works but have to be updated...)

"Oren \(Halvani.de\)" < XXXX@XXXXX.COM >wrote:
Quote
[...] my problem is, how can I extract ALL available e-mails
that are NOT containing a beginning "mailto:" string ??
What you propose to do is a very large task. In essence, to work correctly, you have to validate selected text as an address. As you have 'begun' to realize, the formats vary widely.
One example that you missed includes comments:
no( don't even think of ) XXXX@XXXXX.COM
Further more, within the comments, you can have what would
normally be invalid email address characters.
or
"This could be a valid address"@no( may your children have warts )spam.com
Another possibility is to replace the domain name with an IP address:
nospam@[216.80.243.82]
Have a look at this article and decide if you truely want to parse out all valid addresses.
www.ietf.org/rfc/rfc0822.txt
If you decide that you're going to continue, here's some code
that I grabbed of the web a while ago. It's not in C++ but
it's easy enough to read and should get you started.
~ JD
/* The following is the list of known TLDs that an e-mail address must end with. */
var knownDomsPat=/^(com|net|org|edu|int|mil|gov|arpa|biz|aero|name|coop|info|pro|museum)$/;
/* The following pattern is used to check if the entered e-mail address
fits the user@domain format. It also is used to separate the username
from the domain. */
var emailPat=/^(.+)@(.+)$/;
/* The following string represents the pattern for matching all special
characters. We don't want to allow special characters in the address.
These characters include ( ) <>@ , ; : \ " . [ ] */
var specialChars="\\(\\)><@,;:\\\\\\\"\\.\\[\\]";
/* The following string represents the range of characters allowed in a
username or domainname. It really states which chars aren't allowed.*/
var validChars="\[^\\s" + specialChars + "\]";
/* The following pattern applies if the "user" is a quoted string (in
which case, there are no rules about which characters are allowed
and which aren't; anything goes). E.g. "jiminy cricket"@disney.com
is a legal e-mail address. */
var quotedUser="(\"[^\"]*\")";
/* The following pattern applies for domains that are IP addresses,
rather than symbolic names. E.g. joe@[123.124.233.4] is a legal
e-mail address. NOTE: The square brackets are required. */
var ipDomainPat=/^\[(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})\]$/;
/* The following string represents an atom (basically a series of non-special characters.) */
var atom=validChars + '+';
/* The following string represents one word in the typical username.
For example, in XXXX@XXXXX.COM , john and doe are words.
Basically, a word is either an atom or quoted string. */
var word="(" + atom + "|" + quotedUser + ")";
// The following pattern describes the structure of the user
var userPat=new RegExp("^" + word + "(\\." + word + ")*$");
/* The following pattern describes the structure of a normal symbolic
domain, as opposed to ipDomainPat, shown above. */
var domainPat=new RegExp("^" + atom + "(\\." + atom +")*$");
/* Finally, let's start trying to figure out if the supplied address is valid. */
/* Begin with the coarse pattern to simply break up user@domain into
different pieces that are easy to analyze. */
var matchArray=emailStr.match(emailPat);
if (matchArray==null) {
/* Too many/few @'s or something; basically, this address doesn't
even fit the general mould of a valid e-mail address. */
alert("Email address seems incorrect (check @ and .'s)");
return false;
}
var user=matchArray[1];
var domain=matchArray[2];
// Start by checking that only basic ASCII characters are in the strings (0-127).
for (i=0; i<user.length; i++) {
if (user.charCodeAt(i)>127) {
alert("Ths username contains invalid characters.");
return false;
}
}
for (i=0; i<domain.length; i++) {
if (domain.charCodeAt(i)>127) {
alert("Ths domain name contains invalid characters.");
return false;
}
}
// See if "user" is valid
if (user.match(userPat)==null) {
// user is not valid
alert("The username doesn't seem to be valid.");
return false;
}
/* if the e-mail address is at an IP address (as opposed to a symbolic
host name) make sure the IP address is valid. */
var IPArray=domain.match(ipDomainPat);
if (IPArray!=null) {
// this is an IP address
for (var i=1;i<=4;i++) {
if (IPArray[i]>255) {
alert("Destination IP address is invalid!");
return false;
}
}
return true;
}
// Domain is symbolic name. Check if it's valid.
var atomPat=new RegExp("^" + atom + "$");
var domArr=domain.split(".");
var len=domArr.length;
for (i=0;i<len;i++) {
if (domArr[i].search(atomPat)==-1) {
alert("The domain name does not seem to be valid.");
return false;
}
}
/* domain name seems valid, but now make sure that it ends in a
known top-level domain (like com, edu, gov) or a two-letter word,
representing country (uk, nl), and that there's a hostname preceding
the domain or country. */
if (checkTLD && domArr[domArr.length-1].length!=2 &&
domArr[domArr.length-1].search(knownDomsPat)==-1) {
alert("The address must end in a well-known domain or two letter " + "country.");
return false;
}
// Make sure there's a host name preceding the domain.
if (len<2) {
alert("This address is missing a hostname!");
return false;
}
// If we've gotten this far, everything's valid!
return true;
}
 

Re:extracting e-mail strings (function works but have to be updated...)

Thanks JD & Remy for your great help..
it's realy a big trouble to extract ONLY VALID e-mails..
Oren
 

{smallsort}