scraping email addresses from a local text file

schwim · May 12, 2009

Hi there everyone,

Well, some shortsightedness on my part has put me in a bit of a pickle. To view the whole of my pickle(ewww..), you can read this thread(goes to Mozillazine), but briefly, I sent 1,500 email addresses through a form and now need to retrieve them from a local mail file.

I've viewed many classes and tuts concerning scraping and from what I've found, they spend a lot of time and resources on connecting to a remote server and tricking servers into thinking they're a legitimate browser, both of which are things I don't need to do. I've got the file here and I've got a wamp server. I just need to grab the email addresses, and create a comma delimited file with them.

I don't how to scrape anything, having never done it before. I would very much appreciate some insight and assistance on the various portions of this(regex, applying the loop and storing properly in a file). I'll continue search, but you guys have always been pretty great at surpassing the tutorials in terms of helpfulness.

Thoughts would be greatly appreciated.

thanks,
json

sneakyimp · May 13, 2009

Scraping this file could be very easy or harder depending on its format. I didn't read that whole other thread...is it some kind of delimited file with email addresses separated by semicolons or is it a messy list of nonsense spattered with email addresses? Do the email address have the user friendly names (like John Smith <john.smith@example.com>) or are they all just well-formed email addresses (john.smith@example.com) ?

If they are simple email addresses separated by semicolons, you could do it pretty simply. Let's assume the file is located at "C:\file.txt"

$file = "C:\file.txt";
$contents = file_get_contents($contents);
if ($contents === FALSE) {
  die('could not fetch file contents');
}

// get rid of all the new lines
$contents = implode('', explode("\r", $contents));
$contents = implode('', explode("\n", $contents));

// split file along semicolons (you could change this to commas if you want)
$pieces = explode(',', $contents);

$addresses = array();
foreach($pieces as $pc) {
  $addresses[] = trim($pc);
}

If your file is messier, you might need a pattern matching approach. If the file HTML? This code will take an HTML file and extract all the email addresses that are linked like this:

<a href="mailto:some_email@domain.com">some text could contain email addresses</a>

$file = "C:\file.txt";
$contents = file_get_contents($contents);
if ($contents === FALSE) {
  die('could not fetch file contents');
}

$pattern = '#<a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
$num_matches = preg_match_all($pattern, $contents, $matches);
if ($num_matches > 0) {
  $addresses = array()
  foreach($matches[1] as $m) {
    $addresses[] = $m;
  }
} else {
  echo "no matches\n";
}

Hope that helps.

schwim · May 13, 2009

Hi there Sneaky and thanks so much for your help.

It's not HTML, but fortunately there is a very strict format to the file.

I'm posting a representation of one saved email in the file. All other entries are identical, barring of course the timestamps, etc.

From - Sat Nov 15 01:27:35 2008
X-Account-Key: account36
X-UIDL: UID15911-116673486
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
X-Mozilla-Keys:

Return-path: <nobody@server.domain.com>
Envelope-to: renee@domain.net
Delivery-date: Sat, 15 Nov 2008 01:18:52 -0500
Received: from server.secureserver.net ([64.202.000.102])
by server.domain.com with smtp (Exim 4.69)
(envelope-from <nobody@server.domain.com>)
id 1L1EUb-0007eV-Hj
for renee@domain.net; Sat, 15 Nov 2008 01:18:49 -0500
Received: (qmail 23241 invoked from network); 15 Nov 2008 06:18:53 -0000
Received: from unknown (HELO server.domain.com) (208.109.118.110)
by server.secureserver.net (64.202.000.102) with ESMTP; 15 Nov 2008 06:18:53 -0000
Received: from nobody by server.domain.com with local (Exim 4.69)
(envelope-from <nobody@server.domain.com>)
id 1L1EOo-0008P5-Vt
for renee@domain.net; Fri, 14 Nov 2008 23:12:51 -0700
To: renee@domain.net
Subject: T&T Entry form
From: no_reply@domain.com
X-Mailer: PHP/5.2.5
Message-Id: <E1L1EOo-0008P5-Vt@server.domain.com>
Date: Fri, 14 Nov 2008 23:12:50 -0700
X-SchwimServer3-MailScanner-Information: Please contact the ISP for more information
X-SchwimServer3-MailScanner-ID: 1L1EUb-0007eV-Hj
X-SchwimServer3-MailScanner: Found to be clean
X-SchwimServer3-MailScanner-SpamCheck: not spam, SpamAssassin (not cached,
score=-2.599, required 5, autolearn=not spam, BAYES_00 -2.60)
X-SchwimServer3-MailScanner-From: nobody@server.domain.com
X-Spam-Status: No
X-Antivirus: AVG for E-mail 8.0.199 [270.9.4/1789]

An entry was submitted for the Shopping Spree. The submitted information is:

Name: melanee Lastname
Email: jsullins@domain.net

Thanks,
Your webserver

This repeats about 1,500 times in the same file.

Can I somehow retrieve the address after "Email: " ? Also, am I write that the function you posted retrieves the address, but I still need to build a loop of some kind to write it to another file?

Thanks so much for your time,
json

sneakyimp · May 13, 2009

Hm. I didn't spend much time on this...just googled around for "preg_match_all scrape email" and found this regular expression:

/(\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/

I don't think it will match every email address but it should get most. The code might then look like this:

$file = "C:\file.txt";
$contents = file_get_contents($contents);
if ($contents === FALSE) {
  die('could not fetch file contents');
}

$pattern = '/(\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/';
$num_matches = preg_match_all($pattern, $contents, $matches);
if ($num_matches > 0) {
  $addresses = array()
  foreach($matches[0] as $m) {
    if (!in_array($m, $addresses)) {
      $addresses[] = $m;
    }
  }
} else {
  echo "no matches\n";
}

// put the results in this file
$dest = 'C:\output.txt';
if (!file_put_contents($dest, implode("\r\n", $addresses))) {
  die("unable to write destination file\n");
} else {
  echo "everything seems ok, please check file $dest\n";
}

rikmoncur · May 13, 2009

This might help - I use a programme called 'mailbomber' and it pulls email addresses from local files. Google it, there's a free trial which may be all you need.

schwim · May 13, 2009

Hi there Sneaky,

Thanks again very much for your help. Can I ask if either of two things are possible?

Either grab only the email that comes after "Email: " or have it check for duplicates? I ask because I'll end up with 1,500 instances of all the addresses in the headers of the mails.

Am I correct that it would be preferable to not access the destination file to check for dupes to keep the overhead down?

@ rikmoncur: thanks very much for that. I'm going to give any of those types of programs a miss since they almost always make sure the the crippled version doesn't do anything of value. You can invest a lot of time into getting one going only to find out that with just $70, you too can do what you need My days of cracking appz are behind me, so I'm going to stick with the homegrown method Thanks very much for the suggestion, though.

thanks,
json

sneakyimp · May 13, 2009

Have you even tried the code? If you look closely, you see that I check the array of emails for a given address before adding it:

    if (!in_array($m, $addresses)) {
      $addresses[] = $m;
    }

Yes it can be modified to only get the ones with email first, but I'm doing all the lifting here. Maybe you could take a stab at editing the regular expression pattern?

schwim · May 14, 2009

Hi there sneaky,

You're absolutely right that you've provided 100% of the working code. I could show you my attempts at both doing it myself and at modifying your offering, but they're so pathetic, that I thought I would leave them out.

I have run the code. The reason I thought it wasn't catching the duplicates is because thousands of server.com addresses were getting saved, but at closer look, the part before the @ is a little different each time.

Here is the code as I'm trying to use it right now. It saves all email addresses(barring dupes) to the file, so it's working as it should in that regard.

<?php

$file = "C:\\Program Files\\xampp\\htdocs\\entries.txt";
echo $file."<br>";
$contents = file_get_contents($file);
echo $contents."<br>";
if ($contents === FALSE) {
  die('could not fetch file contents');
}

$pattern = '/(\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/';
$num_matches = preg_match_all($pattern, $contents, $matches);
if ($num_matches > 0) {
  $addresses = array();
  foreach($matches[0] as $m) {
    if (!in_array($m, $addresses)) {
      $addresses[] = $m;
    }
  }
} else {
  echo "no matches\n";
}

// put the results in this file
$dest = 'C:\\Program Files\\xampp\\htdocs\\output.txt';
if (!file_put_contents($dest, implode(",\r\n", $addresses))) {
  die("unable to write destination file\n");
} else {
  echo "everything seems ok, please check file $dest\n";
}

?>

I'm editing this as I go, so forgive me if you're reading something different now

I've managed to get it to save only the emails after "Email: ". The problem I'm having now is that it's saving "Email: " with the addresses

$pattern = '/Email: (\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/';

Any help on getting it to dump that portion after it finds it would be very welcome. I'm going back to my tutorial now

I know I could get rid of the "Email: " with my text editor, but I feel I owe it to myself that I figure out how to do this. I am completely incompetent at regex.

thanks,
json

sneakyimp · May 14, 2009

I'm so proud of you for trying. And not just trying, but trying the hard way -- by reading the docs and tutorials and such. Your pattern is very close but has two problems.

1) You put your assertion outside the characters that start and end the regular expression. For some reason that completely escapes me, regular expression patterns start and end with the same character. Most often, slashes (/) are used. Like this:

/pattern/

I think there might be some official name for these characters...the [man]preg_quote[/man] documentation calls them 'delimiters' but that doesn't sound quite right to me.

You can also put some flags after the ending delimiter like this:

/pattern/is

The i and s at the end are flags.

SO, given that I used slashes to start and end my regular expression, you need to put your assertion AFTER that first slash instead of before it.

2) I think your assertion is a lookahead assertion and would probably try to force the \w bit that follows it to be "Email: ". Use a lookbehind assertion instead like this:

$pattern = '/(?<=Email: )(\w+\.)*\w+@(\w+\.)*\w+(\w+\-\w+)*\.\w+/';

schwim · May 14, 2009

Hi there Sneaky,

Wow, I'm looking at a comma delimited file of 1,430 email addresses without the "Email: ", yay!

I'm trying to find it in the tutorial, but is the "<" in (?<=Email: ) mean that it will discard that after it finds that part of the match?

I can't thank you enough for your help. I have always steered clear of both regex and working with flatfiles through PHP and it really bit me in the keester on this one.

thanks again,
json

sneakyimp · May 14, 2009

regex is amazingly powerful when you get a bit of skill with it. I use the script below to test my regular expressions. You enter the text to be searched in the big textarea and your pattern in the text input and submit and it shows you all the matches returned by preg_match_all.

the < in the assertion does not mean that anything is discarded. It turns a lookahead assertion like (?=pattern) into a lookbehind assertion. Assertions do not 'use up' any characters. This info is in the PHP docs on regex syntax.

[man]file_get_contents[/man] and [man]file_put_contents[/man] are also your friend when working with flat files. It makes things quite simple. the only things to worry about when you use them are pretty much a) do you have permission to read/write the file and b) do you have enough memory or disk space to get/put it.

Here's my regex tester form:

<?php

if($_POST['Submit']) {
	$pattern=$_POST['pattern'];
	if (!$pattern) {
		$error = "You didn't enter a pattern!";
	}
	$string=$_POST['string'];
	if (!$string) {
		$error = "You didn't enter a string to be searched!";
	}

if ($error) {
	show_form($error, $data);
} else {
 // do the pattern match

	$matches = Array();
	$num_matches = preg_match_all($pattern, $string, $matches);

	if ($num_matches == 0) {
		$error = "NO MATCHES";
		show_form($error, $_POST);
	} else {
		show_form('', $_POST, $matches);
	} // if matches found
}
} else {
	show_form();
}

function report_html3($str_arg) {
	echo "<pre style='background-color:#eeeeee;'>" . 
htmlentities($str_arg) . 
"</pre>\n";
}

function show_form($error='', $data=Array(), $matches=Array()) {
?>
<html>
<head>
<title>Regular Expression Tester</title>
</head>
<body>
<h1>Testing Regular Expressions</h1>
<?
	if ($error) {
		echo "<div style='color:#ff0000'>" . $error . "</div>\n";
	}
?>
<p>This form allows you to test a regular expression using <a href="http://php.net/preg_match_all">preg_match_all()</a></p>
<form method=POST>
String:<br>
<textarea name='string' cols='100' rows='10'><?=htmlentities($data['string']) ?></textarea><br><br>
RegEx Pattern:<br>
<input type=text name="pattern" size=100 value="<?= htmlentities($data['pattern']) ?>"><br><br>
<input type=submit value="Test Pattern" name="Submit">
</form>

<?
	$num_matches = sizeof($matches[0]);
	if ($num_matches > 0) {
		echo "<hr>";
		echo "<h2>" . $num_matches . " MATCHES</h2>";
		foreach($matches as $key1=>$matches1) {
			foreach ($matches1 as $key2=>$value) {
				echo "<b>Match " . $key1 . "-" . $key2 . "</b>\n";
				report_html3($value);
			} // for each match
		} // for MAIN MATCH
	}	// if matches
?>
</body></html>
<?
} // show_form()
?>

EDIT: added some links and removed gratuitous include from script.

schwim · May 15, 2009

Hi Sneaky and thanks very much for the regex tester. I'm finding it very helpful when viewing the tutorial. You get to try additional strings to see where something does and doesn't work.

If I can ask one more question. Do people actually become skilled enough with regex that they can write expressions on the fly or do they become knowledgeable enough to simply know the basics and where to look to get the specifics?

I ask because I've tried to learn a lot of things in my life and so far, I'm finding regex to be one of the most befuddling endeavors I've ever undertaken. I can't fathom being able to write some of the expressions I'm finding in the tutorials. I use an email validation class in many of my scripts that uses a regex expression so large, that I can't imagine how they figured it out.

sneakyimp · May 15, 2009

Just like anything, regex becomes easier the more you do it. I'd be sunk without my testing form. That's where I go every single time I need to create an elaborate regex. I believe nrg_alpha, one of the forum members here, is quite the pro. Weedpacket is good at everything. Check out the expression in this thread. I can't imagine ever doing one that large myself, but I've come up with some that are a few hundred characters. They're kind of like Legos. Once you snap a few together, you can imagine how to build much larger ones.

Use the form, it will help.

scraping email addresses from a local text file

schwim

Ssneakyimp

schwim

Ssneakyimp

Rrikmoncur

schwim

Ssneakyimp

schwim

Ssneakyimp

schwim

Ssneakyimp

schwim

Ssneakyimp