Unique array

dalecosp · Oct 2, 2014

I'm reading some code ... (full disclaimer, MINE ):

foreach ($list as $item) {
   if (preg_match($some_regexp,$item,$matched)) {
      $key = $matched[1];
      $some_array[$key] = $item;
   }
}
//this is necessary?
$some_array = array_unique($some_array);

It looks as if the array is constructed in such a way that its keys (and by extension, values*) will be unique be default. Can anyone think of any case where it would NOT be unique? Can I chuck the array_unique() call? It looks pretty danged expensive....

All my testing says "yeah, take out the expensive call". I'm just not trusting my brain this week, I guess. Ouch....

(* this can't be proven in the abstract ... or logically, without knowledge of the data set. However, I have that knowledge and the data will be unique IF the key is ... )

Weedpacket · Oct 3, 2014

Is there or was there a circumstance where [font=monospace]$some_array[/font] is not empty before the loop?

(And I can see an argument for claiming that if the key is unique then so is the value: the key is part of the value - if given two items the same regex finds different keys, then the items themselves must be different.)

NogDog · Oct 3, 2014

If you want to make sure the source array has unique values:

foreach(array_keys(array_flip($list)) as $item) {

Could be faster as no sorting is involved -- to keep in mind in case you actually want the sorting side-effect of array_unique().

dalecosp · Oct 3, 2014

NogDog;11043299 wrote:
If you want to make sure the source array has unique values:
foreach(array_keys(array_flip($list)) as $item) {
Could be faster as no sorting is involved -- to keep in mind in case you actually want the sorting side-effect of array_unique().

It's not really necessary to sort. That's a cool thing that might work. However, wouldn't it also have some expense? We're talking about an array of multiple tens of thousands of strings ... if I can be reasonably sure that the assignment will eliminate duplicates, I'd like to just do the assignment and go on.

Weedpacket wrote:
Is there or was there a circumstance where $some_array is not empty before the loop?

No.

(And I can see an argument for claiming that if the key is unique then so is the value: the key is part of the value - if given two items the same regex finds different keys, then the items themselves must be different.)

That's kind of, I think, what I'm assuming. These are lists of URLs, constructed something like this:

site.com/meaningful_search_term_combination_related_to_content_detail_12345.html

I'm extracting the numeric part (12345 in the example) to be the key. As near as I can tell, these URLs are created by mod_rewrite, and it's actually passing "12345" to the script as a page ID. I have found differing "meaningful_search_term" strings with the same numeric code (and therefore the same page content), and I'm trying to avoid duplicates as a result.

I'm fairly confident this will work unless we can prove otherwise here. Thanks for your time, guys.

NogDog · Oct 3, 2014

dalecosp;11043301 wrote:
...We're talking about an array of multiple tens of thousands of strings ...

Well, my first thought in that case is don't use arrays at all. They are not at all efficient in PHP, and before you even start worrying about manipulating arrays you may be dealing with the fact that that one initial array will be using a lot of RAM. It might be time to take one step back and see if you can accomplish whatever the larger objective is by either iterating through a query result or processing a file handle line by line (as opposed to loading it into an array).

dalecosp · Oct 6, 2014

Thanks. The data does have to be stored someplace and it's not a DB, so do you think we'd benefit from writing a temp file in this situation as opposed to just holding the URLs in RAM (an array) and then trashing it? Let's say ... 30K URLs?

NogDog · Oct 6, 2014

I guess that depends on what the actual source of the data is, and whether or not there's some way to iterate directly on the source? I mean, if you pull some API request into a big string in memory, then write it to a file (and unset() the string) so that you can then iterate through it, I suppose the question is how much more memory would the array of that data take up that that simple string variable? Then again, maybe you could do something like:

file_put_contents('tmp_file_name', $curl->exec());

Sorry if I'm just rambling ... sort of stream-of-consciousness stuff here.

sneakyimp · Oct 8, 2014

If your objective is just to collect all the numeric keys such that you have only one user-friendly string for each key, it seems to me pretty obvious that you'll have one and only one instance of each key and that you'll always end up with the user-friendly string of the last url you encounter with a given numeric key. Each time you encounter the key "12345" it should over-write any previous values you may have encountered that also had the key "12345"

$list = array(
	"1_url1",
	"2_url2",
	"3_url3",
	"1_foo1",
	"9_url9",
	"4_url4",
	"3_foo3"
);

$result = array();
foreach($list as $str) {
	$parts = explode("_", $str);
	$result[intval($parts[0])] = $parts[1];
}
var_dump($result);

The question I have is, "do you ever have one user-friendly url string that occurs with two different numeric keys."

$list = array(
  "12345_friendly-string-one",
  "67890_friendly-string-one"
);

dalecosp · Oct 8, 2014

sneakyimp;11043357 wrote:
If your objective is just to collect all the numeric keys such that you have only one user-friendly string for each key, it seems to me pretty obvious that you'll have one and only one instance of each key and that you'll always end up with the user-friendly string of the last url you encounter with a given numeric key. Each time you encounter the key "12345" it should over-write any previous values you may have encountered that also had the key "12345"
$list = array(
	"1_url1",
	"2_url2",
	"3_url3",
	"1_foo1",
	"9_url9",
	"4_url4",
	"3_foo3"
);

$result = array();
foreach($list as $str) {
	$parts = explode("_", $str);
	$result[intval($parts[0])] = $parts[1];
}
var_dump($result);
The question I have is, "do you ever have one user-friendly url string that occurs with two different numeric keys."
$list = array(
  "12345_friendly-string-one",
  "67890_friendly-string-one"
);

Thank you ... that is indeed the question I was attempting to determine the answer to, without dumping 30K URLs and doing a visual diff. Assuming the regexp works, as I hope and as Weedpacket observes:

Weedpacket wrote:
(And I can see an argument for claiming that if the key is unique then so is the value: the key is part of the value - if given two items the same regex finds different keys, then the items themselves must be different.)

Then I assume that yes, I'll have a unique array. But I wanted to be 99 44/100% sure of this fact, and I've not been sure how to guarantee this to myself.

Derokorian · Oct 8, 2014

log("count before unique: " . count($array));
$array = array_unique($array);
log("count after unique: " . count($array));

Just throw that in there, if after its run a few times the numbers never differ, then it seems safe to assume they will always be, and the call can be removed.

Derokorian · Oct 8, 2014

sneakyimp;11043357 wrote:
The question I have is, "do you ever have one user-friendly url string that occurs with two different numeric keys."
$list = array(
  "12345_friendly-string-one",
  "67890_friendly-string-one"
);

The problem you have, is your sample code uses the 2 parts separately, but in the original post he uses the entire original string as the value, therefore, if the key is unique the value is unique since the value contains the key. Any identical values, will then have identical keys.

Unique array

dalecosp

Weedpacket

NogDog

dalecosp

NogDog

dalecosp

NogDog

Ssneakyimp

dalecosp

DDerokorian

DDerokorian