Thursday, June 06, 2013

Find character frequency in a string

This is a very common problem we are going to solve. We have a string like "abcd1abcd1abcd1" and we need to find out how many times each character appears within the string. In the string "abcd1abcd1abcd1", we see that characters "a","b","c","d","1" appear 3 times each in the string. Let's write the basic logic to achieve this ::

i) Extract unique characters from the given string
ii) Create a second array DICT with unique characters
iii) Iterate the DICT array and for each character, find total number of appearances in given string.

Let's implement the above logic in C language.

#include "stdio.h"
#include "string.h"

main()
{

 /* Example String */
 char *str = "abcd1abcd1abcd1";

 /* Define an ARRAY Where unique chars will be stored */
 static char dict[26];

 int i,j,k=0;

 /* EXTRACT Unique chars in the string */
 for(i=0;i<strlen(str); i++)
 {
   /* Check if this appears in dict array */
   int found = 0;
   for(j=0;j<sizeof(dict);j++)
   {
    if(dict[j] == str[i])
     {
       found++;
       break;
     }
   }
 
   /* If not found, then ASSIGN to dict */
   if( !found )
     dict[k++] = str[i];
 }


 /* Now find count for each unique char */
 for(i=0; dict[i] ; i++)
 {
   int count = 0;
   for(j=0;j<strlen(str);j++)
    if( str[j] == dict[i] )
     count++;
  
   /* Print the Count for each character */
   printf("\n The character [%c] appears %d times", dict[i], count );
 }
}


Output :
The character [a] appears 3 times
The character [b] appears 3 times
The character [c] appears 3 times
The character [d] appears 3 times
The character [1] appears 3 times

The advantage in the above method is that it can tell us the count of all characters including alphanumeric, special chars. However, we have another small solution to the above problem with a constraint that all the characters in the given string must be alphabets. Here we use an array called freq_count whose length is 26 to keep count of 26 characters only. For example freq_count[0] holds count of char 'a', freq_count[1] holds count of char 'b' and so on. Check it out below..

#include "stdio.h"
#include "string.h"

main()
{

 /* Example String */
 char *str = "abcdabcdabcd";

 /* Define an ARRAY for keeping Frequency count */
 static int freq_count[26];

 char ch;
 int i;

 /* Iterate through the string */
 for(i=0;i<strlen(str); i++)
 {

   /* IF CAPS alphabets */
   if( ( str[i]>=65 && str[i]<=90 ) )
   {
       freq_count[ str[i] - 65 ] ++;
   }
   

   /* For small letters */   
   else if( str[i]>=97 && str[i]<=122 )
   {
      freq_count[ str[i] - 97 ] ++;
   }
 }

 /* Now PRINT result */
 for(i=0; i<sizeof(freq_count)/2 ; i++)
 {
   /* Show count if character appears */
   if( freq_count[i] > 0 )
     printf("\n The character %c appears %d times", i + 65, freq_count[i]  );
 }
}


Output ::
The character A appears 3 times
The character B appears 3 times
The character C appears 3 times
The character D appears 3 times

The checking str[i]>=65 && str[i]<=90 and str[i]>=97 && str[i]<=122 checks for alphabets and if so then only freq_count array is updated. So, if the given string has numeric characters like '1','2' etc, they would be ignored.

We see that ASCII codes between 32 and 126 (95 characters) are printable; they contain alphanumeric and special characters including space. Hence we can easily capture their frequency also if we increase the freq_count array length. Check out the implementation below :

#include "stdio.h"
#include "string.h"

main()
{

 /* Example String */
 char *str = " aa!~~ 99! ";

 /* Define an ARRAY for keeping Frequency count */
 static int freq_count[95];

 char ch;
 int i;

 /* Iterate through the string */
 for(i=0;i<strlen(str); i++)
 {

   /* FOR any character */
   freq_count[ str[i] - 32 ] ++;
 }

 /* Now PRINT result */
 for(i=0; i<sizeof(freq_count)/2 ; i++)
 {
   /* Show count if character appears */
   if( freq_count[i] > 0 )
     printf("\n The character [%c] appears %d times", i + 32, freq_count[i]  );
 }
}


Output :
The character [ ] appears 3 times
The character [!] appears 2 times
The character [9] appears 2 times
The character [a] appears 2 times
The character [~] appears 2 times

The first line of output says count of whitespace appearing within the string.

PHP Implementation : 


Let's try to write above implementation in PHP. Check out the code below ...

<?php
 // Example String
 $str = " aa!~~ 99! ";

 // Define an ARRAY for keeping Frequency count
 $freq_count = array();

  // Iterate through the string
 for($i=0;$i<strlen($str); $i++)
 {
   // FOR any character, GET ASCII
   $index = ord($str[$i]) ;
  
   // Create Associative array
   if( isset( $freq_count[ $index ] ) )
     $freq_count[ $index ]++;
   else
     $freq_count[ $index ] = 1;
   
 }

 // Now PRINT result

 foreach($freq_count as $key=>$val)
 {
   // Convert to character
   $ch = chr($key);
   // Print
   echo "<br> The character [$ch] appears $val times" ;
 }
?>


The code above runs jolly well. PHP has an advantage in its array structures. C Array starts with index 0 while PHP associative arrays can be built with key=>value structure and that key can be anything. So, in the first loop we create array indexes like $freq_count[32] (for space) or $freq_count[97] for character 'a' etc. After the $freq_count array is created, it would look like this :

Array ( [32] => 3 [97] => 2 [33] => 2 [126] => 2 [57] => 2 )

[97] => 2 means character with ASCII 97 ('a') appears twice,
[32] => 3 means character with ASCII 32 (space) appears thrice and so on.

Finally we iterate though this array using foreach() loop construct and convert those keys to corresponding characters using chr() method. PHP's arrays are more dynamic in nature than in C. We don't have to define an array of 95 integers like in C. However above C program can be modified to create dynamic linked list at runtime to keep track of characters within the string. This would definitely save us some memory.
 

JavaScript Implementation : 

Let us write the same program in JavaScript. Check out the code snippet below.

<script>
 // Example String
 var $str = " aa!~~ 99! ";

 // Define an ARRAY for keeping Frequency count
 $freq_count = []

 // Iterate through the string
 for(var $i=0;$i<$str.length; $i++)
 {
   // FOR any character, GET ASCII
   $index =  $str.charCodeAt($i) ;
  
   // Associative array
   if( $freq_count[ $index ] === undefined )
     $freq_count[ $index ] = 1;
   else
     $freq_count[ $index ]++;
   
 }

 // Now BUILD result string
 $msg = "";
 for(var $key in $freq_count)
 {
   // Convert to character using fromCharCode()
   $msg += "[" + String.fromCharCode($key) + "]";
   $msg += " appears " + $freq_count[$key] + " times \n";
 }

 // PRINT on Screen
 alert($msg);
 </script>


The output is shown below :


We use String.charCodeAt(position) method to get the ASCII code of the character at position within the string. And we use String.fromCharCode($key) method to convert any Integer back to character. The PHP and Javascript codes are quite similar and easy to understand.

Javascript array is same as C in one respect and that is array index always starts at 0. For example if we define an array and set its 100 index to 10, all previous indexed will be created and filled up with undefined.

var arr  = []; // Define array
arr[100] = 10; // Set index 100


Index 0 to 99 would be automatically created and their values would be undefined. However when iterating through for(x in var) loop, only those key which have values other than undefined will be displayed. PHP array are not like this, they are more dynamic in nature.

// Below JS LOOP would print only 10
for(var x in arr)
  console.log( arr[x] );

1 comment:

Anonymous said...

Unicode-aware version of the PHP code (Javascript code is already Unicode-aware, C code is not able to be easily fixed.

$freq_count = array();
$len = mb_strlen($str, 'UTF-8');
// Iterate through the string
for ($i = 0; $i < $len; $i++) {
$ch = mb_substr($str, $i, 1, 'UTF-8');
// FOR any character, GET ASCII
//$index = ord($str[$i]);

// Create Associative array
if (isset($freq_count[$ch]))
$freq_count[$ch] ++;
else
$freq_count[$ch] = 1;
}