Manipulate a String That Is 30 Million Characters Long

Manipulate a string that is 30 million characters long

PHP is choking because it's running out memory. Instead of having curl populate a PHP variable with the contents of the file, use the

CURLOPT_FILE

option to save the file to disk instead.

//pseudo, untested code to give you the idea

$fp = fopen('path/to/save/file', 'w');
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_exec ($ch);
curl_close ($ch);
fclose($fp);

Then, once the file is saved, instead of using the file or file_get_contents functions (which would load the entire file into memory, killing PHP again), use fopen and fgets to read the file one line at a time.

How can I use a very large string (500 million characters) in my program?

so with the information available, i would just declare a bitarray (initial size as file.length) then i would open the file and reading it by chunk of maybe 4096 then you look trough these 4096 characters

in the loop you just do a simple if text = 1 then set true else set false

do this until you reach the end of the file then you have the full thing into a huge bitarray variable

from that point on you just need to find your pattern

Taking large (1 million) number of substring (100 character wide) reads from a long string (3 million characters)

Using a StringBuilder to assemble the string will get you a 600 times increase in processing (as it avoids repeated object creation everytime you append to the string.

before loop (initialising capacity avoids recreating the backing array in StringBuilder):

StringBuilder sb = new StringBuilder(1000000 * ReadLength);

in loop:

sb.Append(all.Substring(randomPos, ReadLength) + Environment.NewLine);

after loop:

readString = sb.ToString();

Using a char array instead of a string to extract the values yeilds another 30% improvement as you avoid object creation incurred when calling Substring():

before loop:

char[] chars = all.ToCharArray();

in loop:

sb.Append(chars, randomPos, ReadLength);
sb.AppendLine();

Edit (final version which does not use StringBuilder and executes in 300ms):

char[] chars = all.ToCharArray();    
var iterations = 1000000;
char[] results = new char[iterations * (ReadLength + 1)];    
GetRandomStrings(len, iterations, ReadLength, chars, results, 0);    
string s = new string(results);

private static void GetRandomStrings(int len, int iterations, int ReadLength, char[] chars, char[] result, int resultIndex)
{
    Random random = new Random();
    int i = 0, index = resultIndex;
    while (i < iterations && len - 100 > 0) //len is 3000000 
    {
        var i1 = len - ReadLength;
        int randomPos = random.Next() % i1;

        Array.Copy(chars, randomPos, result, index, ReadLength);
        index += ReadLength;
        result[index] = Environment.NewLine[0];
        index++;

        i++;
    }
}

Serving downloads with PHP and cURL

Yes, exactly as it says, readfile expects a filename, and you're giving it a string of the curl result. Just use echo after setting the proper headers:

echo $megaupload->download_file("http://www.megaupload.com/?d=PXNDZQHM");

EDIT: Also you need to do some basic checks that you got something back:

public function download_file($link)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $link);
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookies/megaupload.txt");
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $result = curl_exec($ch);
    // close before you return
    curl_close($ch);
    return $result;
}

Then:

$filecontents = $megaupload->download_file("http://www.megaupload.com/?d=PXNDZQHM");
if($filecontents) {
  echo $filecontents;
}
else {
  die("Yikes, nothing here!");
}

What is the easiest/best/most correct way to iterate through the characters of a string in Java?

I use a for loop to iterate the string and use charAt() to get each character to examine it. Since the String is implemented with an array, the charAt() method is a constant time operation.

String s = "...stuff...";

for (int i = 0; i < s.length(); i++){
    char c = s.charAt(i);        
    //Process char
}

That's what I would do. It seems the easiest to me.

As far as correctness goes, I don't believe that exists here. It is all based on your personal style.

Compress a Integer to a Base64 upper/lower case character in Ruby (to make an Encoded Short URL)

As others here have said ruby's Base64 encoding is not the same as converting an integer to a string using a base of 64. Ruby provides an elegant converter for this but the maximum base is base-36. (See @jad's answer).

Below brings together everything into two methods for encoding/decoding as base-64.

def encode(int)
  chars = [*'A'..'Z', *'a'..'z', *'0'..'9', '_', '!']
  digits = int.digits(64).reverse
  digits.map { |i| chars[i] }.join
end

And to decode

def decode(str)
  chars = [*'A'..'Z', *'a'..'z', *'0'..'9', '_', '!']
  digits = str.chars.map { |char| value = chars.index(char) }.reverse
  output = digits.each_with_index.map do |value, index|
    value * (64 ** index)
  end
  output.sum
end

Give them a try:

puts output = encode(123456) #=> "eJA"
puts decode(output) #=> 123456

The compression is pretty good, an integer around 99 Million (99,999,999) encodes down to 5 characters ("1pOkA").

To gain the extra compression of including upper and lower case characters using base-64 is inherantly case-sensetive. If you are wanting to make this case-insensetive, using the built in base-36 method per Jad's answer is the way to go.

Credit to @stefan for help with this.

Most efficient way to remove special characters from string

Why do you think that your method is not efficient? It's actually one of the most efficient ways that you can do it.

You should of course read the character into a local variable or use an enumerator to reduce the number of array accesses:

public static string RemoveSpecialCharacters(this string str) {
   StringBuilder sb = new StringBuilder();
   foreach (char c in str) {
      if ((c >= '0' && c <= '9') || (c >= 'A' && c <= 'Z') || (c >= 'a' && c <= 'z') || c == '.' || c == '_') {
         sb.Append(c);
      }
   }
   return sb.ToString();
}

One thing that makes a method like this efficient is that it scales well. The execution time will be relative to the length of the string. There is no nasty surprises if you would use it on a large string.

Edit:

I made a quick performance test, running each function a million times with a 24 character string. These are the results:

Original function: 54.5 ms.

My suggested change: 47.1 ms.

Mine with setting StringBuilder capacity: 43.3 ms.

Regular expression: 294.4 ms.

Edit 2:
I added the distinction between A-Z and a-z in the code above. (I reran the performance test, and there is no noticable difference.)

Edit 3:

I tested the lookup+char[] solution, and it runs in about 13 ms.

The price to pay is, of course, the initialization of the huge lookup table and keeping it in memory. Well, it's not that much data, but it's much for such a trivial function...

private static bool[] _lookup;

static Program() {
   _lookup = new bool[65536];
   for (char c = '0'; c <= '9'; c++) _lookup[c] = true;
   for (char c = 'A'; c <= 'Z'; c++) _lookup[c] = true;
   for (char c = 'a'; c <= 'z'; c++) _lookup[c] = true;
   _lookup['.'] = true;
   _lookup['_'] = true;
}

public static string RemoveSpecialCharacters(string str) {
   char[] buffer = new char[str.Length];
   int index = 0;
   foreach (char c in str) {
      if (_lookup[c]) {
         buffer[index] = c;
         index++;
      }
   }
   return new string(buffer, 0, index);
}

Manipulate a String That Is 30 Million Characters Long