R Fork Bomb

So maybe I’m a strange guy, but I think fork bombs are really funny.  What’s a fork bomb?  The basic premise is that you spawn a process that spawns a process that spawns a process…, ad infinitum.

The most beautiful example of a fork bomb, and really one of the most beautiful lines of code ever, was created by Denis Rolo:

:(){ :|:& };:

Aside from looking like the gnarliest smiley face ever, running the above in a Unix terminal will spawn processes forever, unless you’ve limited the number of processes that can be run (I think OS X does this by default, so any brave Mac users are encouraged to try).  After a whole lot have been spawned, you’ve basically locked up your system and will have to reboot.

There’s a thorough explanation of how and why the above code works over at the Wikipedia fork bomb page, but I almost feel like knowing how it works ruins the beauty somewhat.  But that page also has lots of examples of fork bombs in various languages.  Unfortunately, R is absent from that list.  Well I think that’s just a shame.

We could invoke the above (with some modifications due to a weird quirk) with the system() function.  But that’s basically just letting the shell have all the fun.  However, we can do one better using the multicore package:

library(multicore)

while (TRUE) fork()

So why would you want to do this?  Well, I think the whole point is that you wouldn’t want to.  It’s just neat.

One final remark on this.  I have a good friend with whom I regularly converse about programming.  He’s a hardcore C programmer, and as such, he finds my struggles with efficiency in R hilarious.  When I told him that I was going to make a blog post about an R forkbomb, he had a simple, two word response:

“Slowest forkbomb.”

Who says programmers don’t have a sense of humor?

Advertisements

How Much of R is Written in R Part 2: Contributed Packages

So that mean old boss of mine is at it again.  This morning I came in beaming about how many people had read my post How Much of R is Written in R (thanks by the way!).  He then asks me about one little line in that post; the one about how if you looked at the contributed packages, you’d overwhelmingly see them written in R, and that this is what really matters.

He asked if I had bothered to verify this.  I had not.  So off he sent me to do this too.  Can you believe the nerve of this character?  Making me do work.  I’m paid by a university; I’m not supposed to do work!

So think of this post as the quick cash-in sequel that no one probably wanted–the “Transformers 2” of R-bloggers posts, if you will.  I don’t mind admitting that I had to use every ounce of constraint to not call this “… Part 2:  Electric Boogaloo”.

This task is a little more cumbersome than the last one.  Unfortunately this means I have to actually be slightly careful in my shellscript, rather than just playing it fast and loose and making all the hardcore programmers angry by being so sloppy (spoiler alert:  they will still be angry).  I won’t go into the boring details about how I generated the data.  Let’s just have a look at the results; though, as always, all code and generated output will be attached.

Oh, but before we go on, if you thought Part 1 was interesting, you should go check out  http://www.ohloh.net/p/rproject.  Someone linked it in the comments of Part 1, and it’s definitely impressive. It has most of that information from that blog post and then some, and in a much cooler graph.  From the drop down menu, select “Languages”.  The graph is interactive, and if you (for instance) click on C (to hide it), then the graph rescales.  Pretty cool!

Ok, so the results then.

We will again first look at the Language breakdown of the number of total files:

Here R is the clear victor by overwhelming majority.  But the real challenge is looking at the language breakdown by percentage contribution to total lines of code (for our purposes here, we are looking at all ~3000 contributed packages and lumping them together as one huge mass of sourcecode).  The following graph provides that, but first a word of caution.  It’s a bit difficult to tell when a .h, or even a .c file is a C or C++ (in fact, you can fool the Unix/Linux tool “file” into believing these are all kinds of crazy things).  As such, some choice had to be made (see the script if you’re interested).  Errors likely occurred in determining what is C code and what is C++ code.  That said, having made our choices, we have the following breakdown:

Here R is the clear victor, though in light of the doubt cast above, it is perhaps not as strong of a victory as we would hope.  To assuage all doubt, we can lump C and C++ together into one category:

And look at that; R is still winning.  Now, you probably see where this is headed:

Which is pretty darn good!  In short, the people who are outside of the core R team but who are still developing incredibly cool things with R…are making those things in R.  And why shouldn’t they?  R rocks!

Here’s all the boring crap nobody cares about:

First, just a quick heads up; all the sourcecode to follow will look much better on my blog than on R-bloggers, because I have everything formatted (with highlighted syntax) specially in WordPress.  So if you want to be able to read this, then I would recommend reading it directly from my blog.

The shell script to grab all the packages and generate the “csv” (output is tab delimited (because some screwey files have commas in them), changed to .xls because wordpress hates freedom, split into two pieces because xls is a dumb format:  piece 1    piece 2):

#!/bin/sh

outdir="wherever/"

wget -e robots=off -r -l1 --no-parent -A.tar.gz http://cran.r-project.org/src/contrib/

srcdir="cran.r-project.org/src/contrib"
cd $srcdir

echo -e "\n\n          This will take a LONG FREAKING TIME.\n          Please wait paitently.\n\n"

touch $outdir/r_contrib_loc_by_lang.csv

#echo "Language,File.Name,loc,Proj.Name,Proj.ID.Nmbr" >> $outdir/r_contrib_loc_by_lang.csv
echo -e "Language\tFile.Name\tloc\tProj.Name" >> $outdir/r_contrib_loc_by_lang.csv

for archive in `ls *.tar.gz`;do
  tar -zxf $archive
done

for found in `find . -name \*.r -or -name \*.R -or -name \*.pl -or -name \*.c -or -name \*.C -or -name \*.cpp -or -name \*.cc -or -name \*.h -or -name \*.hpp -or -name \*.f`; do
  loc=`wc -l $found | awk '{ print $1 }'`
  filename=`echo $found | sed -n 's:^\(.*/\)\([^/]*$\):\2:p'`
  proj=`echo $found | sed -e 's/.\///' -e 's/\/.*//'`
  lang=`echo $filename | sed 's/.*[.]//'`

  if [ $lang = "r" ]; then
    lang="R"
  elif [ $lang = "pl" ]; then
    lang="Perl"
  elif [ $lang = "C" ]; then
    lang="c"
  elif [ $lang = "cpp" ]; then
    lang="c++"
  elif [ $lang = "h" ]; then
    lang="c"
  elif [ $lang = "hpp" ]; then
    lang="c++"
  elif [ $lang = "f" ]; then
    lang="Fortran"
  elif [ $lang = "cc" ]; then
    # Use file for best guess; bad guesses we revert to c
    lang=`file $found | awk '{ print $3 }'`
    if [ $lang = "English" ] || [ $lang = "Unicode" ]; then
      lang="c"
    fi
  fi
  
  echo -e "$lang\t$filename\t$loc\t$proj"  >> $outdir/r_contrib_loc_by_lang.csv
done

echo -e "\n\n          ALL DONE\n\n"

And here’s the R script used to analyze everything and generate the barplots:

r.loc <- read.delim("r_contrib_loc_by_lang.csv", header=TRUE, stringsAsFactors=FALSE)

a <- r.loc[which(r.loc[1] == "R"), ][3]
b <- r.loc[which(r.loc[1] == "c"), ][3]
c <- r.loc[which(r.loc[1] == "c++"), ][3]
d <- r.loc[which(r.loc[1] == "Fortran"), ][3]
e <- r.loc[which(r.loc[1] == "Perl"), ][3]

lena <- length(a[, 1])
lenb <- length(b[, 1])
lenc <- length(c[, 1])
lend <- length(d[, 1])
lene <- length(e[, 1])

files.total <- lena + lenb + lenc + lend + lene
loc.total <- sum(a) + sum(b) + sum(c) + sum(d) + sum(e)

cat(sprintf("\nNumber R files:\t\t\t %d\nNumber C files:\t\t\t %d\nNumber C++ files:\t\t %d\nNumber Fortran files:\t\t %d\nNumber Perl files:\t\t %d\n", lena, lenb, lenc, lend, lene))
cat(sprintf("--------------------------------------"))
cat(sprintf("\nTotal source files examined:\t %d\n\n", files.total))

cat(sprintf("\nLines R code:\t\t %d\nLines C code:\t\t %d\nLines C++ code:\t\t %d\nLines Fortran code:\t %d\nLines Perl code:\t %d\n", sum(a), sum(b), sum(c), sum(d), sum(e)))
cat(sprintf("--------------------------------"))
cat(sprintf("\nTotal lines of code:\t %d\n\n", loc.total))

cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in C++:\t\t %f\n%% code in Fortran:\t %f\n%% code in Perl:\t\t %f\n", 100*sum(a)/loc.total, 100*sum(b)/loc.total, 100*sum(c)/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total))

png("1_pct_contrib_source_files.png")
barplot(c(100*lena/files.total, 100*lenb/files.total, 100*lenc/files.total, 100*lend/files.total, 100*lene/files.total), main="Percent of Contrib Sourcecode Files By Language", names.arg=c("R","C","C++","Fortran","Perl"), ylim=c(0,70))
dev.off()

png("2_pct_contrib_code.png")
barplot(c(100*sum(a)/loc.total, 100*sum(b)/loc.total, 100*sum(c)/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","C","C++","Fortran","Perl"), ylim=c(0,50))
dev.off()

png("3_pct_contrib_code_c.cpp.combined.png")
barplot(c(100*sum(a)/loc.total, 100*(sum(b)+sum(c))/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","C/C++","Fortran","Perl"), ylim=c(0,50))
dev.off()

png("4_pct_contrib_code_r.vs.all.png")
barplot(c(100*sum(a)/loc.total, 100*(sum(b)+sum(c)+sum(d)+sum(e))/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","Everything Else"), ylim=c(0,60))
dev.off()

With output:

Number R files:			 46054
Number C files:			 9149
Number C++ files:		 9387
Number Fortran files:		 1684
Number Perl files:		 44
--------------------------------------
Total source files examined:	 66318

Lines R code:		 6009966
Lines C code:		 3519637
Lines C++ code:		 2214267
Lines Fortran code:	 645226
Lines Perl code:	 6910
--------------------------------
Total lines of code:	 12396006

% code in R:		 48.483084
% code in C:		 28.393315
% code in C++:		 17.862745
% code in Fortran:	 5.205112
% code in Perl:		 0.055744

How Much of R is Written in R?

My boss sent me an email (on my day off!) asking me just how much of R is written in the R language.  This is very simple if you use R and a Unix-like system.  It also gives me a good excuse to defend the title of this blog.  It’s librestats, not projecteulerstats, afterall.

So I grabbed the R-2.13.1 source package from the cran and wrote up a little script that would look at all .R, .c, and .f files in the archive, record the language (R, C, or Fortran), number of lines of code, and the file the code came from; then it’s just a matter of dumping all that to a csv (converted to .xls (in LibreOffice) because WordPress hates freedom).

We’ll talk in a minute about just how you would generate that csv–but first let’s address the original question.

By a respectable majority, most of the source code files of core R are written in R:

At first glance, it seems like Fortran doesn’t give much of a contribution. However, when we look at the proportion of lines of code, we see something more reasonable:
So there you have it. Roughly 22% of R is written in R. I know some people want R to be written in R for some crazy reason; but really, if anything, that 22% is too high. Trust me, you really want C and Fortran to be doing all the heavy lifting so that things stay nice and peppy.

Besides, this is a fairly irrelevant issue, in my opinion.  What matters is that people outside of Core R are writing in R.  Look at the extra packages repo and you’ll see a very different story from the above graphic.  That’s something SAS certainly can’t say, since people who want to do anything other than call some cookie-cutter SAS proc have to use IML or that ridiculous SAS macro language–each of which is somehow even more of a hilarious mess than base SAS.

Ok, so how do we get that data?  I actually have a much better script  than the one I’m about to describe.  The new one automatically grabs every source package from the cran that you don’t already have and starts digging in on them, dumping everything out into one big csv so you can watch trending.  It’s interesting to see the transition from R being almost entirely (92%) in C to seeing it slowly drop down to ~52%.  But that’s a different post for a different day because I have a few kinks to work out with that script before I would feel comfortable releasing it.

So here’s how this system works.  It’s basically the dumbest possible solution; I’m pretty good at those, if I may say so myself.  Basically the shell script hops into across the R-version/src/ folder and gets a line count of each .R, .c, and .f file.  That’s it; here it is:

#!/bin/sh

outdir="/path/to/where/you/want/the/csv/dumped"

rdir="/path/to/R/source/root/directory/to/be/examined" #eg, ~/R-2.13.1/
cd $rdir/src

for rfile in `find -name *.R`
do
	loc=`wc -l $rfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "R,$loc"  >> $outdir/r_source_loc.csv
done

for cfile in `find -name *.c`
do
	loc=`wc -l $cfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "C,$loc"  >> $outdir/r_source_loc.csv
done

for ffile in `find -name *.f`
do
	loc=`wc -l $ffile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'`
	echo "Fortran,$loc"  >> $outdir/r_source_loc.csv
done

Then the R script just does exactly what you’d think, given the data (take a look at the “csv” for examples).

r.loc <- read.csv("r_source_loc.csv",header=FALSE)

a <-r.loc[which(r.loc[1] == "R"),][2]
b <-r.loc[which(r.loc[1] == "C"),][2]
c <-r.loc[which(r.loc[1] == "Fortran"),][2]

files.total <- length(a[,1])+length(b[,1])+length(c[,1])
loc.total <- sum(a)+sum(b)+sum(c)

cat(sprintf("\nNumber .R source files:\t\t %d\nNumber .c source files:\t\t %d\nNumber .f source files:\t\t %d\n",length(a[,1]),length(b[,1]),length(c[,1])))
cat(sprintf("-------------------------------------"))
cat(sprintf("\nTotal source files examined:\t %d\n\n",length(a[,1])+length(b[,1])+length(c[,1])))

cat(sprintf("\nLines of R code:\t %d\nLines of C code:\t %d\nLines of Fortran code:\t %d\n",sum(a),sum(b),sum(c)))
cat(sprintf("-------------------------------"))
cat(sprintf("\nTotal lines of code:\t %d\n\n",loc.total))

cat(sprintf("\nAmong all lines of code being either R, C, or Fortran:\n"))
cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in Fortran:\t %f\n",100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total))

png("pct_r_source_files.png")
barplot(c(length(a[,1])/files.total,length(b[,1])/files.total,length(c[,1])/files.total),main="Percent of Core R Sourcecode Files",names.arg=c("R","C","Fortran"))
dev.off()

png("pct_r_code.png")
barplot(c(100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total),main="Percent of Core R Lines of Code",names.arg=c("R","C","Fortran"))
dev.off()

From the R script, we can get precise figures, which I prefer to pictures any day. But I seem to be an outlier in this regard…

Number .R source files:		 729
Number .c source files:		 586
Number .f source files:		 45
-------------------------------------
Total source files examined:	 1360

Lines of R code:	 149520
Lines of C code:	 346778
Lines of Fortran code:	 175409
-------------------------------
Total lines of code:	 671707

Among all lines of code being either R, C, or Fortran:
% code in R:		 22.259705
% code in C:		 51.626379
% code in Fortran:	 26.113916


Project Euler in R: Problem 10

Problem:  Find the sum of all the primes below two million.

Commentary:  If you’re not careful, this problem can really get you in R.  In poking around the internet, there are a few solutions to this problem, but all the ones I’ve tested are slow.  Some, even from people who are better at R than me, are VERY slow.  That linked one in particular took 6 minutes and 8.728000 seconds to complete.  My solutions runs in under 2 seconds.

So we’ll be using that old prime algorithm I’ve already discussed to death.  Other than a few tricks in there that I never see get implemented, there’s a crucial trick to this solution.  Since we only want the sum, let’s not be slow about finding the primes we want.  I only have to find all primes below \sqrt{n}, where n=2000000.  It takes basically no time to find them.  From there, I start evaluating my sum, and like a good little R user, I vectorize by getting the possible list of possible primes above \sqrt{n}.  From there, I check if any of my known primes divide into those possible primes.  This is actually much faster than generating primes and storing them.

R Code:

ElapsedTime <- system.time({
##########################
n <- 2000000

### Function to get primes below specified n; option to get only those below sqrt(n)
PrimesBelow <- function(n, below.sqrt=FALSE){
  if (below.sqrt == TRUE){
    m <- ceiling(sqrt(n))
  } else {
    m <- n
  }

  primes <- c(2,3)
  i <- 3
  while (i < m-1){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        primes <- c(primes, i)
      }
    }
  }
  primes
}
###

primes <- PrimesBelow(n, below.sqrt=TRUE)

biglist <- ceiling(sqrt(n)):n
biglist <- biglist[-which(biglist%%2 == 0)]
biglist <- biglist[-which(biglist%%6 == 3)]

for (prime in primes){
  biglist <- biglist[which(biglist%%prime != 0)]
}

answer <- sum(primes)+sum(as.numeric(biglist))
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %f\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  142913828922.000000
Total elapsed time:  0 minutes and 1.915000 seconds


Project Euler in R: Problem 9

Problem:  There exists exactly one Pythagorean triplet for which a + b + c = 1000.  Find the product abc.

Commentary:  I’m not proud of this solution.  Every time I would look at this solution, I just knew there was something really obvious I was missing–that I was doing this in the most bone-headed way possible.  So then I finally just gave up and checked the writeup at Project Euler, and yep, I forgot that Pythagorean Triplets have a very nice parametrization.

I won’t even bother to reproduce the idea here, because the writeup over at Project Euler is actually very well done.  But it is a bit embarrassing to have a masters in number theory and to forget that simple fact.  Oh well.

R Code:

ElapsedTime <- system.time({
##########################
for (a in c(1:333)){
  for (b in c(334:998)){
    test <- a+b+sqrt(a^2+b^2)
    if (test==1000){
      win.a <- a
      win.b <- b
      break
    }
  }
}

answer <- win.a*win.b*sqrt(win.a^2+win.b^2)
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:

The answer is:  31875000
Total elapsed time:  0 minutes and 0.460000 seconds


Prime testing function in R

I was hoping to begin tinkering a bit with the multicore package in R beyond some extremely trivial examples.  Thanks to a combination of R’s dumb quirkiness (for example, being worthless on loops), my poor planning, and general bad programming, my Saturday afternoon tinkering project is ultimately worthless in fulfilling that purpose.

I was really hoping to take the prime searcher I had been using to solve Project Euler problems and use it to make a prime testing function which would utilize the 4 cores on my adorable little core i5.  I succeeded in creating it, but it wasn’t until I had finished (which thankfully was only about an hour of time wasted) that I realized that my plan was stupid and I was going to have to loop over entries (or use sapply(), which is the same thing; don’t kid yourself).  This, as we all know, is where R is fantastically bad.

A very hasty test showed that relying on a single core, the original way I was doing things is a little over 2 times faster than this new implementation.  This doesn’t bode well for the possible speedup when scaling up to 4 cores, and since I have more pressing projects at the moment, I’m going to have to dump this one barring some illumination.

However, it is a workable prime tester.  And for testing against single primes, it’s honestly not that slow.  But, as pointed out above, if you’re using it to test primality of a (reasonably large) range of (reasonably large) values, then you’re going to have to loop and R is going to catch fire in the process.

The syntax is simple enough; simply conjure your integer of choice n and ask R

IsPrime(n)

which is like the old Maple syntax back when I used it extensively (which was ~ Maple 8 or 9, I believe).  Functions and sample output below:

IsPrime <- function(n){ # n=Integer you want to know if is/not prime
  if ((n-floor(n)) > 0){
    cat(sprintf("Error: function only accepts natural number inputs\n"))
  } else if (n < 1){
      cat(sprintf("Error: function only accepts natural number inputs\n"))
    } else
  # Prime list exists
  if (try(is.vector(primes), silent=TRUE) == TRUE){

    # Prime list is already big enough
    if (n %in% primes){
      TRUE
    } else
    if (n < tail(primes,1)){
      FALSE
    } else
    if (n <= (tail(primes,1))^2){
      flag <- 0
      for (prime in primes){
        if (n%%prime == 0){
          flag <- 1
          break
        }
      }
    
      if (flag == 0){
        TRUE
      }
      else {
        FALSE
      }
    }

    # Prime list is too small; get more primes
    else {
      last.known <- tail(primes,1)
      while ((last.known)^2 < n){
        assign("primes", c(primes,GetNextPrime(primes)), envir=.GlobalEnv)
        last.known <- tail(primes,1)
      }
      IsPrime(n)
    }
  } else {
    # Prime list does not exist
    assign("primes", PrimesBelow(n,below.sqrt=TRUE), envir=.GlobalEnv)
    IsPrime(n)
  }
}

# Get next prime
GetNextPrime <- function(primes){ # primes=Known prime list
  i <- tail(primes,1)
  while (TRUE){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        break
      }
    }
  }
  i
}

# Primes below specified integer n; optionally only those below sqrt(n)
PrimesBelow <- function(n, below.sqrt=FALSE){
  if (below.sqrt == TRUE){
    m <- ceiling(sqrt(n))
  } else {
    m <- n
  }
  
  primes <- c(2,3)
  i <- 3
  while (i < m-1){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        primes <- c(primes, i)
      }
    }
  }
  primes
}

Some sample output:

 > primes
Error: object 'primes' not found
> IsPrime(100)
[1] FALSE
> IsPrime(101)
[1] TRUE
> primes
[1]  2  3  5  7 11
>

After screwing around with the prime tester for a few minutes, I was able to find this adorable little gem

1> IsPrime(1234567891)
[1] TRUE

So 1234567891 is prime, but none of 1, 12, 123, 1234, 12345, 123456, 1234567, 12345678, 123456789, 12345678910, 123456789101, or 1234567891011 are (of course, nearly half of those are even, but the point stands).


Project Euler in R: Problem 8

Problem:  Find the greatest product of five consecutive digits in the 1000-digit number.

Commentary:  So this solution is probably a little non-standard.  I’ve actually got a huge blog post coming up about the little “trick” I use here, so I won’t go into it at length just yet. But before we go into the “trick”, I want to explain why it is that I’m doing the weird crap that I’m doing–afterwards we’ll talk a little about the “what”.

My philosophy on Project Euler problems is that they should be taken “as is”.  If I’m given a text file (as is the case in some problems), then I’d better figure out how to read in that text file.  Likewise, if I’m given a text dump (as is the case here), then no changes can be made by hand.  None at all.  I suspect this is how most people proceed.  However, after solving a problem in Project Euler, I enjoy looking at other people’s solutions, and in doing so, I have found some people who admit to doing some truly crazy things.  Things like adding commas after characters by hand in a huge text dump.  Don’t do that!

So in this problem, the goal is to copy/paste the text dump in a script and sandwich it somewhere between some usable code.  I do this by turning it into a vector in R using Unix sorcery.  Once you buy that this is what I’ve done, the rest is trivial.  Just check every 5 consecutive digits, but keep only the biggest.

So how did I get that thing into a vector?  You might argue that it was by cheating.  I couldn’t disagree more (more on this at the conclusion of the commentary).  I used the function system(), which on a Linux (and presumably Mac OS X, *bsd, …) system is a very powerful tool.  It’s essentially a wrapper for the very powerful Unix (or Unix-like) shell which lives under your operating system and makes everything awesome.  I don’t know what it does on Windows because I don’t give a shit.

So here, I pass off to the shell an echo which contains some unwanted (additional) line breaks, together with the text dump from Project Euler.  Next, I pipe off to tr and sed to do their voodoo.  In short, this is how they work here:

  • tr -d ‘\n’ removes line breaks
  • sed ‘s/\\([0-9]\\)/\\1\\n/g’ adds a line break after every numeral
  • sed ‘s/.$//’ removes last line break

Two additional things:  first, the -e’s allow me to chain together a bunch of sed calls at once.  This is handy because sed is the Stream EDitor, which is exactly what it sounds like–and chaining them like that (as opposed to piping the output to another sed call) is faster because (here) you only have to read in the stream once.  Second, you might notice that in the tr call, I use \n for line breaks, but in the sed call I use \\n (and \\( instead of \(, \\1 instead of \1, etc).  The reason is that R’s system() function is really weird, and requires extra escaping of characters…sometimes.

Ok, so that unpacks most of it.  The last bit is the intern=TRUE part (default FALSE) captures the output and stores it into a character vector (as indicated in the help file).  So the last little leg of sorcery is to use as.numeric to use the numbers for my intended purpose.  That’s it; show’s over.  Go home.

So why wasn’t this cheating, in my opinion?  It’s a wrapper that’s built into core R–it’s not even an extra package I have to download.  In my mind, this is very clearly in the “not cheating” category, but I have some other solutions that, whenever I get the time to post the writeups, blur the line a bit–especially when I start invoking bc.

If you want to argue that there’s a “more R-y” way to do it, then I’m willing to concede (even though I don’t know how–I just drop down to the shell to do my dirty work and have never been left wanting).  If you think it violates the spirit of Project Euler, then I suggest you take a look at some of the Python kids who practically import entire solutions.

Besides, sed is really, really cool.  Deal with it, nerd.

R Code:

ElapsedTime <- system.time({
##########################
big <- as.numeric(system("echo '
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
' | tr -d '\n' | sed -e 's/\\([0-9]\\)/\\1\\n/g' -e 's/.$//'",intern=TRUE))

big.prod <- 1

for (i in c(1:(length(big)-5))){
  test <- prod(big[i:(i+4)])
  if (test > big.prod){
    big.prod <- test
  }
}

answer <- big.prod
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:

The answer is:  40824
Total elapsed time:  0 minutes and 0.014000 seconds