# Monthly Archives: August 2011

## How Much of R is Written in R Part 2: Contributed Packages

So that mean old boss of mine is at it again.  This morning I came in beaming about how many people had read my post How Much of R is Written in R (thanks by the way!).  He then asks me about one little line in that post; the one about how if you looked at the contributed packages, you’d overwhelmingly see them written in R, and that this is what really matters.

He asked if I had bothered to verify this.  I had not.  So off he sent me to do this too.  Can you believe the nerve of this character?  Making me do work.  I’m paid by a university; I’m not supposed to do work!

So think of this post as the quick cash-in sequel that no one probably wanted–the “Transformers 2” of R-bloggers posts, if you will.  I don’t mind admitting that I had to use every ounce of constraint to not call this “… Part 2:  Electric Boogaloo”.

This task is a little more cumbersome than the last one.  Unfortunately this means I have to actually be slightly careful in my shellscript, rather than just playing it fast and loose and making all the hardcore programmers angry by being so sloppy (spoiler alert:  they will still be angry).  I won’t go into the boring details about how I generated the data.  Let’s just have a look at the results; though, as always, all code and generated output will be attached.

Oh, but before we go on, if you thought Part 1 was interesting, you should go check out  http://www.ohloh.net/p/rproject.  Someone linked it in the comments of Part 1, and it’s definitely impressive. It has most of that information from that blog post and then some, and in a much cooler graph.  From the drop down menu, select “Languages”.  The graph is interactive, and if you (for instance) click on C (to hide it), then the graph rescales.  Pretty cool!

Ok, so the results then.

We will again first look at the Language breakdown of the number of total files:

Here R is the clear victor by overwhelming majority.  But the real challenge is looking at the language breakdown by percentage contribution to total lines of code (for our purposes here, we are looking at all ~3000 contributed packages and lumping them together as one huge mass of sourcecode).  The following graph provides that, but first a word of caution.  It’s a bit difficult to tell when a .h, or even a .c file is a C or C++ (in fact, you can fool the Unix/Linux tool “file” into believing these are all kinds of crazy things).  As such, some choice had to be made (see the script if you’re interested).  Errors likely occurred in determining what is C code and what is C++ code.  That said, having made our choices, we have the following breakdown:

Here R is the clear victor, though in light of the doubt cast above, it is perhaps not as strong of a victory as we would hope.  To assuage all doubt, we can lump C and C++ together into one category:

And look at that; R is still winning.  Now, you probably see where this is headed:

Which is pretty darn good!  In short, the people who are outside of the core R team but who are still developing incredibly cool things with R…are making those things in R.  And why shouldn’t they?  R rocks!

Here’s all the boring crap nobody cares about:

First, just a quick heads up; all the sourcecode to follow will look much better on my blog than on R-bloggers, because I have everything formatted (with highlighted syntax) specially in WordPress.  So if you want to be able to read this, then I would recommend reading it directly from my blog.

The shell script to grab all the packages and generate the “csv” (output is tab delimited (because some screwey files have commas in them), changed to .xls because wordpress hates freedom, split into two pieces because xls is a dumb format:  piece 1    piece 2):

#!/bin/sh

outdir="wherever/"

wget -e robots=off -r -l1 --no-parent -A.tar.gz http://cran.r-project.org/src/contrib/

srcdir="cran.r-project.org/src/contrib"
cd $srcdir echo -e "\n\n This will take a LONG FREAKING TIME.\n Please wait paitently.\n\n" touch$outdir/r_contrib_loc_by_lang.csv

#echo "Language,File.Name,loc,Proj.Name,Proj.ID.Nmbr" >> $outdir/r_contrib_loc_by_lang.csv echo -e "Language\tFile.Name\tloc\tProj.Name" >>$outdir/r_contrib_loc_by_lang.csv

for archive in ls *.tar.gz;do
tar -zxf $archive done for found in find . -name \*.r -or -name \*.R -or -name \*.pl -or -name \*.c -or -name \*.C -or -name \*.cpp -or -name \*.cc -or -name \*.h -or -name \*.hpp -or -name \*.f; do loc=wc -l$found | awk '{ print $1 }' filename=echo$found | sed -n 's:^$$.*/$$$$[^/]*$$:\2:p'
proj=echo $found | sed -e 's/.\///' -e 's/\/.*//' lang=echo$filename | sed 's/.*[.]//'

if [ $lang = "r" ]; then lang="R" elif [$lang = "pl" ]; then
lang="Perl"
elif [ $lang = "C" ]; then lang="c" elif [$lang = "cpp" ]; then
lang="c++"
elif [ $lang = "h" ]; then lang="c" elif [$lang = "hpp" ]; then
lang="c++"
elif [ $lang = "f" ]; then lang="Fortran" elif [$lang = "cc" ]; then
# Use file for best guess; bad guesses we revert to c
lang=file $found | awk '{ print$3 }'
if [ $lang = "English" ] || [$lang = "Unicode" ]; then
lang="c"
fi
fi

echo -e "$lang\t$filename\t$loc\t$proj"  >> $outdir/r_contrib_loc_by_lang.csv done echo -e "\n\n ALL DONE\n\n"  And here’s the R script used to analyze everything and generate the barplots: r.loc <- read.delim("r_contrib_loc_by_lang.csv", header=TRUE, stringsAsFactors=FALSE) a <- r.loc[which(r.loc[1] == "R"), ][3] b <- r.loc[which(r.loc[1] == "c"), ][3] c <- r.loc[which(r.loc[1] == "c++"), ][3] d <- r.loc[which(r.loc[1] == "Fortran"), ][3] e <- r.loc[which(r.loc[1] == "Perl"), ][3] lena <- length(a[, 1]) lenb <- length(b[, 1]) lenc <- length(c[, 1]) lend <- length(d[, 1]) lene <- length(e[, 1]) files.total <- lena + lenb + lenc + lend + lene loc.total <- sum(a) + sum(b) + sum(c) + sum(d) + sum(e) cat(sprintf("\nNumber R files:\t\t\t %d\nNumber C files:\t\t\t %d\nNumber C++ files:\t\t %d\nNumber Fortran files:\t\t %d\nNumber Perl files:\t\t %d\n", lena, lenb, lenc, lend, lene)) cat(sprintf("--------------------------------------")) cat(sprintf("\nTotal source files examined:\t %d\n\n", files.total)) cat(sprintf("\nLines R code:\t\t %d\nLines C code:\t\t %d\nLines C++ code:\t\t %d\nLines Fortran code:\t %d\nLines Perl code:\t %d\n", sum(a), sum(b), sum(c), sum(d), sum(e))) cat(sprintf("--------------------------------")) cat(sprintf("\nTotal lines of code:\t %d\n\n", loc.total)) cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in C++:\t\t %f\n%% code in Fortran:\t %f\n%% code in Perl:\t\t %f\n", 100*sum(a)/loc.total, 100*sum(b)/loc.total, 100*sum(c)/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total)) png("1_pct_contrib_source_files.png") barplot(c(100*lena/files.total, 100*lenb/files.total, 100*lenc/files.total, 100*lend/files.total, 100*lene/files.total), main="Percent of Contrib Sourcecode Files By Language", names.arg=c("R","C","C++","Fortran","Perl"), ylim=c(0,70)) dev.off() png("2_pct_contrib_code.png") barplot(c(100*sum(a)/loc.total, 100*sum(b)/loc.total, 100*sum(c)/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","C","C++","Fortran","Perl"), ylim=c(0,50)) dev.off() png("3_pct_contrib_code_c.cpp.combined.png") barplot(c(100*sum(a)/loc.total, 100*(sum(b)+sum(c))/loc.total, 100*sum(d)/loc.total, 100*sum(e)/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","C/C++","Fortran","Perl"), ylim=c(0,50)) dev.off() png("4_pct_contrib_code_r.vs.all.png") barplot(c(100*sum(a)/loc.total, 100*(sum(b)+sum(c)+sum(d)+sum(e))/loc.total), main="Percent Contribution of Language to Contrib", names.arg=c("R","Everything Else"), ylim=c(0,60)) dev.off()  With output: Number R files: 46054 Number C files: 9149 Number C++ files: 9387 Number Fortran files: 1684 Number Perl files: 44 -------------------------------------- Total source files examined: 66318 Lines R code: 6009966 Lines C code: 3519637 Lines C++ code: 2214267 Lines Fortran code: 645226 Lines Perl code: 6910 -------------------------------- Total lines of code: 12396006 % code in R: 48.483084 % code in C: 28.393315 % code in C++: 17.862745 % code in Fortran: 5.205112 % code in Perl: 0.055744  ## How Much of R is Written in R? My boss sent me an email (on my day off!) asking me just how much of R is written in the R language. This is very simple if you use R and a Unix-like system. It also gives me a good excuse to defend the title of this blog. It’s librestats, not projecteulerstats, afterall. So I grabbed the R-2.13.1 source package from the cran and wrote up a little script that would look at all .R, .c, and .f files in the archive, record the language (R, C, or Fortran), number of lines of code, and the file the code came from; then it’s just a matter of dumping all that to a csv (converted to .xls (in LibreOffice) because WordPress hates freedom). We’ll talk in a minute about just how you would generate that csv–but first let’s address the original question. By a respectable majority, most of the source code files of core R are written in R: At first glance, it seems like Fortran doesn’t give much of a contribution. However, when we look at the proportion of lines of code, we see something more reasonable: So there you have it. Roughly 22% of R is written in R. I know some people want R to be written in R for some crazy reason; but really, if anything, that 22% is too high. Trust me, you really want C and Fortran to be doing all the heavy lifting so that things stay nice and peppy. Besides, this is a fairly irrelevant issue, in my opinion. What matters is that people outside of Core R are writing in R. Look at the extra packages repo and you’ll see a very different story from the above graphic. That’s something SAS certainly can’t say, since people who want to do anything other than call some cookie-cutter SAS proc have to use IML or that ridiculous SAS macro language–each of which is somehow even more of a hilarious mess than base SAS. Ok, so how do we get that data? I actually have a much better script than the one I’m about to describe. The new one automatically grabs every source package from the cran that you don’t already have and starts digging in on them, dumping everything out into one big csv so you can watch trending. It’s interesting to see the transition from R being almost entirely (92%) in C to seeing it slowly drop down to ~52%. But that’s a different post for a different day because I have a few kinks to work out with that script before I would feel comfortable releasing it. So here’s how this system works. It’s basically the dumbest possible solution; I’m pretty good at those, if I may say so myself. Basically the shell script hops into across the R-version/src/ folder and gets a line count of each .R, .c, and .f file. That’s it; here it is: #!/bin/sh outdir="/path/to/where/you/want/the/csv/dumped" rdir="/path/to/R/source/root/directory/to/be/examined" #eg, ~/R-2.13.1/ cd$rdir/src

for rfile in find -name *.R
do
loc=wc -l $rfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///' echo "R,$loc"  >> $outdir/r_source_loc.csv done for cfile in find -name *.c do loc=wc -l$cfile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///'
echo "C,$loc" >>$outdir/r_source_loc.csv
done

for ffile in find -name *.f
do
loc=wc -l $ffile | sed -e 's/ ./,/' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\//\//g' -e 's/\/[^/]*\///g' -e 's/\///' echo "Fortran,$loc"  >> $outdir/r_source_loc.csv done  Then the R script just does exactly what you’d think, given the data (take a look at the “csv” for examples). r.loc <- read.csv("r_source_loc.csv",header=FALSE) a <-r.loc[which(r.loc[1] == "R"),][2] b <-r.loc[which(r.loc[1] == "C"),][2] c <-r.loc[which(r.loc[1] == "Fortran"),][2] files.total <- length(a[,1])+length(b[,1])+length(c[,1]) loc.total <- sum(a)+sum(b)+sum(c) cat(sprintf("\nNumber .R source files:\t\t %d\nNumber .c source files:\t\t %d\nNumber .f source files:\t\t %d\n",length(a[,1]),length(b[,1]),length(c[,1]))) cat(sprintf("-------------------------------------")) cat(sprintf("\nTotal source files examined:\t %d\n\n",length(a[,1])+length(b[,1])+length(c[,1]))) cat(sprintf("\nLines of R code:\t %d\nLines of C code:\t %d\nLines of Fortran code:\t %d\n",sum(a),sum(b),sum(c))) cat(sprintf("-------------------------------")) cat(sprintf("\nTotal lines of code:\t %d\n\n",loc.total)) cat(sprintf("\nAmong all lines of code being either R, C, or Fortran:\n")) cat(sprintf("%% code in R:\t\t %f\n%% code in C:\t\t %f\n%% code in Fortran:\t %f\n",100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total)) png("pct_r_source_files.png") barplot(c(length(a[,1])/files.total,length(b[,1])/files.total,length(c[,1])/files.total),main="Percent of Core R Sourcecode Files",names.arg=c("R","C","Fortran")) dev.off() png("pct_r_code.png") barplot(c(100*sum(a)/loc.total,100*sum(b)/loc.total,100*sum(c)/loc.total),main="Percent of Core R Lines of Code",names.arg=c("R","C","Fortran")) dev.off()  From the R script, we can get precise figures, which I prefer to pictures any day. But I seem to be an outlier in this regard… Number .R source files: 729 Number .c source files: 586 Number .f source files: 45 ------------------------------------- Total source files examined: 1360 Lines of R code: 149520 Lines of C code: 346778 Lines of Fortran code: 175409 ------------------------------- Total lines of code: 671707 Among all lines of code being either R, C, or Fortran: % code in R: 22.259705 % code in C: 51.626379 % code in Fortran: 26.113916  ## Project Euler in R: Problem 10 Problem: Find the sum of all the primes below two million. Commentary: If you’re not careful, this problem can really get you in R. In poking around the internet, there are a few solutions to this problem, but all the ones I’ve tested are slow. Some, even from people who are better at R than me, are VERY slow. That linked one in particular took 6 minutes and 8.728000 seconds to complete. My solutions runs in under 2 seconds. So we’ll be using that old prime algorithm I’ve already discussed to death. Other than a few tricks in there that I never see get implemented, there’s a crucial trick to this solution. Since we only want the sum, let’s not be slow about finding the primes we want. I only have to find all primes below $\sqrt{n}$, where $n=2000000$. It takes basically no time to find them. From there, I start evaluating my sum, and like a good little R user, I vectorize by getting the possible list of possible primes above $\sqrt{n}$. From there, I check if any of my known primes divide into those possible primes. This is actually much faster than generating primes and storing them. R Code: ElapsedTime <- system.time({ ########################## n <- 2000000 ### Function to get primes below specified n; option to get only those below sqrt(n) PrimesBelow <- function(n, below.sqrt=FALSE){ if (below.sqrt == TRUE){ m <- ceiling(sqrt(n)) } else { m <- n } primes <- c(2,3) i <- 3 while (i < m-1){ flag <- 0 i <- i+2 if (i%%6 == 3){ flag <- 1 } if (flag == 0){ s <- sqrt(i)+1 possible.primes <- primes[primes<s] for (prime in possible.primes){ if ((i%%prime == 0)){ flag <- 1 break } } if (flag == 0){ primes <- c(primes, i) } } } primes } ### primes <- PrimesBelow(n, below.sqrt=TRUE) biglist <- ceiling(sqrt(n)):n biglist <- biglist[-which(biglist%%2 == 0)] biglist <- biglist[-which(biglist%%6 == 3)] for (prime in primes){ biglist <- biglist[which(biglist%%prime != 0)] } answer <- sum(primes)+sum(as.numeric(biglist)) ########################## })[3] ElapsedMins <- floor(ElapsedTime/60) ElapsedSecs <- (ElapsedTime-ElapsedMins*60) cat(sprintf("\nThe answer is: %f\nTotal elapsed time: %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))  Output: The answer is: 142913828922.000000 Total elapsed time: 0 minutes and 1.915000 seconds ## Project Euler in R: Problem 9 Problem: There exists exactly one Pythagorean triplet for which a + b + c = 1000. Find the product abc. Commentary: I’m not proud of this solution. Every time I would look at this solution, I just knew there was something really obvious I was missing–that I was doing this in the most bone-headed way possible. So then I finally just gave up and checked the writeup at Project Euler, and yep, I forgot that Pythagorean Triplets have a very nice parametrization. I won’t even bother to reproduce the idea here, because the writeup over at Project Euler is actually very well done. But it is a bit embarrassing to have a masters in number theory and to forget that simple fact. Oh well. R Code: ElapsedTime <- system.time({ ########################## for (a in c(1:333)){ for (b in c(334:998)){ test <- a+b+sqrt(a^2+b^2) if (test==1000){ win.a <- a win.b <- b break } } } answer <- win.a*win.b*sqrt(win.a^2+win.b^2) ########################## })[3] ElapsedMins <- floor(ElapsedTime/60) ElapsedSecs <- (ElapsedTime-ElapsedMins*60) cat(sprintf("\nThe answer is: %d\nTotal elapsed time: %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))  Output: The answer is: 31875000 Total elapsed time: 0 minutes and 0.460000 seconds ## Prime testing function in R I was hoping to begin tinkering a bit with the multicore package in R beyond some extremely trivial examples. Thanks to a combination of R’s dumb quirkiness (for example, being worthless on loops), my poor planning, and general bad programming, my Saturday afternoon tinkering project is ultimately worthless in fulfilling that purpose. I was really hoping to take the prime searcher I had been using to solve Project Euler problems and use it to make a prime testing function which would utilize the 4 cores on my adorable little core i5. I succeeded in creating it, but it wasn’t until I had finished (which thankfully was only about an hour of time wasted) that I realized that my plan was stupid and I was going to have to loop over entries (or use sapply(), which is the same thing; don’t kid yourself). This, as we all know, is where R is fantastically bad. A very hasty test showed that relying on a single core, the original way I was doing things is a little over 2 times faster than this new implementation. This doesn’t bode well for the possible speedup when scaling up to 4 cores, and since I have more pressing projects at the moment, I’m going to have to dump this one barring some illumination. However, it is a workable prime tester. And for testing against single primes, it’s honestly not that slow. But, as pointed out above, if you’re using it to test primality of a (reasonably large) range of (reasonably large) values, then you’re going to have to loop and R is going to catch fire in the process. The syntax is simple enough; simply conjure your integer of choice $n$ and ask R IsPrime($n$) which is like the old Maple syntax back when I used it extensively (which was ~ Maple 8 or 9, I believe). Functions and sample output below: IsPrime <- function(n){ # n=Integer you want to know if is/not prime if ((n-floor(n)) > 0){ cat(sprintf("Error: function only accepts natural number inputs\n")) } else if (n < 1){ cat(sprintf("Error: function only accepts natural number inputs\n")) } else # Prime list exists if (try(is.vector(primes), silent=TRUE) == TRUE){ # Prime list is already big enough if (n %in% primes){ TRUE } else if (n < tail(primes,1)){ FALSE } else if (n <= (tail(primes,1))^2){ flag <- 0 for (prime in primes){ if (n%%prime == 0){ flag <- 1 break } } if (flag == 0){ TRUE } else { FALSE } } # Prime list is too small; get more primes else { last.known <- tail(primes,1) while ((last.known)^2 < n){ assign("primes", c(primes,GetNextPrime(primes)), envir=.GlobalEnv) last.known <- tail(primes,1) } IsPrime(n) } } else { # Prime list does not exist assign("primes", PrimesBelow(n,below.sqrt=TRUE), envir=.GlobalEnv) IsPrime(n) } } # Get next prime GetNextPrime <- function(primes){ # primes=Known prime list i <- tail(primes,1) while (TRUE){ flag <- 0 i <- i+2 if (i%%6 == 3){ flag <- 1 } if (flag == 0){ s <- sqrt(i)+1 possible.primes <- primes[primes<s] for (prime in possible.primes){ if ((i%%prime == 0)){ flag <- 1 break } } if (flag == 0){ break } } } i } # Primes below specified integer n; optionally only those below sqrt(n) PrimesBelow <- function(n, below.sqrt=FALSE){ if (below.sqrt == TRUE){ m <- ceiling(sqrt(n)) } else { m <- n } primes <- c(2,3) i <- 3 while (i < m-1){ flag <- 0 i <- i+2 if (i%%6 == 3){ flag <- 1 } if (flag == 0){ s <- sqrt(i)+1 possible.primes <- primes[primes<s] for (prime in possible.primes){ if ((i%%prime == 0)){ flag <- 1 break } } if (flag == 0){ primes <- c(primes, i) } } } primes }  Some sample output:  > primes Error: object 'primes' not found > IsPrime(100) [1] FALSE > IsPrime(101) [1] TRUE > primes [1] 2 3 5 7 11 >  After screwing around with the prime tester for a few minutes, I was able to find this adorable little gem 1> IsPrime(1234567891) [1] TRUE  So 1234567891 is prime, but none of 1, 12, 123, 1234, 12345, 123456, 1234567, 12345678, 123456789, 12345678910, 123456789101, or 1234567891011 are (of course, nearly half of those are even, but the point stands). ## Project Euler in R: Problem 8 Problem: Find the greatest product of five consecutive digits in the 1000-digit number. Commentary: So this solution is probably a little non-standard. I’ve actually got a huge blog post coming up about the little “trick” I use here, so I won’t go into it at length just yet. But before we go into the “trick”, I want to explain why it is that I’m doing the weird crap that I’m doing–afterwards we’ll talk a little about the “what”. My philosophy on Project Euler problems is that they should be taken “as is”. If I’m given a text file (as is the case in some problems), then I’d better figure out how to read in that text file. Likewise, if I’m given a text dump (as is the case here), then no changes can be made by hand. None at all. I suspect this is how most people proceed. However, after solving a problem in Project Euler, I enjoy looking at other people’s solutions, and in doing so, I have found some people who admit to doing some truly crazy things. Things like adding commas after characters by hand in a huge text dump. Don’t do that! So in this problem, the goal is to copy/paste the text dump in a script and sandwich it somewhere between some usable code. I do this by turning it into a vector in R using Unix sorcery. Once you buy that this is what I’ve done, the rest is trivial. Just check every 5 consecutive digits, but keep only the biggest. So how did I get that thing into a vector? You might argue that it was by cheating. I couldn’t disagree more (more on this at the conclusion of the commentary). I used the function system(), which on a Linux (and presumably Mac OS X, *bsd, …) system is a very powerful tool. It’s essentially a wrapper for the very powerful Unix (or Unix-like) shell which lives under your operating system and makes everything awesome. I don’t know what it does on Windows because I don’t give a shit. So here, I pass off to the shell an echo which contains some unwanted (additional) line breaks, together with the text dump from Project Euler. Next, I pipe off to tr and sed to do their voodoo. In short, this is how they work here: • tr -d ‘\n’ removes line breaks • sed ‘s/\$$[0-9]\$$/\\1\\n/g’ adds a line break after every numeral • sed ‘s/.$//’ removes last line break

Two additional things:  first, the -e’s allow me to chain together a bunch of sed calls at once.  This is handy because sed is the Stream EDitor, which is exactly what it sounds like–and chaining them like that (as opposed to piping the output to another sed call) is faster because (here) you only have to read in the stream once.  Second, you might notice that in the tr call, I use \n for line breaks, but in the sed call I use \\n (and \$$instead of \(, \\1 instead of \1, etc). The reason is that R’s system() function is really weird, and requires extra escaping of characters…sometimes. Ok, so that unpacks most of it. The last bit is the intern=TRUE part (default FALSE) captures the output and stores it into a character vector (as indicated in the help file). So the last little leg of sorcery is to use as.numeric to use the numbers for my intended purpose. That’s it; show’s over. Go home. So why wasn’t this cheating, in my opinion? It’s a wrapper that’s built into core R–it’s not even an extra package I have to download. In my mind, this is very clearly in the “not cheating” category, but I have some other solutions that, whenever I get the time to post the writeups, blur the line a bit–especially when I start invoking bc. If you want to argue that there’s a “more R-y” way to do it, then I’m willing to concede (even though I don’t know how–I just drop down to the shell to do my dirty work and have never been left wanting). If you think it violates the spirit of Project Euler, then I suggest you take a look at some of the Python kids who practically import entire solutions. Besides, sed is really, really cool. Deal with it, nerd. R Code: ElapsedTime <- system.time({ ########################## big <- as.numeric(system("echo ' 73167176531330624919225119674426574742355349194934 96983520312774506326239578318016984801869478851843 85861560789112949495459501737958331952853208805511 12540698747158523863050715693290963295227443043557 66896648950445244523161731856403098711121722383113 62229893423380308135336276614282806444486645238749 30358907296290491560440772390713810515859307960866 70172427121883998797908792274921901699720888093776 65727333001053367881220235421809751254540594752243 52584907711670556013604839586446706324415722155397 53697817977846174064955149290862569321978468622482 83972241375657056057490261407972968652414535100474 82166370484403199890008895243450658541227588666881 16427171479924442928230863465674813919123162824586 17866458359124566529476545682848912883142607690042 24219022671055626321111109370544217506941658960408 07198403850962455444362981230987879927244284909188 84580156166097919133875499200524063689912560717606 05886116467109405077541002256983155200055935729725 71636269561882670428252483600823257530420752963450 ' | tr -d '\n' | sed -e 's/\\([0-9]\$$/\\1\\n/g' -e 's/.\$//'",intern=TRUE))

big.prod <- 1

for (i in c(1:(length(big)-5))){
test <- prod(big[i:(i+4)])
if (test > big.prod){
big.prod <- test
}
}

##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",



Output:

Total elapsed time:  0 minutes and 0.014000 seconds

## Project Euler in R: Problem 7

Problem:  What is the 10001st prime number?

Commentary:  So it’s finally time to discuss my prime-searching algorithm, as promised back in Problem 5.  We will find primes in the following way:

1. 2 and 3 are prime
2. We examine the odd integers (having already discovered the only even prime) starting with $i=5$.
3. First, check if $i$ is of the form $6k+3$ for some integer $k>0$.  If so, then $i$ can’t be prime, because then $i=6k+3=3(2k+1)$, and since $k>0$, $2k+1 > 1$ (so $i\neq 3$).  In this case, don’t even bother proceeding; replace $i$ with $i+2$ and start over.
4. Calculate $s(i)=\sqrt{i}$.  Among all known primes below $s(i)$  (going no further for the same reason as that outlined in Problem 3), check if these primes divide into $i$.  If we find one that does, then there is no reason to proceed; $i$ is not prime.  Go back to 3. and try again with $i$ replced with $i+2$.  If none of the known primes below $s(i)$ divide $i$, then stop; $i$ must be prime.  Add it to the list of known primes.
5. Repeat this process by replacing $i$ with $i+2$ until we have 10001 primes.

It’s nothing fancy, but it works.  As for the implementation, believe it or not, I actually don’t construct a list of length 10001 and insert into it as I gather my primes like a good little R user.  See the post-code commentary for an explanation.

R Code:

ElapsedTime <- system.time({
##########################
primes <- c(2, 3)
i <- 3
while (length(primes) < 10001){
flag <- 0
i <- i+2
if (i%%6 == 3){
flag <- 1
}
if (flag == 0){
s <- sqrt(i)+1
for (prime in primes){
if ((i%%prime == 0)){
flag <- 1
break
}
if (prime > s){
break
}
}
if (flag == 0){
primes <- c(primes, i)
}
}
}

##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",


Output:
Total elapsed time:  0 minutes and 1.191000 seconds

Post-Code Commentary:  You may notice that I’m not pre-allocating the prime list and then storing new prime values as I go, like a good little R user.  The reason is that it’s actually slower to do it this way.  Now maybe I’m an idiot and I don’t really understand how R does things, but if I take this algorithm and only modify the prime storage part of it to include preallocating the list and then injecting into that list (replacing a 0) as I go, then the time is nearly a full order of magnitude slower.  Here’s the R code that you would produce if you were to do what I just described:

primes <- numeric(10001)
primes[1:2] <- c(2, 3)
i <- 3
prime.position <- 3

while (primes[10001] == 0){
flag <- 0
i <- i+2
if (i%%6 == 3){
flag <- 1
}
if (flag == 0){
s <- sqrt(i)+1
for (prime in primes[primes > 0]){
if ((i%%prime == 0)){
flag <- 1
break
}
if (prime > s){
break
}
}
if (flag == 0){
primes[prime.position] <- i
prime.position<-prime.position+1
}
}
}