Category Archives: Project Euler in R

Project Euler in R: Problem 10

Problem:  Find the sum of all the primes below two million.

Commentary:  If you’re not careful, this problem can really get you in R.  In poking around the internet, there are a few solutions to this problem, but all the ones I’ve tested are slow.  Some, even from people who are better at R than me, are VERY slow.  That linked one in particular took 6 minutes and 8.728000 seconds to complete.  My solutions runs in under 2 seconds.

So we’ll be using that old prime algorithm I’ve already discussed to death.  Other than a few tricks in there that I never see get implemented, there’s a crucial trick to this solution.  Since we only want the sum, let’s not be slow about finding the primes we want.  I only have to find all primes below \sqrt{n}, where n=2000000.  It takes basically no time to find them.  From there, I start evaluating my sum, and like a good little R user, I vectorize by getting the possible list of possible primes above \sqrt{n}.  From there, I check if any of my known primes divide into those possible primes.  This is actually much faster than generating primes and storing them.

R Code:

ElapsedTime <- system.time({
##########################
n <- 2000000

### Function to get primes below specified n; option to get only those below sqrt(n)
PrimesBelow <- function(n, below.sqrt=FALSE){
  if (below.sqrt == TRUE){
    m <- ceiling(sqrt(n))
  } else {
    m <- n
  }

  primes <- c(2,3)
  i <- 3
  while (i < m-1){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possible.primes <- primes[primes<s]
      for (prime in possible.primes){
        if ((i%%prime == 0)){
          flag <- 1
          break
        }
      }
      if (flag == 0){
        primes <- c(primes, i)
      }
    }
  }
  primes
}
###

primes <- PrimesBelow(n, below.sqrt=TRUE)

biglist <- ceiling(sqrt(n)):n
biglist <- biglist[-which(biglist%%2 == 0)]
biglist <- biglist[-which(biglist%%6 == 3)]

for (prime in primes){
  biglist <- biglist[which(biglist%%prime != 0)]
}

answer <- sum(primes)+sum(as.numeric(biglist))
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %f\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  142913828922.000000
Total elapsed time:  0 minutes and 1.915000 seconds


Project Euler in R: Problem 9

Problem:  There exists exactly one Pythagorean triplet for which a + b + c = 1000.  Find the product abc.

Commentary:  I’m not proud of this solution.  Every time I would look at this solution, I just knew there was something really obvious I was missing–that I was doing this in the most bone-headed way possible.  So then I finally just gave up and checked the writeup at Project Euler, and yep, I forgot that Pythagorean Triplets have a very nice parametrization.

I won’t even bother to reproduce the idea here, because the writeup over at Project Euler is actually very well done.  But it is a bit embarrassing to have a masters in number theory and to forget that simple fact.  Oh well.

R Code:

ElapsedTime <- system.time({
##########################
for (a in c(1:333)){
  for (b in c(334:998)){
    test <- a+b+sqrt(a^2+b^2)
    if (test==1000){
      win.a <- a
      win.b <- b
      break
    }
  }
}

answer <- win.a*win.b*sqrt(win.a^2+win.b^2)
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:

The answer is:  31875000
Total elapsed time:  0 minutes and 0.460000 seconds


Project Euler in R: Problem 8

Problem:  Find the greatest product of five consecutive digits in the 1000-digit number.

Commentary:  So this solution is probably a little non-standard.  I’ve actually got a huge blog post coming up about the little “trick” I use here, so I won’t go into it at length just yet. But before we go into the “trick”, I want to explain why it is that I’m doing the weird crap that I’m doing–afterwards we’ll talk a little about the “what”.

My philosophy on Project Euler problems is that they should be taken “as is”.  If I’m given a text file (as is the case in some problems), then I’d better figure out how to read in that text file.  Likewise, if I’m given a text dump (as is the case here), then no changes can be made by hand.  None at all.  I suspect this is how most people proceed.  However, after solving a problem in Project Euler, I enjoy looking at other people’s solutions, and in doing so, I have found some people who admit to doing some truly crazy things.  Things like adding commas after characters by hand in a huge text dump.  Don’t do that!

So in this problem, the goal is to copy/paste the text dump in a script and sandwich it somewhere between some usable code.  I do this by turning it into a vector in R using Unix sorcery.  Once you buy that this is what I’ve done, the rest is trivial.  Just check every 5 consecutive digits, but keep only the biggest.

So how did I get that thing into a vector?  You might argue that it was by cheating.  I couldn’t disagree more (more on this at the conclusion of the commentary).  I used the function system(), which on a Linux (and presumably Mac OS X, *bsd, …) system is a very powerful tool.  It’s essentially a wrapper for the very powerful Unix (or Unix-like) shell which lives under your operating system and makes everything awesome.  I don’t know what it does on Windows because I don’t give a shit.

So here, I pass off to the shell an echo which contains some unwanted (additional) line breaks, together with the text dump from Project Euler.  Next, I pipe off to tr and sed to do their voodoo.  In short, this is how they work here:

  • tr -d ‘\n’ removes line breaks
  • sed ‘s/\\([0-9]\\)/\\1\\n/g’ adds a line break after every numeral
  • sed ‘s/.$//’ removes last line break

Two additional things:  first, the -e’s allow me to chain together a bunch of sed calls at once.  This is handy because sed is the Stream EDitor, which is exactly what it sounds like–and chaining them like that (as opposed to piping the output to another sed call) is faster because (here) you only have to read in the stream once.  Second, you might notice that in the tr call, I use \n for line breaks, but in the sed call I use \\n (and \\( instead of \(, \\1 instead of \1, etc).  The reason is that R’s system() function is really weird, and requires extra escaping of characters…sometimes.

Ok, so that unpacks most of it.  The last bit is the intern=TRUE part (default FALSE) captures the output and stores it into a character vector (as indicated in the help file).  So the last little leg of sorcery is to use as.numeric to use the numbers for my intended purpose.  That’s it; show’s over.  Go home.

So why wasn’t this cheating, in my opinion?  It’s a wrapper that’s built into core R–it’s not even an extra package I have to download.  In my mind, this is very clearly in the “not cheating” category, but I have some other solutions that, whenever I get the time to post the writeups, blur the line a bit–especially when I start invoking bc.

If you want to argue that there’s a “more R-y” way to do it, then I’m willing to concede (even though I don’t know how–I just drop down to the shell to do my dirty work and have never been left wanting).  If you think it violates the spirit of Project Euler, then I suggest you take a look at some of the Python kids who practically import entire solutions.

Besides, sed is really, really cool.  Deal with it, nerd.

R Code:

ElapsedTime <- system.time({
##########################
big <- as.numeric(system("echo '
73167176531330624919225119674426574742355349194934
96983520312774506326239578318016984801869478851843
85861560789112949495459501737958331952853208805511
12540698747158523863050715693290963295227443043557
66896648950445244523161731856403098711121722383113
62229893423380308135336276614282806444486645238749
30358907296290491560440772390713810515859307960866
70172427121883998797908792274921901699720888093776
65727333001053367881220235421809751254540594752243
52584907711670556013604839586446706324415722155397
53697817977846174064955149290862569321978468622482
83972241375657056057490261407972968652414535100474
82166370484403199890008895243450658541227588666881
16427171479924442928230863465674813919123162824586
17866458359124566529476545682848912883142607690042
24219022671055626321111109370544217506941658960408
07198403850962455444362981230987879927244284909188
84580156166097919133875499200524063689912560717606
05886116467109405077541002256983155200055935729725
71636269561882670428252483600823257530420752963450
' | tr -d '\n' | sed -e 's/\\([0-9]\\)/\\1\\n/g' -e 's/.$//'",intern=TRUE))

big.prod <- 1

for (i in c(1:(length(big)-5))){
  test <- prod(big[i:(i+4)])
  if (test > big.prod){
    big.prod <- test
  }
}

answer <- big.prod
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:

The answer is:  40824
Total elapsed time:  0 minutes and 0.014000 seconds


Project Euler in R: Problem 7

Problem:  What is the 10001st prime number?

Commentary:  So it’s finally time to discuss my prime-searching algorithm, as promised back in Problem 5.  We will find primes in the following way:

  1. 2 and 3 are prime
  2. We examine the odd integers (having already discovered the only even prime) starting with i=5.
  3. First, check if i is of the form 6k+3 for some integer k>0.  If so, then i can’t be prime, because then i=6k+3=3(2k+1), and since k>0, 2k+1 > 1 (so i\neq 3).  In this case, don’t even bother proceeding; replace i with i+2 and start over.
  4. Calculate s(i)=\sqrt{i}.  Among all known primes below s(i)  (going no further for the same reason as that outlined in Problem 3), check if these primes divide into i.  If we find one that does, then there is no reason to proceed; i is not prime.  Go back to 3. and try again with i replced with i+2.  If none of the known primes below s(i) divide i, then stop; i must be prime.  Add it to the list of known primes.
  5. Repeat this process by replacing i with i+2 until we have 10001 primes.

It’s nothing fancy, but it works.  As for the implementation, believe it or not, I actually don’t construct a list of length 10001 and insert into it as I gather my primes like a good little R user.  See the post-code commentary for an explanation.

R Code:

ElapsedTime <- system.time({
##########################
primes <- c(2, 3)
i <- 3
while (length(primes) < 10001){
  flag <- 0
  i <- i+2
  if (i%%6 == 3){
    flag <- 1
  }
  if (flag == 0){
    s <- sqrt(i)+1
    for (prime in primes){
      if ((i%%prime == 0)){
        flag <- 1
        break
      }
      if (prime > s){
        break
      }
    }
    if (flag == 0){
      primes <- c(primes, i)
    }
  }
}

answer <- tail(primes, 1)
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  104743
Total elapsed time:  0 minutes and 1.191000 seconds

Post-Code Commentary:  You may notice that I’m not pre-allocating the prime list and then storing new prime values as I go, like a good little R user.  The reason is that it’s actually slower to do it this way.  Now maybe I’m an idiot and I don’t really understand how R does things, but if I take this algorithm and only modify the prime storage part of it to include preallocating the list and then injecting into that list (replacing a 0) as I go, then the time is nearly a full order of magnitude slower.  Here’s the R code that you would produce if you were to do what I just described:

primes <- numeric(10001)
primes[1:2] <- c(2, 3)
i <- 3
prime.position <- 3

while (primes[10001] == 0){
  flag <- 0
  i <- i+2
  if (i%%6 == 3){
    flag <- 1
  }
  if (flag == 0){
    s <- sqrt(i)+1
    for (prime in primes[primes > 0]){
      if ((i%%prime == 0)){
        flag <- 1
        break
      }
      if (prime > s){
        break
      }
    }
    if (flag == 0){
      primes[prime.position] <- i
      prime.position<-prime.position+1
    }
  }
}

answer <- tail(primes, 1)

So how do the two run?  Well, on my work computer (which is a little slower than my home computer–also don’t tell my boss!), the original “bad R user” code runs in 1.852 seconds.  The “better” solution with pre-fetching the vector runs in 11.100 seconds.  That is a 9.248 second differential, which amounts to a nearly 500% increase in run time!

Of course, there are plenty of times when it’s objectively better to pre-allocate; but as it turns out, this isn’t one of them (though I have no idea why).


Project Euler in R: Problem 6

Problem:  Find the difference between the sum of the squares of the first one hundred natural numbers and the square of the sum.

Commentary:  I taught a variety of freshman math courses at my university for roughly 5 years.  As such, this problem is very near and dear to my heart.

I can’t tell you how many kids show up at university thinking they’re hot shits because they took Calc 1 in high school, but don’t know that

a^2+b^2\neq (a+b)^2

This is just one of many adorable things freshmen tend to belive, along with all their ridiculous political beliefs and thinking they’ll somehow change the world in any appreciable way.  Adorable.

Actually, students thinking that the sum of the squares is equal to the square of the sum is so common, that it has its own name.  We call it the “freshman dream”.  I’ve also heard it called “Baby’s Binomial Theorem”, which is way funnier, but much less commonly heard.  And since I guess I have to mention it, there are number-like structures out there where the freshman dream is true (fields of characteristic 2 for my algebros), but the real numbers certainly aren’t one of them.

What’s especially funny about this misunderstanding is that it’s literally almost never true (for real numbers).  Pick your favorite two numbers–they don’t even have to be integers.  They can even be the same number.  Unless one (or both) of them is zero, then tada, the sum of the squares isn’t equal to the square of the sum.

So what other crazy (mathematical) things do university students tend to come up with in freshman math class?  Just on the order of basic arithmetic, it’s almost shocking how little mastery they have over the subject.  And we’re not talking about “complicated” things like the limit, integration, or any other worthless calculus crap.  This is arithmetic.  Our university students don’t understand arithmetic.

Just off the top of my head, these are some of the most common misunderstandings, not counting that mentioned above.  Keep in mind, everything that follows is not true, and what’s more, they’re not true for practically any numbers you throw at them (try it yourself!)

\sqrt{a^2+b^2}=a+b

-(a+b)=-a+b

\frac{a}{c}+\frac{b}{d} = \frac{a+b}{c+d}

\frac{a+b}{a}=b

\frac{a}{a+b}=\frac{1}{b}

And that’s not even getting into logarithms or trigonometry, which none of them understand.

This has a lot to do with why I gave up teaching for the cushy office life.  I haven’t looked back since.

R Code:

ElapsedTime<-system.time({
##########################
answer <- abs(sum((1:100)^2)-sum((1:100))^2)
##########################
})[3]
ElapsedMins<-floor(ElapsedTime/60)
ElapsedSecs<-(ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n", answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  25164150
Total elapsed time:  0 minutes and 0.000000 seconds


Project Euler in R: Problem 5

Problem:  What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

Commentary:  This problem again isn’t very difficult.  Basically we just exploit the existence of prime factorizations and construct the least common multiple (lcm) in the most intuitive way possible.   You might find something a bit strange in this solution, namely in the prime-y stuff; however, I don’t really want to get into that until Problem 7.  So I won’t.

R Code:

ElapsedTime <- system.time({
##########################
###
# Function to get primes below specified n; option to get only those below sqrt(n)
PrimesBelow <- function(n,below.sqrt=FALSE){
  if (below.sqrt==TRUE){
    m <- ceiling(sqrt(n))
  } else {
    m <- n
  }

  primes <- c(2, 3)
  i <- 3
  while (i < m-1){
    flag <- 0
    i <- i+2
    if (i%%6 == 3){
      flag <- 1
    }
    if (flag == 0){
      s <- sqrt(i)+1
      possibleprimes <- primes[primes < s]
      for (prime in possibleprimes){
        if ((i%%prime == 0)){
          flag <- 1
          break
          }
        }
        if (flag == 0){
          primes <- c(primes,i)
        }
      }
    }
  primes
}

# Function to get the prime factorization of all integers below
PrimeFact <- function(n){
list <- numeric(0)
for (i in 2:n){
ps <- numeric(0)
for (prime in primes){
p <- 1
while (i%%(prime^p) == 0){
p <- p+1
}
p <- p-1
ps <- c(ps, p)
}
if (length(list) == 0){
list <- data.frame(primes, ps)
} else {
list <- data.frame(list, data.frame(ps))
}
}
list
}
###

n <- 20

primes <- PrimesBelow(n)

list <- PrimeFact(n)

expnts <- numeric(0)
for (nump in 1:length(primes)){
temp <- numeric(0)
for (i in 2:n){
temp <- c(temp, list[[i]][nump])
}
expnts <- c(expnts, sort(temp, decreasing=TRUE)[1])
}

answer <- prod(primes^expnts)
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  232792560
Total elapsed time:  0 minutes and 0.008600 seconds


Project Euler in R: Problem 4

Problem 4: Find the largest palindrome made from the product of two 3-digit numbers.

Commentary:  Believe it or not, I don’t have much to say about this problem.  There are cuter, more general ways to attack this problem, but computer scientists suffer from an acute case of overgeneralizationitis.  I had a math professor once who loved to say “never do a calculation before its time”.  This is good advice, even in programming, but I think there’s a useful addendum to this.  Never generalize before it’s necessary.  Encapsulating an idea into a function is a stupid waste of time unless you’re pretty sure there’s a good reason to do it. But in my experience, most programmers don’t bother thinking much about anything.  They literally behave like computers, as though some algorithm needs to be fed to them and that is then all they are capable of doing from that point forward.

My biggest problem with programmers is they have this mindset that there’s only one possible solution.  I hesitate to say this of computer scientists in general, but it’s certainly true of people who call themselves programmers:  these people lack imagination.  They talk about “right” and “wrong” solutions, when what they really mean is “good” and “bad”, or more often, “elegant” and “inelegant”, a completely subjective, ill-defined notion.

What’s especially sad about this is that the entire field exists because of  mathematicians and logicians.  Ask a computer scientist about the halting problem and he’ll go on and on about it; ask him about Goedel incompleteness and he’ll scratch his head and stare at you blankly.  What’s even funnier about this is that they even stole the concept of “elegance” in a solution from the mathematicians.  But in their theft, they managed to get something very wrong.  We don’t throw away an idea just because a more elegant one comes along.  Sometimes beautiful, elegant solutions give you absolutely no insight into why a thing is the way it is, when the clunky, inelegant one will.

I’m not advocating that we make every proof (or program) into a brutal slog of calculations and appeals to first principles.  But elegance without insight can be very costly.

And that is why programmers are terrible at mathematics.

R Code:

ElapsedTime <- system.time({
##########################
RearrangeVector <- function(vec, n=1){
  vec <- c(tail(vec, length(vec)-n), head(vec,n))
  vec
}

palindromes <- numeric(9*10*10)
i <- 1
for (a1 in 1:9){
  for (a2 in 0:9){
    for (a3 in 0:9){
      palindromes[i] <- a1*10^5 + a2*10^4 + a3*10^3
                          + a3*10^2 + a2*10^1 + a1
      i <- i+1
    }
  }
}

palindromes <- sort(palindromes, decreasing=TRUE)

prod <- numeric(0)
temp <- 899:999
for (n in 1:100){
  prod <- c(prod, temp*RearrangeVector(temp, n))
}

acceptable <- palindromes[palindromes %in% prod]

answer <- acceptable[1]
##########################
})[3]
ElapsedMins <- floor(ElapsedTime/60)
ElapsedSecs <- (ElapsedTime-ElapsedMins*60)
cat(sprintf("\nThe answer is:  %d\nTotal elapsed time:  %d minutes and %f seconds\n",
answer, ElapsedMins, ElapsedSecs))

Output:
The answer is:  906609
Total elapsed time:  0 minutes and 0.011000 seconds