Tag Archives: R

Consistent plotting with ggplot

Unlike other OPIGlets (looking at you, Claire), I have neither the skill nor the patience to make good figures from scratch. And making good figures — as well as remaking, rescaling and adapting them — is incredibly important, because they play a huge role in the way we communicate our research. So how does an aesthetically impaired DPhil student do her plotting?

Continue reading

Working with Jupyter notebook on a remote server

To celebrate the recent beta release of Jupyter Lab (try it out of you haven’t already), today we’re going to look at how to run a Jupyter session (Notebook or Lab) on a remote server.

Suppose you have lots of data which lives on a remote server and you want to play with it in a Jupyter notebook. You can’t copy the data to your local machine (well, you can, but you’re sensible so you won’t), but you can run your Jupyter session on the remote server. There’s just one problem – since Jupyter notebook is browser-based and works by connecting to the Jupyter session running locally, you can’t just run Jupyter remotely and forward X11 like you would a traditional graphical IDE. Fortunately, the solution is simple: we run Jupyter remotely, create an ssh tunnel connecting a local port to the one used by the Jupyter session, and connect directly to the Jupyter session using our local browser. The best part about this is that you can set up the Jupyter session once then connect to it from any browser on any machine once an ssh tunnel is created, without worrying about X11 forwarding.

Here’s how to do it.

1. First, connect to the remote server if you haven’t already

ssh fergus@funkyserver

1.5. Jupyter takes browser security very seriously, so in order to access a remote session from a local browser we need to set up a password associated with the remote Jupyter session. This is stored in jupyter_notebook_config.py which by default lives in ~/.jupyter. You can edit this manually, but the easiest option is to set the password by running Jupyter with the password argument:

jupyter notebook password
>>> Enter password:

This password will be used to access any Jupyter session running from this installation, so pick something sensible. You can set a new password at any time on the remote server in exactly the same way.

2: Launch a Jupyter session on the remote server. You can specify the access port using the --port option. This might be useful on a shared server where others might be doing the same thing. You’ll also want to run this without launching a browser on the remote server since this is of no use to you.

jupyter lab --port=9000 --no-browser &

Here I’m using Jupyter Lab, but this works in exactly the same way for Jupyter Notebook.

3: Now for the fun part. Jupyter is running on our remote server, but what we really want is to work in our favourite browser on our local machine. To do this we just need to create an ssh tunnel between a port on our machine and the port our Jupyter session is using on the remote server. On our local machine:

ssh -N -f -L 8888:localhost:9000 fergus@funkyserver

For those not familiar with ssh tunneling, we’ve just created a secure, encrypted connection between port 8888 on our local machine and port 9000 on our remote server.

  • -N tells ssh we won’t be running any remote processes using the connection. This is useful for situations like this where all we want to do is port forwarding.
  • -f runs ssh in the background, so we don’t need to keep a terminal session running just for the tunnel.
  • -L specifies that we’ll be forwarding a local port to a remote address and port. In this case, we’re forwarding port 8888 on our machine to port 9000 on the remote server. The name ‘localhost’ just means ‘this computer’. If you’re a Java programmer who lives for verbosity, you could equivalently pass -L localhost:8888:localhost:9000.

4: If you’ve done everything correctly, you should now be able to access your Jupyter session via port 8888 on your machine. Fire up your favourite browser and type localhost:8888 into the address bar. This should bring up a Jupyter session and prompt you for a password. Enter the password you specified for Jupyter on the remote server.

Congratulations! You now have a Jupyter session running remotely which you can connect to anytime, anywhere, from any machine.

Disclaimer: I haven’t tried this on Windows, nor do I intend to. I value my sanity.

aRrrrgh! or how to apply a fitted model to new data

Recently I’ve been battling furiously with R while analysing some loop modelling accuracy data. The idea was simple:

  1. Fit a general linear model to some data
  2. Get out a formula to predict a variable (let’s call it “accuracy”) based on some input parameters
  3. Apply this formula to new data and see how well the predictor does

It turns out, it’s not that simple to actually implement. Fitting a general linear model in R produces coefficients in a vector.

model <- glm(accuracy ~ param1 + param2 * param3, data=trainingset)
coef(model)
            (Intercept)                  param1                  param2 
            0.435395087            -0.093295388             0.148154339 
                 param3           param2:param3
            0.024399530             0.021100300

There seems to be no easy way to insert these coefficients into your formula and apply the resulting equation to new data. The only easy thing to do is to plot the fitted values against the variable we’re trying to predict, i.e. plot our predictions on the training set itself:

plot(model$fitted.values, trainingset$accuracy, xlab="score", ylab="accuracy", main="training set")

I’m sure there must be a better way of doing this, but many hours of Googling led me nowhere. So here is how I did it. I ended up writing my own parser function, which works only on very simple formulae using the + and * operators and without any R code inside the formula.

coefapply <- function(coefficients, row)
{
  result <- 0
  for (i in 1:length(coefficients))
  {
    subresult <- as.numeric(coefficients[i])
    if (!is.na(subresult))
    {
      name <- names(coefficients[i])
      if (name != "(Intercept)")
      {
        subnames <- strsplit(name, ":", fixed=TRUE)[[1]]
        for (n in subnames)
        {
          subresult <- subresult * as.numeric(row[n])
        }
      }
      result <- result + subresult
    }
  }
  return(result)
}

calculate_scores <- function(data, coefficients)
{
  scores <- vector(mode="numeric", length=nrow(data))
  for (i in 1:nrow(data))
  {
    row <- data[i,]
    scores[i] <- coefapply(coefficients, row)
  }
  return(scores)
}

Now we can apply our formula to a new dataset and plot the accuracy achieved on the new data:

model_coef <- coef(model)

# Test if our scores are the same values as the model's fitted values
training_scores <- calculate_scores(model_coef, trainingset)
sum((training_scores - model$fitted.values) < 0.000000000001) / length(scores)

# Calculate scores for our test set and plot them
test_scores <- calculate_scores(model_coef, testset)
plot(test_scores, testset$accuracy, xlab="score", ylab="accuracy", main="test set")

It works for my purpose. Maybe one day someone will see this post, chuckle, and then enlighten me with their perfectly simple and elegant alternative.