Twinkles Weekly: September 22, 2016

Our Run 3 sprint is well defined: major deliverables are the instance catalog generating code, which is to be run at SLAC, and the capability of the phosim pipeline to run all jobs (ie, with checkpointing of some kind). The former is almost ready, while the latter is looking unlikely to happen. We discussed possible fallback positions - as extreme as turning off the Moon!

Instance Catalogs at SLAC

Epic here (#320)

Quick note, since I know we had talked about starting the new runs during the first week of October. Construction on UW campus means they are going to have to cut power to our building for about eight hours on Sunday, October 2nd. UW IT plans to gently power fatboy down on Friday September 30. Presumably, fatboy will be back up sometime Monday October 3rd, but I don't know how long powering up every machine in the physics and astronomy building will take.

Update as of September 26: UW IT will actually not begin powering down serves (like fatboy) until Saturday, October 1, so fatboy should be up all day on Friday. They anticipate everything will be back up and ready to use by 8am PDT on Monday, October 3.

We believe we have a version of the instance catalog generator which is sufficiently ready that it would be worth trying at SLAC. Rahul now has an account at SLAC but does not know which machines to log in to. Heather can do software installs from source if necessary. The normal node to use at SLAC is rhel6-64.slac.stanford.edu.

Rahul will create a setup script for generating instance catalogs.

PhoSim Pipeline Preparation

Discussion here (#315), epic to follow (Tom).

Phosim preparation, studying internal phosim check-pointing mechanism. It is not adequate to solve our issues, since time is not evenly spread among jobs. Not clear this is a high priority for phosim team. We could decide to just go ahead and accept that some jobs will fail for lack of time (as for Run 1).

We could also look into dmtcp as a different mechanism for handling checkpointing. Heather is going to look into this. Unlikely to be ready on a very short timescale.

Should we consider other options like dropping moon? Rahul makes the sensible suggestion that we should consider removing it for visits that we would not otherwise be able to run to completion.