Transient Expand-Archive Failures

Chris Oldwood from The OldWood Thing

[I’m sure there is something else going on here but on the off-chance someone else is also observing this and also lost at least they’ll know they’re not alone.]

We have a GitLab project pipeline that started out as a monolithic job but over the last 9 months has slowly been parallelized and now runs as over 150 jobs spread out across a cluster of 4 fairly decent [1] machines with 8 to 10 concurrent jobs per host. More recently we’ve started seeing the PowerShell Expand-Archive cmdlet failing randomly up to 5% of the time with the following error:

Remove-Item : Cannot find path {...} because it does not exist.

The line of code highlighted in the error is:

$expandedItems | % { Remove-Item $_ -Force -Recurse }

If you google this message it suggests this probably isn’t the real error but a problem with the cmdlet trying to clean-up after failing to extract the contents of the .zip file. Sadly the reason why the extraction might have failed in the first place is now lost.

Investigation

While investigating this error message I ran across two main hits – one from Stack Overflow and the other on the PowerShell GitHub project – both about hitting the classic long path problem in Windows. In our case the extracted paths, even including the build agent root, is still only 100 characters so well within the limit as the archive only has one subfolder and the filenames are short.

Also the archive is built with it’s companion cmdlet Compress-Archive so I doubt it’s an impedance mismatch in our choice of tools.

My gut reaction to anything spurious like this is that it’s the virus scanner (AV) [2]. Sadly I have no direct control over the virus scanner product choice or its configuration. In this instance the machines have Trend Micro whereas the other build agents I’ve built are VMs and have Windows Defender [3], but their load is also much lower. I managed to get the build folder excluded temporarily but that appears to have had no effect and nothing was logged in the AV to say it had blocked anything. (The “behaviour monitoring” in modern AV products often gets triggered by build tools which is annoying.)

After discounting the obvious and checking that memory exhaustion also wasn’t a factor as the memory load for the jobs is variable and the worst case loading can cause the page-file to be used, I wondered if there the problem lay with the GitLab runner cache somehow.

Corrupt Runner Cache?

To avoid downloading the .zip file artefact for every job run we utilise the GitLab runner local cache. This is effectively a .zip file of a packages folder in the project working copy that gets packed up and re-used in the other jobs on the same machine which, given our level of concurrency, means it’s constantly in use. Hence I wondered if our archive was being corrupted when the cache was being unpacked as I’ve seen embedded .zip files cause problems in the past for AV tools (even though it supposedly shouldn’t have been touching the folder). So I added a step to test our archive’s integrity before unpacking it by using 7-Zip as there doesn’t appear to be a companion cmdlet Test-Archive. I immediately saw the integrity test pass but the Expand-Archive step fail a few times so I’m pretty sure the problem is not archive corruption.

Workaround

The workaround which I’ve employed is to use 7-Zip for the unpacking step too and so far we’ve seen no errors at all but I’m left wondering why Expand-Archive was intermittently failing. Taking an extra dependency on a popular tool like 7-Zip is hardly onerous but it bumps the complexity up very slightly and needs to be accounted for in the docs / scripts.

In my 2017 post Fallibility I mentioned how I once worked with someone who was more content to accept they’d found an undocumented bug in the Windows CopyFile() function than believe there was a flaw in their code or analysis [4]. Hence I feel something as ubiquitous as Expand-Archive is unlikely to have a decompression bug and that there is some piece of the puzzle here that I’m missing. Maybe the AV is still interfering in some way that isn’t triggered by 7-Zip or the transient memory pressure caused by the heavier jobs is having an impact?

Given the low cost of the workaround (use 7-Zip instead) the time, effort and disruption needed to run further experiments to explore this problem further is sadly too high. For the time being annecdata is the best I can do.

 

[1] 8 /16 cores, 64 / 128 GB RAM, and NVMe based disks.

[2] I once did some Windows kernel debugging to help prove an anti-virus product update was the reason our engine processes where not terminating correctly under low memory conditions.

[3] Ideally servers shouldn’t need anti-virus tools at all but the principle of Defence in Depth suggests the minor performance impact is worth it to potentially help slow lateral movement.

[4] TL;DR: I quickly showed it was the latter at fault not the Windows API.

 

CI/CD Server Inline Scripts

Chris Oldwood from The OldWood Thing

As you might have already gathered if you’d read my 2014 post “Building the Pipeline - Process Led or Product Led?” I’m very much in favour of developing a build and deployment process locally first, then automating that, rather than clicking buttons in a dedicated CI/CD tool and hoping I can debug it later. I usually end up at least partially scripting builds anyway [1] to save time waiting for the IDE to open [2] when I just need some binaries for a dependency, so it’s not wasted effort.

Inline Scripts

If other teams prefer to configure their build or deployment through a tool’s UI I don’t really have a problem with that if I know I can replay the same steps locally should I need to test something out as the complexity grows. What I do find disturbing though is when some of the tasks use inline scripts to do something non-trivial, like perform the entire deployment. What’s even more disturbing is when that task script is then duplicated across environments and maintained independently.

Versioning

There are various reasons why we use a version control tool, but first and foremost they provide a history, which implies that we can trace back any changes that have been made and we have a natural backup should we need to roll back or restore the build server.

Admittedly most half-decent build and deployment tools come with some form of versioning built in which you gives that safety net. However having that code versioned in a separate tool and repository from the main codebase means that you have to work harder to correlate what version of the system requires what version of the build process. CI/CD tools tend to present you with a fancy UI for looking at the history rather than giving you direct access to, say, it’s internal git repo. And even then what the tool usually gives you is “what” changed, but does not also provide the commentary on “why” it was changed. Much of what I wrote in my “Commit Checklist” equally applies to build and deployment scripts as it does production code.

Although Jenkins isn’t the most polished of tools compared to, say, TeamCity it is pretty easy to configure one of the 3rd party plugins to yank the configuration files out and check them into the same repo as the source code along with a suitable comment. As a consequence any time the repo is tagged due to a build being promoted the Jenkins build configuration gets included for free.

Duplication

My biggest gripe is not with the versioning aspect though, which I believe is pretty important for any non-trivial process, but it’s when the script is manually duplicated across environments. Having no single point of truth, from a logic perspective, is simply asking for trouble. The script will start to drift as subtleties in the environmental differences become enshrined directly in the logic rather than becoming parameterised behaviours.

The tool’s text editor for inline script blocks is usually a simple edit box designed solely for trivial changes; anything more significant is expected to be handled by pasting into a real editor instead. But we all know different people like different editors and so this becomes another unintentional source of difference as tabs and spaces fight for domination.

Fundamentally there should be one common flow of logic that works for every environment. The differences between them should boil down to simple settings, like credentials, or cardinality of resources, e.g. the number of machines in the cluster. Occasionally there may be custom branches in logic, such as the need for a proxy server, but it should be treated as a minor deviation that could apply to any environment, but just happens to only be applicable to, say, one at the moment.

Testability

This naturally leads onto the inherent lack of testability outside of the tool and workflow. It’s even worse if the script makes use of some variable substitution system that the CI/CD tool provides because that means you have to manually fix-up the code before running it outside the tool, or keep running it in the tool and use printf() style debugging by looking at the task’s output.

All script engines I’m aware of accept arguments, so why not run the script as an external script and pass the arguments from the tool in the tried and tested way? This means the tool runs it pretty much the same way you do except perhaps for some minor environmental differences, like user account or current working directory which are all common problems and easily overcome. Most modern scripting languages come with a debugger too which seems silly to give up.

Of course this doesn’t mean that you have to make every single configuration setting a separate parameter to the script, that would be overly complicated too. Maybe you just provide one parameter which is a settings file for the environment with a bunch of key/value pairs. You can then tweak the settings as appropriate while you test and debug. While idempotence and the ideas behind Desired State Configuration (DSC) are highly desirable, there is no reason we can’t also borrow from the Design for Testability guidebook here too by adding features making it easier to test.

Don’t forget that scripting languages often come with unit test frameworks these days too which can allow you to mock out code which has nasty side-effects so you can check your handling and orchestration logic. For example PowerShell has Pester which really helps bring some extra discipline to script development; an area which has historically been tough due to the kinds of side-effects created by executing the code.

Complexity

When an inline script has grown beyond the point where Hoare suggests “there are obviously no deficiencies”, which is probably anything more than a trivial calculation or invocation of another tool, then it should be decomposed into smaller functional units. Then each of these units can be tested and debugged in isolation and perhaps the inline script then merely contains a couple of lines of orchestration code, which would be trivial to replicate at a REPL / prompt.

For example anything around manipulating configuration files is a perfect candidate for factoring out into a function or child script. It might be less efficient to invoke the same function a few times rather than read and write the file once, but in the grand scheme of things I’d bet it’s marginal in comparison to the rest of the build or deployment process.

Many modern scripting languages have a mechanism for loading some sort of module or library of code. Setting up an internal package manager is a pretty heavyweight option in comparison to publishing a .zip file of scripts but if it helps keep the script complexity under control and provides a versioned repository that can be reliably queried at execution time, then why not go for that instead?

Scripts are Artefacts

It’s easy to see how these things happen. What starts off as a line or two of script code eventually turns into a behemoth before anyone realises it’s not been versioned and there are multiple copies. After all, the deployment requirements historically come up at the end of the journey, after the main investment in the feature has already happened. The pressure is then on to get it live, and build & deployment, like tests, is often just another second class citizen.

The Walking Skeleton came about in part to push back against this attitude and make the build pipeline and tests part and parcel of the whole delivery process; it should not be an afterthought. This means it deserves the same rigour we apply elsewhere in our process.

Personally I like to see everything go through the pipeline, by which I mean that source code, scripts, configuration, etc. all enter the pipeline as versioned inputs and are passed along until the deployed product pops out the other end. The way you build your artefacts is inherently tied to the source code and project configuration that produces it. Configuration, whether it be infrastructure or application settings, is also linked to the version of the tools, scripts, and code which consumes it. It’s more awkward to inject version numbers into scripts, like you do with binaries, but even pushing them through the pipeline in a .zip file with version number in the filename makes a big difference to tracking the “glue”.

Ultimately any piece of the puzzle that directly affects the ability to safely deliver continuous increments of a product needs to be held in high regard and treated with the respect it deserves.

 

[1] See “Cleaning the Workspace” for more about why I don’t trust my IDE to clean up after itself.

[2] I’m sure I could load Visual Studio, etc. in “safe mode” to avoid waiting for all the plug-ins and extensions to initialise but it still seems “wrong” to load an entire IDE just to invoke the same build tool I could invoke almost directly from the command line myself.

Building Emacs 25.2 on XUbuntu 17.04

Timo Geusch from The Lone C++ Coder's Blog

I haven’t done much with Ubuntu recently, but had to set up a laptop with XUbuntu 17.04. That came with Emacs 24.5 as the default emacs package, and as skeeto pointed out in the comments, with a separate emacs25 package for Emacs 25.1. I tend to run the latest release Emacs everywhere out of habit, […]

The post Building Emacs 25.2 on XUbuntu 17.04 appeared first on The Lone C++ Coder's Blog.

Automate Only What You Need To

Chris Oldwood from The OldWood Thing

The meme tells us to “automate all the things” and it’s a noble cause which has sprung up as a backlash against the ridiculous amount of manual work we’ve often had to do in the past. However in our endeavour to embrace the meme we should not go overboard and lose sight of what we’re automating and why.

The Value

The main reason we tend to automate things is to save ourselves time (and by extension, money) by leveraging tools that can perform tasks quicker than we can, but also with more determinism and reliability, thereby saving even more time. For example, pasting a complex set of steps off a wiki page into a command prompt to perform a task is slower than an interpreter running a script and is fraught with danger as we might screw up at various points along the way and so end up not doing exactly what we’d intended. Ultimately computers are good at boring repetitive tasks whilst we humans are not.

However if we only do this operation once every six months and there are too many potential points of failure we might spend far longer trying to automate it than it actually takes to do carefully, manually. It’s a classic trade-off and like most things in IT there are some XKCD’s for that – “Automation” and “Is It Worth the Time”. They make sobering reading when you’re trying to work out how much time you might save automating something and therefore also gives a good indication of the maximum amount of time you should spend on achieving that.

Orchestration First, Actor Later

Where I think the meme starts to break down is when we get this balance wrong and begin to lose sight of where the real value is, thereby wasting time trying to automate not only all the steps but also wire it into some job scheduling system (e.g. CI server) so that once in a blue moon we can push a button and the whole task from start to finish is executed for us without further intervention.

The dream suggests at that point we can go off and do something else more valuable instead. Whilst this notion of autonomy is idyllic it can also come with a considerable extra up-front cost and any shortcuts are likely to buy us false security (i.e. it silently fails and we lose time investigating downstream failures instead).

For example there are many crude command prompt one-liners I’ve written in the past to pick up common mistakes that are trivial for me to run because they’ve been written to automate the expensive bit, not the entire problem. I often rely on my own visual system to filter out the noise and compensate for the impurities within the process. Removing these wrinkles is often where the proverbial “last 10% that takes 90% of the time” goes [1].

It’s all too easy to get seduced by the meme and believe that no automation task is truly complete until it’s fully automated.

An Example

In .Net when you publish shared libraries as NuGet packages you have a .nuspec file which lists the package dependencies. The library .csproj build file also has project dependencies for use with compilation. However these two sets of dependencies should be kept in sync [2].

Initially with only a couple of NuGet packages it was easy to do manually as I knew it was unlikely to change. However once the monolithic library got split it up the dependencies started to grow and manually comparing the relevant sections got harder and more laborious.

Given the text based nature of the two files (XML) it was pretty easy to write a simple shell one-liner to grep the values from the two sets of relevant XML tags, dump them in a file, and then use diff to show a side-by-side comparison. Then it just needed wrapping in a for loop to traverse the solution workspace.

Because the one-liner was mine I got to take various shortcuts like hardcoding the input path and temporary files along with “knowing” that a certain project was always misreported. At this point a previously manual process has largely been automated and as long as I run it regularly will catch any mistakes.

Of course it’s nice to share things like this so that others can take advantage after I’m gone, and it might be even better if the process can be added as a build step so that it’s caught the moment the problem surfaces rather than later in response to a more obscure issue. Now things begin to get tricky and we start to see diminishing returns.

First, the Gnu on Windows (GoW) toolset I used isn’t standard on Windows so now I need to make the one-liner portable or make everyone else match my tooling choice [3]. I also need to fix the hard coded paths and start adding a bit of error handling. I also need to find a way to remove the noise caused by the one “awkward” project.

None of this is onerous, but this all takes time and whilst I’m doing it I’m not doing something (probably) more valuable. The majority of the value was in being able to scale out this safety check, there is (probably) far less value in making it portable and making it run reliably as part of an automated build. This is because essentially it only needs to be run whenever the project dependencies change and that was incredibly rare once the initial split was done. Additionally the risk of not finding an impedance mismatch was small and should be caught by other automated aspects of the development process, i.e. the deployment and test suite.

Knowing When to Automate More

This scenario of cobbling something together and then finding you need to do it more often is the bread and butter of build & deployment pipelines. You often start out with a bunch of hacked together scripts which do just enough to allow the team to bootstrap itself in to an initial fluid state of delivery. This is commonly referred to as a walking skeleton because it forms the basis for the entire development process.

The point of starting with the walking skeleton rather than just diving headlong into features is to try and tackle some of the problems that historically got left until it was too late, such as packaging and deployment. In the modern era of continuous delivery we look to deliver a thin slice of functionality quickly and then build upon it piecemeal.

However it’s all too easy to get bogged down early on in a project and spend lots of time just getting the build pipeline up and running and have nothing functional to show for it. This has always made me feel a little uncomfortable as it feels as though we should be able to get away with far less than perhaps we think we need to.

In “Building the Pipeline - Process Led or Automation Led” and my even earlier post “Layered Builds” I’ve tried to promote a more organic approach that focuses on what I think really matters most which is a consistent and extensible approach. In essence we focus first on producing a simple, repeatable process that can be used locally to enable the application skeleton to safely evolve and then balance the need for automating this further along with the other features. If quality or speed of delivery drops and more automation looks to be the answer then it can be added with the knowledge that it’s being done for deliberate reasons, rather than because we’ve got carried away gold plating the build system based on what other people think it should do (i.e. a cargo cult mentality).

Technical Risk

The one caveat to being leaner about your automation is that you may (accidentally) put off addressing one or more technical risks because you don’t perceive them as risks. This leads us back to why the meme exists in the first place – failing to address certain aspects of software delivery until it’s too late. If there is a technical concern, address it, but only to the extent that the risk is understood, you may not need to do anything about it now.

With a team of juniors there is likely to be far more unknowns [4] than with a team of experienced programmers, therefore the set of perceived risks will be higher. Whilst you might not know the most elegant approach to solving a problem, knowing an approach already reduces the risk because you know that you can trade technical debt in the short term for something else more valuable if necessary.

Everything is Negotiable

The thing I like most about an agile development process is that every trade-off gets put front-and-centre, everything is now negotiable [5]. Every task now comes with an implicit question: is this the most valuable thing we could be doing?

Whilst manually building a private cloud for your production system using a UI is almost certainly not the most scalable approach, neither is starting day one of a project by diving into, say, Terraform when you don’t even know what you’re supposed to be building. There is nothing wrong with starting off manually, you just need to be diligent and ensure that your decision to only automate “enough of the things” is always working in your favour.

 

[1] See “The Curse of NTLM Based HTTP Proxies”.

[2] I’m not aware of Visual Studio doing this yet although there may now be extensions and tools written by others I’m not aware of.

[3] Yes, the Unix command line tools should be ubiquitous and maybe finally they will be with Bash on Windows.

[4] See “Turning Unconscious Incompetence to Conscious Incompetence”.

[5] See “Estimating is Liberating”.

Automated Integration Testing with TIBCO

Chris Oldwood from The OldWood Thing

In the past few years I’ve worked on a few projects where TIBCO has been the message queuing product of choice within the company. Naturally being a test-oriented kind of guy I’ve used unit and component tests for much of the donkey work, but initially had to shy away from writing any automated integration tests due to the inherent difficulties of getting the system into a known state in isolation.

Organisational Barriers

For any automated integration tests to run reliably we need to control the whole environment, which ideally is our development workstations but also our CI build environment (see “The Developer’s Sandbox”). The main barriers to this with a commercial product like TIBCO are often technological, but also more often than not, organisational too.

In my experience middleware like this tends to be proprietary, very expensive, and owned within the organisation by a dedicated team. They will configure the staging and production queues and manage the fault-tolerant servers, which is probably what you’d expect as you near production. A more modern DevOps friendly company would recognise the need to allow teams to test internally first and would help them get access to the product and tools so they can build their test scaffolding that provides the initial feedback loop.

Hence just being given the client access libraries to the product is not enough, we need a way to bring up and tear down the service endpoint, in isolation, so that we can test connectivity and failover scenarios and message interoperability. We also need to be able develop and test our logic around poisoned messages and dead-letter queues. And all this needs to be automatable so that as we develop and refactor we can be sure that we’ve not broken anything; manually testing this stuff is not just not scalable in a shared test environment at the pace modern software is developed.

That said, the TIBCO EMS SDK I’ve been working with (v6.3.0) has all the parts I needed to do this stuff, albeit with some workarounds to avoid needing to run the tests with administrator rights which we’ll look into later.

The only other thorny issue is licensing. You would hope that software product companies would do their utmost to get developers on their side and make it easy for them to build and test their wares, but it is often hard to get clarity around how the product can be used outside of the final production environment. For example trying to find out if the TIBCO service can be run on a developer’s workstation or in a cloud hosted VM solely for the purposes of running some automated tests has been a somewhat arduous task.

This may not be solely the fault of the underlying product company, although the old fashioned licensing agreements often do little to distinguish production and modern development use [1]. No, the real difficulty is finding the right person within the client’s company to talk to about such matters. Unless they are au fait with the role modern automated integrated testing takes place in the development process you will struggle to convince them your intended use is in the interests of the 3rd party product, not stealing revenue from them.

Okay, time to step down from the soap box and focus on the problems we can solve…

Hosting TIBEMSD as a Windows Service

From an automated testing perspective what we need access to is the TIBEMSD.EXE console application. This provides us with one or more TIBCO message queues that we can host on our local machine. Owning thing process means we can therefore create, publish to and delete queues on demand and therefore tightly control the environment.

If you only want to do basic integration testing around the sending and receiving of messages you can configure it as a Windows service and just leave it running in the background. Then your tests can just rely on it always being there like a local database or the file-system. The build machine can be configured this way too.

Unfortunately because it’s a console application and not written to be hosted as a service (at least v6.3 isn’t), you need to use a shim like SRVANY.EXE from the Windows 2003 Resource Kit or something more modern like NSSM. These tools act as an adaptor to the console application so that the Windows SCM can control them.

One thing to be careful of when running TIBEMSD in this way is that it will stick its data files in the CWD (Current Working Directory), which for a service is %SystemRoot%\System32, unless you configure the shim to change it. Putting them in a separate folder makes them a little more obvious and easier to delete when having a clear out [2].

Running TIBEMSD On Demand

Running the TIBCO server as a service makes certain kinds of tests easier to write as you don’t have to worry about starting and stopping it, unless that’s exactly the kinds of test you want to write.

I’ve found it’s all too easy when adding new code or during a refactoring to accidentally break the service so that it doesn’t behave as intended when the network goes up and down, especially when you’re trying to handle poisoned messages.

Hence I prefer to have the TIBEMSD.EXE binary included in the source code repository, in a known place so that it can be started and stopped on demand to verify the connectivity side is working properly. For those classes of integration tests where you just need it to be running you can add it to your fixture-level setup and even keep it running across fixtures to ensure the tests running at an adequate pace.

If, like me, you don’t run as an Administrator all the time (or use elevated command prompts by default) you will find that TIBEMSD doesn’t run out-of-the-box in this way. Fortunately it’s easy to overcome these two issues and run in a LUA (Limited User Account).

Only Bind to the Localhost

One of the problems is that by default the server will try and listen for remote connections from anywhere which means it wants a hole in the firewall for its default port. This of course means you’ll get that firewall popup dialog which is annoying when trying to automate stuff. Whilst you could grant it permission with a one-off NETSH ADVFIREWALL command I prefer components in test mode to not need any special configuration if at all possible.

Windows will allow sockets that only listen for connections from the local host to avoid generating the annoying firewall popup dialog (and this was finally extended to include HTTP too). However we need to tell the TIBCO server to do just that, which we can achieve by creating a trivial configuration file (e.g. localhost.conf) with the following entry:

listen=tcp://127.0.0.1:7222

Now we just need to start it with the –conf switch:

> tibemsd.exe -config localhost.conf

Suppressing the Need For Elevation

So far so good but our other problem is that when you start TIBEMSD it wants you to elevate its permissions. I presume this is a legacy thing and there may be some feature that really needs it but so far in my automated tests I haven’t hit it.

There are a number of ways to control elevation for legacy software that doesn’t have a manifest, like using an external one, but TIBEMSD does and that takes priority. Luckily for us there is a solution in the form of the __COMPAT_LAYER environment variable [3]. Setting this, either through a batch file or within our test code, supresses the need to elevate the server and it runs happily in the background as a normal user, e.g.

> set __COMPAT_LAYER=RunAsInvoker
> tibemsd.exe -config localhost.conf

Spawning TIBEMSD From Within a Test

Once we know how to run TIBEMSD without it causing any popups we are in a position to do that from within an automated test running as any user (LUA), e.g. a developer or the build machine.

In C#, the language where I have been doing this most recently, we can either hard-code a relative path [4] to where TIBEMSD.EXE resides within the repo, or read it from the test assembly’s app.config file to give us a little more flexibility.

<appSettings>
  <add key=”tibemsd.exe”
       value=”..\..\tools\TIBCO\tibemsd.exe” />
  <add key=”conf_file”
       value=”..\..\tools\TIBCO\localhost.conf” />
</appSettings>

We can also add our special .conf file to the same folder and therefore find it in the same way. Whilst we could generate it on-the-fly it never changes so I see little point in doing this extra work.

Something to be wary of if you’re using, say, NUnit to write your integration tests is that it (and ReSharper) can copy the test assemblies to a random location to aid in insuring your tests have no accidental dependencies. In this instance we do, and a rather large one at that, so we need the relative distance between where the test assemblies are built and run (XxxIntTests\bin\Debug) and the TIBEMSD.EXE binary to remain fixed. Hence we need to disable this copying behaviour with the /noshadow switch (or “Tools | Unit Testing | Shadow-copy assemblies being tested” in ReSharper).

Given that we know where our test assembly resides we can use Assembly.GetExecutingAssembly() to create a fully qualified path from the relative one like so:

private static string GetExecutingFolder()
{
  var codebase = Assembly.GetExecutingAssembly()
                         .CodeBase;
  var folder = Path.GetDirectoryName(codebase);
  return new Uri(folder).LocalPath;
}
. . .
var thisFolder = GetExecutingFolder();
var tibcoFolder = “..\..\tools\TIBCO”;
var serverPath = Path.Combine(
            thisFolder, tibcoFolder, “tibemsd.exe”);
var configPath = Path.Combine(
            thisFolder, tibcoFolder, “localhost.conf”);

Now that we know where the binary and config lives we just need to stop the elevation by setting the right environment variable:

Environment.SetEnvironmentVariable("__COMPAT_LAYER", "RunAsInvoker");

Finally we can start the TIBEMSD.EXE console application in the background (i.e. no distracting console window) using Diagnostics.Process:

var process = new System.Diagnostics.Process
{
  StartInfo = new ProcessStartInfo(path, args)
  {
    UseShellExecute = false,
    CreateNoWindow = true,
  }
};
process.Start();

Stopping the daemon involves calling Kill(). There are more graceful ways of remotely stopping a console application which you can try first, but Kill() is always the fall-back approach and of course the TIBCO server has been designed to survive such abuse.

Naturally you can wrap this up with the Dispose pattern so that your test code can be self-contained:

// Arrange
using (RunTibcoServer())
{
  // Act
}

// Assert

Or if you want to amortise the cost of starting it across your tests you can use the fixture-level set-up and tear down:

private IDisposable _server;

[FixtureSetUp]
public void GivenMessageQueueIsAvailable()
{
  _server = RunTibcoServer();
}

[FixtureTearDown]
public void StopMessageQueue()
{
  _server?.Dispose();
  _server = null;
}

One final issue to be aware of, and it’s a common one with integration tests like this which start a process on demand, is that the server might still be running unintentionally across test runs. This can happen when you’re debugging a test and you kill the debugger whilst still inside the test body. The solution is to ensure that the server definitely isn’t already running before you spawn it, and that can be done by killing any existing instances of it:

Process.GetProcessesByName(“tibemsd”)
       .ForEach(p => p.Kill());

Naturally this is a sledgehammer approach and assumes you aren’t using separate ports to run multiple disparate instances, or anything like that.

Other Gottchas

This gets us over the biggest hurdle, control of the server process, but there are a few other little things worth noting.

Due to the asynchronous nature and potential for residual state I’ve found it’s better to drop and re-create any queues at the start of each test to flush them. I also use the Assume.That construct in the arrangement to make it doubly clear I expect the test to start with empty queues.

Also if you’re writing tests that cover background connect and failover be aware that the TIBCO reconnection logic doesn’t trigger unless you have multiple servers configured. Luckily you can specify the same server twice, e.g.

var connection= “tcp://localhost,tcp://localhost”;

If you expect your server to shutdown gracefully, even in the face of having no connection to the queue, you might find that calling Close() on the session and/or connection blocks whilst it’s trying to reconnect (at least in EMS v6.3 it does). This might not be an expected production scenario, but it can hang your tests if something goes awry, hence I’ve used a slightly distasteful workaround where the call to Close() happens on a separate thread with a timeout:

Task.Run(() => _connection.Close()).Wait(1000);

Conclusion

Writing automated integration tests against a middleware product like TIBCO is often an uphill battle that I suspect many don’t have the appetite or patience for. Whilst this post tackles the technical challenges, as they are at least surmountable, the somewhat harder problem of tackling the organisation is sadly still left as an exercise for the reader.

 

[1] The modern NoSQL database vendors appear to have a much simpler model – use it as much as you like outside production.

[2] If the data files get really large because you leave test messages in them by accident they can cause your machine to really grind after a restart as the service goes through recovery.

[3] How to Run Applications Manifested as Highest Available With a Logon Script Without Elevation for Members of the Administrators Group

[4] A relative path means the repo can then exist anywhere on the developer’s file-system and also means the code and tools are then always self-consistent across revisions.