Don’t Let Your Microservice Get Put in Timeout
As a form of parental discipline, some children are put in “Time-out” as a behavioral modification technique. In response to some unacceptable behavior, or a repeated and persistent request “Puuhhhhleaase” it may become necessary to temporarily separate a child from an environment. If you’ve seen an HTTP 408 Timeout error status code coming back from a service, maybe you feel the same way.
A network timeout can exhibit itself in several ways that cause confusion. Perhaps one of these examples will seem familiar.
A Websocket in Java may time out and raise an exception like this:
java.net.SocketTimeoutException: Read timed out
A socket timeout like this can occur when the connection is left open for too long. For file transfers, it should be open enough to handle the largest file but any longer is wasted waiting for an answer unlikely to come. Typically, you should investigate the SLA for a given service API downstream in the system as your upper bound.
Perhaps you’ve seen a cryptic ECONNRESET exception raised in Python such as:
requests.exceptions.ConnectionError (Caused by <class 'socket.error'>: [Errno 104] Connection reset by peer)
This could be a connection timeout because the server is trying to complete an expensive operation. Making a database connection can take longer than expected. Connection pooling is often used to alleviate but even getting an item from a pool can face hiccups or a deadlock if there were no timeouts enforced.
ERR proxy: error connecting to 10.131.16.144:8081: dial tcp 10.131.16.144:8081: i/o timeout
This could a timeout related to routing to a host.
So why is this happening? Well, for starters I think most of us like to imagine the network looking something like this:
but in reality, it is probably closer to this:
Even striving for five nines with 99.999% uptime allows for some failed requests. Add a very large number of microservice requests compounded with Bayes’ Theorem and you have a recipe for failure.
If you are new to the world of cloud computing, web services, and distributed computing you may be falling prey to one of the fallacies and seeing strange timeout errors like the above as a result.
“The more I looked around at networking inside and outside of Sun the more I thought I could see instances where making these assumptions got people into trouble.”
Predix is subject to some of the same assumptions about the nature of computer networks that have dogged the industry for the past 20 years. James Gosling (of Java fame), Bill Joy, Dave Lyon, and Peter Deutsch contributed to eight frequently cited examples of fallacies in the understanding of computer networks.
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
There are better resources than myself if you want a deeper understanding of these fallacies, but the important take-away is that you need to deal with timeouts as a common occurrence.
There are some commonly recommended approaches to deal with network glitches that can occur, especially while transferring large data sets.
- Retry upon failure
- Investigate messaging architectures
- Check your proxy settings
If at first you don’t succeed
If a request to a service fails or times out, try again. You don’t give up the first time you face adversity, don’t let your application either. Many libraries default to fail-fast and not do any retries, so you may need to adjust configuration settings and properties for retries, timeouts, etc.
For example, developers who are using Apache Jclouds SDK could configure their calls as below. Tuning these properties to wait between retries and limit the total number of attempts can be extremely important.
async.retry library a try for your node.js application and install with npm.
The network is not reliable. You will have latency. You will need to do some additional research to find the right way to add resiliency for your framework, library, and language of choice like the above examples illustrate.
Take a Number
While these fallacies were being written, I recall dreading getting disconnected from a BBS while downloading a 3 MB file over a 28.8kb modem. If you are transferring large binary files, or waiting for a long-lived request to process a big data problem, you may want to investigate a messaging architecture.
I won’t go too deep into messaging in this post but your service can be designed to accept requests and give the requestor a reference id they can use to check on the status later.
The request goes into a queue for processing and eventually the status will be complete and the result can be fetched. Predix services like EventHub, Redis and RabbitMQ can be valuable resources in this type of system. This could also take the form of a Retry-After header in your response.
Bandwidth is not infinite and there is a cost of transferring data. Other tenants will appreciate it when you are kind to the network and only use it when you need to.
What To Do About Walls
A firewall is designed to protect the squishy center of an enterprise. Many industrial companies and traditional corporations will have firewalls in place blocking any traffic to networks that aren’t trusted.
If you see timeout errors you may simply need to find yourself a proxy to get through the wall to your destination.
Specific instructions will vary from place to place, but the Get Started with the Predix Hello World App goes into detail about dealing with Proxy Connections.
The network is not secure, despite best intentions.
Sometimes things just don’t work in complex systems.
Before you do though — I hope some of the earlier advice helps you navigate some of the pitfalls in distributed computing.