In this article we will see a “real life” example: we will describe how to parse a tag-based FIX message, how to improve original parsing code. The second part of this article will be dedicated to implementing a simple gateway for FIX messages and finding out why parse-compose logic is very bad from performance point of view.
FIX messages consist of a number of fields. Each field has a name (it is decimal numerical in FIX) and a value (its datatype depends on message name). Fields are separated with 0x01
and name is separated from value with =
. This is textual message format, so field 45 with value ‘test’ will look like ’45=test’. FIX also defines some binary fields, consisting of field name, field length and raw data, which may contain 0x01, but for the sake of simplicity we will not discuss them.
Message parsing: naive approach
Let’s start writing a message parser. Just for ease of reading, field separator 0x01
was replaced by semicolon in the source code. It doesn’t change any logic, only makes a message literal more readable. I’ve also replaced real FIX fields with very fake ones and left only date/int/double/string field formats. Adding more of them is straightforward, but not beneficial for this article.
The following code reads a message 20K times in the beginning – in order to compile test code and 10M times after that – for the actual test. It parses a “FIX” message string into a list of Field
objects, which are field id plus field value.
Note: the actual code for this article (see link at the end of the article) is more object oriented
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 public class FixTests { private static final int ITERS = 10000000; private static final String MESSAGE = "1=123;5=test data;7=20120815;8=another data field, this one is rather long;" + "8=and one more field, looks like a repeating one;14=20120101;9=4444;21=20111231;48=one more string field to parse;" + "5=another field 5, why does it repeat itself?;1=123;5=test data;7=20120815;8=another data field, this one is rather long;100=144.82;102=2.25"; public static void main(String[] args) { test( 20000 ); //to compile a method test( ITERS ); } private static void test( final int iters ) { long cnt = 0; final long start = System.currentTimeMillis(); for ( int i = 0; i < iters; ++i ) { final List<Field> fields = parse( MESSAGE ); cnt += fields.size(); } final long time = System.currentTimeMillis() - start; if ( iters >= 100000 ) System.out.println( "Time to parse " + iters + " messages = " + time / 1000.0 + " sec, cnt = " + cnt ); } private static Set<Integer> set( final int... values ) { final Set<Integer> res = new HashSet<Integer>( values.length ); for ( final int i : values ) res.add( i ); return res; } //numbers of non-string fields private static final Set<Integer> DATE_FIELDS = set( 7, 14, 21 ); private static final Set<Integer> INT_FIELDS = set( 7, 14, 21 ); private static final Set<Integer> DOUBLE_FIELDS = set( 100, 102 ); private static final String FIELD_SEPARATOR = ";"; private static final String VALUE_SEPARATOR = "="; private static final class Field { public final int id; public final Object value; private Field(int id, Object value) { this.id = id; this.value = value; } } //SimpleDateFormat objects are not threadsafe, so such wrapper will save us from multithreading issues private static final ThreadLocal<SimpleDateFormat> DATE_FORMAT = new ThreadLocal<SimpleDateFormat>() { @Override protected SimpleDateFormat initialValue() { final SimpleDateFormat sdf = new SimpleDateFormat( "yyyyMMdd" ); sdf.setLenient( true ); return sdf; } }; private static List<Field> parse( final String str ) { final String[] parts = str.split( FIELD_SEPARATOR ); final List<Field> res = new ArrayList<Field>( parts.length ); for ( final String part : parts ) { final String[] subparts = part.split( VALUE_SEPARATOR ); final int fieldId = Integer.parseInt( subparts[ 0 ] ); if ( DATE_FIELDS.contains( fieldId ) ) { try { res.add( new Field( fieldId, DATE_FORMAT.get().parse( subparts[ 1 ] ) ) ); } catch (ParseException e) { //not production code, so ignore failure, like with numbers } } else if ( INT_FIELDS.contains( fieldId ) ) res.add( new Field( fieldId, Integer.parseInt( subparts[1]) ) ); else if ( DOUBLE_FIELDS.contains( fieldId ) ) res.add( new Field( fieldId, Double.parseDouble( subparts[ 1 ] ) ) ); else //string res.add( new Field( fieldId, subparts[ 1 ] ) ); } return res; } }public class FixTests { private static final int ITERS = 10000000; private static final String MESSAGE = "1=123;5=test data;7=20120815;8=another data field, this one is rather long;" + "8=and one more field, looks like a repeating one;14=20120101;9=4444;21=20111231;48=one more string field to parse;" + "5=another field 5, why does it repeat itself?;1=123;5=test data;7=20120815;8=another data field, this one is rather long;100=144.82;102=2.25"; public static void main(String[] args) { test( 20000 ); //to compile a method test( ITERS ); } private static void test( final int iters ) { long cnt = 0; final long start = System.currentTimeMillis(); for ( int i = 0; i < iters; ++i ) { final List<Field> fields = parse( MESSAGE ); cnt += fields.size(); } final long time = System.currentTimeMillis() - start; if ( iters >= 100000 ) System.out.println( "Time to parse " + iters + " messages = " + time / 1000.0 + " sec, cnt = " + cnt ); } private static Set<Integer> set( final int... values ) { final Set<Integer> res = new HashSet<Integer>( values.length ); for ( final int i : values ) res.add( i ); return res; } //numbers of non-string fields private static final Set<Integer> DATE_FIELDS = set( 7, 14, 21 ); private static final Set<Integer> INT_FIELDS = set( 7, 14, 21 ); private static final Set<Integer> DOUBLE_FIELDS = set( 100, 102 ); private static final String FIELD_SEPARATOR = ";"; private static final String VALUE_SEPARATOR = "="; private static final class Field { public final int id; public final Object value; private Field(int id, Object value) { this.id = id; this.value = value; } } //SimpleDateFormat objects are not threadsafe, so such wrapper will save us from multithreading issues private static final ThreadLocal<SimpleDateFormat> DATE_FORMAT = new ThreadLocal<SimpleDateFormat>() { @Override protected SimpleDateFormat initialValue() { final SimpleDateFormat sdf = new SimpleDateFormat( "yyyyMMdd" ); sdf.setLenient( true ); return sdf; } }; private static List<Field> parse( final String str ) { final String[] parts = str.split( FIELD_SEPARATOR ); final List<Field> res = new ArrayList<Field>( parts.length ); for ( final String part : parts ) { final String[] subparts = part.split( VALUE_SEPARATOR ); final int fieldId = Integer.parseInt( subparts[ 0 ] ); if ( DATE_FIELDS.contains( fieldId ) ) { try { res.add( new Field( fieldId, DATE_FORMAT.get().parse( subparts[ 1 ] ) ) ); } catch (ParseException e) { //not production code, so ignore failure, like with numbers } } else if ( INT_FIELDS.contains( fieldId ) ) res.add( new Field( fieldId, Integer.parseInt( subparts[1]) ) ); else if ( DOUBLE_FIELDS.contains( fieldId ) ) res.add( new Field( fieldId, Double.parseDouble( subparts[ 1 ] ) ) ); else //string res.add( new Field( fieldId, subparts[ 1 ] ) ); } return res; } }
Initially I ran this code using Java 1.6.0_30 on Core i7-2630QM (2 Ghz up to 2.8 Ghz) CPU with 8G RAM. It took 138 seconds to process a message ten million times. I was a little surprised by this result (how could it be so bad) and ran it again with a profiler. It told me that 92% of time was spent in String.split
and 6% in DateFormat.parse
.
Luckily, from Regexp-related methods of String I know that String.parse
method with a single character string pattern was optimized in Java 7. I’ve restarted this code in Java 1.7.0_02 and it finished in 82.6 seconds, 53% of which was spent in DateFormat.parse
and 31% in String.split
Do we need to parse each field?
Before continuing, we need to ask ourselves one question: do we need to analyze/process all message fields or just some of them? First case is some sort of processing engine. Second case – it may be some filtering gateway or simply a converter to a different message format.
If we need to process all/most of fields, it will be easier to parse them in advance. Otherwise, we may create a more generic Field
, which will store an original not parsed field value as string and provide conversion methods (possibly caching a converted value). We will not discuss the second case now, because its performance directly depends on the number of fields you’ll need to process.
Optimizing full message parsing
Instead, we will see how to optimize existing message parsing code. First of all, we need to optimize date parsing. We will use an idea from java.util.Date, java.util.Calendar and java.text.SimpleDateFormat article: we will parse every new date and cache results in a map. 3 years are just a thousand dates, so such a map will consume just a tiny amount of RAM. Even 30 years are still rather small. If you are worried about multithreading – you may keep such map per thread (confine it either in ThreadLocal
or in thread confined parser object), or you may preprocess last 3-5-10 years and save them in an unmodifiable static map. All dates outside your range have to be parsed directly.
Usage of the following tiny class will decrease processing time from 82 sec to 36 sec in Java 7 (same code takes 99 sec to complete in Java 6, because its worst problem in Java 6 is String.split
method performance).
1 2 3 4 5 6 7 8 9 10 11 12 13 private static class DateParser { private final Map<String, Date> m_cache = new HashMap<String, Date>( 100 ); public Date parse( final String date ) throws ParseException { final Date cached = m_cache.get( date ); if ( cached != null ) return cached; final Date res = DATE_FORMAT.get().parse( date ); m_cache.put( date, res ); return res; } }private static class DateParser { private final Map<String, Date> m_cache = new HashMap<String, Date>( 100 ); public Date parse( final String date ) throws ParseException { final Date cached = m_cache.get( date ); if ( cached != null ) return cached; final Date res = DATE_FORMAT.get().parse( date ); m_cache.put( date, res ); return res; } }
Now we have to return to the original problem – splitting a message into fields, names and values. The last “optimized” code spends 71% in String.split
even in Java 7. Let’s make a small change, which is usually working in such cases. Instead of using String.split
to split a name=value field into a name and a value, we will find a position of separator using String.indexOf(char)
and use String.substring
for cheap extraction of name and value (do not use String.indexOf(String)
– it is a little slower). Now it takes 19.1 sec to parse 1 million messages in Java 7 (String.split
is responsible for 31% of runtime and Integer.parseInt
for 22%). It takes 33.5 sec to run the same code in Java 6: 79% spent in String.split
, 10% in Integer.parseInt
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 //date parsing + inner split optimized private static List<Field> parse3(final String str) { final String[] parts = str.split( FIELD_SEPARATOR ); final List<Field> res = new ArrayList<Field>( parts.length ); for ( final String part : parts ) { final int eq = part.indexOf( VALUE_SEPARATOR_CHAR ); final int fieldId = Integer.parseInt( part.substring( 0, eq ) ); res.add( makeField( fieldId, part.substring( eq + 1 ) ) ); } return res; } private static Field makeField( final int fieldId, final String value ) { if ( DATE_FIELDS.contains( fieldId ) ) { try { return new Field( fieldId, m_dateParser.parse( value ) ); } catch (ParseException e) { //not production code, so ignore failure, like with numbers return null; } } else if ( INT_FIELDS.contains( fieldId ) ) return new Field( fieldId, Integer.parseInt( value ) ); else if ( DOUBLE_FIELDS.contains( fieldId ) ) return new Field( fieldId, Double.parseDouble( value ) ); else //string return new Field( fieldId, value ); }//date parsing + inner split optimized private static List<Field> parse3(final String str) { final String[] parts = str.split( FIELD_SEPARATOR ); final List<Field> res = new ArrayList<Field>( parts.length ); for ( final String part : parts ) { final int eq = part.indexOf( VALUE_SEPARATOR_CHAR ); final int fieldId = Integer.parseInt( part.substring( 0, eq ) ); res.add( makeField( fieldId, part.substring( eq + 1 ) ) ); } return res; } private static Field makeField( final int fieldId, final String value ) { if ( DATE_FIELDS.contains( fieldId ) ) { try { return new Field( fieldId, m_dateParser.parse( value ) ); } catch (ParseException e) { //not production code, so ignore failure, like with numbers return null; } } else if ( INT_FIELDS.contains( fieldId ) ) return new Field( fieldId, Integer.parseInt( value ) ); else if ( DOUBLE_FIELDS.contains( fieldId ) ) return new Field( fieldId, Double.parseDouble( value ) ); else //string return new Field( fieldId, value ); }
Set of integers or bit set?
Now we need to pay attention to definitions of int/double/date fields – 3 Set<Integer>
. In this version checking these sets contributes about a second to the total runtime. It was too little to worry about on previous steps, but now it is worthy 6% for Java 6 (33.5 -> 31.8 sec) and 10% for Java 7 (19.1 -> 17.1 sec). There are two possible better alternatives: replace sets of integers with BitSet
s (as described in Bit sets article) or use a byte array where a cell at index i
contains field type code. In our test case performance of both of them will be nearly similar.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 private static BitSet set( final int... values ) { final BitSet res = new BitSet( 120 ); for ( final int i : values ) res.set( i ); return res; } private static final BitSet DATE_FIELDS = set( 7, 14, 21 ); private static final BitSet INT_FIELDS = set( 7, 14, 21 ); private static final BitSet DOUBLE_FIELDS = set( 100, 102 ); private static Field makeField( final int fieldId, final String value ) { if ( DATE_FIELDS.get( fieldId ) ) { try { return new Field( fieldId, m_dateParser.parse( value ) ); } catch (ParseException e) { //not production code, so ignore failure, like with numbers return null; } } else if ( INT_FIELDS.get( fieldId ) ) return new Field( fieldId, Integer.parseInt( value ) ); else if ( DOUBLE_FIELDS.get( fieldId ) ) return new Field( fieldId, Double.parseDouble( value ) ); else //string return new Field( fieldId, value ); }private static BitSet set( final int... values ) { final BitSet res = new BitSet( 120 ); for ( final int i : values ) res.set( i ); return res; } private static final BitSet DATE_FIELDS = set( 7, 14, 21 ); private static final BitSet INT_FIELDS = set( 7, 14, 21 ); private static final BitSet DOUBLE_FIELDS = set( 100, 102 ); private static Field makeField( final int fieldId, final String value ) { if ( DATE_FIELDS.get( fieldId ) ) { try { return new Field( fieldId, m_dateParser.parse( value ) ); } catch (ParseException e) { //not production code, so ignore failure, like with numbers return null; } } else if ( INT_FIELDS.get( fieldId ) ) return new Field( fieldId, Integer.parseInt( value ) ); else if ( DOUBLE_FIELDS.get( fieldId ) ) return new Field( fieldId, Double.parseDouble( value ) ); else //string return new Field( fieldId, value ); }
Final effort – manual parsing, last tunings
The next part in optimization process is to get rid of String.split
entirely. Instead, we will iterate all message characters and do something on the field separator, the value separator and at the end of string. We don’t call String.split
anymore, so we may expect Java 6 working as fast as Java 7 (or a little slower due to Java 7 optimizations).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 private static List<Field> parse4( final String str ) { final List<Field> res = new ArrayList<Field>( 20 ); final char[] chars = str.toCharArray(); //optional, may use String.charAt as well int p = 0; int startField = 0; int startValue = 0; int fieldId = -1; while ( p < chars.length ) { if ( chars[ p ] == FIELD_SEPARATOR_CHAR ) { res.add( makeField( fieldId, str.substring( startValue, p ) ) ); startField = p + 1; } else if ( chars[ p ] == VALUE_SEPARATOR_CHAR ) { startValue = p + 1; fieldId = Integer.parseInt( str.substring( startField, p ) ); } ++p; } if ( startField < chars.length ) res.add( makeField( fieldId, str.substring( startValue ) ) ); return res; }private static List<Field> parse4( final String str ) { final List<Field> res = new ArrayList<Field>( 20 ); final char[] chars = str.toCharArray(); //optional, may use String.charAt as well int p = 0; int startField = 0; int startValue = 0; int fieldId = -1; while ( p < chars.length ) { if ( chars[ p ] == FIELD_SEPARATOR_CHAR ) { res.add( makeField( fieldId, str.substring( startValue, p ) ) ); startField = p + 1; } else if ( chars[ p ] == VALUE_SEPARATOR_CHAR ) { startValue = p + 1; fieldId = Integer.parseInt( str.substring( startField, p ) ); } ++p; } if ( startField < chars.length ) res.add( makeField( fieldId, str.substring( startValue ) ) ); return res; }
Surprisingly, this method works faster in Java 6: 15.5 sec for 10M calls. It took 15.9 sec for the same number of method calls in Java 7, which made it a little slower in Java 7, but nearly 2 times faster in Java 6 (31.8 -> 15.5 sec).
But now we can save even more. Instead of calling Integer.parseInt
for field names, we may incrementally parse them in our loop. This should save us from looking back and creating some temporary objects.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 //optimized date parsing + finite automate + manual intToStr private static List<Field> parse5( final String str ) { final List<Field> res = new ArrayList<Field>( 20 ); final char[] chars = str.toCharArray(); //optional, may use String.charAt as well int p = 0; int startField = 0; int startValue = -1; int fieldId = 0; while ( p < chars.length ) { if ( chars[ p ] == FIELD_SEPARATOR_CHAR ) { res.add( makeField( fieldId, str.substring( startValue, p ) ) ); startField = p + 1; startValue = -1; fieldId = 0; } else if ( chars[ p ] == VALUE_SEPARATOR_CHAR ) { startValue = p + 1; } else if ( startValue == -1 ) { final int digit = chars[ p ] - '0'; fieldId = fieldId * 10 + digit; } ++p; } if ( startField < chars.length ) res.add( makeField( fieldId, str.substring( startValue ) ) ); return res; }//optimized date parsing + finite automate + manual intToStr private static List<Field> parse5( final String str ) { final List<Field> res = new ArrayList<Field>( 20 ); final char[] chars = str.toCharArray(); //optional, may use String.charAt as well int p = 0; int startField = 0; int startValue = -1; int fieldId = 0; while ( p < chars.length ) { if ( chars[ p ] == FIELD_SEPARATOR_CHAR ) { res.add( makeField( fieldId, str.substring( startValue, p ) ) ); startField = p + 1; startValue = -1; fieldId = 0; } else if ( chars[ p ] == VALUE_SEPARATOR_CHAR ) { startValue = p + 1; } else if ( startValue == -1 ) { final int digit = chars[ p ] - '0'; fieldId = fieldId * 10 + digit; } ++p; } if ( startField < chars.length ) res.add( makeField( fieldId, str.substring( startValue ) ) ); return res; }
As a result, it now takes 13.3 sec to process 10M messages in Java 6 and same 13.3 sec in Java 7. Hurray! 11 times faster than original version on Java 6 and 6.5 times faster on Java 7.
Further optimizations: skip fields
We can't further optimize this example without large efforts. But we can achieve more by reviewing this task. As it was previously mentioned, if you don't need to process every single field, then it worth storing original string values for all fields and process them on demand.
If you know which field you will need in advance, you may even not create them in the parse
method - just look for field separator if fieldId is not in your "favourites" list.
Summary
Always try to cache parsed dates if they do not have a time component in case of message processing: number of various dates in modern financial data is very low.
String.split
should usually be avoided. The only exception is a single character pattern in Java 7. You can still write faster code even in this case, but you should add some parsing logic into a splitting loop.
Never parse a "field=value" pair with String.split
. String.indexOf(char)
with separator character is a far better alternative.
See QuickFIX library for a good open-source implementation of the FIX parsing infrastructure.
Source code for FIX parser/gateway articles
The post Use case: FIX message processing. Part 1: Writing a simple FIX parser appeared first on Java Performance Tuning Guide.