From 95bb778f57f55e02bd7fe6f86dfbb4f0a94d6ade Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lo=C3=AFc=20Hoguin?= Date: Mon, 25 Nov 2013 20:35:17 +0100 Subject: Add an introductory chapter about parsing --- guide/parsers.md | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 92 insertions(+) create mode 100644 guide/parsers.md (limited to 'guide/parsers.md') diff --git a/guide/parsers.md b/guide/parsers.md new file mode 100644 index 0000000..e6e9ece --- /dev/null +++ b/guide/parsers.md @@ -0,0 +1,92 @@ +Writing parsers +=============== + +There are three kinds of protocols: + + * Text protocols + * Schema-less binary protocols + * Schema-based binary protocols + +This chapter introduces the first two kinds. It will not cover +more advanced topics such as continuations or parser generators. + +This chapter isn't specifically about Ranch, we assume here that +you know how to read data from the socket. The data you read and +the data that hasn't been parsed is saved in a buffer. Every +time you read from the socket, the data read is appended to the +buffer. What happens next depends on the kind of protocol. We +will only cover the first two. + +Parsing text +------------ + +Text protocols are generally line based. This means that we can't +do anything with them until we receive the full line. + +A simple way to get a full line is to use `binary:split/{2,3}`. + +``` erlang +case binary:split(Buffer, <<"\n">>) of + [_] -> + get_more_data(Buffer); + [Line, Rest] -> + handle_line(Line, Rest) +end. +``` + +In the above example, we can have two results. Either there was +a line break in the buffer and we get it split into two parts, +the line and the rest of the buffer; or there was no line break +in the buffer and we need to get more data from the socket. + +Next, we need to parse the line. The simplest way is to again +split, here on space. The difference is that we want to split +on all spaces character, as we want to tokenize the whole string. + +``` erlang +case binary:split(Line, <<" ">>, [global]) of + [<<"HELLO">>] -> + be_polite(); + [<<"AUTH">>, User, Password] -> + authenticate_user(User, Password); + [<<"QUIT">>, Reason] -> + quit(Reason) + %% ... +end. +``` + +Pretty simple, right? Match on the command name, get the rest +of the tokens in variables and call the respective functions. + +After doing this, you will want to check if there is another +line in the buffer, and handle it immediately if any. +Otherwise wait for more data. + +Parsing binary +-------------- + +Binary protocols can be more varied, although most of them are +pretty similar. The first four bytes of a frame tend to be +the size of the frame, which is followed by a certain number +of bytes for the type of frame and then various parameters. + +Sometimes the size of the frame includes the first four bytes, +sometimes not. Other times this size is encoded over two bytes. +And even other times little-endian is used instead of big-endian. + +The general idea stays the same though. + +``` erlang +<< Size:32, _/bits >> = Buffer, +case Buffer of + << Frame:Size/binary, Rest/bits >> -> + handle_frame(Frame, Buffer); + _ -> + get_more_data(Buffer) +end. +``` + +You will then need to parse this frame using binary pattern +matching, and handle it. Then you will want to check if there +is another frame fully received in the buffer, and handle it +immediately if any. Otherwise wait for more data. -- cgit v1.2.3