Streamson reads parts of input JSON and performs processes input and optionally generates an output. It expects input to be valid UTF-8 encoded JSON.
The JSON processing itself can be done using several strategies. You can choose the strategy based on that what you want to do with the JSON. You may want to e.g remove some JSON part or split JSON into smaller parts, etc.
To match a part of the data you need to define a matcher. The matcher examines the path + data type (obejct, array, string, number, boolean, null) and based on the decides whether the data should be matched.
If you have some matched data you may want to do something with it. So you need to define some handlers. Handlers accept streams of data and may produce some data output.
Strategies
All
Matches all data (no need to set matchers). Handlers can be used to convert the content of entire JSON or to perform some kind of analysis.
Convert
Alters the JSON by calling convert handlers to matched parts.
Extract
Alters the JSON as well. It returns only the matched parts as output. Handlers are trigger over the matched stream, but output is not converted.
Filter
It alters the JSON. If the path is matched the matched part should be removed from output JSON. Handlers can be used here to e.g. store removed parts into a file.
Trigger
It triggers handlers on matched JSON parts. It doesn't return data as output. So it doesn't meant to convert the output.
Path
The path is some kind of structure description of currently processed JSON part.
{ // root path starts here
"users": [ // {"users"} path starts here
{ // {"users"}[0] path starts here
"name": "first", // {"users"}[0]{"name"}
"id": 1, // {"users"}[0]{"id"}
}, // {"users"}[0] path ends here
] // {"users"} path ends here
} // root path ends here
Matchers
Simple
Its definition is very similar to path. But it contains a few additions.
[]
will match all items in array[1,3-5]
will match second, fourth, fifth and sixth item in array{}
will match any key in object?
will match all items in dict or array*
will match all items in dict or array 0 and times
Examples
{"a"}[]
matches paths{"a"}[0]
,{"a"}[1]
, ...{}[1]
matches paths{"a"}[1]
,{"b"}[1]
, ...?[1]
matches same as{}[1]
,[][1]
*[1]
matches[1]
and same as?[1]
,??[1]
,???[1]
, ...
Depth
You can match based on the path depth.
Examples
2
would match{"a"}[1]
,{"a"}[1]{"b"}
, but would not match{"a"}
2-2
would match{"a"}[1]
, but would not match{"a"}[1]{"b"}
nor{"a"}
Regex
You can match path base on regular expression as well.
Examples
^\{"[Uu][Ss][Ee][Rr][Ss]"\}$
would match{"user"}
,{"User"}
,{"USER"}
, ...
Combinator
It is also possible to combine two matchers together or to negate the matcher. These matchers need to be wrapped by Combinator matcher.
Combinator itself supports following operations:
- negate (e.g.
~<matcher>
) - or (e.g.
<matcher1> || <matcher2>
) - and (e.g.
<matcher1> && <matcher2>
)
Handlers
Analyser
It collect informations about the JSON which is being processed. Basically it count different paths with squashed arrays.
This handler is useful only with all strategy, because it needs to see the data of entire JSON.
Example
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
Collected data would look like this:
"": 1 (root elemet)
{"users"}: 1
{"users"}[]: 3
{"users"}[]{"id"}: 3
{"users"}[]{"is_admin"}: 1
{"users"}[]{"name"}: 3
Buffer
Collect the data that are being matched. Note that it can process nested matches. The buffer itself can be poped when after some data were fed to the input. This way this handler can be used to process huge amount of relatively small JSONs.
This handlers is not present in binary, because it wouldn't make much sense, but still can be useful in rust or python bindings.
Example
Matcher (combinator of two simple matchers):
{"users"}[] || {"users"}[]{"name"}
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
After consuming the entire input the buffer should contain:
{"users"}[0]{"name"} "user1"
{"users"}[0] {"name": "user1", "id": 1},
{"users"}[1]{"name"} "user2"
{"users"}[1] {"name": "user2", "id": 2},
{"users"}[2]{"name"} "user3"
{"users"}[2] {"name": "user3", "id": 3},
Indenter
Converts the input data to so it can be more human readable or compressed. This handler is useful only with all strategy.
Example
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
Output JSON with indent=2:
{
"users": [
{
"name": "user1",
"id": 1
},
{
"name": "user2",
"id": 2,
"is_admin": true
},
{
"name": "user3",
"id": 3
}
]
}
Output JSON with undefined indent (compressed):
{"users":[{"name":"user1","id":1},{"name":"user2","id":2,"is_admin":true},{"name":"user3","id":3}]}
Indexer
Collect info about indexes of JSON parts (start / end).
Example
Simple matcher:
{"users"}[]{"name"}
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
Collected indexes would look like this:
{"users"}[0]{"name"} Start(24)
{"users"}[0]{"name"} End(30)
{"users"}[1]{"name"} Start(56)
{"users"}[1]{"name"} End(62)
{"users"}[2]{"name"} Start(108)
{"users"}[2]{"name"} End(114)
Output
Writes matched data to given output (could be a file, stdout, ...).
Example
Simple matcher:
{"users"}[]
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
Output defined as a file (e.g. /tmp/out.json
). And the content
should look like this:
{"name": "user1", "id": 1}{"name": "user2", "id": 2, "is_admin": true}{"name": "user3", "id": 3},
Regex
Uses sed regex expression convert data (e.g. s/user/User/
).
Example
Simple matcher:
{"users"}[]{"id"}
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
And with Regex handler s/user/User/
the output would look like this:
{
"users": [
{"name": "User1", "id": 1},
{"name": "User2", "id": 2, "is_admin": true},
{"name": "User3", "id": 3},
]
}
Replace
Replaces entire matched data with another fixed data.
Example
Simple matcher:
{"users"}[]{"is_admin"}
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
Output with false
:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": false},
{"name": "user3", "id": 3},
]
}
Shorten
Make matched data shorter. Note that this handler should be applied to strings only.
Example
Simple matcher:
{"users"}[]{"name"}
Input JSON:
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": true},
{"name": "user3", "id": 3},
]
}
Shorten with 2 max size and ..."
and terminator:
{
"users": [
{"name": "us...", "id": 1},
{"name": "us...", "id": 2, "is_admin": true},
{"name": "us...", "id": 3},
]
}
Unstringify
Unstringifies matched data.
e.g. ("{\"a\":5}"
will be converted to {"a":5}
.
Example
Simple matcher:
{"users"}[]{"is_admin"}
Input JSON
{
"users": [
{"name": "user1", "id": 1},
{"name": "user2", "id": 2, "is_admin": "true"},
{"name": "user3", "id": 3},
]
}
After Unstringify was used the output should look like this:
{
"users": [
{"name": user1, "id": 1},
{"name": user1, "id": 2, "is_admin": true},
{"name": user1, "id": 3},
]
}
Group handler
Handlers can be also grouped together. To determine the grouping behaviour it is important to determine whether its subhandlers are converting data. You can imagine the grouped handlers as a list. And is processed in following way. If handler is not converting data the input data are passed to the handler itself and to the next handler. However if handler converts data, the data are passed to handler itself and handlers output data are passed to the next handler.
Data1 -> handlerA(converts=false) -> Data1 -> handlerB(converts=true) -> Data2
Gotchas
Although streamson is memory efficient there can still be situations when it consumes quite huge amount of memory. A several situations may occure which cause it's memory inefficiency. Therefore it is not wise to run streamson on a JSON input which you don't trust.
Too long keys
Imagine that key in JSON object is just too big.
{
"3.14159...<thousands of numbers>": "pi"
}
However streamson should handle
{
"pi": "3.14159...<thousands of numbers>"
}
fine for non-buffering handlers.
Expanding
Now lets say we read a JSON file which is only build one level on the top of another.
[
1,
[
2,
[
3,
[
4,
...
]
]
]
]
At least streamson has to store the path in some kind of stack. And this stack could become quite huge in such situation.