Select Object
Basic Introduction
Selectobject Interface supports users to execute SQL statement for object content with specified format (CSV/JSON) in BOS. It screens, analyzes and filtrates the object content through SQL structured query language and returns the file content required to users.
Currently where users want to screen and filtrate the object content stored in BOS, they must download a single object through Getobject Interface and analyze and filtrate the data in the local region; Selectobject Interface will integrate the content screened and filtrated to BOS service layer, in order to reduce the network bandwidth and delay for users to download data from BOS, save CPU and memory resources consumed during the data screening of users and decrease the application cost required for users to access data in BOS.
Applicable Scenarios
The typical application scenario of Selectobject is to combine with big data products, replace the Getobject Interface for BOS data processing and be used for specific content extraction from log files and data analysis and screening.
Use Conditions
Where users want to use the object in Selectobject Interface screening BOS, the following restrictive conditions and details should be met:
-
File type supported
- RFC 4180 standard CSV (including TSV and other types of CSY files) and Json files with select UTF-8 coding are supported only;
- The largest line and column length of CSV files supported is 512K;
- Json files with select supported include DOCUMENT and LINES. DOCUMENT means the whole file is a single JSON subject. LINES means the whole file is composed of multirow JSON subjects, but the file itself is not a legal JSON subject. The lines are separated with line break. Users can specify the common line and column separators, e.g. \n, \r\n;
- Standard storage, infrequent storage and cold storage files can be selected;
- The encrypted files with the server encryption methods of SSE-BOS, SSE-KMS and SSE-C can be selected;
- Files compressed with GZIP method can be selected. The selected content is returned with flow decompression. GZIP file does not support deflate format and its supporting standard is RFC1952: Gzip Compression Standard Reference.
-
SQL Syntax Supported
- Only SELECT syntax is supported now. SQL statement meets
Select field_list From source Where condition Limit number
form; - string, int(64bit), float(64bit), timestamp and Boolean data types are supported;
- Logical condition (AND/OR/NOT), arithmetic expression (+-*/%), comparison operator (>,=,<,>=,<=,! =), matching operator (LIKE, BETWEEN+AND, IN) and null-conditional operator (IS NULL/IS NOT NULL) are supported;
- Aggregate function (AVG, COUNT, MAX, MIN, SUM), transfer function CAST and alias keyword AS are supported;
- File query is supported only, but join, order by, group by, having, offset and other keywords are not supported.
- Only SELECT syntax is supported now. SQL statement meets
-
SQL Statement Limitation
- The maximum length of the single SQL statement is 16K and the maximum column number is 1000. The maximum length of column name is 1024.The maximum number of aggregate operation (count/avg, etc.) is 100;
- LIMIT function is prior to aggregate function. E.g.
Select avg (cast(_1 as int)) from BosObject limit 100
means to average the top 100 elements, which is different from MySQL lexeme. - COUNT function can only end with
*
, that is,count (*)
, rather than count(1) form; - The maximum length of JSON node data specified by json path after FROM of SQL statement is 512K and its maximum depth is 10 layers;
[*]
array wildcard can only appear in SELECT Json file and the expression after select or where cannot have[*]
array wildcard. The array wildcard can only appear in json path after from;- For SELECT Csv files, from keyword can only be followed with BosObject;
- WHERE statement cannot contain aggregated conditions and only logical operators are allowed;
- In LIKE statement, at most 5 % wildcards are supported, meaning 0 or several any characters,_means single character; IN statement supports 1024 constant items at most;
- fileds after Select can be column name, CSV column index (_1,_2, etc.) or aggregate function, e.g. AVG (CAST_1 as int), but cannot be a single CAST_1 as int; filed does not support the binary expressions;
- Where a field after select is
*
, then other fields are prohibited. E.g.select * ,_from s
is illegal; aggregate function and single column name in field of select cannot appear alone; all alias names in field of select shall be different; - Where there is filed or source of
key[* ]/key[1]
form in json SQL, we will regard the field as an array element of select and key is key; however, where SQL field/source includeskey[a]
form, it will be analyzed that key iskey[a]
, to acquire the corresponding value in json; - Matching of Json files and Key in SQL is case sensitive. E.g. select s.Age and select s.age are different.
Select Data Fault-tolerant Mechanism
(1) Missing Data Processing
- When a certain column of data is missing in csv files, where the column is used in WHERE for conditional judgment, then it is regarded directly that the conditions are not satisfied and the column of data should be skipped; where the missing column is used in SELECT and then for aggregated operation, e.g. avg(cast_1 as int), it is regarded to be illegal to aggregate a non-existent column and it is needed to end directly and return the corresponding error message;
- When a key in json files is missing, ditto;
- When the column data of csv files or a key of json files is missing, it is subject to default NULL treatment, that is, it is determined true through IS NULL.
-
Other cases:
- Where json key or csv column is used in the expression after WHERE, e.g
…… where _1 = ''
,…… where a.b = 'value'
; where it is missing, it is subject to default NULL value. - Where json key or csv column is used after select as a field, e.g.
select _1 from……
,select a.b from……
; where it is missing, csv column should return to default null character string and json key also returns to null character string
- Where json key or csv column is used in the expression after WHERE, e.g
(II) Processing of Type Mismatch Data
- Column data type of csv files is unlawful, such as
CAST_1 as INT
. However,_1 value is a non-numeric string, leading to the failure of cast. IfCAST_1 as INT
is applied to make conditional judgment after WHERE, you can directly consider that the condition is not met and skip the data in the row. But if this column is used for aggregate operations after SELECT, such asavg(cast_1 as int)
, we think that it is invalid to aggregate an unlawful column and it should be directly ended and returned to corresponding wrong information; - If corresponding data type of some key in json files is invalid, it should be processed as stated above.
Csv Object
In general, selecting a csv object is to select designated column data through column ordinal or name, or conduct aggregate operations on some columns. For example, content of a test.csv file is as follows, containing columns of various data types. Csv defaults that each column data is a string, requiring corresponding CAST conversion operations. In addition, it is not allowed to leave any spacing between column delimiters:
header1,header2,header3
1,2,3.4
a,b,c
"d","e","f"
true,false,true
2006-01-02 15:04:06,"2006-01-02 16:04:06",2006-01-02 17:04:06
Frequently-used SQL Statement
SQL statement | Description | Comments |
---|---|---|
select * from BosObject limit 100 | Return to the first 100 rows in object | - |
select header1,header2 from BosObject | Return to the column named header1 and the one named header2 in object | The parameter of fileHeaderInfo requires to be "USE" |
select _1,_3 from BosObject where cast(_1 as int)<= cast(_3 as int) | Return to integers of column 1 and row 3, meeting the condition that column 1 is less than or equal to row 3 | The column needing to be represented by_1,_3 only can be converted by CAST if it is a integer. Otherwise, it will be skipped on account of its failure to meet the condition |
select count(*) from BosObject | Return to total number of rows in object | - |
select AVG(cast(_1 AS int)), MAX(cast(_1 AS int)), MIN(cast(_1 AS int)) from BosObject | Return to average value, maximum value and minimum value of the first column in object | The first column of each row can not include non-integer strings, or else direct failure will happen |
select SUM(cast(header1 AS float)) from BosObject WHERE cast(header1 AS float) != 1 | Return to the sum where all columns are named header 1 with its value not being equal to 1 | The header1 column of each row can not include non-numeric strings |
select * from BosObject where1 LIKE '%' | Return of column1 form needs to meet rows of "%fruit", for example, "apple tree" match the condition, but "apple" doesn’t match the condition | The string after LIKE operator use a signal quote |
select * from BosObject where cast(_1 AS int) % 3 = 0 | Return to all rows where_1 column can be divided by 3 evenly | It is not allowed to use the operator % unless_1 is a integer-typed string |
select * from BosObject where cast(_1 AS int) between 1 and 2 | Return to all rows where the_1 column is located in the range of [1,2]in object | _1 requires to be a integer-typed string |
select * from BosObject where cast(_1 AS timestamp) NOT IN (cast('2006-01-02 15:04:06' as timestamp), cast('2006-01-03 15:04:06' as timestamp)) | Return to all rows where the_1 column is out of IN range | _1 requires to be a data string |
select * from BosObject where cast(_1 AS int) * cast(_2 AS int) > cast(_3 AS float) + 1 | Return to all rows where_1 column form meets results of conditional expression in object | _1,_2,_3 requires to meet valid string form of CAST condition |
Json Object
During selection of a json object, corresponding data is selected through key in general. Json file includes two types LINES and DOCUMENT
JSON DOCUMENT Object
{"name": "Smith",
"age": 16,
"weight": 65.5,
"org": null,
"projects":
[
{"project_name":"project1", "completed":false},
{"project_name":"project2", "completed":true}
]
}
JSON LINES Object
{"name": "Smith",
"age": 16,
"org": null,
"projects":
[
{"project_name":"project1", "completed":false},
{"project_name":"project2", "completed":true}
]
}
{"name": "charles",
"age": 17,
"org": "baidu",
"weight": 65.5,
"projects":
[
{"project_name":"project3", "completed":false},
{"project_name":"project4", "completed":true}
]
}
Frequently-used SQL Statement
- Basic form of Json path:
field0.field1[n].property1.attributes[*]
indicates looking up the nth element of array under field0 node below root node of JSON file firstly and looking up all contents of attributes array in property1 of the element secondly - JSON object SQL can also use aggregate function, logical operation, arithmetical operation and so on; value in JSON has data type, so it does not need to be converted by CAST
SQL statement | Description | Comments |
---|---|---|
select projects from BosObject where name='Smith' | Return to projects elements that meet name='Smith' condition in json file | |
select * from BosObject.projects[*].project_name | Return to project_name field of projects node array under root node of json file | |
select s.completed from BosObject.projects[1] s where s.project_name='project2' | Return to completed field value of first element of projects array in object, which meets project_name = 'project2' | |
select * from BosObject s where s.org IS NULL AND weight is null | Return to records with blank name and weight in json file | Weight node is still regarded as null if it does not exist |
Error Return Code
1.Error code returned by server may returns as the form http status code, and may also returns in error-code of End Message. From in which ErrorCode returns is determined by occurred specific error type
ErrorCode | Description | HTTP Status Code |
---|---|---|
InvalidFileType | Select only supports selection of csv and json object contents | 400 |
RecordTooLarge | Single-row record length of csv files exceeds 512KB restriction | 400 |
SqlSyntaxError | Statement of sql is unlawful with grammatical mistakes | 400 |
InvalidSqlFields | In sal statement, filed of SELECT is unlawful and may have binary operator or other invalid operations | 400 |
InvalidSqlBinaryExpr | Use of binary operator is invalid, and left and right operands are not matched to each other in type | 400 |
SqlFieldsNumExceedLimit | The amount of field in SELEECT of sql statement exceeds the limitation | 400 |
AggregateInvalidField | USe of aggregate function in sql statement is invalid, so it is only able to aggregate numerical columns | 400 |
InvalidSqlJsonPathDepth | The depth of selected json object node is unlawful because it exceeds 1024 restriction or it is less than 1 | 400 |
SqlSourceNumExceedLimit | In sql statement, the amount of source in FROM is only allowed to be one | 400 |
FieldNotExist | There is no field of SELECT in sql exists in the file | 400 |
InappropriateJson | The content and format of json object is incorrect | 400 |
HeaderNotExist | There is no header information in csv object | 400 |
DecompressError | object fails to be extracted | 400 |
DataOverflowsType | Results of aggregate column exceed restrictions on type | 400 |
InvalidSqlSource | Source of FROM in sql statement is unlawful. In this case, it is needed to check whether source format meets requirements | 400 |
InvalidSqlLimitValue | In sql statement, Limit field value is unlawful, which requires to be a positive integer | 400 |
InvalidSqlNotOperator | In sql statement, NOT operator is misused and only can be used in front of BETWEEN/IN/LIKE, representing denial | 400 |
InvalidSqlBetweenOperator | BETWEEN operator in sql statement is misused for BETWEEN as well as AND need to be used simultaneously and both sides of AND should have consistent types | 400 |
InvalidSqlInOperator | IN operator in sql statement is misused because internal values should have consistent types | 400 |
InvalidSqlIsOperator | In sql statement, IS operator is misused and only can be used with NULL/NOT NULL | 400 |
InvalidSqlLikeOperator | LIKE operator in sql statement is misused | 400 |
InvalidSqlFunction | Sql statement function is misused. Under this condition, it is required to check type and number of function parameters | 400 |
InvalidExpressionParameter | In Selectobject requests, expression parameter is unlawful | 400 |
InvalidExpressionTypeParameter | In Selectobject requests, expressionType parameters is unlawful | 400 |
InvalidCompressionTypeParameter | In Selectobject requests, compressionType parameter is unlawful | 400 |
InvalidJsonTypeParameter | In Selectobject requests, json type parameter is unlawful | 400 |
InvalidQuoteFieldsParameter | In Selectobject requests, quote fields parameter is unlawful | 400 |
InvalidSelectRequestJsonBody | In Selectobject requests, json body parameter is unlawful | 400 |
Example for Use of SDK
At present, [BOS Java SDK](Bos/SDK/Java SDK/file management.md), [BOS go SDK](Bos/SDK/go-sdk/file management.md), and [BOS Python SDK](Bos/SDK/Python SDK/file management.md) all support the SelectObject interface.
Example for Java SDK
public void selectCsv(BosClient client, String bucketName, String csvObject) {
System.out.println("------ select csv object ------");
SelectObjectRequest request = new SelectObjectRequest(bucketName, csvObject)
.withSelectType("csv")
.withExpression("select * from BosObject limit 3")
.withExpressionType(SelectObjectRequest.ExpressionType.SQL)
.withInputSerialization(new InputSerialization()
.withCompressionType("NONE")
.withFileHeaderInfo("NONE")
.withRecordDelimiter("\r\n")
.withFieldDelimiter(",")
.withQuoteCharacter("\"")
.withCommentCharacter("#"))
.withOutputSerialization(new OutputSerialization()
.withOutputHeader(false)
.withQuoteFields("ALWAYS")
.withRecordDelimiter("\n")
.withFieldDelimiter(",")
.withQuoteCharacter("\""))
.withRequestProgress(false);
SelectObjectResponse response = client.selectObject(request);
//Output query results
printRecords(reponse.getMessages());
}
public void selectJson(BosClient client, String bucketName, String jsonObject) {
System.out.println("------ select json object ------");
SelectObjectRequest request = new SelectObjectRequest(bucketName, jsonkey)
.withSelectType("json")
.withExpression("select * from BosObject where age > 20")
.withInputSerialization(new InputSerialization()
.withCompressionType("NONE")
.withJsonType("LINES"))
.withOutputSerialization(new OutputSerialization()
.withRecordDelimiter("\n"))
.withRequestProgress(false);
SelectObjectResponse response = client.selectObject(request);
//Output query results
printRecords(reponse.getMessages());
}
public void printRecords(SelectObjectResponse.Messages messages) {
if (messages == null) {
return;
}
while (messages.hasNext()) {
SelectObjectResponse.CommonMessage message = messages.next();
if (message.Type.equals("Records")) {
for (String record: message.getRecords()) {
System.out.println(record);
}
}
}
}
Golang example
package main
import (
"bufio"
"encoding/binary"
"fmt"
"io"
"strings"
)
import (
"github.com/baidubce/bce-sdk-go/services/bos"
"github.com/baidubce/bce-sdk-go/services/bos/api"
)
func main() {
selectBosObject()
}
func selectBosObject() {
// Initialize BosClient
AK, SK := "ak", "sk"
ENDPOINT := "bj.bcebos.com"
bosClient, _ := bos.NewClient(AK, SK, ENDPOINT)
// Ensure that bucket and object have existed, and that object meets requirements for csv/json file format
bucket := "select-bucket"
csvObject := "test.csv"
fmt.Println("------ select csv object -------")
csvArgs := &api.SelectObjectArgs{
SelectType: "csv",
SelectRequest: &api.SelectObjectRequest{
Expression: "c2VsZWN0ICogZnJvbSBCb3NPYmplY3Qgd2hlcmUgY2FzdChfMSBBUyBpbnQpICogY2FzdChfMiBBUyBpbnQpID4gY2FzdChfMyBBUyBmbG9hdCkgKyAx",
ExpressionType: "SQL",
InputSerialization: &api.SelectObjectInput{
CompressionType: "NONE",
CsvParams: map[string]string{
"fileHeaderInfo": "IGNORE",
"recordDelimiter": "Cg==",
"fieldDelimiter": "LA==",
"quoteCharacter": "Ig==",
"commentCharacter": "Iw==",
},
},
OutputSerialization: &api.SelectObjectOutput{
OutputHeader: false,
CsvParams: map[string]string{
"quoteFields": "ALWAYS",
"recordDelimiter": "Cg==",
"fieldDelimiter": "LA==",
"quoteCharacter": "Ig==",
},
},
RequestProgress: &api.SelectObjectProgress{
Enabled: true,
},
},
}
csvRes, err := bosClient.SelectObject(bucket, csvObject, csvArgs)
if err != nil {
fmt.Println(err)
return
}
parseMessages(csvRes)
fmt.Println("------ select json object -------")
jsonObject := "test.json"
jsonArgs := &api.SelectObjectArgs{
SelectType: "json",
SelectRequest: &api.SelectObjectRequest{
Expression: "c2VsZWN0ICogZnJvbSBCb3NPYmplY3QucHJvamVjdHNbKl0ucHJvamVjdF9uYW1l",
ExpressionType: "SQL",
InputSerialization: &api.SelectObjectInput{
CompressionType: "NONE",
JsonParams: map[string]string{
"type": "LINES",
},
},
OutputSerialization: &api.SelectObjectOutput{
JsonParams: map[string]string{
"recordDelimiter": "Cg==",
},
},
RequestProgress: &api.SelectObjectProgress{
Enabled: true,
},
},
}
jsonRes, err := bosClient.SelectObject(bucket, jsonObject, jsonArgs)
if err != nil {
fmt.Println(err)
return
}
parseMessages(jsonRes)
}
// Parse all headers and save them in map
func parseHeaders(headers []byte) map[string]string {
hm := make(map[string]string)
index := 0
for index < len(headers) {
// headers key length
keyLen := int(headers[index])
index += 1
// headers key
key := headers[index : index+keyLen]
index += keyLen
// headers value length
valLenByte := headers[index : index+2]
valLen := int(binary.BigEndian.Uint16(valLenByte))
index += 2
// headers value
val := headers[index : index+valLen]
index += valLen
hm[string(key)] = string(val)
}
return hm
}
func parseMessages(res *api.SelectObjectResult) {
defer res.Body.Close()
reader := bufio.NewReader(res.Body)
for {
// total length in prelude, 4 bytes
p := make([]byte, 4)
l, err := io.ReadFull(reader, p)
if err != nil || l < 4 {
fmt.Printf("read total length err: %+v, len: %d\n", err, l)
break
}
totalLen := binary.BigEndian.Uint32(p)
// headers length in prelude, 4 bytes
l, err = io.ReadFull(reader, p)
if err != nil || l < 4 {
fmt.Printf("read headers length err: %+v, len: %d\n", err, l)
break
}
headersLen := binary.BigEndian.Uint32(p)
// headers part
headers := make([]byte, headersLen)
l, err = io.ReadFull(reader, headers)
if err != nil || uint32(l) < headersLen {
fmt.Printf("read headers data err: %+v, len: %d\n", err, l)
break
}
// Obtain the length of each header, parse contents of headers and judge specific msg type; end up reading if it turns out to be end msg,
// Call progress information of callback function output if it is cont msg, and output record information if it is record msg
headersMap := parseHeaders(headers)
if headersMap["message-type"] == "Records" {
// payload part
payloadLen := totalLen - headersLen - 12
payload := make([]byte, payloadLen)
if _, err := io.ReadFull(reader, payload); err != nil {
fmt.Printf("read payload data err: %+v\n", err)
}
// Set line break in OutputSerialization field used by you to implement branch processing
rs := strings.Split(string(payload), "\n")
_, err = io.ReadFull(reader, p)
crc := binary.BigEndian.Uint32(p)
recordsMsg := &api.RecordsMessage{
CommonMessage: api.CommonMessage{
Prelude: api.Prelude{
TotalLen: totalLen,
HeadersLen: headersLen,
},
Headers: headersMap,
Crc32: crc,
},
Records: rs,
}
fmt.Printf("RecordsMessage: %+v\n", recordsMsg)
continue
}
if headersMap["message-type"] == "Cont" {
// payload part, progress
bs := make([]byte, 8)
_, err = io.ReadFull(reader, bs)
bytesScanned := binary.BigEndian.Uint64(bs)
br := make([]byte, 8)
_, err = io.ReadFull(reader, br)
bytesReturned := binary.BigEndian.Uint64(br)
_, err = io.ReadFull(reader, p)
crc := binary.BigEndian.Uint32(p)
contMsg := &api.ContinuationMessage{
CommonMessage: api.CommonMessage{
Prelude: api.Prelude{
TotalLen: totalLen,
HeadersLen: headersLen,
},
Headers: headersMap,
Crc32: crc,
},
BytesScanned: bytesScanned,
BytesReturned: bytesReturned,
}
fmt.Printf("ContinuationMessage: %+v\n", contMsg)
continue
}
if headersMap["message-type"] == "End" {
_, err = io.ReadFull(reader, p)
crc := binary.BigEndian.Uint32(p)
endMsg := &api.EndMessage{
CommonMessage: api.CommonMessage{
Prelude: api.Prelude{
TotalLen: totalLen,
HeadersLen: headersLen,
},
Headers: headersMap,
Crc32: crc,
},
}
fmt.Printf("EndMessage: %+v\n", endMsg)
break
}
}
}