Select Object

Last Updated：2020-10-23

Basic Introduction

Selectobject Interface supports users to execute SQL statement for object content with specified format (CSV/JSON) in BOS. It screens, analyzes and filtrates the object content through SQL structured query language and returns the file content required to users.

Currently where users want to screen and filtrate the object content stored in BOS, they must download a single object through Getobject Interface and analyze and filtrate the data in the local region; Selectobject Interface will integrate the content screened and filtrated to BOS service layer, in order to reduce the network bandwidth and delay for users to download data from BOS, save CPU and memory resources consumed during the data screening of users and decrease the application cost required for users to access data in BOS.

Applicable Scenarios

The typical application scenario of Selectobject is to combine with big data products, replace the Getobject Interface for BOS data processing and be used for specific content extraction from log files and data analysis and screening.

Use Conditions

Where users want to use the object in Selectobject Interface screening BOS, the following restrictive conditions and details should be met:

File type supported
- RFC 4180 standard CSV (including TSV and other types of CSY files) and Json files with select UTF-8 coding are supported only;
- The largest line and column length of CSV files supported is 512K;
- Json files with select supported include DOCUMENT and LINES. DOCUMENT means the whole file is a single JSON subject. LINES means the whole file is composed of multirow JSON subjects, but the file itself is not a legal JSON subject. The lines are separated with line break. Users can specify the common line and column separators, e.g. \n, \r\n;
- Standard storage, infrequent storage and cold storage files can be selected;
- The encrypted files with the server encryption methods of SSE-BOS, SSE-KMS and SSE-C can be selected;
- Files compressed with GZIP method can be selected. The selected content is returned with flow decompression. GZIP file does not support deflate format and its supporting standard is RFC1952: Gzip Compression Standard Reference.
SQL Syntax Supported
- Only SELECT syntax is supported now. SQL statement meets Select field_list From source Where condition Limit number form;
- string, int(64bit), float(64bit), timestamp and Boolean data types are supported;
- Logical condition (AND/OR/NOT), arithmetic expression (+-*/%), comparison operator (>,=,<,>=,<=,! =), matching operator (LIKE, BETWEEN+AND, IN) and null-conditional operator (IS NULL/IS NOT NULL) are supported;
- Aggregate function (AVG, COUNT, MAX, MIN, SUM), transfer function CAST and alias keyword AS are supported;
- File query is supported only, but join, order by, group by, having, offset and other keywords are not supported.
SQL Statement Limitation
- The maximum length of the single SQL statement is 16K and the maximum column number is 1000. The maximum length of column name is 1024.The maximum number of aggregate operation (count/avg, etc.) is 100;
- LIMIT function is prior to aggregate function. E.g. Select avg (cast(_1 as int)) from BosObject limit 100 means to average the top 100 elements, which is different from MySQL lexeme.
- COUNT function can only end with * , that is, count (*), rather than count(1) form;
- The maximum length of JSON node data specified by json path after FROM of SQL statement is 512K and its maximum depth is 10 layers;
- [*] array wildcard can only appear in SELECT Json file and the expression after select or where cannot have [*] array wildcard. The array wildcard can only appear in json path after from;
- For SELECT Csv files, from keyword can only be followed with BosObject;
- WHERE statement cannot contain aggregated conditions and only logical operators are allowed;
- In LIKE statement, at most 5 % wildcards are supported, meaning 0 or several any characters,_means single character; IN statement supports 1024 constant items at most;
- fileds after Select can be column name, CSV column index (_1,_2, etc.) or aggregate function, e.g. AVG (CAST_1 as int), but cannot be a single CAST_1 as int; filed does not support the binary expressions;
- Where a field after select is * , then other fields are prohibited. E.g. select * ,_from s is illegal; aggregate function and single column name in field of select cannot appear alone; all alias names in field of select shall be different;
- Where there is filed or source of key[* ]/key[1] form in json SQL, we will regard the field as an array element of select and key is key; however, where SQL field/source includes key[a] form, it will be analyzed that key is key[a], to acquire the corresponding value in json;
- Matching of Json files and Key in SQL is case sensitive. E.g. select s.Age and select s.age are different.

Select Data Fault-tolerant Mechanism

(1) Missing Data Processing

When a certain column of data is missing in csv files, where the column is used in WHERE for conditional judgment, then it is regarded directly that the conditions are not satisfied and the column of data should be skipped; where the missing column is used in SELECT and then for aggregated operation, e.g. avg(cast_1 as int), it is regarded to be illegal to aggregate a non-existent column and it is needed to end directly and return the corresponding error message;
When a key in json files is missing, ditto;
When the column data of csv files or a key of json files is missing, it is subject to default NULL treatment, that is, it is determined true through IS NULL.
Other cases:
1. Where json key or csv column is used in the expression after WHERE, e.g …… where _1 = '', …… where a.b = 'value'; where it is missing, it is subject to default NULL value.
2. Where json key or csv column is used after select as a field, e.g. select _1 from……, select a.b from……; where it is missing, csv column should return to default null character string and json key also returns to null character string

(II) Processing of Type Mismatch Data

Column data type of csv files is unlawful, such as CAST_1 as INT. However,_1 value is a non-numeric string, leading to the failure of cast. If CAST_1 as INT is applied to make conditional judgment after WHERE, you can directly consider that the condition is not met and skip the data in the row. But if this column is used for aggregate operations after SELECT, such as avg(cast_1 as int), we think that it is invalid to aggregate an unlawful column and it should be directly ended and returned to corresponding wrong information;
If corresponding data type of some key in json files is invalid, it should be processed as stated above.

Csv Object

In general, selecting a csv object is to select designated column data through column ordinal or name, or conduct aggregate operations on some columns. For example, content of a test.csv file is as follows, containing columns of various data types. Csv defaults that each column data is a string, requiring corresponding CAST conversion operations. In addition, it is not allowed to leave any spacing between column delimiters:

header1,header2,header3
1,2,3.4
a,b,c 
"d","e","f" 
true,false,true 
2006-01-02 15:04:06,"2006-01-02 16:04:06",2006-01-02 17:04:06

Frequently-used SQL Statement

SQL statement	Description	Comments
select * from BosObject limit 100	Return to the first 100 rows in object	-
select header1,header2 from BosObject	Return to the column named header1 and the one named header2 in object	The parameter of fileHeaderInfo requires to be "USE"
select _1,_3 from BosObject where cast(_1 as int)<= cast(_3 as int)	Return to integers of column 1 and row 3, meeting the condition that column 1 is less than or equal to row 3	The column needing to be represented by_1,_3 only can be converted by CAST if it is a integer. Otherwise, it will be skipped on account of its failure to meet the condition
select count(*) from BosObject	Return to total number of rows in object	-
select AVG(cast(_1 AS int)), MAX(cast(_1 AS int)), MIN(cast(_1 AS int)) from BosObject	Return to average value, maximum value and minimum value of the first column in object	The first column of each row can not include non-integer strings, or else direct failure will happen
select SUM(cast(header1 AS float)) from BosObject WHERE cast(header1 AS float) != 1	Return to the sum where all columns are named header 1 with its value not being equal to 1	The header1 column of each row can not include non-numeric strings
select * from BosObject where1 LIKE '%'	Return of column1 form needs to meet rows of "%fruit", for example, "apple tree" match the condition, but "apple" doesn’t match the condition	The string after LIKE operator use a signal quote
select * from BosObject where cast(_1 AS int) % 3 = 0	Return to all rows where_1 column can be divided by 3 evenly	It is not allowed to use the operator % unless_1 is a integer-typed string
select * from BosObject where cast(_1 AS int) between 1 and 2	Return to all rows where the_1 column is located in the range of [1,2]in object	_1 requires to be a integer-typed string
select * from BosObject where cast(_1 AS timestamp) NOT IN (cast('2006-01-02 15:04:06' as timestamp), cast('2006-01-03 15:04:06' as timestamp))	Return to all rows where the_1 column is out of IN range	_1 requires to be a data string
select * from BosObject where cast(_1 AS int) * cast(_2 AS int) > cast(_3 AS float) + 1	Return to all rows where_1 column form meets results of conditional expression in object	_1,_2,_3 requires to meet valid string form of CAST condition

Json Object

During selection of a json object, corresponding data is selected through key in general. Json file includes two types LINES and DOCUMENT

JSON DOCUMENT Object

{"name": "Smith", 
"age": 16, 
"weight": 65.5, 
"org": null, 
"projects": 
    [ 
     {"project_name":"project1", "completed":false}, 
     {"project_name":"project2", "completed":true} 
    ] 
}

JSON LINES Object

{"name": "Smith", 
"age": 16, 
"org": null, 
"projects": 
    [ 
     {"project_name":"project1", "completed":false}, 
     {"project_name":"project2", "completed":true} 
    ] 
} 
{"name": "charles", 
"age": 17, 
"org": "baidu", 
"weight": 65.5, 
"projects": 
    [ 
     {"project_name":"project3", "completed":false}, 
     {"project_name":"project4", "completed":true} 
    ] 
}

Frequently-used SQL Statement

Basic form of Json path: field0.field1[n].property1.attributes[*] indicates looking up the nth element of array under field0 node below root node of JSON file firstly and looking up all contents of attributes array in property1 of the element secondly
JSON object SQL can also use aggregate function, logical operation, arithmetical operation and so on; value in JSON has data type, so it does not need to be converted by CAST

SQL statement	Description	Comments
select projects from BosObject where name='Smith'	Return to projects elements that meet name='Smith' condition in json file
select * from BosObject.projects[*].project_name	Return to project_name field of projects node array under root node of json file
select s.completed from BosObject.projects[1] s where s.project_name='project2'	Return to completed field value of first element of projects array in object, which meets project_name = 'project2'
select * from BosObject s where s.org IS NULL AND weight is null	Return to records with blank name and weight in json file	Weight node is still regarded as null if it does not exist

Error Return Code

1.Error code returned by server may returns as the form http status code, and may also returns in error-code of End Message. From in which ErrorCode returns is determined by occurred specific error type

ErrorCode	Description	HTTP Status Code
InvalidFileType	Select only supports selection of csv and json object contents	400
RecordTooLarge	Single-row record length of csv files exceeds 512KB restriction	400
SqlSyntaxError	Statement of sql is unlawful with grammatical mistakes	400
InvalidSqlFields	In sal statement, filed of SELECT is unlawful and may have binary operator or other invalid operations	400
InvalidSqlBinaryExpr	Use of binary operator is invalid, and left and right operands are not matched to each other in type	400
SqlFieldsNumExceedLimit	The amount of field in SELEECT of sql statement exceeds the limitation	400
AggregateInvalidField	USe of aggregate function in sql statement is invalid, so it is only able to aggregate numerical columns	400
InvalidSqlJsonPathDepth	The depth of selected json object node is unlawful because it exceeds 1024 restriction or it is less than 1	400
SqlSourceNumExceedLimit	In sql statement, the amount of source in FROM is only allowed to be one	400
FieldNotExist	There is no field of SELECT in sql exists in the file	400
InappropriateJson	The content and format of json object is incorrect	400
HeaderNotExist	There is no header information in csv object	400
DecompressError	object fails to be extracted	400
DataOverflowsType	Results of aggregate column exceed restrictions on type	400
InvalidSqlSource	Source of FROM in sql statement is unlawful. In this case, it is needed to check whether source format meets requirements	400
InvalidSqlLimitValue	In sql statement, Limit field value is unlawful, which requires to be a positive integer	400
InvalidSqlNotOperator	In sql statement, NOT operator is misused and only can be used in front of BETWEEN/IN/LIKE, representing denial	400
InvalidSqlBetweenOperator	BETWEEN operator in sql statement is misused for BETWEEN as well as AND need to be used simultaneously and both sides of AND should have consistent types	400
InvalidSqlInOperator	IN operator in sql statement is misused because internal values should have consistent types	400
InvalidSqlIsOperator	In sql statement, IS operator is misused and only can be used with NULL/NOT NULL	400
InvalidSqlLikeOperator	LIKE operator in sql statement is misused	400
InvalidSqlFunction	Sql statement function is misused. Under this condition, it is required to check type and number of function parameters	400
InvalidExpressionParameter	In Selectobject requests, expression parameter is unlawful	400
InvalidExpressionTypeParameter	In Selectobject requests, expressionType parameters is unlawful	400
InvalidCompressionTypeParameter	In Selectobject requests, compressionType parameter is unlawful	400
InvalidJsonTypeParameter	In Selectobject requests, json type parameter is unlawful	400
InvalidQuoteFieldsParameter	In Selectobject requests, quote fields parameter is unlawful	400
InvalidSelectRequestJsonBody	In Selectobject requests, json body parameter is unlawful	400

Example for Use of SDK

At present, [BOS Java SDK](Bos/SDK/Java SDK/file management.md), [BOS go SDK](Bos/SDK/go-sdk/file management.md), and [BOS Python SDK](Bos/SDK/Python SDK/file management.md) all support the SelectObject interface.

Example for Java SDK

public void selectCsv(BosClient client, String bucketName, String csvObject) {
    System.out.println("------ select csv object ------");
    SelectObjectRequest request = new SelectObjectRequest(bucketName, csvObject)
        .withSelectType("csv")
        .withExpression("select * from BosObject limit 3")
        .withExpressionType(SelectObjectRequest.ExpressionType.SQL)
        .withInputSerialization(new InputSerialization()
                .withCompressionType("NONE")
                .withFileHeaderInfo("NONE")
                .withRecordDelimiter("\r\n")
                .withFieldDelimiter(",")
                .withQuoteCharacter("\"")
                .withCommentCharacter("#"))
        .withOutputSerialization(new OutputSerialization()
                .withOutputHeader(false)
                .withQuoteFields("ALWAYS")
                .withRecordDelimiter("\n")
                .withFieldDelimiter(",")
                .withQuoteCharacter("\""))
        .withRequestProgress(false);
    SelectObjectResponse response = client.selectObject(request);
    //Output query results
        printRecords(reponse.getMessages()); 
}
public void selectJson(BosClient client, String bucketName, String jsonObject) {
    System.out.println("------ select json object ------");
    SelectObjectRequest request = new SelectObjectRequest(bucketName, jsonkey)
        .withSelectType("json")
        .withExpression("select * from BosObject where age > 20")
        .withInputSerialization(new InputSerialization()
                .withCompressionType("NONE")
                .withJsonType("LINES"))
        .withOutputSerialization(new OutputSerialization()
                .withRecordDelimiter("\n"))
        .withRequestProgress(false);
    SelectObjectResponse response = client.selectObject(request);
    //Output query results
        printRecords(reponse.getMessages()); 
}
public void printRecords(SelectObjectResponse.Messages messages) {
    if (messages == null) {
                return;
    }
    while (messages.hasNext()) {
        SelectObjectResponse.CommonMessage message = messages.next();
        if (message.Type.equals("Records")) {
            for (String record: message.getRecords()) {
                                System.out.println(record);
                        }
                }
        }
}

Golang example

package main
import (
	"bufio"
	"encoding/binary"
	"fmt"
	"io"
	"strings"
)
import (
	"github.com/baidubce/bce-sdk-go/services/bos"
	"github.com/baidubce/bce-sdk-go/services/bos/api"
)
func main() {
	selectBosObject()
}
func selectBosObject() {
	// Initialize BosClient
	AK, SK := "ak", "sk"
	ENDPOINT := "bj.bcebos.com"
	bosClient, _ := bos.NewClient(AK, SK, ENDPOINT)
	// Ensure that bucket and object have existed, and that object meets requirements for csv/json file format 
	bucket := "select-bucket"
	csvObject := "test.csv"
	fmt.Println("------ select csv object -------")
	csvArgs := &api.SelectObjectArgs{
		SelectType: "csv",
		SelectRequest: &api.SelectObjectRequest{
			Expression:     "c2VsZWN0ICogZnJvbSBCb3NPYmplY3Qgd2hlcmUgY2FzdChfMSBBUyBpbnQpICogY2FzdChfMiBBUyBpbnQpID4gY2FzdChfMyBBUyBmbG9hdCkgKyAx",
			ExpressionType: "SQL",
			InputSerialization: &api.SelectObjectInput{
				CompressionType: "NONE",
				CsvParams: map[string]string{
					"fileHeaderInfo":   "IGNORE",
					"recordDelimiter":  "Cg==",
					"fieldDelimiter":   "LA==",
					"quoteCharacter":   "Ig==",
					"commentCharacter": "Iw==",
				},
			},
			OutputSerialization: &api.SelectObjectOutput{
				OutputHeader: false,
				CsvParams: map[string]string{
					"quoteFields":     "ALWAYS",
					"recordDelimiter": "Cg==",
					"fieldDelimiter":  "LA==",
					"quoteCharacter":  "Ig==",
				},
			},
			RequestProgress: &api.SelectObjectProgress{
				Enabled: true,
			},
		},
	}
	csvRes, err := bosClient.SelectObject(bucket, csvObject, csvArgs)
	if err != nil {
		fmt.Println(err)
		return
	}
	parseMessages(csvRes)
	fmt.Println("------ select json object -------")
	jsonObject := "test.json"
	jsonArgs := &api.SelectObjectArgs{
		SelectType: "json",
		SelectRequest: &api.SelectObjectRequest{
			Expression:     "c2VsZWN0ICogZnJvbSBCb3NPYmplY3QucHJvamVjdHNbKl0ucHJvamVjdF9uYW1l",
			ExpressionType: "SQL",
			InputSerialization: &api.SelectObjectInput{
				CompressionType: "NONE",
				JsonParams: map[string]string{
					"type": "LINES",
				},
			},
			OutputSerialization: &api.SelectObjectOutput{
				JsonParams: map[string]string{
					"recordDelimiter": "Cg==",
				},
			},
			RequestProgress: &api.SelectObjectProgress{
				Enabled: true,
			},
		},
	}
	jsonRes, err := bosClient.SelectObject(bucket, jsonObject, jsonArgs)
	if err != nil {
		fmt.Println(err)
		return
	}
	parseMessages(jsonRes)
}
// Parse all headers and save them in map
func parseHeaders(headers []byte) map[string]string {
	hm := make(map[string]string)
	index := 0
	for index < len(headers) {
		// headers key length
		keyLen := int(headers[index])
		index += 1
		// headers key
		key := headers[index : index+keyLen]
		index += keyLen

		// headers value length
		valLenByte := headers[index : index+2]
		valLen := int(binary.BigEndian.Uint16(valLenByte))
		index += 2
		// headers value
		val := headers[index : index+valLen]
		index += valLen
		hm[string(key)] = string(val)
	}
	return hm
}

func parseMessages(res *api.SelectObjectResult) {
	defer res.Body.Close()
	reader := bufio.NewReader(res.Body)
	for {
		// total length in prelude, 4 bytes
		p := make([]byte, 4)
		l, err := io.ReadFull(reader, p)
		if err != nil || l < 4 {
			fmt.Printf("read total length err: %+v, len: %d\n", err, l)
			break
		}
		totalLen := binary.BigEndian.Uint32(p)
		// headers length in prelude, 4 bytes
		l, err = io.ReadFull(reader, p)
		if err != nil || l < 4 {
			fmt.Printf("read headers length err: %+v, len: %d\n", err, l)
			break
		}
		headersLen := binary.BigEndian.Uint32(p)
		// headers part
		headers := make([]byte, headersLen)
		l, err = io.ReadFull(reader, headers)
		if err != nil || uint32(l) < headersLen {
			fmt.Printf("read headers data err: %+v, len: %d\n", err, l)
			break
		}
		// Obtain the length of each header, parse contents of headers and judge specific msg type; end up reading if it turns out to be end msg,
		// Call progress information of callback function output if it is cont msg, and output record information if it is record msg 
		headersMap := parseHeaders(headers)
		if headersMap["message-type"] == "Records" {
			// payload part
			payloadLen := totalLen - headersLen - 12
			payload := make([]byte, payloadLen)
			if _, err := io.ReadFull(reader, payload); err != nil {
				fmt.Printf("read payload data err: %+v\n", err)
			}
			// Set line break in OutputSerialization field used by you to implement branch processing 
			rs := strings.Split(string(payload), "\n")
			_, err = io.ReadFull(reader, p)
			crc := binary.BigEndian.Uint32(p)
			recordsMsg := &api.RecordsMessage{
				CommonMessage: api.CommonMessage{
					Prelude: api.Prelude{
						TotalLen:   totalLen,
						HeadersLen: headersLen,
					},
					Headers: headersMap,
					Crc32:   crc,
				},
				Records: rs,
			}
			fmt.Printf("RecordsMessage: %+v\n", recordsMsg)
			continue
		}
		if headersMap["message-type"] == "Cont" {
			// payload part, progress
			bs := make([]byte, 8)
			_, err = io.ReadFull(reader, bs)
			bytesScanned := binary.BigEndian.Uint64(bs)

			br := make([]byte, 8)
			_, err = io.ReadFull(reader, br)
			bytesReturned := binary.BigEndian.Uint64(br)

			_, err = io.ReadFull(reader, p)
			crc := binary.BigEndian.Uint32(p)

			contMsg := &api.ContinuationMessage{
				CommonMessage: api.CommonMessage{
					Prelude: api.Prelude{
						TotalLen:   totalLen,
						HeadersLen: headersLen,
					},
					Headers: headersMap,
					Crc32:   crc,
				},
				BytesScanned:  bytesScanned,
				BytesReturned: bytesReturned,
			}
			fmt.Printf("ContinuationMessage: %+v\n", contMsg)
			continue
		}
		if headersMap["message-type"] == "End" {
			_, err = io.ReadFull(reader, p)
			crc := binary.BigEndian.Uint32(p)

			endMsg := &api.EndMessage{
				CommonMessage: api.CommonMessage{
					Prelude: api.Prelude{
						TotalLen:   totalLen,
						HeadersLen: headersLen,
					},
					Headers: headersMap,
					Crc32:   crc,
				},
			}
			fmt.Printf("EndMessage: %+v\n", endMsg)
			break
		}
	}
}

View of File List

Data Encryption and Protection

百度智能云

Object Storage

Select Object

Basic Introduction

Applicable Scenarios

Use Conditions

Select Data Fault-tolerant Mechanism

Csv Object

Frequently-used SQL Statement

Json Object

Frequently-used SQL Statement

Error Return Code

Example for Use of SDK

Example for Java SDK

Golang example