Running query on Qubole data source, I encounter error
ERROR: extra data after last expected column
My data might have tab characters.
This is a limitation of Qubole REST API, specifically the delimiters used in view results API. Please update your query to replace all tab characters with space via the function replace(string, search, replace) (doc), e.g. replace(column_with_tab_character, '\<Tab on your keyboard goes here>', ' ')
A detailed technical explanation is as follows:
1) Let's use this query as example data. The second column data have tabs.
select * from (
('2017-01-01', 'Saigon Phu Quoc Resort & Spa 985', '1'),
('2017-01-02', 'No Happy Tower 1', '2')
) t (a, bug_with_tab, b)
2) The view results API has a parameter raw to switch the delimiter used in response
3) Currently, we are using the default mode of the API, i.e. raw = false. For the above query, we get this response from Qubole
"2017-01-01\tSaigon Phu Quoc Resort & Spa\t\t\t985\t1\r\n2017-01-02\tNo\t\tHappy\t\tTower\t2\r\n"
Qubole uses tab character (\t) to delimit columns and '\r\n' sequence to delimit rows. We cannot distinguish whether the tab character in the response is Qubole delimiter or actual data. This leads to incorrect number of columns after parsing the response.
4) We tried raw mode to use a different delimiter. We get this response
"2017-01-01\u0001Saigon Phu Quoc Resort & Spa\t\t\t985\u00011\n2017-01-02\u0001Happy\tNo\t\tHappy\t\tTower\u00012\n\n"
Now, Qubole uses ^A character (\u0001) to delimit columns and newline (\n) to delimit rows. The tab character is safe but the newline character is not. Newline is a common character in description, address, so using this mode would break other queries using such data.
5) As we cannot work around the limitation of Qubole on our side, we would need your help to achieve a solution. Please replace the tab character in SQL query, or clean the data directly in the database.