标题摘要内容
一种面向人类和其他人的人工智能浏览器体系结构(AIBA):一种具有两个权证、唤醒中立和私人的可识别信息保值的语音命名系统语音实现
来源: | 作者:元君 | 发布时间: 2022-04-07 | 2207 次浏览 | 分享到:

An Artificial Intelligence Browser Architecture (AIBA) For Our Kind and Others: A Voice Name System Speech implementation with two warrants, Wake Neutrality and Value Preservation of Privately Identifiable Information


Brian Subirana 

Massachusetts Insitute of Technology

Harvard University


Abstract: Conversational commerce, first pioneered by Apple’s Siri, is the first of may applications based on always-on artificial intelligence systems that decide on its own when to interact with the environment, potentially collecting 24x7 longitudinal training data that is often Privately Identifiable Information (PII). A large body of scholarly papers, on the order of a million according to a simple Google Scholar search, suggests that the treatment of many health conditions, including COVID-19 and dementia, can be vastly improved by this data if the dataset is large enough as it has happened in other domains (e.g. GPT3). In contrast, current dominant systems are closed garden solutions without wake neutrality and that can’t fully exploit the PII data they have because of IRB/Cohues-type constraints. 

We present a voice browser-and-server architecture that aims to address these two limitations by offering wake neturality and the possibility to handle PII aiming to maximize its value. We have implemented this browser for the collection of speech samples and have successfully demonstrated it can capture over 200.000 samples of COVID-19 coughs. The architecture we propose is designed so it can grow beyond our kind into other domains such as collecting sound samples from vehicles, video images from nature, ingestible robotics, multi-modal signals (EEG, EKG,...), or even interacting with other kinds such as dogs and cats.

1. Introduction 

We have published elsewhere the need to design a technical architecture that can support an ecosystem of artificial intelligence players while optimizing value for those involved, which we understand means juggling many different design priorities:

- A Voice Name System to name any object in the world: Multiple devices may be present and multiple end-points may exist giving rise to conflicts that need to be resolved [5]. 

- A Wake Neutrality architecture: We would like the infrastructure to provide equal access to all players such as the phone system does once you know a phone number, a property we call Wake Neutrality [12] for its similarity in spirit to Net Neutrality. 

- A Set of Common Biomarkers: We would like the infrastructure to develop AI models that have some commonality across them [3,4].

- A Standardized Brain Model such as MIT’s CBMM Brain Model: For our kind [2] we would like to have a reference brain model so that we don’t have to reinvent everything from scratch every time. For other kinds we’ll need similar reference models such as acustic markers for automobile diagnosis. 

- A Common Set of Use Cases such as those suggested by MIT Open Voice:By developping common use cases, people can grow services around them [14]. 

- A Common Language such as Huey: Interactions should have as much standardization as possible so that creators and consumers don’t have to relearn similar concepts [14]. 

- Application to other Kinds: Conversational commerce will be everywhere and cars, and even bricks will have their own personalities that will need to be supported [10]. 

- Legal Programming: In an Internet of Things world, there are many hurdles to legal compliance that need to be addressed [1, 6, 11, 12] 

- Common Architectural Boundaries: To support the development of hardware and software products, clear boundaries need to be established [6, 7, 9, 14].

In the rest of this paper we provide the specs for a browser-and-server architecture that was the basis for an implemented system that we used in my laboratory to collect speech samples for COVID-19 research [3, 4]. I hope it will inspire others to create similar versions that can be open sourced and shared, and that compatibly grow the architecture into other applications, modalities and kinds.

2. AIBA 1.0: Highlights

Summary of functionality: The user opens the app and sees a record button with (or without) instructions on what to record. The user can start recording by pressing the microphone icon as soon as the app is opened, except the first time that the terms are to be shown and accepted. The app records all recordings with a unique hash_number, always the same for every phone, that the user never sees. Everything else is optional and happens in the configuration options (introduce personal non-identifiable information, select language) or in the server (set instructions of what to record).

Highlights of the suggested architecture: The server sends a number of instructions on what to record to the App (they are included in the). The App runs the user through them. The App sends the server the voice samples. In addition, if the user makes changes to the configuration, these are sent to the server. There are some exceptions, as when no instructions are given (free recording), or as when the server processes the audio and sends some feedback on whether the user has tested positive or not (the processing option will be introduced after the models have been built), behavior which will be defined by an. Tests will be recorded in the subdirectory of the server app directory (name of the subdirectory is the hash_number of the phone). 

Note on the tested version: We have developped two versions of the app. The MIT Voice App was written in React Native and there is also a Web version of the app that was tested without interruption for two years at http://opensigma.mit.edu. It was used to collect over 200.000 samples of coughs during the first months of the COVID-19 pandemic.


Figure 1: Early Web Version of the VNS browser We Built in My Lab


3. AIBA 1.0: Basic interface

AIBA 1.0 is centered on collecting speech and natural language input from our kind:

- Interface design 

o Screen opens and terms are accepted (this screen is only shown the first time) 

o Second screen (loads default app language and default words in the app from the server-side, if there is no internet connection or it’s slow)

§ Text output: <text_request> (e.g. “Digues els números del 0 al 10”)

§ Record button

§ Text input: <text_input_box> (e.g. “Diguens com et trobes”). 

§ Settings button

§ A number (corresponding to the number of recordings, which keeps increasing)

§ Behavior of this screen:

· When pressing the record button some feedback is given so that the user knows recording is on. Recording happens while this feedback is given and for as long as the number of the pair given in the pair. Recording can be stopped by pressing the record button again.”

· Settings sends you to settings

· The number increases after each recording

· The text input is a box where to write text associated with the sample (the user may type whether it has fever or any comorbid conditions)

· If the <app_runtime_config_file> is empty, the user keeps recording what it wants (it’s a free recording mode for the app without specific instructions). Otherwise, it round robins around the options.

o Third screen –
- if theends with a text without seconds then this text is displayed (e.g. “thank you please record again tomorrow”), and no other options are given to the user but a button that says: “start over”. When the user presses it, we go back to screen two. 
- If the directoryhas a text file and/or audio file, the text is displayed and/or the audio played. E.g., ”This text could be generated from the response of the” (e.g. results from COVID-19 test). If there is no response from the server we keep recording 
- In general, if there is no connection with the server we should be collecting samples (and send them when there is internet coverage). This may happen in the basement of a Hospital.
- Settings option 

o Choose language. Selection is added to the <local_config_status_file>

o Introduce password and “codi d’estudi”.

o Tell us about you:

- Opens up a menu with field texts associated to the requests of file <personal_information_request>. For the rapid implementation we can give a fixed set of questions if this is too complicated. We need a fast turnaround.

o Generate single numbers to be shared with neighbors (i.e. different numbers to give to people related to you). You can press it indefinitely. These numbers are also added to the <local_config_status_file>

o Introduce neighbors’ numbers (e.g. different numbers that others give you). Numbers added to the <local_config_status_file>

o See terms

o Run_time_file_config_number. Number added to the <local_config_status_file>

o Reset number of recordings. Resets number to zero. Two numbers are updated. In the <local_config_status_file>. <Total_count>=<Total_count>+<Current_count> and after <Current_count>=0. and the total count

o Fix color

o <reset_time> it’s the time before a session becomes inactive and a reset is established.

o <Engine_number> identifies the engine to provide the answer as to whether the user has the virus or not according to the algorithm 

o <VNS_number> describes the server that will interact with the phone (it can be an ip address or just a name).

o <Dynamic_VNS_toggle> the default is “on” but it can be set to “off” by the user

o <Dynamic_VNS> the server that will be used if the above is on. As a default if none is established, the app should point to http://voice.mit.edu

3. Configurable options from the server-side:

Any element of any of thecan be changed / added in realtime. E.g., a new file can be introduced to reflect a new collection strategy for another COVID-19 study, as well as a default.

Other considerations:

- A back-end server is required to collect the samples

- All the recordings from the same phone should have a unique hash that is invisible to the user (stored in the phone but the user can not see it). Samples should also be saved with an associated timestamp.

- It’s a must to have multiple languages. In our implementation we hadEnglish, Spanish and Catalan to start

- Source code will be made available so that it can be updated and improved by others

- Server will provide a daily download of the audio files with the associated metadata. 

4. Files involved: